asmaier - Developer IT

Which Bliki (Blog+Wiki) solution can you recommend?

- by asmaier

I'm searching for a good Bliki solution, meaning a combination of blog and wiki that I can install on my own web space. I would like to be able to write articles in the wiki style much like with media wiki. So I want to use a wiki markup language, have a revision history, comments, internal links to other pages (maybe in other languages) and be able to collaboratively edit the articles. On the other side I would like to have a blog-like view on my articles, showing new articles (and changes to existing articles) in a time ordered fashion. It would be nice if it would be possible to search through the articles and also tag the articles, so one could generate a tag cloud for the articles. A nice feature would also be to be able to order the articles according to views or even a voting system for the articles. Good would also be a permission system to keep certain articles private, showing them only to people logged in to the platform. Apart from these nice to have features an absolute must have feature for the Bliki platform I'm searching is the possibility to handle math equations (written in LaTeX syntax) and display them either as pictures like media wiki or even better using Mathjax. At the moment I'm using a web service called wikiDot which offers some of the mentioned features, however the free version shows to much advertisements, the blog feature is not mature, the design is quite ugly and loading of the page is often slow. So I want to install a Bliki solution on my own webspace. Can you recommend any solution for that?

Read the article

Which virtualization solutions are available for non-x86 platforms?

- by asmaier

Are their any virtualization solutions like Xen, KVM and VMWare ESX available for non-x86 platform? Especially I'm interested if there are solutions available or if Xen, KVM can be made to run on the platforms POWER5/6/7 PowerPC Itanium 64 NEC SX-8/9 Cray X2 BlueGene What are your experiences? Is it possible to virtualize GPUs like Nvidias Tesla/Fermi?

Read the article

Which virtualization solutions are available for non-x86 platforms?

- by asmaier

Are their any virtualization solutions like Xen, KVM and VMWare ESX available for non-x86 platform? Especially I'm interested if there are solutions available or if Xen, KVM can be made to run on the platforms POWER5/6/7 PowerPC Itanium 64 NEC SX-8/9 Cray X2 BlueGene What are your experiences? Is it possible to virtualize GPUs like Nvidias Tesla/Fermi?

Read the article

How to optimize my PageRank calculation?

- by asmaier

In the book Programming Collective Intelligence I found the following function to compute the PageRank: def calculatepagerank(self,iterations=20): # clear out the current PageRank tables self.con.execute("drop table if exists pagerank") self.con.execute("create table pagerank(urlid primary key,score)") self.con.execute("create index prankidx on pagerank(urlid)") # initialize every url with a PageRank of 1.0 self.con.execute("insert into pagerank select rowid,1.0 from urllist") self.dbcommit() for i in range(iterations): print "Iteration %d" % i for (urlid,) in self.con.execute("select rowid from urllist"): pr=0.15 # Loop through all the pages that link to this one for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid): # Get the PageRank of the linker linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0] # Get the total number of links from the linker linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0] pr+=0.85*(linkingpr/linkingcount) self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid)) self.dbcommit() However, this function is very slow, because of all the SQL queries in every iteration >>> import cProfile >>> cProfile.run("crawler.calculatepagerank()") 2262510 function calls in 136.006 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 136.006 136.006 <string>:1(<module>) 1 20.826 20.826 136.006 136.006 searchengine.py:179(calculatepagerank) 21 0.000 0.000 0.528 0.025 searchengine.py:27(dbcommit) 21 0.528 0.025 0.528 0.025 {method 'commit' of 'sqlite3.Connecti 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler 1339864 112.602 0.000 112.602 0.000 {method 'execute' of 'sqlite3.Connec 922600 2.050 0.000 2.050 0.000 {method 'fetchone' of 'sqlite3.Cursor' 1 0.000 0.000 0.000 0.000 {range} So I optimized the function and came up with this: def calculatepagerank2(self,iterations=20): # clear out the current PageRank tables self.con.execute("drop table if exists pagerank") self.con.execute("create table pagerank(urlid primary key,score)") self.con.execute("create index prankidx on pagerank(urlid)") # initialize every url with a PageRank of 1.0 self.con.execute("insert into pagerank select rowid,1.0 from urllist") self.dbcommit() inlinks={} numoutlinks={} pagerank={} for (urlid,) in self.con.execute("select rowid from urllist"): inlinks[urlid]=[] numoutlinks[urlid]=0 # Initialize pagerank vector with 1.0 pagerank[urlid]=1.0 # Loop through all the pages that link to this one for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid): inlinks[urlid].append(inlink) # get number of outgoing links from a page numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0] for i in range(iterations): print "Iteration %d" % i for urlid in pagerank: pr=0.15 for link in inlinks[urlid]: linkpr=pagerank[link] linkcount=numoutlinks[link] pr+=0.85*(linkpr/linkcount) pagerank[urlid]=pr for urlid in pagerank: self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid)) self.dbcommit() This function is 20 times faster (but uses a lot more memory for all the temporary dictionaries) because it avoids the unnecessary SQL queries in every iteration: >>> cProfile.run("crawler.calculatepagerank2()") 64802 function calls in 6.950 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.004 0.004 6.950 6.950 <string>:1(<module>) 1 1.004 1.004 6.946 6.946 searchengine.py:207(calculatepagerank2 2 0.000 0.000 0.104 0.052 searchengine.py:27(dbcommit) 23065 0.012 0.000 0.012 0.000 {meth 'append' of 'list' objects} 2 0.104 0.052 0.104 0.052 {meth 'commit' of 'sqlite3.Connection 1 0.000 0.000 0.000 0.000 {meth 'disable' of '_lsprof.Profiler' 31298 5.809 0.000 5.809 0.000 {meth 'execute' of 'sqlite3.Connectio 10431 0.018 0.000 0.018 0.000 {method 'fetchone' of 'sqlite3.Cursor' 1 0.000 0.000 0.000 0.000 {range} But is it possible to further reduce the number of SQL queries to speed up the function even more?

Read the article

Python: How to read huge text file into memory

- by asmaier

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file $ ls -l links.csv; file links.csv; tail links.csv -rw-r--r-- 1 user user 469904280 30 Nov 22:42 links.csv links.csv: ASCII text, with CRLF line terminators 4757187,59883 4757187,99822 4757187,66546 4757187,638452 4757187,4627959 4757187,312826 4757187,6143 4757187,6141 4757187,3081726 4757187,58197 So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available. However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is: #!/usr/bin/python # -*- coding: utf-8 -*- import sys infile=open("links.csv", "r") edges=[] count=0 #count the total number of lines in the file for line in infile: count=count+1 total=count print "Total number of lines: ",total infile.seek(0) count=0 for line in infile: edge=tuple(map(int,line.strip().split(","))) edges.append(edge) count=count+1 # for every million lines print memory consumption if count%1000000==0: print "Position: ", edge print "Read ",float(count)/float(total)*100,"%." mem=sys.getsizeof(edges) for edge in edges: mem=mem+sys.getsizeof(edge) for node in edge: mem=mem+sys.getsizeof(node) print "Memory (Bytes): ", mem The output I got was: Total number of lines: 30609720 Position: (9745, 2994) Read 3.26693612356 %. Memory (Bytes): 64348736 Position: (38857, 103574) Read 6.53387224712 %. Memory (Bytes): 128816320 Position: (83609, 63498) Read 9.80080837067 %. Memory (Bytes): 192553000 Position: (139692, 1078610) Read 13.0677444942 %. Memory (Bytes): 257873392 Position: (205067, 153705) Read 16.3346806178 %. Memory (Bytes): 320107588 Position: (283371, 253064) Read 19.6016167413 %. Memory (Bytes): 385448716 Position: (354601, 377328) Read 22.8685528649 %. Memory (Bytes): 448629828 Position: (441109, 3024112) Read 26.1354889885 %. Memory (Bytes): 512208580 Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

Read the article

How to prevent code/option injection in a bash script

- by asmaier

I have written a small bash script called "isinFile.sh" for checking if the first term given to the script can be found in the file "file.txt": #!/bin/bash FILE="file.txt" if [ `grep -w "$1" $FILE` ]; then echo "true" else echo "false" fi However, running the script like > ./isinFile.sh -x breaks the script, since -x is interpreted by grep as an option. So I improved my script #!/bin/bash FILE="file.txt" if [ `grep -w -- "$1" $FILE` ]; then echo "true" else echo "false" fi using -- as an argument to grep. Now running > ./isinFile.sh -x false works. But is using -- the correct and only way to prevent code/option injection in bash scripts? I have not seen it in the wild, only found it mentioned in ABASH: Finding Bugs in Bash Scripts.

Read the article

Which technology is best suited to store and query a huge readonly graph?

- by asmaier

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph). It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex. One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph. Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework? Which technology would you use to store and query such a huge readonly graph?

Read the article

Is it possible in any Java IDE to collapse the type definitions in the source code?

- by asmaier

Lately I often have to read Java code like this: LinkedHashMap<String, Integer> totals = new LinkedHashMap<String, Integer>(listOfRows.get(0)) for (LinkedHashMap<String, Integer> row : (ArrayList<LinkedHashMap<String,Integer>>) table.getValue()) { for(Entry<String, Integer> elem : row.entrySet()) { String colName=elem.getKey(); int Value=elem.getValue(); int oldValue=totals.get(colName); int sum = Value + oldValue; totals.put(colName, sum); } } Due to the long and nested type definitions the simple algorithm becomes quite obscured. So I wished I could remove or collapse the type definitions with my IDE to see the Java code without types like: totals = new (listOfRows.get(0)) for (row : table.getValue()) { for(elem : row.entrySet()) { colName=elem.getKey(); Value=elem.getValue(); oldValue=totals.get(colName); sum = Value + oldValue; totals.put(colName, sum); } } The best way of course would be to collapse the type definitions, but when moving the mouse over a variable show the type as a tooltip. Is there a Java IDE or a plugin for an IDE that can do this?

Read the article

How to optimize my PostgreSQL DB for prefix search?

- by asmaier

I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db =#\d nodes Table "public.nodes" Column | Type | Modifiers --------+------------------------+----------- id | integer | not null title | character varying(256) | score | double precision | Indexes: "nodes_pkey" PRIMARY KEY, btree (id) I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s") =# explain analyze select title,score from nodes where title ilike 's%' order by score desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1) Sort Key: score Sort Method: external merge Disk: 5712kB -> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1) Filter: ((title)::text ~~* 's%'::text) Total runtime: 5260.791 ms (6 rows) This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index =# create index title_idx on nodes using btree(lower(title) text_pattern_ops); =# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1) -> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1) Sort Key: score Sort Method: top-N heapsort Memory: 17kB -> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1) Filter: (lower((title)::text) ~~ 's%'::text) -> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1) Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text)) Total runtime: 1325.085 ms (9 rows) So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?

Read the article

How to improve my LDAP schema?

- by asmaier

Hello, I have a OpenLDAP Database and it holds some project objects that look like dn: cn=Proj1,ou=Project,ou=ua,dc=org cn: Proj1 objectClass: top objectClass: posixGroup member: 001ag member: 002ag System: ABEL System: PCx Budget: ABEL:1000000:0.3 Budget: PCx:300000:0.3 One can see that the Budget attribute is a ":"-separated string, where the first part holds the name of the system the budget is for, the second part holds some budget (which may change every month) and the last entry is a conversion factor for the budget of that system. Seeing this, I thought this is bad database design, since attribute values should always be atomic. But how can I improve that in LDAP, so that I can do a direct ldapsearch or a direct ldapmodify of the budget of System "ABEL" instead of writing a script, that will have to parse and split the ":"-separated string?

Read the article

How to build an interactive search engine web interface using python

- by asmaier

I have build a static web interface for searching data from some tables in my PostgreSQL database. The query website consists of a simple textfield for entering the search term, the result website presents the results as a simple html table. The server side code for searching the PostgreSQL database and returning the results is written in python using psycopg2. Now I would like to add some interactive "Ajax features" to my search engine. When entering the search term I would like to be able to see a list of possible search terms like Google does it. On the results page, I would like to be able to sort the table showing the results. What would be the easiest/recommended way to implement these features for my search engine web site? Do I need a full-fledged web framework like Django for that?

Developer IT