What should i do for accomodating large scale data storage and retrieval?

Posted by kailashbuki on Stack Overflow See other posts from Stack Overflow or by kailashbuki
Published on 2010-12-30T05:33:28Z Indexed on 2010/12/31 7:54 UTC
Read the original article Hit count: 218

There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;

34 "doc1, doc2, doc45"

The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:

self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")

And the snippet for insert/update operation is:

if self.cursor.execute("UPDATE `fingerprint` SET documents=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
                self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))         

The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.

© Stack Overflow or respective owner

Related posts about python

Related posts about mysql