Reading and conditionally updating N rows, where N > 100,000 for DNA Sequence processing

Posted by makerofthings7 on Programmers See other posts from Programmers or by makerofthings7
Published on 2012-09-12T03:13:26Z Indexed on 2012/09/12 3:48 UTC
Read the original article Hit count: 500

Filed under:
|
|
|

I have a proof of concept application that uses Azure tables to associate DNA sequences to "something".

Table 1 is the master table. It uniquely lists every DNA sequence. The PK is a load balanced hash of the RK. The RK is the unique encoded value of the DNA sequence.

Additional tables are created per subject. Each subject has a list of N DNA sequences that have one reference in the Master table, where N is > 100,000.

It is possible for many tables to reference the same DNA sequence, but in this case only one entry will be present in the Master table.

My Azure dilemma:

I need to lock the reference in the Master table as I work with the data. I need to handle timeouts, and prevent other threads from overwriting my data as one C# thread is working with the information. Other threads need to realise that this is locked, and move onto other unlocked records and do the work.

Ideally I'd like to get some progress report of how my computation is going, and have the option to cancel the process (and unwind the locks).

Question

What is the best approach for this?

I'm looking at these code snippets for inspiration:

http://blogs.msdn.com/b/jimoneil/archive/2010/10/05/azure-home-part-7-asynchronous-table-storage-pagination.aspx

http://stackoverflow.com/q/4535740/328397

© Programmers or respective owner

Related posts about c#

Related posts about multithreading