How can I quickly parse large (>10GB) files?

Posted by Andrew on Stack Overflow See other posts from Stack Overflow or by Andrew
Published on 2009-12-17T01:56:48Z Indexed on 2010/03/19 16:01 UTC
Read the original article Hit count: 160

Filed under:
|
|

Hi - I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5

I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.

I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?

FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:

chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[@]}
do

awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query 

done

© Stack Overflow or respective owner

Related posts about awk

Related posts about perl