What is the fastest way to find duplicates in multiple BIG txt files?

Posted by user2950750 on Stack Overflow See other posts from Stack Overflow or by user2950750
Published on 2013-11-03T21:21:43Z Indexed on 2013/11/03 21:54 UTC
Read the original article Hit count: 267

Filed under:

text-processing

I am really in deep water here and I need a lifeline.

I have 10 txt files. Each file has up to 100.000.000 lines of data. Each line is simply a number representing something else. Numbers go up to 9 digits.

I need to (somehow) scan these 10 files and find the numbers that appear in all 10 files.

And here comes the tricky part. I have to do it in less than 2 seconds.

I am not a developer, so I need an explanation for dummies. I have done enough research to learn that hash tables and map reduce might be something that I can make use of. But can it really be used to make it this fast, or do I need more advanced solutions?

I have also been thinking about cutting up the files into smaller files. To that 1 file with 100.000.000 lines is transformed into 100 files with 1.000.000 lines.

But I do not know what is best: 10 files with 100 million lines or 1000 files with 1 million lines?

When I try to open the 100 million line file, it takes forever. So I think, maybe, it is just too big to be used. But I don't know if you can write code that will scan it without opening.

Speed is the most important factor in this, and I need to know if it can be done as fast as I need it, or if I have to store my data in another way, for example, in a database like mysql or something.

Thank you in advance to anybody that can give some good feedback.

Developer IT

What is the fastest way to find duplicates in multiple BIG txt files? - Developer IT

What is the fastest way to find duplicates in multiple BIG txt files?

database

text

text-processing

Related posts about database

SQL SERVER Retrieve and Explore Database Backup without Restoring Database Idera virtual database

Cloning A Database On The Same Server Using Rman Duplicate From Active Database

cPickle ImportError: No module named multiarray

SQL SERVER – 2008 – Introduction to Snapshot Database – Restore From Snapshot

OTN ???? ?????? ???????

Related posts about text

Error in running script [closed]

Coloring even heighten columns

HTML: How to create a DIV with only vertical scroll-bar to show long paragraphs on a webpage?

Qt Linking Error.

XSLT Escape Character not working

Categories cloud