How to search for duplicate values in a huge text file having around Half Million records
- by Shibu
I have an input txt file which has data in the form of records (each row is a record and represents more or less like a DB table) and I need to find for duplicate values. For example:
Rec1: ACCOUNT_NBR_1*NAME_1*VALUE_1
Rec2: ACCOUNT_NBR_2*NAME_2*VALUE_2
Rec3: ACCOUNT_NBR_1*NAME_3*VALUE_3
In the above set, the Rec1 and Rec2 are considered to be duplicates as the ACCOUNT NUMBERS are same(ACCOUNT_NBR1).
Note: The input file shown above is a delimiter type file (the delimiter being *) however the file type can also be a fixed length file where each column starts and ends with a specified positions.
I am currently doing this with the following logic:
Loop thru each ACCOUNT NUMBER
  Loop thru each line of the txt file and record and check if this is repeated.
  If repeated record the same in a hashtable.
  End 
End
And I am using 'Pattern' & 'BufferedReader' java API's to perform the above task.
But since it is taking a long time, I would like to know a better way of handling it.
Thanks,
Shibu