De-dupe a list of hundreds of thousands of first name/last name/address/date of birth

Posted by Darren on Stack Overflow See other posts from Stack Overflow or by Darren
Published on 2011-01-13T01:50:27Z Indexed on 2011/01/13 1:53 UTC
Read the original article Hit count: 588

Filed under:

I have a large data set which I know contains many dupicate records. Basically I have data on first name, last name, different address components and date of birth.

I think the best way to do this is to use the name and date of birth as chances are if these things match, it's the same person. There are probably lots of instances where there are slight differences in spelling (like typos missing a single letter) or use of name (ie: some might have a middle initial in first name column) which would be good to account for, but I'm not sure how to approach this.

Are there any tools or articles on going about this process? The data is all in a MySQL database and I have a basic proficiency in SQL.

© Stack Overflow or respective owner

Related posts about mysql