Complex string matching with fuzzywuzzy

Posted by That1Guy on Programmers See other posts from Programmers or by That1Guy
Published on 2012-10-16T19:18:36Z Indexed on 2012/10/16 23:20 UTC
Read the original article Hit count: 197

Filed under:
|

I'm attempting to write a process that matches obscure strings to a single 'master string' for further processing. I have a lot of data that looks something like this:

Basketball
Basket Ball
Football
BasketBallR
BBall
BBall - r
FootB

...and so on. These need to be mapped to a master record like so:

Basketball       = Basket Ball, BBall
Basketball - R   = BasketBallR, BBall - r

I also have instances of data resembling this format:

Football -r
FootBall - r-g/H,Q,HH

These situations need to be separated into different categories before being mapped. For example FootBall - r-g/H,Q,HH should be:

Football - r
Football - g
Football - H
Football - Q
Football - HH

At this point, it still needs to be mapped to a master record...

I've tried several different combinations of fuzzywuzzy matching methods, Levenshtein Distance measurements, regex, etc. and can't seem to find a reliable method to logically associate different naming styles of a single item with a master name.

I'm throwing my hands up in desperation. Are there any existing python resources than can help sort out my problem? Are there other options? Can anybody point out an obvious option that I might have overlooked?

Basically, any suggestion, solution, resource or alternative method is greatly appreciated.

© Programmers or respective owner

Related posts about python

Related posts about strings