Technique to remove common words(and their plural versions) from a string

Posted by Jake M on Stack Overflow See other posts from Stack Overflow or by Jake M
Published on 2012-03-31T06:29:15Z Indexed on 2012/04/07 23:29 UTC
Read the original article Hit count: 265

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

What do you think would be the most efficient way to remove common words from the tag list?

By common words, I mean words like: 'the', 'at', 'there', 'their' etc.

I have 2 methodologies I can use, which do you think is more efficient in terms of speed and do you know of a more efficient way I could do this?

Methodology 1:
- Determine the number of times each word occurs(using the library Collections)
- Have a list of common words and remove all 'Common Words' from the Collection object by attempting to delete that key from the Collection object if it exists.
- Therefore the speed will be determined by the length of the variable delims

import collections from Counter
delim     = ['there','there\'s','theres','they','they\'re'] 
# the above will end up being a really long list!
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
    del word_freq[delim]
return freq.most_common()

Methodology 2:
- For common words that can be plural, look at each word in the recipe string, and check if it partially contains the non-plural version of a common word. Eg; For the string "There's a test" check each word to see if it contains "there" and delete it if it does.

delim         = ['this','at','them'] # words that cant be plural
partial_delim = ['there','they',] # words that could occur in many forms
word_freq     = Counter(recipe_str.lower().split())
for delim in set(delims):
    del word_freq[delim]
# really slow 
for delim in set(partial_delims):
    for word in word_freq:
        if word.find(delim) != -1:
           del word_freq[delim]
return freq.most_common()

© Stack Overflow or respective owner

Related posts about python

Related posts about parsing