translate by replacing words inside existing text

Posted by Berry Tsakala on Stack Overflow See other posts from Stack Overflow or by Berry Tsakala
Published on 2010-02-08T21:13:10Z Indexed on 2011/01/01 14:53 UTC
Read the original article Hit count: 332

What are common approaches for translating certain words (or expressions) inside a given text, when the text must be reconstructed (with punctuations and everythin.) ?

The translation comes from a lookup table, and covers words, collocations, and emoticons like L33t, CUL8R, :-), etc.

Simple string search-and-replace is not enough since it can replace part of longer words (cat > dog ?> caterpillar > dogerpillar).

Assume the following input:

s = "dogbert, started a dilbert dilbertion proces cat-bert :-)"

after translation, i should receive something like:

result = "anna, started a george dilbertion process cat-bert smiley"

I can't simply tokenize, since i loose punctuations and word positions.

Regular expressions, works for normal words, but don't catch special expressions like the smiley :-) but it does .

re.sub(r'\bword\b','translation',s) ==> translation
re.sub(r'\b:-\)\b','smiley',s) ==> :-)

for now i'm using the above mentioned regex, and simple replace for the non-alphanumeric words, but it's far from being bulletproof.

(p.s. i'm using python)

© Stack Overflow or respective owner

Related posts about python

Related posts about regex