Splitting string on probable English word boundaries

Posted by Sean on Stack Overflow See other posts from Stack Overflow or by Sean
Published on 2010-02-13T18:26:52Z Indexed on 2010/05/10 1:38 UTC
Read the original article Hit count: 408

Filed under:

text-analysis

I recently used Adobe Acrobat Pro's OCR feature to process a Japanese kanji dictionary. The overall quality of the output is generally quite a bit better than I'd hoped, but word boundaries in the English portions of the text have often been lost. For example, here's one line from my file:

softening;weakening(ofthemarket)8 CHANGE [transform] oneselfINTO,takethe form of; disguise oneself

I could go around and insert the missing word boundaries everywhere, but this would be adding to what is already a substantial task. I'm hoping that there might exist software which can analyze text like this, where some of the words run together, and split the text on probable word boundaries. Is there such a package?

I'm using Emacs, so it'd be extra-sweet if the package in question were already an Emacs package or could be readily integrated into Emacs, so that I could simply put my cursor on a line like the above and repeatedly invoke some command that splits the line on word boundaries in decreasing order of probable correctness.

Related posts about text-analysis

How to extract common / significant phrases from a series of text entries

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a… >>> More
Splitting string on probable English word boundaries

as seen on Stack Overflow - Search for 'Stack Overflow'
I recently used Adobe Acrobat Pro's OCR feature to process a Japanese kanji dictionary. The overall quality of the output is generally quite a bit better than I'd hoped, but word boundaries in the English portions of the text have often been lost. For example, here's one line from my file: softening;weakening(ofthemarket)8… >>> More
Algorithm to suggest a list of tags to users

as seen on Stack Overflow - Search for 'Stack Overflow'
Given a free text, I need to analyse this this text and suggest a list of tags from a pre existing list. What algorithms are out there in the market? Can they handle a case where, for example, the text have a word like high cholesterol and I would like it so suggest heart disease although… >>> More
Is there any open source text analysis library for PHP?

as seen on Stack Overflow - Search for 'Stack Overflow'
I am looking for a PHP library which does more or less the same thing as this webpage: http://textalyser.net/ I know that there are popular libraries in python and java, but I am looking for a PHP version. Thanks for your help! >>> More
Which programming language for text editing?

as seen on Programmers - Search for 'Programmers'
I need a programming language for text editing and processing (replace, formatting, regex, string comparison, word processing, text analysis, etc). Which programming language is more powerful and has more functions for this purpose? Since I work PHP for my web projects, I currently use PHP; but the… >>> More

Developer IT