Fuzzy Regex, Text Processing, Lexical Analysis?

Posted by justinzane on Stack Overflow See other posts from Stack Overflow or by justinzane
Published on 2012-05-27T23:23:16Z Indexed on 2012/06/06 4:40 UTC
Read the original article Hit count: 388

Filed under:

text-processing

I'm not quite sure what terminology to search for, so my title is funky... Here is the workflow I've got:

Semi-structured documents are scanned to file. The files are OCR'd to text.
The text is parsed into Python objects
The objects are serialized (to SQL, JSON, whatever) for use.

The documents are structures like this:

HEADER blah blah, Page ###

blah

Garbage text...

1. Question Text...

continued until now. A. Choice text...

adsadsf. B. Another Choice...

2. Another Question...

I need to extract the questions and choices. The problem is that, because the text is OCR output, there are occasional strange substitutions like '2' -> 'Z' which makes ordinary regular expressions useless. I've tried the Levenshtein module and it helps, but it requires prior knowledge of what edit distance is to be expected.

I don't know whether I'm looking to create a parser? a lexer? something else? This has lead me down all kinds of interesting but nonrelevant paths. Guidance would be greatly appreciated. Oh, also, the text is generally from specific technical domains, so general spelling tools are not so helpful.

Regarding the structure of the documents, there is no clear visual pattern -- like line breaks or indentation -- with the exception of the fact that "questions" usually begin a line. Crap on the document can cause characters to appear before the actual beginning of the line, which means that something along the lines of r'^[0-9]+' does not reliably work.

Though the "questions" always begin with an int, a period and a space; the OCR can substitute other characters or skip characters. This is not so much a problem with Tesseract or Cunieform, rather with the poor quality of the paper documents.

#

Note: for the project in question, it was decided that having a human prep the OCR'd text was better that spending the time coding a solution. I'd still love good pointers, however.

Developer IT

Fuzzy Regex, Text Processing, Lexical Analysis? - Developer IT

Fuzzy Regex, Text Processing, Lexical Analysis?

python

regex

text-processing

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

OWASP Regex Repository: Is this regex correct?

Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

JS regex isn't matching, even thought it works with a regex tester

c# RegEx with "|"

Categories cloud