Fuzzy Regex, Text Processing, Lexical Analysis?

Posted by justinzane on Stack Overflow See other posts from Stack Overflow or by justinzane
Published on 2012-05-27T23:23:16Z Indexed on 2012/06/06 4:40 UTC
Read the original article Hit count: 388

Filed under:
|
|

I'm not quite sure what terminology to search for, so my title is funky... Here is the workflow I've got:

  1. Semi-structured documents are scanned to file. The files are OCR'd to text.
  2. The text is parsed into Python objects
  3. The objects are serialized (to SQL, JSON, whatever) for use.

The documents are structures like this:

HEADER blah blah, Page ###

blah

Garbage text...

1. Question Text...

continued until now. A. Choice text...

adsadsf. B. Another Choice...

2. Another Question...

I need to extract the questions and choices. The problem is that, because the text is OCR output, there are occasional strange substitutions like '2' -> 'Z' which makes ordinary regular expressions useless. I've tried the Levenshtein module and it helps, but it requires prior knowledge of what edit distance is to be expected.

I don't know whether I'm looking to create a parser? a lexer? something else? This has lead me down all kinds of interesting but nonrelevant paths. Guidance would be greatly appreciated. Oh, also, the text is generally from specific technical domains, so general spelling tools are not so helpful.

Regarding the structure of the documents, there is no clear visual pattern -- like line breaks or indentation -- with the exception of the fact that "questions" usually begin a line. Crap on the document can cause characters to appear before the actual beginning of the line, which means that something along the lines of r'^[0-9]+' does not reliably work.

Though the "questions" always begin with an int, a period and a space; the OCR can substitute other characters or skip characters. This is not so much a problem with Tesseract or Cunieform, rather with the poor quality of the paper documents.

#

Note: for the project in question, it was decided that having a human prep the OCR'd text was better that spending the time coding a solution. I'd still love good pointers, however.

© Stack Overflow or respective owner

Related posts about python

Related posts about regex