How to preprocess text to do OCR error correction

Posted by eaglefarm on Stack Overflow See other posts from Stack Overflow or by eaglefarm
Published on 2010-04-28T01:18:00Z Indexed on 2010/04/28 1:23 UTC
Read the original article Hit count: 491

Filed under:

ocr

|

crc

|

forwarderrorcorrection

|

error-correction

Here is what I'm trying to accomplish: I need to get a several large text files from a computer that is not networked and has no other output except a printer. I tried printing the text, then scanning the printout with OCR to recover the text on another computer but the OCR gets lots of errors (1 vs l, o vs 0, O vs D, etc).

To solve this I am thinking of writing a program to process (annotate?) the text file, before printing it, so that the errors can be corrected from the text output of the OCR program. For example, for 1 (number one) vs l (letter L), I could change the text like this:

sample

inserting \nnn after characters that are frequently wrong in the OCR results:

sampl\108e

Then I can write another program to examine the file, looking for \nnn and check the character before the \nnn (where nnn is the ascii code in decimal) and fix it if necessary. Of course the program will have to recognize that the \nnn may have errors too but at least it knows that the nnn are digits and can easily correct them.

I think I would add a CRC on each line so that any line that isn't corrected perfectly can be flagged as having a problem.

Has anyone done anything like this? If there is an existing way of doing this I'd rather not reinvent the wheel. Or any suggestions for annotation format that would help solve this problem would be helpful too.

© Stack Overflow or respective owner

Related posts about ocr

free open-source linux screenshot & ocr tool

as seen on Super User - Search for 'Super User'
I'm looking for a tool which would be able to capture a screen region, pass it to OCR and put the result into clipboard. "import ppm:- | gocr -i - | xclip -selection c" works, but gocr is unreliable: simple text on a webpage has errors. It is a clear font but the OCR tool always misses "r" and replaces… >>> More
OCR, OCR-B Fonts in PHP?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I am looking for a good solution to parse OCR-B fonts off a PNG images fed from scanners. Any tips on a engine? In php >>> More
OCR with Neural network: data extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm using the AForge library framework and its neural network. At the moment when I train my network I create lots of images (one image per letter per font) at a big size (30 pt), cut out the actual letter, scale this down to a smaller size (10x10 px) and then save it to my harddisk. I can then go… >>> More
OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot. Background: I doing this to extract data from a legacy application for use with other applications. This is the only way to get at… >>> More
OCR an RSA key fob (security token)

as seen on Stack Overflow - Search for 'Stack Overflow'
I put together a quick WinForm/embedded IE browser control which logs into our company's bank website each morning and scrapes/exports the desired deposit information (the bank is a smallish regional bank). Since we have a few dozen "pseudoaccounts" that draw from the same master account, this actually… >>> More

Related posts about crc

Convert CRC-CCITT Kermit 16 DELPHI code to C#

as seen on Stack Overflow - Search for 'Stack Overflow'
I am working on a function that will give me a Kermit CRC value from a HEX string. I have a piece of code in DELPHI. I am a .NET developer and need the code in C#. function CRC_16(cadena : string):word; var valuehex : word; i: integer; CRC : word; Begin CRC := 0; for i := 1 to length(cadena)… >>> More
Force copy files off CRC error filled hard drive

as seen on Super User - Search for 'Super User'
So I got a dying Western Digital hard drive here and I have a new Western Digital hard drive to transfer all the data to. I have the new HDD hooked up by a SATA to USB. I want to transfer all the pictures, etc to the new HDD. I am unable to because of the CRC error. I have ran chkdsk /f /r and it… >>> More
Implementation of ZipCrypto / Zip 2.0 encryption in java

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying o implement the zipcrypto / zip 2.0 encryption algoritm to deal with encrypted zip files as discussed in http://www.pkware.com/documents/casestudies/APPNOTE.TXT I believe I've followed the specs but just can't seem to get it working. I'm fairly sure the issue has to do with my interpretation… >>> More
How to use boost::crc?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to use boost::crc so that it works exactly like PHP's crc32() function. I tried reading the horrible documentation and many headaches later I haven't made any progress. Apparently I have to do something like: int GetCrc32(const string& my_string) { return crc_32 = boost::crc<bits… >>> More
Find out CRC or CHECKSUM of RS232 data

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to communicate with a RS232 device, I have no specs or information available. I send a 16 byte command and get a 16 byte result back. The last byte looks like some kind of crc or checksum, I have tried using this http://miscel.dk/MiscEl/miscelCRCandChecksum.html with no luck. Anyone can… >>> More