SharePoint OCR image files indexing

Posted on Microsoft .NET Support Team See other posts from Microsoft .NET Support Team
Published on Tue, 01 Dec 2009 09:54:00 +0000 Indexed on 2010/03/16 15:31 UTC
Read the original article Hit count: 711

Filed under:

sharepoint

Introduction

This article describes how to setup indexing of the image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology and as such is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I however use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different.

Background

One of the projects I was working on required a storage of old documents scanned into PDF files. Then there was a separate team of people responsible for providing a tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path.

OCR

The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT (http://code.google.com/p/tesseract-ocr/). Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then it was moved to the Open Source, and now if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files.

Image files conversion

So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic (http://www.imagemagick.org/). This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting any TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript (http://mirror.cs.wisc.edu/pub/mirrors/ghost/GPL/gs864/gs864w32.exe).

Dealing with text PDFs

With that utility installed, I was cooking - I can convert any file (in particular PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have a PDFTOTEXT (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip) - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that!

Batch process

It took another 15 minutes to setup a batch script to automate the process:

Check the file extension
If file is a PDF file
- try to extract text out of it
- if there is more than certain amount of text in the file - done!
- if there is no text, convert first page into bitmap
- run OCR on the bitmap
For any other file type, convert file into bitmap
Run OCR on the bitmap

Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.Continue

Developer IT