How to know if a PDF contains only images or has been OCR scanned for searching?

Posted by Bratch on Stack Overflow See other posts from Stack Overflow or by Bratch
Published on 2009-09-28T22:45:42Z Indexed on 2010/04/22 18:13 UTC
Read the original article Hit count: 202

Filed under:

pdf

|

ocr

|

search

|

acrobat

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

© Stack Overflow or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about ocr

free open-source linux screenshot & ocr tool

as seen on Super User - Search for 'Super User'
I'm looking for a tool which would be able to capture a screen region, pass it to OCR and put the result into clipboard. "import ppm:- | gocr -i - | xclip -selection c" works, but gocr is unreliable: simple text on a webpage has errors. It is a clear font but the OCR tool always misses "r" and replaces… >>> More
OCR, OCR-B Fonts in PHP?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I am looking for a good solution to parse OCR-B fonts off a PNG images fed from scanners. Any tips on a engine? In php >>> More
OCR with Neural network: data extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm using the AForge library framework and its neural network. At the moment when I train my network I create lots of images (one image per letter per font) at a big size (30 pt), cut out the actual letter, scale this down to a smaller size (10x10 px) and then save it to my harddisk. I can then go… >>> More
OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot. Background: I doing this to extract data from a legacy application for use with other applications. This is the only way to get at… >>> More
OCR an RSA key fob (security token)

as seen on Stack Overflow - Search for 'Stack Overflow'
I put together a quick WinForm/embedded IE browser control which logs into our company's bank website each morning and scrapes/exports the desired deposit information (the bank is a smallish regional bank). Since we have a few dozen "pseudoaccounts" that draw from the same master account, this actually… >>> More