Stripping Non-Text from a Scanned, OCRd PDF

Posted by Daniel S. on Super User See other posts from Super User or by Daniel S.
Published on 2011-12-19T05:19:09Z Indexed on 2012/04/09 11:33 UTC
Read the original article Hit count: 264

Filed under:
|

I have a PDF created from a scanned document. OCR was used to recognize text. In Acrobat, if I select text, and click 'copy with formatting', I can paste the formatted text into Word, so it seems that fonts and colors are also embedded in the document in addition to just plain text and possibly the size.

Is there any way to use this information to create a PDF that just contains the formatted OCRd text, without the scanned image. Currently, my document only shows the scanned image, and the text is on an invisible layer. I would like to create a PDF document that removes the image that was scanned, and displays the formatted text that is currently hidden.

The following post has a section on "How can we make the invisible text visible?" PDF has an extra blank in all words after running through Ghostscript

However, doing this does not show the correct text formatting (that is retained when pasting in Word), and I also would like to remove the scanned image so that the final PDF just contains formatted (color, font, size) vector fonts, and no images.

© Super User or respective owner

Related posts about pdf

Related posts about ocr