not readable PDF files

Posted by Michal_R on Stack Overflow See other posts from Stack Overflow or by Michal_R
Published on 2010-05-28T01:36:40Z Indexed on 2010/05/28 1:41 UTC
Read the original article Hit count: 360

Filed under:
|

Hello,

I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"¦xDn¦if|d+gDF"Ti&cD+lh d FÁhis~n +xd f«"d¦ffih »h"

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semanticaly correct chars or strings of digits and letters)

Could anybody help me??? THX :)

© Stack Overflow or respective owner

Related posts about pdf

Related posts about pdfbox