not readable PDF files

Posted by Michal_R on Stack Overflow See other posts from Stack Overflow or by Michal_R
Published on 2010-05-28T01:36:40Z Indexed on 2010/05/28 1:41 UTC
Read the original article Hit count: 448

Filed under:

pdf

|

pdfbox

Hello,

I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"¦xDn¦if|d+gDF"Ti&cD+lh d FÁhis~n +xd f«"d¦ffih »h"

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semanticaly correct chars or strings of digits and letters)

Could anybody help me??? THX :)

© Stack Overflow or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about pdfbox

Preserve "long" spaces in PDFBox text extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns… >>> More
PDFBox: Problem with converting pdf page into image

as seen on Stack Overflow - Search for 'Stack Overflow'
My mission is pretty simple: converting every single page of a pdf file into images. I tried using icepdf open source version to generate the images but they don't generate the image with the correct font. So I start using PDFBox instead. The code is the following: PDDocument document = PDDocument… >>> More
How to use PDFBox 1.0 in .net / C# environment using IKVM

as seen on Stack Overflow - Search for 'Stack Overflow'
Id like to use PDFBox to generate PDF highlight files in my .net project. PDFBox states that it can be used in .net via IKVM http://www.pdfbox.org/userguide/dot_net.html BUT running ikvmc (latest version) to generate the DLLs on PDFBOX.1.0.0.jar generates a whole lot of NoClassDefFound warnings… >>> More
Using Java PDFBox library to write Russian PDF

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello , I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could… >>> More
Using Java PDFBox library to write Russian PDF

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide… >>> More