Preserve "long" spaces in PDFBox text extraction

Posted by Thilo on Stack Overflow See other posts from Stack Overflow or by Thilo
Published on 2011-01-11T10:47:44Z Indexed on 2011/01/11 10:54 UTC
Read the original article Hit count: 259

Filed under:
|
|
|
|

I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other

This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).

I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.

Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.

© Stack Overflow or respective owner

Related posts about pdf

Related posts about table