Preserve "long" spaces in PDFBox text extraction

Posted by Thilo on Stack Overflow See other posts from Stack Overflow or by Thilo
Published on 2011-01-11T10:47:44Z Indexed on 2011/01/11 10:54 UTC
Read the original article Hit count: 343

Filed under:

pdf

|

table

|

whitespace

|

text-extraction

|

pdfbox

I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other

This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).

I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.

Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.

© Stack Overflow or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about table

Planning Database as a Service Implementation Project

as seen on Oracle Blogs - Search for 'Oracle Blogs'
People, process and planning are the three key elements to success in a private cloud journey. Some common questions i hear from field/customers often relates to tasks involved in setting up Database-as-a-Service(DBaaS) using Oracle Enterprise Manager 12c from scratch and how these tasks are mapped… >>> More
Need help in filtering the data with various condition and filling in scroll window GP

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I am filtering the data and displaying in scroll window. There are many combination to display this data by customer id, customer id and itemnumber, customer id, itemnumber, work and history condition. And from date and To date condition. My query is when I am selecting the customer id and… >>> More
Postgres user drop

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to drop a user: drop user testUser; I want to force this to work in a simple manner (Not a million calls)... How can I do this easily? I get this output: ERROR: role "testUser" cannot be dropped because some objects depend on it DETAIL: access to table main.tap_db_version access… >>> More
Postgres user/role drop

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to drop a user: drop user testUser; I want to force this to work in a simple manner (Not a million calls)... How can I do this easily? I get this output: ERROR: role "testUser" cannot be dropped because some objects depend on it DETAIL: access to table main.tap_db_version access… >>> More
Table element in CSS table-cell offsets other table-cell contents

as seen on Stack Overflow - Search for 'Stack Overflow'
The following table element in the "center" div causes the contents in the "left" divs to be offset by several pixels from the top (8 in my browser). Adding some text prior to the table removes this offset. Why? How do I stop this from happening without requiring a "dummy" line of text before my… >>> More