pdfbox - Developer IT

Preserve "long" spaces in PDFBox text extraction

- by Thilo

I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns). I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough. Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.

Read the article

PDFBox: Problem with converting pdf page into image

- by user552910

My mission is pretty simple: converting every single page of a pdf file into images. I tried using icepdf open source version to generate the images but they don't generate the image with the correct font. So I start using PDFBox instead. The code is the following: PDDocument document = PDDocument.load(new File("testing.pdf")); List<PDPage> pages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < pages.size(); i++) { PDPage singlePage = pages.get(i); BufferedImage buffImage = convertToImage(singlePage, 8, 12); ImageIO.write(buffImage, "png", new File(PdfUtil.DATA_OUTPUT_DIR+(count++)+".png")); } The font looks good, but the pictures within the pdf file look fainted out (See the attachment). I look into the source code but I still have no clue how to fix it. Do you guys have any idea what's going on? Please help. Thanks!!

Read the article

PDFBox Pagebreak strange Nullpointer exception

- by schneiti

I currently try printing out text on multiple pages. For this, I count the number of rows and when they reach a fixed amount a method called pagebreakis executed. After the first pagebreak, when I try setting a font using contentstream.setFont(PDType1Font.HELVETICA, 12); it yields the following errormessage occuring at the described setFont-row. java.lang.NullPointerException at org.apache.pdfbox.pdmodel.edit.PDPageContentStream.setFont(PDPageContentStream.java:321) at com.xy.deu.xy.abc.db.schemavergleich.PDFDocumenter.drawBGCS(PDFDocumenter.java:781) at com.xy.deu.xy.abc.db.schemavergleich.PDFDocumenter.createPDFDocumentation(PDFDocumenter.java:205) at com.xy.deu.xy.abc.db.schemavergleich.MainClass.createPDFandOutput(MainClass.java:361) at com.xy.deu.xy.abc.db.schemavergleich.MainClass.start(MainClass.java:231) at com.xy.deu.xy.abc.db.schemavergleich.MainClass.main(MainClass.java:180) Below is the code that gets executed as the error occurs. ... // If Table gets to long for a page -> pagebreak: if(currentLines > 37) { pageBreak(currentLinePos); // TODO Currently causing app to crash } ... private void pageBreak(int currentLine) throws Exception { contentStream.endText(); contentStream.close(); // Create new page page = new PDPage(PDPage.PAGE_SIZE_A4); doc.addPage( page ); // Create a new font object selecting one of the PDF base fonts font = PDType1Font.HELVETICA; // Start a new content stream which will "hold" the to be created content contentStream = new PDPageContentStream(doc, page); currentLines = 0; mediabox = page.findMediaBox(); contentStream.beginText(); contentStream.moveTextPositionByAmount(startX, startY); contentStream.setFont(PDType1Font.HELVETICA, 12); } Now comes the strange thing: Debugging yields into nothing that is not set. I'll attach a screenshot for you right at the position where the error occurs:

Read the article

How to use PDFBox 1.0 in .net / C# environment using IKVM

- by Evan

Id like to use PDFBox to generate PDF highlight files in my .net project. PDFBox states that it can be used in .net via IKVM http://www.pdfbox.org/userguide/dot_net.html BUT running ikvmc (latest version) to generate the DLLs on PDFBOX.1.0.0.jar generates a whole lot of NoClassDefFound warnings. How should I fix this, and what other DLLs do I need to include in my project? It seems as though file names have changed from the older documentation/articles I have read on the matter. thanks in advance.

Read the article

why both pdfbox and pdfrenderer can not support "Additional fonts"?

- by MemoryLeak

I have a pdf which contains 'UniCNS-UCS2-H' font, I tried both pdfbox and pdfrenderer, they all throw exception: Unknown encoding for 'UniCNS-UCS2-H' and this font was included in a font file :mingliu.ttc(it's a true type collection file, I don't know does this matter ? what can I do to let these two libraries support additional fonts ?

Read the article

Using Java PDFBox library to write Russian PDF

- by Brad

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. Here is the important code lines : PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text. font.setEncoding( new WinAnsiEncoding() ); // Define the Encoding used in writing. // Some code here to open the PDF & define a new page. contentStream.drawString( "??????? ????????????" ); // Write the Russian text. The WinAnsiEncoding source code is : Click here --------------------- Edit on 18 November 2009 After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding. I am not sure how to use it, but here is what i have tried until now : COSDictionary cosDic = new COSDictionary(); cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter. font.setEncoding( new DictionaryEncoding( cosDic ) ); This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank. The DictionaryEncoding source code is : Click here Thanks . . .

Read the article

Using Java PDFBox library to write Russian PDF

- by Brad

Hello , I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. Here is the important code lines : PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text. font.setEncoding( new WinAnsiEncoding() ); // Define the Encoding used in writing. // Some code here to open the PDF & define a new page. contentStream.drawString( "??????? ????????????" ); // Write the Russian text. The WinAnsiEncoding source code is : Click here --------------------- Edit on 18 November 2009 After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding. I am not sure how to use it, but here is what i have tried until now : COSDictionary cosDic = new COSDictionary(); cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter. font.setEncoding( new DictionaryEncoding( cosDic ) ); This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank. The DictionaryEncoding source code is : Click here Thanks . . .

Read the article

not readable PDF files

- by Michal_R

Hello, I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this: "¦xDn¦if|d+gDF"Ti&cD+lh d FÁhis~n +xd f«"d¦ffih »h" or "10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17" I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semanticaly correct chars or strings of digits and letters) Could anybody help me??? THX :)

Read the article

PDF to Image Conversion in Java

- by Geertjan

In the past, I created a NetBeans plugin for loading images as slides into NetBeans IDE. That means you had to manually create an image from each slide first. So, this time, I took it a step further. You can choose a PDF file, which is then automatically converted to an image for each page, each of which is presented as a node that can be clicked to open the slide in the main window. As you can see, the remaining problem is font rendering. Currently I'm using PDFBox. Any alternatives that render font better? This is the createKeys method of the child factory, ideally it would be replaced by code from some other library that handles font rendering better: @Override protected boolean createKeys(List<ImageObject> list) { mylist = new ArrayList<ImageObject>(); try { if (file != null) { ProgressHandle handle = ProgressHandleFactory.createHandle( "Creating images from " + file.getPath()); handle.start(); PDDocument document = PDDocument.load(file); List<PDPage> pages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < pages.size(); i++) { PDPage pDPage = pages.get(i); mylist.add(new ImageObject(pDPage.convertToImage(), i)); } handle.finish(); } list.addAll(mylist); } catch (IOException ex) { Exceptions.printStackTrace(ex); } return true; } The import statements from PDFBox are as follows: import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage;

Read the article

Java API to create pdf with tables: any recommendations?

- by jack

I need to create a PDF containing some tables. When looking on google/stackoverflow the most frequent API seems to be iText but that's under the AGPL licence and thus not desirable for my purposes. I also frequently see apache pdfbox but that does not seem to have native support for tables (although a slightly hacky way was posted at Apache PDFBox Java library - Is there an API for creating tables? ) Does anyone have any recommendations?

Read the article

Export PDF pages to a series of images in Java

- by dasp

I need to export the pages of an arbitrary PDF document into a series of individual images in jpeg/png/etc format. I need to do this in in Java. Although I do know about iText, PDFBox and various other java pdf libraries, I am hoping for a pointer to some working example, or some how-to. Thanks.

Read the article

PDF Text Extraction Approach Using OCR

- by Jon

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written. I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable. I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible). Any help would be appreciated.

Read the article

I'm getting this exception : Unresolved compilation problems

- by Stephan

I get this exception after i removed from my project the jars (pdfbox ,bouncycastle etc) and moved them to another folder but i included them in the build path ... at the first line eclipse shows this error( the constructor PDFParser(InputStream) refers to missing type InputStream) -altought FileInputStream is extended from InputStream- and i don't know why? FileInputStream in = new FileInputStream(path); PDFParser parser = new PDFParser(in); PDFTextStripper textStripper = new PDFTextStripper(); parser.parse(); String text = textStripper.getText(new PDDocument(parser.getDocument())); any ideas? ** Exception in thread "AWT-EventQueue-0" java.lang.Error: Unresolved compilation problems: The constructor PDFParser(InputStream) refers to the missing type InputStream The constructor PDFTextStripper() refers to the missing type IOException The method parse() from the type PDFParser refers to the missing type IOException The method getText(PDDocument) from the type PDFTextStripper refers to the missing type IOException The method getDocument() from the type PDFParser refers to the missing type IOException The method getDocument() from the type PDFParser refers to the missing type IOException The method close() from the type COSDocument refers to the missing type IOException **

Read the article

PDFParsing & extracting the images only in iPhone application.

- by sagar

Hello - Every one. ** : My Query : ** I want to extract only images from entire pdf document. ( Using Objective C - for iPhone Application ) : My Efforts : I have gone through this link which has details regarding different operators of PDF Document. ( http://mail-archives.apache.org/mod_mbox/pdfbox-commits/200912.mbox/%[email protected]%3E ) I also studied this document - ( http://developer.apple.com/mac/library/documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html#//apple_ref/doc/uid/TP30001066-CH220-TPXREF101 ) I also have gone through entire document of PDFReference.pdf ( From original Adobe Site ) PDFReference.pdf (Adobe Document - says that - for image there are following operators ) q Q BI EI I have placed following table get the image myTable = CGPDFOperatorTableCreate(); CGPDFOperatorTableSetCallback(myTable, "q", arrayCallback2); CGPDFOperatorTableSetCallback(myTable, "TJ", arrayCallback); CGPDFOperatorTableSetCallback(myTable, "Tj", stringCallback); I have following arrayCallback2 method for getting image void arrayCallback2(CGPDFScannerRef inScanner, void *userInfo) { // how to extract image from this code // means I have tried too many different ways. following is incorrect way & not giving image // CGPDFStreamRef stream; // represents a sequence of bytes // if (CGPDFDictionaryGetStream (d, "BI", &stream)){ // CGPDFDataFormat t=CGPDFDataFormatJPEG2000; // CFDataRef data = CGPDFStreamCopyData (stream, &t); // } } above arrayCallback2 method is called for operator "q", But I don't know How to extract the image from it. In short. What should be the solution for extracting the images from the pdf documents? Thanks in advance for your kind help. Sagar kothari.

Search Results

Search found 14 results on 1 pages for 'pdfbox'.

Page 1/1 | 1

- by Thilo

- by user552910

- by schneiti

- by Evan

- by MemoryLeak

- by Brad

- by Brad

- by Michal_R

- by Geertjan

- by jack

- by dasp

- by Jon

- by Stephan

- by sagar