Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)
        Posted  
        
            by Eonil
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Eonil
        
        
        
        Published on 2010-02-05T09:46:14Z
        Indexed on 
            2010/03/22
            13:41 UTC
        
        
        Read the original article
        Hit count: 433
        
Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.
Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.
I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.
I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.
© Stack Overflow or respective owner