Parse html and find data in the html
        Posted  
        
            by Dan.StackOverflow
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Dan.StackOverflow
        
        
        
        Published on 2010-04-01T04:04:03Z
        Indexed on 
            2010/04/01
            4:13 UTC
        
        
        Read the original article
        Hit count: 1164
        
Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:
<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
so lets try it:
>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>
that looks good, lets see what else we have:
>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>
LOL WUT?
seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.
© Stack Overflow or respective owner