How to parse invalid HTML with Perl?
        Posted  
        
            by 
                bodacydo
            
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by bodacydo
        
        
        
        Published on 2012-07-04T21:12:41Z
        Indexed on 
            2012/07/04
            21:15 UTC
        
        
        Read the original article
        Hit count: 934
        
I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I'm left with just <div class="highlight"></div>.
The editors often have also done things like:
<div class="article"><style>@font-face {   font-family: "Cambria"; }</style>Article starts here</div>
Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.
Any ideas how to approach this broken HTML and actually make sense out of it?
© Stack Overflow or respective owner