How to parse invalid HTML with Perl?

Posted by bodacydo on Stack Overflow See other posts from Stack Overflow or by bodacydo
Published on 2012-07-04T21:12:41Z Indexed on 2012/07/04 21:15 UTC
Read the original article Hit count: 789

Filed under:
|
|
|

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:

<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>

I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I'm left with just <div class="highlight"></div>.

The editors often have also done things like:

<div class="article"><style>@font-face {   font-family: "Cambria"; }</style>Article starts here</div>

Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.

Any ideas how to approach this broken HTML and actually make sense out of it?

© Stack Overflow or respective owner

Related posts about html

Related posts about perl