Parsing HTML "Visually"

Posted by Midhat on Stack Overflow See other posts from Stack Overflow or by Midhat
Published on 2010-06-01T21:06:24Z Indexed on 2010/06/02 1:53 UTC
Read the original article Hit count: 196

Filed under:
|

OKay I am at loss how to name this question. I have some HTML files, probably written by lord Lucifier himself, that I need to parse. It consists of many segments like this, among other html tags

<p>HeadingNumber</p>
<p style="text-indent:number;margin-top:neg_num ">Heading Text</p>
<p>Body</p>

Notice that the heading number and text are in seperate p tags, aligned in a horizontal line by css. the css may be whatever Lucifier fancies, a mixture of indents, paddings, margins and positions.

However that line is a single object in my business model and should be kept as such. So How do I detect whether two p elements are visually in a single line and process them accordingly. I believe the HTML files are well formed if it helps.

© Stack Overflow or respective owner

Related posts about html

Related posts about parsing