which layout engine for finding coordinates of html elements on the web page?

Posted by Mexx on Stack Overflow See other posts from Stack Overflow or by Mexx
Published on 2010-04-24T19:28:47Z Indexed on 2010/04/24 19:33 UTC
Read the original article Hit count: 217

Filed under:
|
|
|

I am doing some web data classification task and was thinking if I could get the co-ordinates of html elements as they would appear on a web-browser without taking into consideration any css or javascript being referred in the web page.

My language of programming is c++ and the need results for a couple million of pages, so it has to be fast. I know there is a Microsoft COM component which renders the page in a web browser control and then can be queried for position of different html tags. But this is not suitable in my case as it first renders the whole page which takes up a lot of time.

So as I found out, there are open-source layout engines WebKit, Gecko that can probably be used for this. But that's a huge piece of code and I need someone to direct me to the right classes or right modules to look into or any previous/similar work someone has done previously. Also, please let me know what you guys think is a good choice if I want to customize the existing code for use with multiple threads to make it faster.

Thanks

© Stack Overflow or respective owner

Related posts about c++

Related posts about html