Interpreting Search Results

Posted by Simon on Stack Overflow See other posts from Stack Overflow or by Simon
Published on 2010-04-20T16:47:49Z Indexed on 2010/04/20 16:53 UTC
Read the original article Hit count: 319

Filed under:

Hi all,

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.

What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.

Thanks, Simon

© Stack Overflow or respective owner

Related posts about information-retrieval