Interpreting Search Results

Posted by Simon on Stack Overflow See other posts from Stack Overflow or by Simon
Published on 2010-04-20T16:47:49Z Indexed on 2010/04/20 16:53 UTC
Read the original article Hit count: 369

Filed under:

information-retrieval

Hi all,

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.

What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.

Thanks, Simon

Related posts about information-retrieval

Suggestion needed to learn Machine Learning and Information Retrieval

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi! I want lo learn about Information Retrieval and Machine Learning. Which books do you recommend and in what order do you think is better to read them? The idea is to reach a good understanding of recommendation systems. Thanks! Jonathan >>> More
Fetching templates via API. Who provides this service?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I'm mainly a server side developer. I'm not a designer, even if I understand web layouts, grids, CSS, typography, valid markup, etc. and I'm able to do some graphic work too (almost). It just takes a lot of time and the result is not always beautiful. I know there are tons of website templates sites… >>> More
How do search engines handle hyphenated words?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I am not sure my title fully explains what I mean. I thought this might be an interesting question. If I had a set of keywords, broken with a dash or 2, will search engines consider the dashed split keyword as maybe a full keyword? Say I have a site that sort of breaks words down, like the dictionary… >>> More
entity set expansion python

as seen on Stack Overflow - Search for 'Stack Overflow'
Do you know of any existing implementation in any language (preferably python) of any entity set expansion algorithms, such that the one from Google sets ? ( http://labs.google.com/sets ) I couldn't find any library implementing such algorithms and I'd like to play with some of those to see how they… >>> More
Assistance with building an inverted-index

as seen on Stack Overflow - Search for 'Stack Overflow'
It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters saved as a string value. So, hashmap["ba"] = "bad barley base" Once I'm done tokenizing a line I take… >>> More

Developer IT