Not crawling the same content twice

Posted by sirrocco on Stack Overflow See other posts from Stack Overflow or by sirrocco
Published on 2009-11-08T12:03:49Z Indexed on 2010/03/17 6:21 UTC
Read the original article Hit count: 158

Filed under:

I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified.

Now , in the first pass I crawl all the pages in the site.

But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions.

So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ?

In the end, the crawled content will end up in lucene - creating a custom search engine.

So, I would like to avoid crawling multiple times the same content. Any better ideas ?

EDIT :

Let's say the site has a page : Results that will be accessed like so :

Results?page=1 , Results?page=2 ...etc

I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

© Stack Overflow or respective owner

Related posts about webcrawling