Not crawling the same content twice

Posted by sirrocco on Stack Overflow See other posts from Stack Overflow or by sirrocco
Published on 2009-11-08T12:03:49Z Indexed on 2010/03/17 6:21 UTC
Read the original article Hit count: 209

Filed under:

webcrawling

I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified.

Now , in the first pass I crawl all the pages in the site.

But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions.

So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ?

In the end, the crawled content will end up in lucene - creating a custom search engine.

So, I would like to avoid crawling multiple times the same content. Any better ideas ?

EDIT :

Let's say the site has a page : Results that will be accessed like so :

Results?page=1 , Results?page=2 ...etc

I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

Related posts about webcrawling

Asynchronous Webcrawling F#, something wrong ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Not quite sure if it is ok to do this but, my question is: Is there something wrong with my code ? It doesn't go as fast as I would like, and since I am using lots of async workflows maybe I am doing something wrong. The goal here is to build something that can crawl 20 000 pages in less than an hour… >>> More
WebCrawling Dynamic Links

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise. >>> More
Crawling engine architecture - Java/ Perl integration

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Getting web page after calling DownloadStringAsync()?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine… >>> More

Developer IT