Web crawler update strategy
        Posted  
        
            by superb
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by superb
        
        
        
        Published on 2010-04-05T03:28:55Z
        Indexed on 
            2010/04/05
            3:33 UTC
        
        
        Read the original article
        Hit count: 736
        
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.
The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?
Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:
- crawler add start URLs to URL repository
 - crawler ask URL repository for at most N URL to crawl
 - crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed.
 - just go back to step 2
 
To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?
Since that is an open question, hopefully it will brought some fruitful discussion here.
© Stack Overflow or respective owner