Scraping paginated items from a website using scrapy

Posted by Mridang Agarwalla on Stack Overflow See other posts from Stack Overflow or by Mridang Agarwalla
Published on 2012-10-16T16:58:04Z Indexed on 2012/10/16 17:00 UTC
Read the original article Hit count: 267

Filed under:
|

I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day.

Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change.

I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop.

How can implement this?

(My Scrapy installation works with my Django installation using django-dynamic-scraper )

© Stack Overflow or respective owner

Related posts about python

Related posts about scrapy