Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback
Posted
by
zoonosis
on Stack Overflow
See other posts from Stack Overflow
or by zoonosis
Published on 2012-09-05T07:39:13Z
Indexed on
2012/09/06
15:38 UTC
Read the original article
Hit count: 357
Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse
Running the code below only returns the info collected in parse
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
Im sure its something simple, I just can't see it.
class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
© Stack Overflow or respective owner