wget crawling search results of news website

Posted by kiltek on Super User See other posts from Super User or by kiltek
Published on 2013-11-02T23:27:13Z Indexed on 2013/11/03 3:59 UTC
Read the original article Hit count: 267

Filed under:
|
|

I am trying to crawl the search results of a news website using wget.

The name of the website is www.voanews.com.

After typing in my search keyword and clicking search, it proceeds to the results. Then i can specify a "to" and a "from"-date and hit search again.

After this the URL becomes:

http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article

and the actual content of the results is what i want to download.

To achieve this I created the following wget-command:

wget --reject=js,txt,gif,jpeg,jpg \
     --accept=html \
     --user-agent=My-Browser \
     --recursive --level=2 \
     www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article

Unfortunately, the crawler doesn't download the search results. It only gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..." links and saves the articles they link to.

It seems like he crawler doesn't check the search result links at all.

What am I doing wrong and how can I modify the wget command to download the results search list links (and of course the sites they link to) only ?

© Super User or respective owner

Related posts about download

Related posts about search