Which web crawler to use to save news articles from a website into .txt files?

Posted by brokencoding on Stack Overflow See other posts from Stack Overflow or by brokencoding
Published on 2010-02-19T15:46:09Z Indexed on 2010/05/11 13:04 UTC
Read the original article Hit count: 455

Filed under:

web-crawler

|

web-spider

|

help

Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use).

So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven.

I have 0 expertise in this area, so i beg you for some sugestions in which crawler to use for this task.

© Stack Overflow or respective owner

Related posts about web-crawler

web crawler needed

as seen on Stack Overflow - Search for 'Stack Overflow'
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Appengine Apps Vs Google bot web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root… >>> More
Extracting data from internet

as seen on Programmers - Search for 'Programmers'
I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion… >>> More
Web crawler update strategy

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated… >>> More

Related posts about web-spider

Which web crawler to use to save news articles from a website into .txt files?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use). So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content… >>> More
apache-memory-hacker-linux

as seen on Server Fault - Search for 'Server Fault'
When we start the linux system it take only 435mb memory and it is 4GB memory server. When we start the httpd services it take 1000mb and outmatically it take all the memory and the server crase. even we stop the apache just it release 200mb memory. What will be the problem Can any one tell me what… >>> More
First Experience with Web Services

as seen on DotNetBlocks - Search for 'DotNetBlocks'
When I first started programming with Microsoft .Net (1.0 Framework) I had a strong desire to learn how search engines indexed web sites. At that time I was a working as a search engine spammer creating web pages to generate traffic for specific themes for various clients. One way I attempted to better… >>> More
Problem with web spiders

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, could you help me code web spider that crawls say the links that start with www.example.com/ruby/ and not the entire website www.example.com >>> More
.htaccess template, suggestions needed

as seen on Server Fault - Search for 'Server Fault'
DefaultLanguage en-US FileETag None Header unset ETag ServerSignature Off SetEnv TZ Europe/Belgrade # Rewrites Options +FollowSymLinks RewriteEngine On RewriteBase / # Redirect to WWW RewriteCond %{HTTP_HOST} ^serpentineseo.com RewriteRule (.*) http://www.serpentineseo.com/$1 [R=301,L] # Redirect… >>> More