Screen scraping software that will traverse pages

Posted by nilbus on Super User See other posts from Super User or by nilbus
Published on 2009-12-31T17:21:49Z Indexed on 2010/03/24 0:03 UTC
Read the original article Hit count: 783

Filed under:

web

We're creating a mashup site that pulls information from many sources all over the web. Many of these sites don't provide RSS feeds or APIs to access the information they provide. This leaves us with screen scraping as our method for collecting the data.

There are many scripting tools out there written in different scripting languages for screen scraping that require you to write scraping scripts in the language the scraper was written in. Scrapy, scrAPI, and scrubyt are a few written in Ruby and Python.

There are other web-based tools I've seen like Dapper that create XML or RSS feeds based on a webpage. It has a beautiful web-based interface that requires no scripting skills to use. This would be a great tool, if it were able to traverse multiple pages to gather data from hundreds pages of results.

We need something that will scrape information from paginated web sites, much like scrubyt, but with a user interface that a non-programmer could use. We'll script up our own solution if we need to, probably using scrubyt, but if there's a better solution out there, we want to use it. Does anything like this exist?

Related posts about web

Why is Java EE 6 better than Spring ?

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Java EE 6 was released over 2 years ago and now there are 14 compliant application servers. In all my talks around the world, a question that is frequently asked is Why should I use Java EE 6 instead of Spring ? There are already several blogs covering that topic: Java EE… >>> More
Hosting a website on Heroku.... I know how to, but im running into problems!

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I'm starting to learn more on the back-end scale of programing. Recently I started up Heroku for the second or third time. This time I actually installed the Git update to my Mac and installed Heroku in the terminal. I wanted to upload a static html site with the sinatra gem. Everything worked out… >>> More
Microsoft .NET Web Programming: Web Sites versus Web Applications

as seen on Samir ASP.NET with C# Technology - Search for 'Samir ASP.NET with C# Technology'
In .NET 2.0, Microsoft introduced the Web Site. This was the default way to create a web Project in Visual Studio 2005. In Visual Studio 2008, the Web Application has been restored as the default web Project in Visual Studio/.NET 3.x The Web Site is a file/folder based Project structure. It… >>> More
VS2008 - Unable to Add Web Reference to Web Application Project (The web services enumeration compon

as seen on Stack Overflow - Search for 'Stack Overflow'
I've run into a situation where I was unable to add a Web Reference in Visual Studio 2008 to a Web Application Project. The error I couldn't resolve was "The web services enumeration components are not available. You need to reinstall Visual Studio to add web references to your application." How… >>> More
Outlook Web Access: "Outlook Web Access has encountered a Web browsing error"

as seen on Super User - Search for 'Super User'
When one of my colleagues is accessing Outlook Web Access from IE, he frequently gets an error reported: "Outlook Web Access has encountered a Web browsing error". The error report includes the following: Client Information User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4… >>> More

Developer IT

Screen scraping software that will traverse pages - Developer IT

Screen scraping software that will traverse pages

web

Related posts about web

Why is Java EE 6 better than Spring ?

Hosting a website on Heroku.... I know how to, but im running into problems!

Microsoft .NET Web Programming: Web Sites versus Web Applications

VS2008 - Unable to Add Web Reference to Web Application Project (The web services enumeration compon

Outlook Web Access: "Outlook Web Access has encountered a Web browsing error"

Categories cloud