Getting data from a webpage in a stable and efficient way

Posted by Mike Heremans on Programmers See other posts from Programmers or by Mike Heremans
Published on 2012-06-06T07:59:43Z Indexed on 2012/06/06 10:48 UTC
Read the original article Hit count: 342

Filed under:

data

|

parsing

Recently I've learned that using a regex to parse the HTML of a website to get the data you need isn't the best course of action.

So my question is simple: What then, is the best / most efficient and a generally stable way to get this data?

I should note that:

There are no API's
There is no other source where I can get the data from (no databases, feeds and such)
There is no access to the source files. (Data from public websites)
Let's say the data is normal text, displayed in a table in a html page

I'm currently using python for my project but a language independent solution/tips would be nice.

As a side question: How would you go about it when the webpage is constructed by Ajax calls?

© Programmers or respective owner

Related posts about data

timetable in a jTable

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to create a timetable in a jTable. For the top row it will display from monday to sunday and the left colume will display the time of the day with 2h interval e.g 1st colume (0000 - 0200), 2nd colume (0200 - 0400) .... And if i click a button the timing will change from 2h interval to 1h interval… >>> More
Reading data from an Entity Framework data model through a WCF Data Service

as seen on ASP.net Weblogs - Search for 'ASP.net Weblogs'
This is going to be the fourth post of a series of posts regarding ASP.Net and the Entity Framework and how we can use Entity Framework to access our datastore. You can find the first one here , the second one here and the third one here . I have a post regarding ASP.Net and EntityDataSource. You… >>> More
SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

as seen on SQL Authority - Search for 'SQL Authority'
There has been much fanfare over the new SQL Server 2012, and especially around its new companion product Data Quality Services (DQS). Among the many new features is the addition of this integrated knowledge-driven product that enables data stewards everywhere to profile, match, and cleanse data.… >>> More
Modifying a HTML page to fix several "bugs" add a function to next/previous on a option dropdown

as seen on Stack Overflow - Search for 'Stack Overflow'
SOF, I've got a few problems plaguing me at the moment and am wondering if anyone could assist me with them. I'm trying to get Next Class | Previous Class to act as buttons so that when Next Class is clicked it will go to the next item in the dropdown list and for previous it would go to back one… >>> More
Shrinking TCP Window Size to 0 on Cisco ASA

as seen on Server Fault - Search for 'Server Fault'
Having an issue with any large file transfer that crosses our Cisco ASA unit come to an eventual pause. Setup Test1: Server A, FileZilla Client <- 1GBPS - Cisco ASA <- 1 GBPS - Server B, FileZilla Server TCP Window size on large transfers will drop to 0 after around 30 seconds of a large… >>> More

Related posts about parsing

Hot to fix nautilus desktop on linux mint

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
so I'm using Linux Mint 13 with Cinnamon and suddenly there are no icons on the desktop and the right click doesn't work, it's like the desktop doesn't start up at all, but the Cinnamon interface and everything else are working just fine. This happens only when I open the session with Cinnamon, if… >>> More
Is parsing JSON faster than parsing XML

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm creating a sophisticated JavaScript library for working with my company's server side framework. The server side framework encodes its data to a simple XML format. There's no fancy namespacing or anything like that. Ideally I'd like to parse all of the data in the browser as JSON. However, if… >>> More
Looking for a tutorial on Recursive Descent Parsing.

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to parse some data to no success. Can anyone recommend a good introduction with a lot of examples to Recursive Descent Parsing? I haven't been able to find any. >>> More
Parsing XML with Hpricot, a Gem of a Ruby Gem

as seen on Internet.com - Search for 'Internet.com'
Need to parse complex XML documents but don't know where to begin? Leave the task to Ruby's powerful Hpricot library. >>> More
Parsing scripts that use curly braces

as seen on Programmers - Search for 'Programmers'
To get an idea of what I'm doing, I am writing a python parser that will parse directx .x text files. The problem I have deals with how the files are formatted. Although I'm writing it in python, I'm looking for general algorithms for dealing with this sort of parsing. .x files define data using… >>> More