Web scraping with Python

Posted by Jack on Stack Overflow See other posts from Stack Overflow or by Jack
Published on 2010-03-07T18:07:24Z Indexed on 2010/03/28 11:03 UTC
Read the original article Hit count: 449

Filed under:

webscraping

|

python

|

webkit

|

firefox

I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags.

Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

© Stack Overflow or respective owner

Related posts about webscraping

Webscraping Google tasks via Google Calendar

as seen on Stack Overflow - Search for 'Stack Overflow'
As gmail and the task api is not available everywhere (eg: some companies block gmail but not calendar), is there a way to scrap google task through the calendar web interface ? The solution can be to use jQuery/Jaxer or a pointer to a browser script/plugin. >>> More
R webscraping: interrogating for date and importance

as seen on Stack Overflow - Search for 'Stack Overflow'
I am able to webscrape a table from a webpage containing news library(XML) webpage <- "http://www.tradingeconomics.com/calendar" tables <- readHTMLTable(webpage ) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) dfcal <- as.data.frame(tables$calendar) However I do not know… >>> More
How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes: I would like to go through all the "species pages" present in this link: http://gtrnadb.ucsc.edu/ So for each of them I will… >>> More
HTML Agility Pack Screen Scraping XPATH isn't returning data

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox… >>> More
Is there a jQuery webscraper out there?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to pullout some info from an external site using jQuery and Adobe AIR. Right now I'm using a hidden div and jQuery's load function to load fragments of the external site, once the info is loaded I parse some info with selectors. This is fine but it's kinda dirty and I need to perform this… >>> More

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More