Python web scraping involving HTML tags with attributes

Posted by rohanbk on Stack Overflow See other posts from Stack Overflow or by rohanbk
Published on 2009-09-08T02:23:25Z Indexed on 2010/05/02 7:07 UTC
Read the original article Hit count: 537

Filed under:

lxml

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?

At the moment, my code looks like what is below:

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.

EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

Developer IT

Python web scraping involving HTML tags with attributes - Developer IT

Python web scraping involving HTML tags with attributes

python

scraping

beautifulsoup

lxml

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about scraping

Screen-scraping of a secure page of any site on https:// with asp.net in C#

looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

PHP Screen Scraping Class

Alert Log Scraping with Oracle&#146;s ADRCI Utility

Web scraping etiquette

Categories cloud