Beautifulsoup recursive attribute

Posted by Marcos Placona on Stack Overflow See other posts from Stack Overflow or by Marcos Placona
Published on 2011-01-04T20:46:30Z Indexed on 2011/01/04 21:54 UTC
Read the original article Hit count: 454

Filed under:

python

|

beautifulsoup

Hi, trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall()

I have a pretty odd xml format shown below:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
      <catalog>true</catalog>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
      <catalog>false</catalog>
   </book>
 </catalog>

As you can see, the catalog tag repeats inside the book tag, which causes an error when I try to to something like:

from BeautifulSoup import BeautifulStoneSoup as BSS

catalog = "catalog.xml"


def open_rss():
    f = open(catalog, 'r')
    return f.read()

def rss_parser():
    rss_contents = open_rss()
    soup = BSS(rss_contents)
    items = soup.findAll('catalog', recursive=False)

    for item in items:
        print item.title.string

rss_parser()

As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one.

This doesn't seem to work, as I always get the following error:

  File "catalog.py", line 17, in rss_parser
    print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'

I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem.

Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file.

Thanks in advance,

Marcos

Developer IT