App-Engine Parse a UrlFetch UTF-8 encoded stream

Posted by Davidrd91 on Stack Overflow See other posts from Stack Overflow or by Davidrd91
Published on 2012-11-25T22:53:59Z Indexed on 2012/11/25 23:03 UTC
Read the original article Hit count: 197

I am trying to parse an XML from a URL using the xml.sax parser. I know there are other libraries to use but coming from Java this is the one I am most familiar with and seems the least complicated to me.

The code I'm using to parse is as follows:

parser = xml.sax.make_parser()
handler = MangaHandler()
parser.setContentHandler(handler)
url = urlfetch.Fetch('http://www.mangapanda.com/alphabetical', allow_truncated = False, follow_redirects = False, deadline = False)
xml.sax.parseString(url.content, handler)

This returns a SaxException (invalid token) once the parser reaches the first & sign:

SAXParseException: <unknown>:582:34: not well-formed (invalid token)

Because urlfetch returns a string and not a stream I cannot use the parse() (which only works with streams) and am left to use parseString() instead. To see if parsing as a stream would fix this I tried:

parser.parse(io.StringIO(url.content).encode('utf-8'))

but this returns:

TypeError: initial_value must be unicode or None, not str

I have also tried to use the urllib2 libraries which do return a stream instead of urlfetch but the file is too large and is automatically truncated, leaving me with missing data.

Any Sort of work-around for this would be greatly appreciated as I've spent days getting around one obstacle just to be stopped by another.

© Stack Overflow or respective owner

Related posts about python

Related posts about google-app-engine