Python.expat can't parse XML file with bad symbols. How to go around?
- by culebrón
I'm trying to parse an XML file with expat, and here's the line where I get bad token exception:
<tag k="name"
v="???????????????????????????????????????????????????????????????????" />
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 610127, column 37
The symbols in hex look like: \xd1? Seems like someone wrote this string (Russian alfabet) hitting backspace a few times.
I set parser.returns_unicode = True, but this didn't help. The 1st line is <?xml version="1.0" encoding="UTF-8"?>. I work with a bz2 file. (bz2.BZ2File)
How can I parse the file?