Python minidom and UTF-8 encoded XML with hash references

Posted by Jakob Simon-Gaarde on Stack Overflow See other posts from Stack Overflow or by Jakob Simon-Gaarde
Published on 2011-01-11T22:15:56Z Indexed on 2011/01/11 22:54 UTC
Read the original article Hit count: 345

Filed under:
|
|
|
|

Hi

I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".

gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. æ).

I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).

Anyway I guess gSOAP probably is obeying transport rules, or what?

When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:

So if the string "æble" is contained in the XML, it comes like this in the request:

"æble"

After parsing the XML the unicode string in the DOM Text Node's data member looks like this:

u'\xc3\xa6ble'

I would expect it to look like this:

u'\xe6ble'

What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?

Thanks in advance.

Best regards Jakob Simon-Gaarde

© Stack Overflow or respective owner

Related posts about python

Related posts about hash