Search Results

Search found 1 results on 1 pages for 'dangra'.

Page 1/1 | 1 

  • Why is python decode replacing more than the invalid bytes from an encoded string?

    - by dangra
    Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome. The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX' >>> fragment = 'PREFIX\xe3\xabSUFFIX' >>> fragment.decode('utf-8', 'strict') ... UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data What follows is the summary of replacement policies used to handle decoding errors by python, firefox and chrome. Note how the three differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes). by Python The builtin replace error handler replaces the invalid \xe3\xab plus the S from SUFFIX by U+FFFD >>> fragment.decode('utf-8', 'replace') u'PREFIX\ufffdUFFIX' >>> print _ PREFIX?UFFIX The python implementation builtin replace error handler looks like: >>> python_replace = lambda exc: (u'\ufffd', exc.end) As expected, trying this gives same result than builtin: >>> codecs.register_error('python_replace', python_replace) >>> fragment.decode('utf-8', 'python_replace') u'PREFIX\ufffdUFFIX' >>> print _ PREFIX?UFFIX by Firefox Firefox replaces each invalid byte by U+FFFD >>> firefox_replace = lambda exc: (u'\ufffd', exc.start+1) >>> codecs.register_error('firefox_replace', firefox_replace) >>> test_string.decode('utf-8', 'firefox_replace') u'PREFIX\ufffd\ufffdSUFFIX' >>> print _ PREFIX??SUFFIX by Chrome Chrome replaces each invalid sequence of bytes by U+FFFD >>> chrome_replace = lambda exc: (u'\ufffd', exc.end-1) >>> codecs.register_error('chrome_replace', chrome_replace) >>> fragment.decode('utf-8', 'chrome_replace') u'PREFIX\ufffdSUFFIX' >>> print _ PREFIX?SUFFIX The main question is why builtin replace error handler for str.decode is removing the S from SUFFIX. Also, is there any unicode's official recommended way for handling decoding replacements?

    Read the article

1