Should UTF-16 be considered harmful?

Posted by Artyom on Programmers See other posts from Programmers or by Artyom
Published on 2009-06-26T16:04:18Z Indexed on 2012/12/19 17:13 UTC
Read the original article Hit count: 291

Filed under:

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

  • 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
  • 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
  • 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
  • 𠂊 (U+2008A) Han Character

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

  • Opera has problem with editing them (delete required 2 presses on backspace)
  • Notepad can't deal with them correctly (delete required 2 presses on backspace)
  • File names editing in Window dialogs in broken (delete required 2 presses on backspace)
  • All QT3 applications can't deal with them - show two empty squares instead of one symbol.
  • Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP.
  • Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
  • StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
  • WinForms TextBox may generate invalid string when limited with MaxLength.

It seems that such bugs are extremely easy to find in many applications that use UTF-16.

So... Do you think that UTF-16 should be considered harmful?

© Programmers or respective owner

Related posts about unicode