Should UTF-16 be considered harmful?

Posted by Artyom on Stack Overflow See other posts from Stack Overflow or by Artyom
Published on 2009-06-26T16:04:18Z Indexed on 2010/03/18 1:51 UTC
Read the original article Hit count: 552

Filed under:

subjective

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

𝄞
𝕥
𝟶
𠂊

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

Opera has problem with editing them
Notepad can't deal with them correctly (delete for example)
File names editing in Window dialogs in broken
All QT3 applications can't deal with them.
StackOverflow seems to remove these characters if edited directly in as Unicode characters, and only seems to allow them as HTML Unicode escapes.

So... This was very simple test. Do you think that UTF-16 should be considered harmful?

Developer IT