Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

Posted by dan04 on Stack Overflow See other posts from Stack Overflow or by dan04
Published on 2010-06-10T02:32:07Z Indexed on 2010/06/10 2:52 UTC
Read the original article Hit count: 328

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).

There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).

Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?

By "UTF-8-like", I mean, at minimum:

  • The bytes 0x00-0x7F are reserved for ASCII characters.
  • Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.

© Stack Overflow or respective owner

Related posts about character-encoding

Related posts about hypothetical