Cross-platform iteration of Unicode string

Posted by kizzx2 on Stack Overflow See other posts from Stack Overflow or by kizzx2
Published on 2011-01-02T16:11:44Z Indexed on 2011/01/02 16:54 UTC
Read the original article Hit count: 263

Filed under:

c++

|

unicode

|

cross-platform

|

icu

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

Example

The text "??????" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks.

static void Main(string[] args)
{
    const string s = "??????";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

So there we have it in .NET. We also have Win32's CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"??????";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}

Question

Both ways I know of are specific to Microsoft. Are there portable ways to do it?

I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU.
C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

Developer IT

Cross-platform iteration of Unicode string - Developer IT

Cross-platform iteration of Unicode string

c++

unicode

cross-platform

icu

Example

Question

Related posts about c++

C++ : C++ Primer (Stanley Lipmann) or The C++ programming language (special edition)

Which C++ book shold I get between "C++ Primer" vs "C++ Primer Plus"

Managed c++ std::string not accessible in unmanaged c++

I need help on my C++ assignment using MS Visual C++

The Definitive C++ Book Guide and List

Related posts about unicode

Translating Between Unicode and Non-Unicode Character Sets in Java

SQLite, python, unicode, and non-utf data

SQLite, python, unicode, and non-utf data

notepad sql Unicode and Non Unicode

On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

Categories cloud