Find non-ascii characters from a UTF-8 string

Posted by user10607 on Programmers See other posts from Programmers or by user10607
Published on 2013-11-06T15:45:04Z Indexed on 2013/11/06 16:10 UTC
Read the original article Hit count: 238

Filed under:
|

I need to find the non-ASCII characters from a UTF-8 string.

my understanding: UTF-8 is a superset of character encoding in which 0-127 are ascii characters. So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.

On the above understanding i have written following code in C :

Note: I'm using the Ubuntu gcc compiler to run C code

utf-string is xvab c

long i;
    char arr[] = "xvab c";
    printf("length : %lu \n", sizeof(arr));
        for(i=0; i<sizeof(arr); i++){

        char ch = arr[i];
        if (isascii(ch))
             printf("Ascii character %c\n", ch);
              else
             printf("Not ascii character %c\n", ch);
    }

Which prints the output like:

length : 9 
Ascii character x
Not ascii character 
Not ascii character ?
Not ascii character ?
Ascii character a
Ascii character b
Ascii character  
Ascii character c
Ascii character 

To naked eye length of xvab c seems to be 6, but in code it is coming as 9 ? Correct answer for the xvab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).

How can i find the non-ascii character from UTF-8 string, correctly.

Please guide on the subject.

© Programmers or respective owner

Related posts about c

    Related posts about utf-8