Is there a list of language only character regions for UTF-8 somewhere?

Posted by Brehtt on Stack Overflow See other posts from Stack Overflow or by Brehtt
Published on 2010-05-17T03:15:36Z Indexed on 2010/05/17 3:20 UTC
Read the original article Hit count: 263

I'm trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my approach to work I need to ignore non-language characters, such as control characters, mathematical symbols etc. Just trying to dissect the basic Latin section of the UTF standard has resulted in multiple regions, with characters like the division symbol being right in the middle of a range of valid Latin characters.

Is there a list somewhere that identifies these regions? Or better yet, a Regex that defines the regions or something in C# that can identify the different characters?

© Stack Overflow or respective owner

Related posts about utf-8

Related posts about natural-language