Is there a list of language only character regions for UTF-8 somewhere?

Posted by Brehtt on Stack Overflow See other posts from Stack Overflow or by Brehtt
Published on 2010-05-17T03:15:36Z Indexed on 2010/05/17 3:20 UTC
Read the original article Hit count: 320

Filed under:

utf-8

|

natural-language

|

character-encoding

I'm trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my approach to work I need to ignore non-language characters, such as control characters, mathematical symbols etc. Just trying to dissect the basic Latin section of the UTF standard has resulted in multiple regions, with characters like the division symbol being right in the middle of a range of valid Latin characters.

Is there a list somewhere that identifies these regions? Or better yet, a Regex that defines the regions or something in C# that can identify the different characters?

© Stack Overflow or respective owner

Related posts about utf-8

Why can't I change the AU_AU locale to en_US?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
/bin/bash: warning: setlocale: LC_ALL: cannot change locale ( (unset)) Generating locales... en_US.ISO-8859-1... /usr/sbin/locale-gen: line 177: warning: setlocale: LC_ALL: cannot change locale ( (unset)) done Generation complete. ganesha@ubuntu:~$ sudo update_locale LANG=en_US sudo: update_locale:… >>> More
Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite… >>> More
Reading a plist utf-8 value as utf-16

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on an iphone app that needs to display superscripts and subscripts. I'm using a picker to read in data from a plist but the unicode values aren't being displayed corretly in the pickerview. Subscripts and superscripts are not being recognized. I'm assuming this is due to the encoding… >>> More
Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1… >>> More
How can I tell if a CSV is in UTF-7 or UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Excel seems to save CSV files in (what I think is) UTF-7, despite the fact that most information I have read suggest that in general, you should not UTF-7. Indeed, other applications (Text pad, which lets me choose) save things in UTF-8 (or Unicode etc, but UTF-7 is not even an option). Using .NET… >>> More

Related posts about natural-language

Natural language processing - Ideas for beginner's projects

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi guys, I am a beginner in NLP and NLTK. I am very interested in NLP and hence joined a weekend course on AI in some local institution, which requires me to do a project for completion of the course, and I decided to do it in NLP. The problem is,the instructor is not good at all for this course… >>> More
details on the following Natural Language Processing terms ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Named Entity Extraction (extract ppl, cities, organizations) Content Tagging (extract topic tags by scanning doc) Structured Data Extraction Topic Categorization (taxonomy classification by scanning doc....bayesian ) Text extraction (HTML page cleaning) are there libraries that i can use to do any… >>> More
Natural Language parsing of an appointment?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm looking for a Java library to help parse user entered text that represents an 'appointment' for a calendar application. For instance: Lunch with Mike at 11:30 on Tuesday or 5pm Happy hour on Friday I've found some promising leads like https://jchronic.dev.java.net/ and http://www.datejs… >>> More
String chunking algorithm with natural language context

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing. Chunks cannot be longer than 10k (or other arbitrary value) Text should be broken with natural language context in mind split… >>> More
oppertunities in the area of natural language processing

as seen on Stack Overflow - Search for 'Stack Overflow'
i worked on indian language telugu using python... and now i am interested to work in any company which works on natural language processing.if any oppertunities please tell me >>> More