Getting started with character and text processing (encoding, regular expressions)
        Posted  
        
            by TK
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by TK
        
        
        
        Published on 2010-05-01T02:54:28Z
        Indexed on 
            2010/05/01
            2:57 UTC
        
        
        Read the original article
        Hit count: 592
        
I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need to learn about advanced topics right away. But I need to know:
- Bit and bytes level knowledge of encodings.
 - Characters and alphabets not used in English.
 - Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
 - Regular expressions.
 - Algorithm for text processing.
 - Parsing natural languages.
 
I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.
I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)
© Stack Overflow or respective owner