Getting started with character and text processing (encoding, regular expressions)

Posted by TK on Stack Overflow See other posts from Stack Overflow or by TK
Published on 2010-05-01T02:54:28Z Indexed on 2010/05/01 2:57 UTC
Read the original article Hit count: 592

Filed under:

text

|

processing

|

linguistics

|

encoding

|

parsing

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."

I don't say I need to learn about advanced topics right away. But I need to know:

Bit and bytes level knowledge of encodings.
Characters and alphabets not used in English.
Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
Regular expressions.
Algorithm for text processing.
Parsing natural languages.

I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.

I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)

© Stack Overflow or respective owner

Related posts about text

Error in running script [closed]

as seen on Programmers - Search for 'Programmers'
I'm trying to run heathusf_v1.1.0.tar.gz found here I installed tcsh to make build_heathusf work. But, when I run ./build_heathusf, I get the following (I'm running that on a Fedora Linux system from Terminal): $ ./build_heathusf Compiling programs to build a library of image processing functions… >>> More
Coloring even heighten columns

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to set different a background colors for left and right columns and to maintain the same height. So I set a background color for outer wrapper ("container" div) so it will set a color to rightBar. But this didn't work. Online Demo I want it to work on all browsers. Markup: <!DOCTYPE… >>> More
HTML: How to create a DIV with only vertical scroll-bar to show long paragraphs on a webpage?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to show terms and condition note on my website. I dont want to use text field and also dont want to use my whole page. I just want to display my text in selected area and want to use only vertical scroll-bar to go down and read all text. Currently I am using this code: <div style="width:10;height:10;overflow:scroll"… >>> More
Qt Linking Error.

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I configure qt-x11 with following options ./configure -prefix /iTalk/qtx11 -prefix-install -bindir /iTalk/qtx11-install/bin -libdir /iTalk/qtx11-install/lib -docdir /iTalk/qtx11-install/doc -headerdir /iTalk/qtx11-install/include -datadir /iTalk/qtx11-install/data -examplesdir /iTalk/qtx11-install/examples… >>> More
XSLT Escape Character not working

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to use escape charaters in my text output, as i would like too surround the output in emailData tags. I am using <xsl:text><emailData></xsl:text> In the XSLT to esnure that this works however because i am using a tool called Cast Iron for some reason it… >>> More

Related posts about processing

configure Squid3 proxy server on Ubuntu with caching and logging

as seen on Server Fault - Search for 'Server Fault'
I have a ubuntu 11.10 machine. Installed Squid3. When i configure the squid as http_access allow all, everything works fine. my current configuration mostly default is as follows: 2012/09/10 13:19:57| Processing Configuration File: /etc/squid3/squid.conf (depth 0) 2012/09/10 13:19:57| Processing:… >>> More
apt-get fails to upgrade, install, remove etc

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I upgraded from 11.10 to 12.04, had no issues that I noticed. Recently tried to install something via software center, but it was throwing errors. Changed to trying to sudo apt-get install instead but again no luck. I've genuinely tried as much as I know to fix this, but I can't so I figured I'd ask… >>> More
Processing a tab delimited file with shell script processing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, normally I would use Python/Perl for this procedure but I find myself (for political reasons) having to pull this off using a bash shell. I have a large tab delimited file that contains six columns and the second column is integers. I need to shell script a solution that would verify that… >>> More
Processing xml file VS. processing excel file from .Net

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello All, I would like to ask what should be faster: reading excel file from .Net or reading xml file which contains the same data. The same is for writing. Thank you very much in advance. mayap. >>> More
configure squid3 to set up a web proxy in ubuntu12.04

as seen on Super User - Search for 'Super User'
I am in a LAN and have to use a proxy given to access the web in a very limited way. I can't even use google, github.com or SE sites. However I can use ssh to log into a server, which I have root access so basically I can do anything I want with it. So I was thinking that maybe I could use that server… >>> More