Theory: "Lexical Encoding"

Posted by _ande_turner_ on Stack Overflow See other posts from Stack Overflow or by _ande_turner_
Published on 2008-10-04T14:48:06Z Indexed on 2010/05/22 20:41 UTC
Read the original article Hit count: 345

Filed under:

linguistics

I am using the term "Lexical Encoding" for my lack of a better one.

A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.

// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };

In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.

Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.

What are other examples of "Lexical Encoding" techniques?

If you were interested in where the word-usage statistics come from :
http://www.wordcount.org

Developer IT

Theory: "Lexical Encoding" - Developer IT

Theory: "Lexical Encoding"

encoding

theory

nlp

linguistics

Related posts about encoding

<?xml version=“1.0” encoding=“UTF-8”?> not <?xml version='1.0' encoding='UTF-8'?>

Ivar definitions show 'long' type encoding as 'long long' type encoding

How to avoid encoding the key of request parameters being encoding

C# Check if character exists in encoding

How to detect the character encoding of a text file?

Related posts about theory

Why does NUnit ignore datapoints when using generics in a theory

How do you deal with translating theory into practice?

Title Tag Optimization - Common Grounds Between Raw SEO Theory & Search Engine Marketing Priorities

Database theory - relationship between two tables

Recommended book about parallel programming - theory & best practice?

Categories cloud