Compose synthetic English phrase that would contain 160 bits of recoverable information

Posted by Alexander Gladysh on Stack Overflow See other posts from Stack Overflow or by Alexander Gladysh
Published on 2011-01-15T05:41:39Z Indexed on 2011/01/15 5:53 UTC
Read the original article Hit count: 261

Filed under:

computational-linguistics

I have 160 bits of random data.

Just for fun, I want to generate pseudo-English phrase to "store" this information in. I want to be able to recover this information from the phrase.

Note: This is not a security question, I don't care if someone else will be able to recover the information or even detect that it is there or not.

Criteria for better phrases, from most important to the least:

Short
Unique
Natural-looking

The current approach, suggested here:

Take three lists of 1024 nouns, verbs and adjectives each (picking most popular ones). Generate a phrase by the following pattern, reading 20 bits for each word:

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

Now, this seems to be a good approach, but the phrase is a bit too long and a bit too dull.

I have found a corpus of words here (Part of Speech Database).

After some ad-hoc filtering, I calculated that this corpus contains, approximately

50690 usable adjectives
123585 nouns
15301 verbs

This allows me to use up to

16 bits per adjective (actually 16.9, but I can't figure how to use fractional bits)
15 bits per noun
13 bits per verb

For noun-verb-adjective-verb pattern this gives 57 bits per "sentence" in phrase. This means that, if I'll use all words I can get from this corpus, I can generate three sentences instead of four (160 / 57 ˜ 2.8).

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

Still a bit too long and dull.

Any hints how can I improve it?

What I see that I can try:

Try to compress my data somehow before encoding. But since the data is completely random, only some phrases would be shorter (and, I guess, not by much).
Improve phrase pattern, so it would look better.
Use several patterns, using the first word in phrase to somehow indicate for future decoding which pattern was used. (For example, use the last letter or even the length of the word.) Pick pattern according to the first bytes of the data.

...I'm not that good with English to come up with better phrase patterns. Any suggestions?

Use more linguistics in the pattern. Different tenses etc.

...I guess, I would need much better word corpus than I have now for that. Any hints where can I get a suitable one?

Developer IT

Compose synthetic English phrase that would contain 160 bits of recoverable information - Developer IT

Compose synthetic English phrase that would contain 160 bits of recoverable information

nlp

fun

steganography

computational-linguistics

Related posts about nlp

stanford pos tagger runs out of memory?

NLP with greatly contrained input and abilities

NLP - Word Alignment

AGFL npx grammar nlp techniques dependency parsing

Starting out NLP - Python + large data set

Related posts about fun

Geek Fun: Virtualized Old School Windows – Windows 95

Friday Fun: Play 3D Rally Racing in Google Chrome

Desktop Fun: Fast Cars Wallpapers

Desktop Fun: Halloween 2013 Wallpaper Collection [Bonus Edition]

Friday Fun: Omega Crisis

Categories cloud