Search Results

Search found 8 results on 1 pages for 'ngram'.

Page 1/1 | 1 

  • Ngram IDF smoothing

    - by adi92
    I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf.. For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf.. I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest. If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores. Anybody has any better ideas?

    Read the article

  • Using Markov models to convert all caps to mixed case and related problems

    - by hippietrail
    I've been thinking about using Markov techniques to restore missing information to natural language text. Restore mixed case to text in all caps Restore accents / diacritics to languages which should have them but have been converted to plain ASCII Convert rough phonetic transcriptions back into native alphabets That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context. I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities. Am I on the right track? Are there already some services, libraries, or tools for this sort of thing? Examples GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush tantot il rit a gorge deployee - tantôt il rit à gorge déployée

    Read the article

  • "Anagram solver" based on statistics rather than a dictionary/table?

    - by James M.
    My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words. I have created an N-gram model (for now, N=2) based on the letters in a bunch of text. Now, given a random sequence of letters, I would like to permute them into the most likely sequence according to the transition probabilities. I thought I would need the Viterbi algorithm when I started this, but as I look deeper, the Viterbi algorithm optimizes a sequence of hidden random variables based on the observed output. I am trying to optimize the output sequence. Is there a well-known algorithm for this that I can read about? Or am I on the right track with Viterbi and I'm just not seeing how to apply it?

    Read the article

  • Simple NLP: How to use ngram to do word similarity?

    - by sadawd
    Dear Everyone, I Hear that google uses up to 7-grams for their own data. I am interested in finding words that are similar in context (i.e. cat and dog) and I was wondering how do I compute the similarity of two words on a n-gram model given that n 2. Given a sample set like this forexample: (I, love cats), (cats, loves, dogs), (dogs, hate, human) What is a good way to compare the similarity of this pair (I, cats)? Also does anyone know of anyway to do levels for NLP? like: Army-Military-Solider ?

    Read the article

  • Racket: change dotted pair to list

    - by user2963128
    I have a program that recursively calls a hashtable and prints out data from it. Unfortunately my hashtable seems to be saving data as dotted pairs so when I call the hashtable I get an error saying that there is no data for it because its tryign to search the hashtable for a dotted pair instead of a list. Is there an easy way to make the dotted pair into a regular list? IE im getting '("was" . "beginning") instead of '("was" "beginning") Is there a way to change this without re-writing how my hashtable store stuff? im using the let function to set a variable to this and then calling another function based on this variable (let ((data ( list-ref(hash-ref Ngram-table key) (random (length (hash-ref Ngram-table key)))))) is there a way to make the value stored in data just a list like this '("var1" "var2") instead of a dotted pair? edit: im getting dotted pairs because im using let to set data to the part of the hashtable's key and one of the elements in that hash.

    Read the article

  • Using hash tables in racket

    - by user2963128
    Im working on an Ngrams program and im having trouble filling out my hash table. i want to write out a recursive function that will take the words and add them to the hash table. The way its supposed to work is given the data set 1 2 3 4 5 6 7 the first entry in the hash table should have a key of [1 2] and the data should be 3. The second entry should be: [2 3] and its data should be 4 and continuing on until the end of the text file. we are given a predefined function called readword that will simply return 1 word from the text. But im not sure how to make these calls overlap each other. The calls would look something like this if the data was hard coded in. (hash-set! (list "1" "2") 3 (hash-set! (list "2" "3") 4 2 calls that ive tried look like this (hash-set! Ngram-table(list((word1) (word2)) readword in))) (hash-set! Ngram-table(append((cdr data) word1)) readword in) apparently the in after readword is supposed to tell the computer that this is and input instead of an output or something like that. How would I call this to make the data in the key of the hashtable overlap like this? And what would the recursive call look like? edit: also we are not allowed to use assigment statements in this program.

    Read the article

  • N-gram split function for string similarity comparison

    - by Michael
    As part of excersise to better understand F# which I am currently learning , I wrote function to split given string into n-grams. 1) I would like to receive feedback about my function : can this be written simpler or in more efficient way? 2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example). 3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal? //s:string - target string to split into n-grams //n:int - n-gram size to split string into let ngram_split (s:string, n:int) = let ngram_count = s.Length - (s.Length % n) let ngram_list = List.init ngram_count (fun i -> if( i + n >= s.Length ) then s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length) (fun i -> "#") else s.Substring(i,n) ) let ngram_array_unique = ngram_list |> Seq.ofList |> Seq.distinct |> Array.ofSeq //produce tuples of ngrams (ngram string,how much occurrences in original string) Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i], ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i]) |> List.length) )

    Read the article

1