Search Results

Search found 151 results on 7 pages for 'similarity'.

Page 1/7 | 1 2 3 4 5 6 7  | Next Page >

  • Advice on String Similarity Metrics (Java). Distance, sounds like or combo?

    - by andreas
    Hello, A part of a process requires to apply String Similarity Algorithms. The results of this process will be stored and produce lets say SS_Dataset. Based on this Dataset, further decisions will have to be made. My questions are: Should i apply one or more string similarity algorithms to produce SS_Dataset ? Any comparisons between algorithms that calculate the 'distance' and the 'Sounds Like' similarity ? Does one family of algorithms produces more accurate results over the other? Does a combination give more accurate results on similarity? Can you recommend implementations that you have worked with? My implementation will include packages from the following libraries http://www.dcs.shef.ac.uk/~sam/simmetrics.html http://jtmt.sourceforge.net/ Regards,

    Read the article

  • Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

    - by seanieb
    I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity. Is there any program that can do this? Or should I start writing this from scratch?

    Read the article

  • about cosine similarity

    - by jaskirat
    hi i m finding cosine similarity between documents ..i did like dis D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4 D2=(7,0,0,1) cos(theta) = (56 + 0 + 0 + 1) / sqrt(64 + 49) sqrt(1 +1 ) which comes out to be cos(theta)= 5 now what do i evaluate from this value...i dont get it wat does cos(theta)=5 signify about the similarity between them...pls reply ..Am i doing things right ??????????..pls do reply guys.. will be thank ful to you..

    Read the article

  • measuring similarity between documents using jaccard coefficient

    - by jaskirat
    hi i m finding similarity between documents ....nd to measure that i used jaccard coefficient...i did like dis D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4 D2=(7,0,0,0) jaccard coefficient= dotproduct(d1,d2) / |d1|+|d2|-dotproduct(d1,d2) and the answer comes out to be " -1.367931 "...what does it signify about the similarity between the documents...pls do reply..please...thank u..

    Read the article

  • Package to compare LSA, TFIDF, Cosine metrics and Language Models

    - by gouwsmeister
    Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for production. I specifically want the doc similarity matrix as I want to correlate this with human ratings. Thank you in advance!

    Read the article

  • Very fast document similarity

    - by peyton
    Hello, I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical. My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not! I have already taken into account the following possibilities for speeding up performance: Pre-normalize all the term vectors Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know! I have a few questions: If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive. Is there some clever way to only consider vectors which are likely to be close to the query document? Is there some way to be more frugal about the amount of memory I'm using for all these vectors? Thanks!

    Read the article

  • Find cosine similarity in R

    - by Derek
    I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek

    Read the article

  • 'Similarity' in Data Mining

    - by Shailesh Tainwala
    In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with. Any examples, links, references will be helpful. Also, being new to the field, I would like the community opinion on how closely related Data Mining and Artificial Intelligence are. Are they synonyms, is one the subset of the other? Thanks in advance for sharing your knowledge.

    Read the article

  • Ways to calculate similarity

    - by MarySheen
    Hi I am doing a community website that requires me to calculate the similarity between any two users. each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others. Can anyone tell me how to go about this problem or point me to some resources. Thanks Mary

    Read the article

  • Lucene numDocs and doqFreq on custom similarity class

    - by David A
    Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then i calculate a global value in order to use it on every query) and i want to use that values on a custom similarity implementation in order to calculate a new IDF. The problem is that I dont know how to use (or send) the new doqFreq and numDocs values from my app on that new similarty implementation as I dont want to change lucene´s code apart from this extra class. Any suggestions or examples? I read the docs but i dont now how to aproach this :s Thanks

    Read the article

  • fast similarity detection

    - by reinierpost
    I have a large collection of objects and I need to figure out the similarities between them. To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size). I need the ability to quickly find, given an object, the set of objects similar to it. To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better. Have you seen this problem before, or something similar to it? What is a good solution? To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n2) where n is the number of objects. Is there a general solution with lower complexity?

    Read the article

  • How do you efficiently implement a document similarity search system?

    - by Björn Lindqvist
    How do you implement a "similar items" system for items described by a set of tags? In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you will like these too" system. I am familiar with Cosine similarity and using that algorithm works very well. But it is way to slow. For each article, I need to iterate over all articles, calculate the cosine similarity for the article pair and then select the five articles with the highest similarity rating. With 200k articles and 30k tags, it takes me half a minute to calculate the similar articles for a single article. So I need another algorithm that produces roughly as good results as cosine similarity but that can be run in realtime and which does not require me to iterate over the whole document corpus each time. Maybe someone can suggest an off-the-shelf solution for this? Most of the search engines I looked at does not enable document similarity searching.

    Read the article

  • Algorithm to find a measurement of similarity between lists.

    - by Cubed
    Given that I have two lists that each contain a separate subset of a common superset, is there an algorithm to give me a similarity measurement? Example: A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate } How similar are these two lists? Note that I do not know all elements of the common superset. Update: I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies. Clarification: Order is of importance. If identical elements occupy the same position in the list, we have the highest similarity for that element. The similarity decreased the farther apart the identical elements are. The similarity is even lower if the element only exists in one of the lists. I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.

    Read the article

  • Document Similarity: Comparing two documents efficiently

    - by seanieb
    I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries This is what I have so far, it works, but is super slow: # Doc A cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0])) doca = cursor1.fetchall() #convert tuple to a dictionary doca_dic = dict((row[0], row[1]) for row in doca) #Doc B cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0])) docb = cursor2.fetchall() #convert tuple to a dictionary docb_dic = dict((row[0], row[1]) for row in docb) # loop through each token in doca and see if one matches in docb for x in doca_dic: if docb_dic.has_key(x): #calculate the similarity by summing the products of the tf-idf_norm similarity += doca_dic[x] * docb_dic[x] print "similarity" print similarity I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated. Thanks.

    Read the article

  • Detecting similar words among n text documents

    - by javanes
    Hi; I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents include the word "web". Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes? I am unfamiliar with datamining world. In general manner is there a term used for efforts of finding similarities between different documents? If there is then I will make my research easily. Thanks.

    Read the article

  • cosine similarity problem

    - by jaskirat
    hi.... i have calculated the tf-idf values of terms of document 1 and document 2..now i dont know how to use these tf-idf values...basically i want to find similarity between two documents(in my case are webpages)..can any body tell how to implement cosine similarity, jaccard coefficient to find similarity...c# code would be appreciated..pls help...thanks

    Read the article

  • Seo - page similarity percentage

    - by user1360479
    Using Similar Page Checker (http://www.webconfs.com/similar-page-checker.php) you can check if a website is similar to other one. Is there any rule of thumb how high percentage is accepted by Google. Meaning when Google consider that page is similar than other one and will not index it. I' having two pages within same domain where "how to order" -information is similar and that's why percentage is about 70. Thx

    Read the article

  • Use different Solr Similarity algo for every search

    - by snickernet
    Hi Guys, Is possible in Solr 1.4 to specify which similarity class to use for every search within a single index? Let's say, I got 2 type of search (keyword and brand). For keyword search, I want to use the DefaultSimilarity class. But, for brand search, I want to use my CustomSimilarity class. I've been modifying the schema.xml to specify a single similarity class to use. But, I came to this requirement that I have to use 2 different similarity classes. I'll be glad to here your thoughts on this. Thanks in advance.

    Read the article

  • java cosine similarity problem

    - by agazerboy
    Hi again :) I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :( for example: If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length doc 1 1 2 3 4 5 6 doc 2 1 2 3 4 5 6 7 8 5 2 4 9 if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length. Any tips ???

    Read the article

  • similarity between strings - sql server 2005

    - by csetzkorn
    Hi, I am looking for a simple way (UDF?) to establish the similarity between strings. The SOUNDEX and DIFFERENCE function do not seem to do the job. Similarity should be based on number of characters in common (order matters). For example: Spiruroidea sp. AM-2008 and Spiruroidea gen. sp. AM-2008 should be recognised as similar. Any pointers would be very much appreciated. Thanks. Christian

    Read the article

  • Best similarity metric for collaborative filtering?

    - by allclaws
    I'm trying to decide on the best similarity metric for a product recommendation system using item-based collaborative filtering. This is a shopping basket scenario where ratings are binary valued - the user has either purchased an item or not - there is no explicit rating system (eg, 5-stars). Step 1 is to compute item-to-item similarity, though I want to look at incorporating more features later on. Is the Tanimoto coefficient the best way to go for binary values? Or are there other metrics that are appropriate here? Thanks.

    Read the article

  • N-gram split function for string similarity comparison

    - by Michael
    As part of excersise to better understand F# which I am currently learning , I wrote function to split given string into n-grams. 1) I would like to receive feedback about my function : can this be written simpler or in more efficient way? 2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example). 3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal? //s:string - target string to split into n-grams //n:int - n-gram size to split string into let ngram_split (s:string, n:int) = let ngram_count = s.Length - (s.Length % n) let ngram_list = List.init ngram_count (fun i -> if( i + n >= s.Length ) then s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length) (fun i -> "#") else s.Substring(i,n) ) let ngram_array_unique = ngram_list |> Seq.ofList |> Seq.distinct |> Array.ofSeq //produce tuples of ngrams (ngram string,how much occurrences in original string) Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i], ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i]) |> List.length) )

    Read the article

  • similarity match

    - by csetzkorn
    Many search engine have the 'did you mean' functionality. Is there a simple way to use (N)Hibernate (e.g. ICriteria) to find an entity (e.g. keyword) based on similarity. Please note that I do not mean Expression.Like or something like this. I hope this question makes sense. Thanks. Christian

    Read the article

1 2 3 4 5 6 7  | Next Page >