fast similarity detection

Posted by reinierpost on Stack Overflow See other posts from Stack Overflow or by reinierpost
Published on 2009-12-11T16:07:49Z Indexed on 2010/04/08 22:53 UTC
Read the original article Hit count: 623

Filed under:

similarity

|

METRIC

|

time-complexity

|

algorithm-design

I have a large collection of objects and I need to figure out the similarities between them.

To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).

I need the ability to quickly find, given an object, the set of objects similar to it.

To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.

Have you seen this problem before, or something similar to it? What is a good solution?

To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n²) where n is the number of objects. Is there a general solution with lower complexity?

© Stack Overflow or respective owner

Related posts about similarity

Find cosine similarity in R

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek >>> More
fast similarity detection

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a large collection of objects and I need to figure out the similarities between them. To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number… >>> More
Lucene numDocs and doqFreq on custom similarity class

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then… >>> More
Document Similarity: Comparing two documents efficiently

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries This is what I have so far, it works, but is super slow: # Doc A cursor1.execute("SELECT token, tfidf_norm… >>> More
Advice on String Similarity Metrics (Java). Distance, sounds like or combo?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, A part of a process requires to apply String Similarity Algorithms. The results of this process will be stored and produce lets say SS_Dataset. Based on this Dataset, further decisions will have to be made. My questions are: Should i apply one or more string similarity algorithms… >>> More

Related posts about METRIC

How To Clear An Alert - Part 2

as seen on Oracle Blogs - Search for 'Oracle Blogs'
There were some interesting comments and remarks on the original posting, so I decided to do a follow-up and address some of the issues that got raised... Handling Metric Errors First of all, there is a significant difference between an 'error' and an 'alert'. An 'alert' is the violation of a… >>> More
cannot delete IPv6 default gateway

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
The commands below should be pretty self-explanatory. Please note that the route for which i get failure is obtained by RA and has very less expiry ( e Flag in UDAe). @vm:~$ ip -6 route 2001:4860:4001:800::1002 via fe80::20c:29ff:fe87:f9e7 dev eth1 proto static metric 1024 2001:4860:4001:800::1003… >>> More
The only metric with any value

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /*… >>> More
Notifications for Expiring DBSNMP Passwords

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Most user accounts these days have a password profile on them that automatically expires the password after a set number of days. Depending on your company’s security requirements, this may be as little as 30 days or as long as 365 days, although typically it falls between 60-90 days… >>> More
Configure IPv6 routing

as seen on Server Fault - Search for 'Server Fault'
I've got IPv6 addresses from SIXXS. My host is connected with SIXXS network over a AICCU tunnel ("sixxs" interface). My host address is 2001:::2, the host on the end has address 2001:::1. On my host IPv6 is fully accessible. I have problem with configuring IPv6 network on VMs. I use VirtualBox, the… >>> More