Search Results

Search found 6 results on 1 pages for 'jaro'.

Page 1/1 | 1 

  • Optimizing Jaro-Winkler algorithm

    - by Pentium10
    I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device. Can it be optimized more? public class Jaro { /** * gets the similarity of the two strings using Jaro distance. * * @param string1 the first input string * @param string2 the second input string * @return a value between 0-1 of the similarity */ public float getSimilarity(final String string1, final String string2) { //get half the length of the string rounded up - (this is the distance used for acceptable transpositions) final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2); //get common characters final StringBuffer common1 = getCommonCharacters(string1, string2, halflen); final StringBuffer common2 = getCommonCharacters(string2, string1, halflen); //check for zero in common if (common1.length() == 0 || common2.length() == 0) { return 0.0f; } //check for same length common strings returning 0.0f is not the same if (common1.length() != common2.length()) { return 0.0f; } //get the number of transpositions int transpositions = 0; int n=common1.length(); for (int i = 0; i < n; i++) { if (common1.charAt(i) != common2.charAt(i)) transpositions++; } transpositions /= 2.0f; //calculate jaro metric return (common1.length() / ((float) string1.length()) + common2.length() / ((float) string2.length()) + (common1.length() - transpositions) / ((float) common1.length())) / 3.0f; } /** * returns a string buffer of characters from string1 within string2 if they are of a given * distance seperation from the position in string1. * * @param string1 * @param string2 * @param distanceSep * @return a string buffer of characters from string1 within string2 if they are of a given * distance seperation from the position in string1 */ private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) { //create a return buffer of characters final StringBuffer returnCommons = new StringBuffer(); //create a copy of string2 for processing final StringBuffer copy = new StringBuffer(string2); //iterate over string1 int n=string1.length(); int m=string2.length(); for (int i = 0; i < n; i++) { final char ch = string1.charAt(i); //set boolean for quick loop exit if found boolean foundIt = false; //compare char with range of characters to either side for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) { //check if found if (copy.charAt(j) == ch) { foundIt = true; //append character found returnCommons.append(ch); //alter copied string2 for processing copy.setCharAt(j, (char)0); } } } return returnCommons; } } I mention that in the whole process I make just instance of the script, so only once jaro= new Jaro(); If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization.

    Read the article

  • Desktop Fun: Add New Theme Packs to Windows 7

    - by Asian Angel
    One of the wonderful things about Windows 7 is the availability of new themes and with more becoming available each month there are plenty to choose from. Join us as we take a look at sampler set of the great themes that you can download for your system. For the themes shown here we have included a full-screen image and a screenshot showing the wallpapers that are available with each theme. Once you have downloaded the themes simply double click on the theme-pack file to install them. Note: The system “text size and sound schemes” will vary slightly from theme to theme. Cats Anytime Dogs in Summer Tigers Ceske jaro (Czech Spring) Brazil Lugares Coloridos Latvian Nature Srpska priroda (Serbian Nature) Bicycle Ride around Taiwan Bing’s Best Avatar Zune Characters Conclusion If you are looking for an easy way to add some beautiful variety to your Windows 7 installation then head on over to the Microsoft website…you just might find that perfect theme waiting for your computer. Links Windows 7 Themes at Microsoft Ceske jaro (Czech Spring) at Softpedia Similar Articles Productive Geek Tips Windows 7 Welcome Screen Taking Forever? Here’s the Fix (Maybe)Unofficial Windows XP Themes Created by MicrosoftSweet Black Theme for FirefoxDownload New Themes in Windows 7Sweet Black Theme for Windows XP TouchFreeze Alternative in AutoHotkey The Icy Undertow Desktop Windows Home Server – Backup to LAN The Clear & Clean Desktop Use This Bookmarklet to Easily Get Albums Use AutoHotkey to Assign a Hotkey to a Specific Window Latest Software Reviews Tinyhacker Random Tips DVDFab 6 Revo Uninstaller Pro Registry Mechanic 9 for Windows PC Tools Internet Security Suite 2010 Chitika iPad Labs Gives Live iPad Sale Stats Heaven & Hell Finder Icon Using TrueCrypt to Secure Your Data Quickly Schedule Meetings With NeedtoMeet Share Flickr Photos On Facebook Automatically Are You Blocked On Gtalk? Find out

    Read the article

  • MongoDB - how to join parent and child products by reference

    - by Jaro
    my mongo collection stores products. There are two product types: child and parent. Parent product holds array of its child as reference. Use case: use mydb; child1 = { _id: 1, name: "Child 1", is_child: true, is_parent: false, children : [] } child2 = { _id: 2, name: "Child 2", is_child: true, is_parent: false, children : [] } parent = { _id: 3, name: "Parent product", is_child: false, is_parent: true, children : [1, 2] } db.product.insert( [child1, child2, parent] ); And I'm looking for any query returning { _id: 3, name: "Parent product", is_child: false, is_parent: true, children: [ { _id: 1, name: "Child 1", is_child: true, is_parent: false, children : [] }, { _id: 2, name: "Child 2", is_child: true, is_parent: false, children : [] } ] } I'm newbie to mongodb, but I guess an usage of map-reduce could solve the problem. Can anyone advice? Thx

    Read the article

  • Python performance improvement request for winkler

    - by Martlark
    I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names. def winklerCompareP(str1, str2): """Return approximate string comparator measure (between 0.0 and 1.0) USAGE: score = winkler(str1, str2) ARGUMENTS: str1 The first string str2 The second string DESCRIPTION: As described in 'An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves Thibaudeau. Based on the 'jaro' string comparator, but modifies it according to whether the first few characters are the same or not. """ # Quick check if the strings are the same - - - - - - - - - - - - - - - - - - # jaro_winkler_marker_char = chr(1) if (str1 == str2): return 1.0 len1 = len(str1) len2 = len(str2) halflen = max(len1,len2) / 2 - 1 ass1 = '' # Characters assigned in str1 ass2 = '' # Characters assigned in str2 #ass1 = '' #ass2 = '' workstr1 = str1 workstr2 = str2 common1 = 0 # Number of common characters common2 = 0 #print "'len1', str1[i], start, end, index, ass1, workstr2, common1" # Analyse the first string - - - - - - - - - - - - - - - - - - - - - - - - - # for i in range(len1): start = max(0,i-halflen) end = min(i+halflen+1,len2) index = workstr2.find(str1[i],start,end) #print 'len1', str1[i], start, end, index, ass1, workstr2, common1 if (index > -1): # Found common character common1 += 1 #ass1 += str1[i] ass1 = ass1 + str1[i] workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:] #print "str1 analyse result", ass1, common1 #print "str1 analyse result", ass1, common1 # Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - - # for i in range(len2): start = max(0,i-halflen) end = min(i+halflen+1,len1) index = workstr1.find(str2[i],start,end) #print 'len2', str2[i], start, end, index, ass1, workstr1, common2 if (index > -1): # Found common character common2 += 1 #ass2 += str2[i] ass2 = ass2 + str2[i] workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:] if (common1 != common2): print('Winkler: Wrong common values for strings "%s" and "%s"' % \ (str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \ ', common should be the same.') common1 = float(common1+common2) / 2.0 ##### This is just a fix ##### if (common1 == 0): return 0.0 # Compute number of transpositions - - - - - - - - - - - - - - - - - - - - - # transposition = 0 for i in range(len(ass1)): if (ass1[i] != ass2[i]): transposition += 1 transposition = transposition / 2.0 # Now compute how many characters are common at beginning - - - - - - - - - - # minlen = min(len1,len2) for same in range(minlen+1): if (str1[:same] != str2[:same]): break same -= 1 if (same > 4): same = 4 common1 = float(common1) w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1) wn = w + same*0.1 * (1.0 - w) return wn

    Read the article

  • Please help fix and optimize this query

    - by user607217
    I am working on a system to find potential duplicates in our customers table (SQL 2005). I am using the built-in SOUNDEX value that our software computes when customers are added/updated, but I also implemented the double metaphone algorithm for better matching. This is the most-nested query I have created, and I can't help but think there is a better way to do it and I'd like to learn. In the inner-most query I am joining the customer table to the metaphone table I created, then finding customers that have identical pKey (primary phonetic key). I take that, union that with customers that have matching soundex values, and then proceed to score those matches with various text similarity functions. This is currently working, but I would also like to add a union of customers whose aKey (alternate phonetic key) match. This would be identical to "QUERY A" except to substitute on (c1Akey = c2Akey) for the join. However, when I attempt to include that, I get errors when I try to execute my query. Here is the code: --Create aggregate ranking select c1Name, c2Name, nDiff, c1Addr, c2Addr, aDiff, c1SSN, c2SSN, sDiff, c1DOB, c2DOB, dDiff, nDiff+aDiff+dDiff+sDiff as Score ,(sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5 as [Rank] FROM ( --Create match scores for different fields SELECT c1Name, c2Name, c1Addr, c2Addr, c1SSN, c2SSN, c1LTD, c2LTD, c1DOB, c2DOB, dbo.Jaro(c1name, c2name) AS nDiff, dbo.JaroWinkler(c1addr, c2addr) AS aDiff, CASE WHEN c1dob = '1901-01-01' OR c2dob = '1901-01-01' OR c1dob = '1800-01-01' OR c2dob = '1800-01-01' THEN .5 ELSE dbo.SmithWaterman(c1dob, c2dob) END AS dDiff, CASE WHEN c1ssn = '000-00-0000' OR c2ssn = '000-00-0000' THEN .5 ELSE dbo.Jaro(c1ssn, c2ssn) END AS sDiff FROM -- Generate list of possible matches based on multiple phonetic matching fields ( select * from -- List of similar names from pKey field of ##Metaphone table --QUERY A BEGIN (select customers.custno as c1Custno, name as c1Name, haddr as c1Addr, ssn as c1SSN, lasttripdate as c1LTD, dob as c1DOB, soundex as c1Soundex, pkey as c1Pkey, akey as c1Akey from Customers WITH (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c1 JOIN (select customers.custno as c2Custno, name as c2Name, haddr as c2Addr, ssn as c2SSN, lasttripdate as c2LTD, dob as c2DOB, soundex as c2Soundex, pkey as c2Pkey, akey as c2Akey from Customers with (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c2 on (c1Pkey = c2Pkey) and (c1Custno < c2Custno) WHERE (c1Name <> 'PARENT, GUARDIAN') and c1soundex != c2soundex --QUERY A END union --List of similar names from pregenerated SOUNDEX field (select t1.custno, t1.name, t1.haddr, t1.ssn, t1.lasttripdate, t1.dob, t1.[soundex], 0, 0, t2.custno, t2.name, t2.haddr, t2.ssn, t2.lasttripdate, t2.dob, t2.[soundex], 0, 0 from Customers t1 WITH (nolock) join customers t2 with (nolock) on t1.[soundex] = t2.[soundex] and t1.custno < t2.custno where (t1.name <> 'PARENT, GUARDIAN')) ) as a ) as b where (sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5 >= 7.5 order by [rank] desc, score desc Previously, I was using joins such as on c1.pkey = c2.pkey or c1.akey = c2.akey or c1.soundex = c2.soundex but the performance was horrendous, and using unions seems to be working a lot better. Out of 103K customers, tt is currently generating a list of 8.5M potential matches (based on the phonetic codes) in 2.25 minutes, and then taking another 2 to score, rank and filter those down to about 3000. So I am happy with the performance, I just can't help but think there is a better way to structure this, and I need help adding the extra union condition. Thanks!

    Read the article

1