Search Results

Search found 10 results on 1 pages for 'jaro winkler'.

Page 1/1 | 1 

  • Optimizing Jaro-Winkler algorithm

    - by Pentium10
    I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device. Can it be optimized more? public class Jaro { /** * gets the similarity of the two strings using Jaro distance. * * @param string1 the first input string * @param string2 the second input string * @return a value between 0-1 of the similarity */ public float getSimilarity(final String string1, final String string2) { //get half the length of the string rounded up - (this is the distance used for acceptable transpositions) final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2); //get common characters final StringBuffer common1 = getCommonCharacters(string1, string2, halflen); final StringBuffer common2 = getCommonCharacters(string2, string1, halflen); //check for zero in common if (common1.length() == 0 || common2.length() == 0) { return 0.0f; } //check for same length common strings returning 0.0f is not the same if (common1.length() != common2.length()) { return 0.0f; } //get the number of transpositions int transpositions = 0; int n=common1.length(); for (int i = 0; i < n; i++) { if (common1.charAt(i) != common2.charAt(i)) transpositions++; } transpositions /= 2.0f; //calculate jaro metric return (common1.length() / ((float) string1.length()) + common2.length() / ((float) string2.length()) + (common1.length() - transpositions) / ((float) common1.length())) / 3.0f; } /** * returns a string buffer of characters from string1 within string2 if they are of a given * distance seperation from the position in string1. * * @param string1 * @param string2 * @param distanceSep * @return a string buffer of characters from string1 within string2 if they are of a given * distance seperation from the position in string1 */ private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) { //create a return buffer of characters final StringBuffer returnCommons = new StringBuffer(); //create a copy of string2 for processing final StringBuffer copy = new StringBuffer(string2); //iterate over string1 int n=string1.length(); int m=string2.length(); for (int i = 0; i < n; i++) { final char ch = string1.charAt(i); //set boolean for quick loop exit if found boolean foundIt = false; //compare char with range of characters to either side for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) { //check if found if (copy.charAt(j) == ch) { foundIt = true; //append character found returnCommons.append(ch); //alter copied string2 for processing copy.setCharAt(j, (char)0); } } } return returnCommons; } } I mention that in the whole process I make just instance of the script, so only once jaro= new Jaro(); If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization.

    Read the article

  • Python performance improvement request for winkler

    - by Martlark
    I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names. def winklerCompareP(str1, str2): """Return approximate string comparator measure (between 0.0 and 1.0) USAGE: score = winkler(str1, str2) ARGUMENTS: str1 The first string str2 The second string DESCRIPTION: As described in 'An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves Thibaudeau. Based on the 'jaro' string comparator, but modifies it according to whether the first few characters are the same or not. """ # Quick check if the strings are the same - - - - - - - - - - - - - - - - - - # jaro_winkler_marker_char = chr(1) if (str1 == str2): return 1.0 len1 = len(str1) len2 = len(str2) halflen = max(len1,len2) / 2 - 1 ass1 = '' # Characters assigned in str1 ass2 = '' # Characters assigned in str2 #ass1 = '' #ass2 = '' workstr1 = str1 workstr2 = str2 common1 = 0 # Number of common characters common2 = 0 #print "'len1', str1[i], start, end, index, ass1, workstr2, common1" # Analyse the first string - - - - - - - - - - - - - - - - - - - - - - - - - # for i in range(len1): start = max(0,i-halflen) end = min(i+halflen+1,len2) index = workstr2.find(str1[i],start,end) #print 'len1', str1[i], start, end, index, ass1, workstr2, common1 if (index > -1): # Found common character common1 += 1 #ass1 += str1[i] ass1 = ass1 + str1[i] workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:] #print "str1 analyse result", ass1, common1 #print "str1 analyse result", ass1, common1 # Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - - # for i in range(len2): start = max(0,i-halflen) end = min(i+halflen+1,len1) index = workstr1.find(str2[i],start,end) #print 'len2', str2[i], start, end, index, ass1, workstr1, common2 if (index > -1): # Found common character common2 += 1 #ass2 += str2[i] ass2 = ass2 + str2[i] workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:] if (common1 != common2): print('Winkler: Wrong common values for strings "%s" and "%s"' % \ (str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \ ', common should be the same.') common1 = float(common1+common2) / 2.0 ##### This is just a fix ##### if (common1 == 0): return 0.0 # Compute number of transpositions - - - - - - - - - - - - - - - - - - - - - # transposition = 0 for i in range(len(ass1)): if (ass1[i] != ass2[i]): transposition += 1 transposition = transposition / 2.0 # Now compute how many characters are common at beginning - - - - - - - - - - # minlen = min(len1,len2) for same in range(minlen+1): if (str1[:same] != str2[:same]): break same -= 1 if (same > 4): same = 4 common1 = float(common1) w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1) wn = w + same*0.1 * (1.0 - w) return wn

    Read the article

  • New Cloud Security Book: Securing the Cloud by Vic Winkler

    - by user12608550
    It's rare that I read a technical book straight through; I usually read key chapters and save the rest for later reference. But Winkler's book, written by an accomplished and highly experienced security professional, was worth a complete read, cover to cover. Of the recently published cloud security books, such as... Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance, by Tim Mather, Subra Kumaraswamy, and Shahed Latif; O'Reilly Media Inc, 2009; Cloud Computing: Implementation, Management, and Security, by John Rittenhouse and James Ransome; CRC Press 2010; Cloud Security: A Comprehensive Guide to Secure Cloud Computing, by Ronald Krutz and Russell Vines; Wiley Publishing Inc, 2010 ...Securing the Cloud is the most useful and informative about all aspects of cloud security. Clearly, through his experience, the author has thought through many practical issues of securing large, virtualized IT installations. His Chapter 6 on Best Practices and Chapter 9 with its valuable checklists are worth the price of the book. If you are among the many new cloud computing professionals, Securing the Cloud is an essential reference for your work.

    Read the article

  • MVC: Why put the business logic in the model? What happens when I've multiple types of storage?

    - by Steffen Winkler
    I always thought that the business logic has to be in the controller and that the controller, since it is the 'middle' part, stays static and that the model/view have to be capsuled via interfaces, that way you could change the business logic without affecting anything else, program multiple Models (one for each database/type of storage) and a dozens of views (for different platforms for example). Now I read in this question that you should always put the business logic into the model and that the controller is deeply connected with the view. To me, that doesn't really make sense and implies that each time I want to have the means of supporting another database/type of storage I've to rewrite my whole model including the business logic. And if I want another view, I've to rewrite both the view and the controller. May someone explain why that is or if I went wrong somewhere? Currently, that whole thing doesn't really make sense to me.

    Read the article

  • Directory directive: AuthType None but still need an AuthProvider?

    - by Steffen Winkler
    For now I just need the server to let me download files from one specific folder (in my case I chose /opt/myFolder for that task) Distribution is Debian 6.0 *edit_start* Apache version is 2.4, according to their official documentation, the Order/Allow clauses are deprecated and should not be used anymore I'm an idiot: Apache version is 2.2. *edit_end* My directory directives in apache2.conf look like this: <IfModule dir_module> DirectoryIndex index.html index.htm index.php </IfModule> ServerRoot "/etc/apache2" DocumentRoot "/opt/myFolder" <Directory /> Options FollowSymLinks AuthType None AllowOverride None Require all denie </Directory> <Directory "/opt/myFolder/*"> Options FollowSymLinks MultiViews AllowOverride None AuthType None Require all allow </Directory> When I try to access a file inside that folder (http://myserver.de/aTestFile.zip) I get an Internal Server Error. Also Apache writes the following error into it's log: configuration error: couldn't check user. Check your authn provider!: /aTestFile.zip Why would I need an authn provider if I don't want any authentication? Also I hope someone can explain to me what kind of AuthenticationProvider I'd need for that. Everytime I search for those things I get pointed at people asking how to protect files/directories with passwords or restrict access to some IP addresses, which doesn't really help me. ok, since I've Apache version 2.2, here is the error I get when using the Order/Deny/Allow commands instead of AuthType/Require: Invalid command 'Order', perhaps misspelled or defined by a module not included in the server configuration.

    Read the article

  • Desktop Fun: Add New Theme Packs to Windows 7

    - by Asian Angel
    One of the wonderful things about Windows 7 is the availability of new themes and with more becoming available each month there are plenty to choose from. Join us as we take a look at sampler set of the great themes that you can download for your system. For the themes shown here we have included a full-screen image and a screenshot showing the wallpapers that are available with each theme. Once you have downloaded the themes simply double click on the theme-pack file to install them. Note: The system “text size and sound schemes” will vary slightly from theme to theme. Cats Anytime Dogs in Summer Tigers Ceske jaro (Czech Spring) Brazil Lugares Coloridos Latvian Nature Srpska priroda (Serbian Nature) Bicycle Ride around Taiwan Bing’s Best Avatar Zune Characters Conclusion If you are looking for an easy way to add some beautiful variety to your Windows 7 installation then head on over to the Microsoft website…you just might find that perfect theme waiting for your computer. Links Windows 7 Themes at Microsoft Ceske jaro (Czech Spring) at Softpedia Similar Articles Productive Geek Tips Windows 7 Welcome Screen Taking Forever? Here’s the Fix (Maybe)Unofficial Windows XP Themes Created by MicrosoftSweet Black Theme for FirefoxDownload New Themes in Windows 7Sweet Black Theme for Windows XP TouchFreeze Alternative in AutoHotkey The Icy Undertow Desktop Windows Home Server – Backup to LAN The Clear & Clean Desktop Use This Bookmarklet to Easily Get Albums Use AutoHotkey to Assign a Hotkey to a Specific Window Latest Software Reviews Tinyhacker Random Tips DVDFab 6 Revo Uninstaller Pro Registry Mechanic 9 for Windows PC Tools Internet Security Suite 2010 Chitika iPad Labs Gives Live iPad Sale Stats Heaven & Hell Finder Icon Using TrueCrypt to Secure Your Data Quickly Schedule Meetings With NeedtoMeet Share Flickr Photos On Facebook Automatically Are You Blocked On Gtalk? Find out

    Read the article

  • MongoDB - how to join parent and child products by reference

    - by Jaro
    my mongo collection stores products. There are two product types: child and parent. Parent product holds array of its child as reference. Use case: use mydb; child1 = { _id: 1, name: "Child 1", is_child: true, is_parent: false, children : [] } child2 = { _id: 2, name: "Child 2", is_child: true, is_parent: false, children : [] } parent = { _id: 3, name: "Parent product", is_child: false, is_parent: true, children : [1, 2] } db.product.insert( [child1, child2, parent] ); And I'm looking for any query returning { _id: 3, name: "Parent product", is_child: false, is_parent: true, children: [ { _id: 1, name: "Child 1", is_child: true, is_parent: false, children : [] }, { _id: 2, name: "Child 2", is_child: true, is_parent: false, children : [] } ] } I'm newbie to mongodb, but I guess an usage of map-reduce could solve the problem. Can anyone advice? Thx

    Read the article

  • Today's Links (6/23/2011)

    - by Bob Rhubart
    Lydia Smyers interviews Justin "Mr. OTN" Kestelyn on the Oracle ACE Program Justin Kestelyn describes the Oracle ACE program, what it means to the developer community, and how to get involved. Incremental Essbase Metadata Imports Now Possible with OBIEE 11g | Mark Rittman "So, how does this work, and how easy is it to implement?" asks Oracle ACE Director Mark Rittman, and then he dives in to find out. ORACLENERD: The Podcast Oracle ACE Chet "ORACLENERD" Justice recounts his brush with stardom on Christian Screen's The Art of Business Intelligence podcast. Bay Area Coherence Special Interest Group Next Meeting July 21, 2011 | Cristóbal Soto Soto shares information on next month's Bay Area Coherence SIG shindig. New Cloud Security Book: Securing the Cloud by Vic Winkler | Dr Cloud's Flying Software Circus "Securing the Cloud is the most useful and informative about all aspects of cloud security," says Harry "Dr. Cloud" Foxwell. Oracle MDM Maturity Model | David Butler "The model covers maturity levels around five key areas: Profiling data sources; Defining a data strategy; Defining a data consolidation plan; Data maintenance; and Data utilization," says Butler. Integrating Strategic Planning for Cloud and SOA | David Sprott "Full blown Cloud adoption implies mature and sophisticated SOA implementation and impacts many business processes," says Sprott.

    Read the article

  • Please help fix and optimize this query

    - by user607217
    I am working on a system to find potential duplicates in our customers table (SQL 2005). I am using the built-in SOUNDEX value that our software computes when customers are added/updated, but I also implemented the double metaphone algorithm for better matching. This is the most-nested query I have created, and I can't help but think there is a better way to do it and I'd like to learn. In the inner-most query I am joining the customer table to the metaphone table I created, then finding customers that have identical pKey (primary phonetic key). I take that, union that with customers that have matching soundex values, and then proceed to score those matches with various text similarity functions. This is currently working, but I would also like to add a union of customers whose aKey (alternate phonetic key) match. This would be identical to "QUERY A" except to substitute on (c1Akey = c2Akey) for the join. However, when I attempt to include that, I get errors when I try to execute my query. Here is the code: --Create aggregate ranking select c1Name, c2Name, nDiff, c1Addr, c2Addr, aDiff, c1SSN, c2SSN, sDiff, c1DOB, c2DOB, dDiff, nDiff+aDiff+dDiff+sDiff as Score ,(sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5 as [Rank] FROM ( --Create match scores for different fields SELECT c1Name, c2Name, c1Addr, c2Addr, c1SSN, c2SSN, c1LTD, c2LTD, c1DOB, c2DOB, dbo.Jaro(c1name, c2name) AS nDiff, dbo.JaroWinkler(c1addr, c2addr) AS aDiff, CASE WHEN c1dob = '1901-01-01' OR c2dob = '1901-01-01' OR c1dob = '1800-01-01' OR c2dob = '1800-01-01' THEN .5 ELSE dbo.SmithWaterman(c1dob, c2dob) END AS dDiff, CASE WHEN c1ssn = '000-00-0000' OR c2ssn = '000-00-0000' THEN .5 ELSE dbo.Jaro(c1ssn, c2ssn) END AS sDiff FROM -- Generate list of possible matches based on multiple phonetic matching fields ( select * from -- List of similar names from pKey field of ##Metaphone table --QUERY A BEGIN (select customers.custno as c1Custno, name as c1Name, haddr as c1Addr, ssn as c1SSN, lasttripdate as c1LTD, dob as c1DOB, soundex as c1Soundex, pkey as c1Pkey, akey as c1Akey from Customers WITH (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c1 JOIN (select customers.custno as c2Custno, name as c2Name, haddr as c2Addr, ssn as c2SSN, lasttripdate as c2LTD, dob as c2DOB, soundex as c2Soundex, pkey as c2Pkey, akey as c2Akey from Customers with (nolock) join ##Metaphone on customers.custno = ##Metaphone.custno) as c2 on (c1Pkey = c2Pkey) and (c1Custno < c2Custno) WHERE (c1Name <> 'PARENT, GUARDIAN') and c1soundex != c2soundex --QUERY A END union --List of similar names from pregenerated SOUNDEX field (select t1.custno, t1.name, t1.haddr, t1.ssn, t1.lasttripdate, t1.dob, t1.[soundex], 0, 0, t2.custno, t2.name, t2.haddr, t2.ssn, t2.lasttripdate, t2.dob, t2.[soundex], 0, 0 from Customers t1 WITH (nolock) join customers t2 with (nolock) on t1.[soundex] = t2.[soundex] and t1.custno < t2.custno where (t1.name <> 'PARENT, GUARDIAN')) ) as a ) as b where (sDiff+dDiff)*1.5 + (nDiff+dDiff)*1.5 + (nDiff+sDiff)*1.5 + aDiff *.5 + nDiff *.5 >= 7.5 order by [rank] desc, score desc Previously, I was using joins such as on c1.pkey = c2.pkey or c1.akey = c2.akey or c1.soundex = c2.soundex but the performance was horrendous, and using unions seems to be working a lot better. Out of 103K customers, tt is currently generating a list of 8.5M potential matches (based on the phonetic codes) in 2.25 minutes, and then taking another 2 to score, rank and filter those down to about 3000. So I am happy with the performance, I just can't help but think there is a better way to structure this, and I need help adding the extra union condition. Thanks!

    Read the article

1