Search Results

Search found 45 results on 2 pages for 'levenshtein'.

Page 1/2 | 1 2  | Next Page >

  • Optimizing Levenshtein Distance Algorithm

    - by Matt
    I have a stored procedure that uses Levenshtein Distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein Distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using: ALTER function dbo.Levenshtein ( @Source nvarchar(200), @Target nvarchar(200) ) RETURNS int AS BEGIN DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000) SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0 WHILE @j <= @Target_len BEGIN SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1 END WHILE @i <= @Source_len BEGIN SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @Target_len BEGIN SET @Dist = @Dist + 1 SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) + CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END IF @Dist > @Dist_temp BEGIN SET @Dist = @Dist_temp END SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1 IF @Dist > @Dist_temp SET @Dist = @Dist_temp BEGIN SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1 END END SELECT @Distv1 = @Distv0, @i = @i + 1 END RETURN @Dist END Anyone have any ideas? Any input is appreciated. Thanks, Matt

    Read the article

  • Levenshtein: MySQL + PHP

    - by user317005
    $word = strtolower($_GET['term']); $lev = 0; $q = mysql_uqery("SELECT `term` FROM `words`"); while($r = mysql_fetch_assoc($q)) { $r['term'] = strtolower($r['term']); $lev = levenshtein($word, $r['term']); if($lev >= 0 && $lev < 5) { $word = $r['term']; } } how can I move all that into just one query? don't want to have to query through all terms and do the filtering in php.

    Read the article

  • MySQL Error: FUNCTION LEVENSHTEIN already exists

    - by kgrote
    I've got an ExpressionEngine database and I exported a couple of tables from it, then dropped those tables. When I try to re-import the tables in PHPMyAdmin, I get this error: SQL query: -- -- Database: `my_db` -- DELIMITER $$ -- -- Functions -- CREATE DEFINER=`my_username`@`%` FUNCTION `LEVENSHTEIN`(s1 VARCHAR(255), s2 VARCHAR(255)) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = [...] MySQL said: Documentation #1304 - FUNCTION LEVENSHTEIN already exists I get this error even if I drop all tables from the DB and try to import anything. The only way I can get the error to go away is to totally delete the database and re-create it. What's causing that error and how can I stop it from happening?

    Read the article

  • Best way to detect similar email addresses?

    - by Chris
    I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit. ([email protected], [email protected], [email protected], etc...). I want to find similar email addresses for evaluation. Currently I'm using a levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach? The test code I'm using now is: using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; using System.Threading; namespace LevenshteinAnalyzer { class Program { const string INPUT_FILE = @"C:\Input.txt"; const string OUTPUT_FILE = @"C:\Output.txt"; static void Main(string[] args) { var inputWords = File.ReadAllLines(INPUT_FILE); var outputWords = new SortedSet<string>(); for (var i = 0; i < inputWords.Length; i++) { if (i % 100 == 0) Console.WriteLine("Processing record #" + i); var word1 = inputWords[i].ToLower(); for (var n = i + 1; n < inputWords.Length; n++) { if (i == n) continue; var word2 = inputWords[n].ToLower(); if (word1 == word2) continue; if (outputWords.Contains(word1)) continue; if (outputWords.Contains(word2)) continue; var distance = LevenshteinAlgorithm.Compute(word1, word2); if (distance <= 2) { outputWords.Add(word1); outputWords.Add(word2); } } } File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray()); Console.WriteLine("Found {0} words", outputWords.Count); } } }

    Read the article

  • Will these optimizations to my Ruby implementation of diff improve performance in a Rails app?

    - by grg-n-sox
    <tl;dr> In source version control diff patch generation, would it be worth it to use the optimizations listed at the very bottom of this writing (see <optimizations>) in my Ruby implementation of diff for making diff patches? </tl;dr> <introduction> I am programming something I have never done before and there might already be tools out there to do the exact thing I am programming but at this point I am having too much fun to care so I am still going to do it from scratch, even if there is a tool for this. So anyways, I am working on a Ruby on Rails app and need a certain feature. Basically I want each entry in a table of mine, let's say for example a table of video games, to have a stored chunk of text that represents a review or something of the sort for that table entry. However, I want this text to be both editable by any registered user and also keep track of different submissions in a version control system. The simplest solution I could think of is just implement a solution that keeps track of the text body and the diff patch history of different versions of the text body as objects in Ruby and then serialize it, preferably in human readable form (so I'll most likely use YAML for this) for editing if needed due to corruption by a software bug or a mistake is made by an admin doing some version editing. So at first I just tried to dive in head first into this feature to find that the problem of generating a diff patch is more difficult that I thought to do efficiently. So I did some research and came across some ideas. Some I have implemented already and some I have not. However, it all pretty much revolves around the longest common subsequence problem, as you would already know if you have already done anything with diff or diff-like features, and optimization the function that solves it. Currently I have it so it truncates the compared versions of the text body from the beginning and end until non-matching lines are found. Then it solves the problem using a comparison matrix, but instead of incrementing the value stored in a cell when it finds a matching line like in most longest common subsequence algorithms I have seen examples of, I increment when I have a non-matching line so as to calculate edit distance instead of longest common subsequence. Although as far as I can tell between the two approaches, they are essentially two sides of the same coin so either could be used to derive an answer. It then back-traces through the comparison matrix and notes when there was an incrementation and in which adjacent cell (West, Northwest, or North) to determine that line's diff entry and assumes all other lines to be unchanged. Normally I would leave it at that, but since this is going into a Rails environment and not just some stand-alone Ruby script, I started getting worried about needing to optimize at least enough so if a spammer that somehow knew how I implemented the version control system and knew my worst case scenario entry still wouldn't be able to hit the server that bad. After some searching and reading of research papers and articles through the internet, I've come across several that seem decent but all seem to have pros and cons and I am having a hard time deciding how well in this situation that the pros and cons balance out. So are the ones listed here worth it? I have listed them with known pros and cons. </introduction> <optimizations> Chop the compared sequences into multiple chucks of subsequences by splitting where lines are unchanged, and then truncating each section of unchanged lines at the beginning and end of each section. Then solve the edit distance of each subsequence. Pro: Changes the time increase as the changed area gets bigger from a quadratic increase to something more similar to a linear increase. Con: Figuring out where to split already seems like you have to solve edit distance except now you don't care how it is changed. Would be fine if this was solvable by a process closer to solving hamming distance but a single insertion would throw this off. Use a cryptographic hash function to both convert all sequence elements into integers and ensure uniqueness. Then solve the edit distance comparing the hash integers instead of the sequence elements themselves. Pro: The operation of comparing two integers is faster than the operation of comparing two strings, so a slight performance gain is received after every comparison, which can be a lot overall. Con: Using a cryptographic hash function takes time to convert all the sequence elements and may end up costing more time to do the conversion that you gain back from the integer comparisons. You could use the built in hash function for a string but that will not guarantee uniqueness. Use lazy evaluation to only calculate the three center-most diagonals of the comparison matrix and then only calculate additional diagonals as needed. And then also use this approach to possibly remove the need on some comparisons to compare all three adjacent cells as desribed here. Pro: Can turn an algorithm that always takes O(n * m) time and make it so only worst case scenario is that time, best case becomes practically linear, and average case is somewhere between the two. Con: It is an algorithm I've only seen implemented in functional programming languages and I am having a difficult time comprehending how to convert this into Ruby based on how it is described at the site linked to above. Make a C module and do the hard work at the native level in C and just make a Ruby wrapper for it so Ruby can make all the calls to it that it needs. Pro: I have to imagine that evaluating something like this in could be a LOT faster. Con: I have no idea how Rails handles apps with ruby code that has C extensions and it hurts the portability of the app. This is an optimization for after the solving of edit distance, but idea is to store additional combined diffs with the ones produced by each version to make a delta-tree data structure with the most recently made diff as the root node of the tree so getting to any version takes worst case time of O(log n) instead of O(n). Pro: Would make going back to an old version a lot faster. Con: It would mean every new commit, the delta-tree would get a new root node that will cost time to reorganize the delta-tree for an operation that will be carried out a lot more often than going back a version, not to mention the unlikelihood it will be an old version. </optimizations> So are these things worth the effort?

    Read the article

  • Algorithm for disordered sequences of strings

    - by Kinopiko
    The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters: quick brown fox quikc brown fax The Levenshtein distance = 3. What is a similar algorithm for the distance between two strings with similar subsequences? For example, in quickbrownfox brownquickfox the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like quickbrownfox qburiocwknfox and yet the completely disordered version has a Levenshtein distance of eight. What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?

    Read the article

  • Algorithm for measuring distance between disordered sequences

    - by Kinopiko
    The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters: quick brown fox quikc brown fax The Levenshtein distance = 3. What is a similar algorithm for the distance between two strings with similar subsequences? For example, in quickbrownfox brownquickfox the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like quickbrownfox qburiocwknfox and yet this completely disordered version has a Levenshtein distance of eight. What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?

    Read the article

  • Algorithm for measuring distance between disordered sequences of strings

    - by Kinopiko
    The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters: quick brown fox quikc brown fax The Levenshtein distance = 3. What is a similar algorithm for the distance between two strings with similar subsequences? For example, in quickbrownfox brownquickfox the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like quickbrownfox qburiocwknfox and yet this completely disordered version has a Levenshtein distance of eight. What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?

    Read the article

  • How can I optimize retrieving lowest edit distance from a large table in SQL?

    - by Matt
    Hey, I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following: Get the record with the minimum distance for the source string as well as a trimmed version of the source string Pick the record with the minimum distance If the min distances are equal (original vs trimmed), choose the trimmed one with the lowest distance If there are still multiple records that fall under the above two categories, pick the one with the highest frequency Here's my working version: DECLARE @Results TABLE ( ID int, [Name] nvarchar(200), Distance int, Frequency int, Trimmed bit ) INSERT INTO @Results SELECT ID, [Name], (dbo.Levenshtein(@Source, [Name])) As Distance, Frequency, 'False' As Trimmed FROM MyTable INSERT INTO @Results SELECT ID, [Name], (dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance, Frequency, 'True' As Trimmed FROM MyTable SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency) I believe what I need to do here is to.. Not dumb the results to a temporary table Do only 1 select from `MyTable` Setting the results right in the select from the initial select statement. (Since select will set variables and you can set multiple variables in one select statement) I know there has to be a good implementation to this but I can't figure it out... this is as far as I got: SELECT top 1 @ResultID = ID, @Result = [Name], (dbo.Levenshtein(@Source, [Name])) As distOrig, (dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed, Frequency FROM MyTable WHERE /* ... yeah I'm lost */ ORDER BY distOrig, distTrimmed, Frequency Any ideas?

    Read the article

  • Comparing (similar) images with Python/PIL

    - by Attila Oláh
    I'm trying to calculate the similarity (read: Levenshtein distance) of two images, using Python 2.6 and PIL. I plan to use the python-levenshtein library for fast comparison. Main question: What is a good strategy for comparing images? My idea is something like: Convert to RGB (transparent - white) (or maybe convert to monochrome?) Scale up the smaller one to the larger one's size Convert each channel (= the only channel, if converted to monochrome) to a sequence (item value = color value of the pixel) Calculate the Levenshtein distance between the two sequences Of course, this will not handle cases like mirrored images, cropped images, etc. But for basic comparison, this should be useful. Is there a better strategy documented somewhere?

    Read the article

  • algorithm for checking addresses for matches?

    - by user151841
    I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other. I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate. However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc. Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it? I'm working with PHP/MySql, so I have the limitations inherent in that system.

    Read the article

  • Fuzzy-String Search: Find misspelled information with T-SQL

    An optimized Damerau-Levenshtein Distance (DLD) algorithm for "fuzzy" string matching in Transact-SQL 2000-2008 Learn Agile Database Development Best PracticesAgile database development experts Sebastian Meine and Dennis Lloyd are running day-long classes designed to complement Red Gate’s SQL in the City US tour. Classes will be held in San Francisco, Chicago, Boston and Seattle. Register Now.

    Read the article

  • Syntax for documenting JSON structure

    - by Roman A. Taycher
    So I'm trying to document the format of the json returned by an api I am writing against and I'd like to know if there is any popular format for the documentation of json structure. Note I'm not trying to to test or validate anything, I'm just using this for documentation. Also some ways to add comments to non-constants(items always returned w/ the same value) would be nice. This the not totally thought out scheme I'm currently using: Plain names refer to identifiers or types. Some types have type-comment Strings that appear to be constant(always returned for that type of request) strings are "str" Constant Numbers would be just the number Constant null is null Booleans are true/false for constant booleans or Boolean otherwise [a,b,c] are lists with 3 items a,b,c [... ...] is a list of repeating elements of some types/constants/patterns {a:A,b:B,c:c} and {... ...} is the same for a dictionary. example: story := [header,footer] header := {"data":realHeader,"kind":"Listing"} realHeader := {"after": null, "before": null, "children": [{"data": realRealHeader, "kind": "t3"}], "modhash": ""} footer := {"data":AlmostComments,"kind":"Listing"} AlmostComments := {"data": {"after": null, "before": null, "children": comments, "modhash": ""}, "kind": "t1"} comments := [...{"data":comment, "kind":"t1"}...] realRealHeader := {"author": string, "clicked": boolean, "created": int, "created_utc": int, "domain": "code.reddit.com", "downs": int, "hidden": boolean, "id": string-id, "is_self": boolean, "levenshtein": null, "likes": null, "media": null, "media_embed": { }, "name": string-id, "num_comments": int, "over_18": false, "permalink": string-urlLinkToStoryStartingFrom/r, "saved": false, "score": int, "selftext": string, "selftext_html": string-html, "subreddit": string-subredditname, "subreddit_id": string-id, "thumbnail": "", "title": string, "ups": int, "url": "http://code.reddit.com/" } comments := { "author": string, "body": string-body_html-wout-html, "body_html": string-html-formated, "created": int, "created_utc": int, "downs": int, "id": string-id, "levenshtein": null, "likes": null, "link_id": string-id, "name": string-id", "parent_id": string-id, "replies": AlmostComments or null, "subreddit": string-subredditname, "subreddit_id": string-id, "ups": int }

    Read the article

  • Optimizing near-duplicate value search

    - by GApple
    I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up. There are two criteria that I am matching on One string is wholly contained within the other, and is at least 1/4 of its length The strings have an edit distance less than 5% of the total length of the two strings The Pseudo-PHP code: foreach($values as $value){ foreach($values as $match){ if( ( $value['length'] < $match['length'] && $value['length'] * 4 > $match['length'] && stripos($match['value'], $value['value']) !== false ) || ( $match['length'] < $value['length'] && $match['length'] * 4 > $value['length'] && stripos($value['value'], $match['value']) !== false ) || ( abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length']) && 0 < ($match['changes'] = levenshtein($value['value'], $match['value'])) && $match['changes'] * 20 <= ($value['length'] + $match['length']) ) ){ $matches[] = &$match; } } } I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays. Some properties of a few sets of values being operated on Total | Strings | # of matches per string | | Strings | With Matches | Average | Median | Max | Time (s) | --------+--------------+---------+--------+------+----------+ 844 | 413 | 1.8 | 1 | 58 | 140 | 593 | 156 | 1.2 | 1 | 5 | 62 | 272 | 168 | 3.2 | 2 | 26 | 10 | 157 | 47 | 1.5 | 1 | 4 | 3.2 | 106 | 48 | 1.8 | 1 | 8 | 1.3 | 62 | 47 | 2.9 | 2 | 16 | 0.4 | Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

    Read the article

  • Convert filenames to their checksum before saving to prevent duplicates. Is is a smart thing to do?

    - by Xananax
    TL;DR:what the title says I am developing some sort of image board in PHP. I was thinking of changing each image's filename to it's checksum prior to saving it. This way, I might be able to prevent duplicates. I know this wouldn't work for two images that are the same but differ in size or level of compression or whatnot, but this method would allow for an early check. What bugs me is that I never saw this method implemented anywhere, so I was wondering if there is a catch to it. Maybe it is just more efficient to keep the original filename and store the hash in DB? Maybe the whole method is just not useful and my question is moot? What do you think? On a side note, I don't really get how hashes are calculated so I was wondering, if my first question checks out, if it would be possible to calculate the likeness that two images are similar by comparing hashes (levenshtein or something of the sort).

    Read the article

  • Algoritms for "fuzzy matching" strings

    - by Alexey Romanov
    By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

    Read the article

  • How do you hook a C++ compiled dll function to a sql database?

    - by Thomas
    I want to do something like: lastName SIMILARTO(lastName, 'Schwarseneger', 2) where lastName is the field in the database, 'Schwarseneger' is the value that lastName field is being compared to and 2 is the maximum number of characters (edit distance) that can differ between the lastName field, and the entered value. I can implement the SIMILARTO function in C++ using the Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance), but how do hook the function in a dll to a mySQL implementation?

    Read the article

  • Removing duplicate images (deduplication) - calculating "overlap" of images

    - by jotango
    Hello, I have a ton of product images on our file system. Our code removes 100% identical images (or does not allow them to be uploaded). However our sellers often upload items pictures which are very similar, but not exactly. They could have more whitespace, a worse quality (compression), a different size etc. Is there any way I can calculate the degree of overlap between two images, to flag ones for deletion? Kind of like a Levenshtein distance between two images... Any pointers would be very cool. Thanks!

    Read the article

  • Be liberal in what you accept... or not?

    - by Matthieu M.
    [Disclaimer: this question is subjective, but I would prefer getting answers backed by facts and/or reflexions] I think everyone knows about the Robustness Principle, usually summed up by Postel's Law: Be conservative in what you send; be liberal in what you accept. I would agree that for the design of a widespread communication protocol this may make sense (with the goal of allowing easy extension), however I have always thought that its application to HTML / CSS was a total failure, each browser implementing its own silent tweak detection / behavior, making it near impossible to obtain a consistent rendering across multiple browsers. I do notice though that there the RFC of the TCP protocol deems "Silent Failure" acceptable unless otherwise specified... which is an interesting behavior, to say the least. There are other examples of the application of this principle throughout the software trade that regularly pop up because they have bitten developpers, from the top off my head: Javascript semi-colon insertion C (silent) builtin conversions (which would not be so bad if it did not truncated...) and there are tools to help implement "smart" behavior: name matching phonetic algorithms (Double Metaphone) string distances algorithms (Levenshtein distance) However I find that this approach, while it may be helpful when dealing with non-technical users or to help users in the process of error recovery, has some drawbacks when applied to the design of library/classes interface: it is somewhat subjective whether the algorithm guesses "right", and thus it may go against the Principle of Least Astonishment it makes the implementation more difficult, thus more chances to introduce bugs (violation of YAGNI ?) it makes the behavior more susceptible to change, as any modification of the "guess" routine may break old programs, nearly excluding refactoring possibilities... from the start! And this is what led me to the following question: When designing an interface (library, class, message), do you lean toward the robustness principle or not ? I myself tend to be quite strict, using extensive input validation on my interfaces, and I was wondering if I was perhaps too strict.

    Read the article

  • Complex string matching with fuzzywuzzy

    - by That1Guy
    I'm attempting to write a process that matches obscure strings to a single 'master string' for further processing. I have a lot of data that looks something like this: Basketball Basket Ball Football BasketBallR BBall BBall - r FootB ...and so on. These need to be mapped to a master record like so: Basketball = Basket Ball, BBall Basketball - R = BasketBallR, BBall - r I also have instances of data resembling this format: Football -r FootBall - r-g/H,Q,HH These situations need to be separated into different categories before being mapped. For example FootBall - r-g/H,Q,HH should be: Football - r Football - g Football - H Football - Q Football - HH At this point, it still needs to be mapped to a master record... I've tried several different combinations of fuzzywuzzy matching methods, Levenshtein Distance measurements, regex, etc. and can't seem to find a reliable method to logically associate different naming styles of a single item with a master name. I'm throwing my hands up in desperation. Are there any existing python resources than can help sort out my problem? Are there other options? Can anybody point out an obvious option that I might have overlooked? Basically, any suggestion, solution, resource or alternative method is greatly appreciated.

    Read the article

  • Fuzzy Regular Expressions

    - by Thomas Ahle
    In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2. This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical. Do you have any ideas how to do this effectively?

    Read the article

  • Fuzzy string matching algorithm in Python

    - by Mridang Agarwalla
    Hi guys, I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching e.g. String A: The quick brown fox. String B: The quick brown fox jumped over the lazy dog. These should match as all words in string A are in string B. Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level. Thanks in advance.

    Read the article

  • Compare two String with MySQL

    - by Scorpi0
    Hi, I wan't to compare two strings in a SQL request so I can retrieve the best match, the aim is to propose to an operator the best zip code possible. For example, in France, we have Integer Zip code, so I made an easy request : SELECT * FROM myTable ORDER BY abs(zip_code - 75000) This request returns first the data closest of Paris. Unfortunatelly, United Kingdom have zip code like AB421RS, so my request can't do it. I see in SQL Server a function 'Difference' : http://www.java2s.com/Code/SQLServer/String-Functions/DIFFERENCEworkoutwhenonestringsoundssimilartoanotherstring.htm But I use MySQL.. Is there anyone who have a good idea to do the trick in one simple request ? PS : the Levenshtein Distance will not do it, as I really wan't to compare string like if they were number. ABCDEF have to be closer to AWXYZ than to ZBCDEF.

    Read the article

  • Writing a post search algorithm.

    - by MdaG
    I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score. I'm using the edit distance (Levenshtein) "e(x, y) = e" to calculate the score for each post when compared to the query word "x" and post word "y" according to: score(x, y) = 2^(2 - e)(1 - min(e, |x|) / |x|) Each word in a post contributes to the total score for that specific post. This approach seems to work well when the posts are of roughly the same size, but sometime certain large posts manages to rack up score solely on having a lot of words in them while in practice not being relevant to the query. Am I approaching this problem in the wrong way or is there some way to normalize the score that I haven't thought of?

    Read the article

1 2  | Next Page >