Optimizing near-duplicate value search

Posted by GApple on Stack Overflow See other posts from Stack Overflow or by GApple
Published on 2010-05-20T22:56:50Z Indexed on 2010/05/20 23:00 UTC
Read the original article Hit count: 156

Filed under:
|
|

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.

There are two criteria that I am matching on

  1. One string is wholly contained within the other, and is at least 1/4 of its length
  2. The strings have an edit distance less than 5% of the total length of the two strings

The Pseudo-PHP code:

foreach($values as $value){
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
}

I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.

Some properties of a few sets of values being operated on

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

© Stack Overflow or respective owner

Related posts about php

Related posts about string-matching