Optimize Duplicate Detection
- by Dave Jarvis
Background
This is an optimization problem. Oracle Forms XML files have elements such as:
<Trigger TriggerName="name" TriggerText="SELECT * FROM DUAL" ... />
Where the TriggerText is arbitrary SQL code. Each SQL statement has been extracted into uniquely named files such as:
sql/module=DIAL_ACCESS+trigger=KEY-LISTVAL+filename=d_access.fmb.sql     
sql/module=REP_PAT_SEEN+trigger=KEY-LISTVAL+filename=rep_pat_seen.fmb.sql 
I wrote a script to generate a list of exact duplicates using a brute force approach.
Problem
There are 37,497 files to compare against each other; it takes 8 minutes to compare one file against all the others. Logically, if A = B and A = C, then there is no need to check if B = C. So the problem is: how do you eliminate the redundant comparisons?
The script will complete in approximately 208 days.
Script Source Code
The comparison script is as follows:
#!/bin/bash
echo Loading directory ...
for i in $(find sql/ -type f -name \*.sql); do
        echo Comparing $i ...
        for j in $(find sql/ -type f -name \*.sql); do
                if [ "$i" = "$j" ]; then
                        continue;
                fi
                # Case insensitive compare, ignore spaces
                diff -IEbwBaq $i $j > /dev/null
                # 0 = no difference (i.e., duplicate code)
                if [ $? = 0 ]; then
                        echo $i :: $j >> clones.txt
                fi
        done
done
Question
How would you optimize the script so that checking for cloned code is a few orders of magnitude faster?
System Constraints
Using a quad-core CPU with an SSD; trying to avoid using cloud services if possible. The system is a Windows-based machine with Cygwin installed -- algorithms or solutions in other languages are welcome.
Thank you!