How to replace pairs of strings in two files to identical IDs?

Posted by Péter Török on Stack Overflow See other posts from Stack Overflow or by Péter Török
Published on 2010-04-20T10:49:13Z Indexed on 2010/04/20 12:03 UTC
Read the original article Hit count: 259

Filed under:
|
|
|

Sorry if the title is not very intelligible, I couldn't come up with anything better. Hopefully my explanation is clear enough:

I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:

UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e

That is, wherever the first file contains UnifiedClassLoader3@19518cc, the second contains UnifiedClassLoader3@d0357a, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]

I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3@19518cc in file1 and UnifiedClassLoader3@d0357a in file2 with UnifiedClassLoader3@1; all occurrences of both JBossRMIClassLoader@13c2d7f in file1 and JBossRMIClassLoader@191777e in file2 with JBossRMIClassLoader@2 etc.

Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with

grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq

However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.

I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1, concatenate the results and sort them again. The result would be something like

2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...

Then I could (either manually or with sed itself) massage this into a sed command like

sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g' 
    -e 's/ClassLoader@13c2d7f/ClassLoader@137/g' 
    -e 's/ClassLoader3@1267649/ClassLoader3@563/g' 
    file1.log > file1_processed.log

and similarly for file2.

However, before I start, I would like to verify that my plan is the simplest possible working solution to this.

Is there any flaw in this approach? Is there a simpler way?

© Stack Overflow or respective owner

Related posts about shell

Related posts about cygwin