Finding what makes strings unique in a list, can you improve on brute force?

Posted by Ed Guiness on Stack Overflow See other posts from Stack Overflow or by Ed Guiness
Published on 2010-05-18T17:40:39Z Indexed on 2010/05/18 18:00 UTC
Read the original article Hit count: 301

Suppose I have a list of strings where each string is

  • exactly 4 characters long and
  • unique within the list.

For each of these strings I want to identify the position of the characters within the string that make the string unique.

So for a list of three strings

abcd
abcc
bbcb

For the first string I want to identify the character in 4th position d since d does not appear in the 4th position in any other string.

For the second string I want to identify the character in 4th position c.

For the third string it I want to identify the character in 1st position b AND the character in 4th position, also b.

This could be concisely represented as

abcd -> ...d
abcc -> ...c
bbcb -> b..b

If you consider the same problem but with a list of binary numbers

0101
0011
1111

Then the result I want would be

0101 -> ..0.
0011 -> .0..
1111 -> 1...

Staying with the binary theme I can use XOR to identify which bits are unique within two binary numbers since

0101 ^ 0011 = 0110

which I can interpret as meaning that in this case the 2nd and 3rd bits (reading left to right) are unique between these two binary numbers. This technique might be a red herring unless somehow it can be extended to the larger list.

A brute-force approach would be to look at each string in turn, and for each string to iterate through vertical slices of the remainder of the strings in the list.

So for the list

abcd
abcc
bbcb

I would start with

abcd

and iterate through vertical slices of

abcc
bbcb

where these vertical slices would be

a | b | c | c
b | b | c | b

or in list form, "ab", "bb", "cc", "cb".

This would result in four comparisons

a : ab -> . (a is not unique)
b : bb -> . (b is not unique)
c : cc -> . (c is not unique)
d : cb -> d (d is unique)

or concisely

abcd -> ...d

Maybe it's wishful thinking, but I have a feeling that there should be an elegant and general solution that would apply to an arbitrarily large list of strings (or binary numbers). But if there is I haven't yet been able to see it.

I hope to use this algorithm to to derive minimal signatures from a collection of unique images (bitmaps) in order to efficiently identify those images at a future time. If future efficiency wasn't a concern I would use a simple hash of each image.

Can you improve on brute force?

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about language-agnostic