How does the rsync algorithm correctly identify repeating blocks?

Posted by Kai on Stack Overflow See other posts from Stack Overflow or by Kai
Published on 2010-04-01T03:10:51Z Indexed on 2010/04/01 3:13 UTC
Read the original article Hit count: 282

Filed under:

I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.

Consider this example, where A is the receiver and B is the sender.

A = abcde1234512345fghij
B = abcde12345fghij

As you can see, the only change is that 12345 has been removed.

Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.

abcde|12345|fghij

abcde -> 495
12345 -> 255
fghij -> 520

values = [495, 255, 520]

Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.

  1. Hash the first block. Does this hash exist in the values list? abcde -> 495 (yes, so skip)
  2. Hash the second block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
  3. Hash the third block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
  4. Hash the fourth block. Does this hash exist in the values list? fghij -> 520 (yes, so skip)
  5. No more data, we're done.

Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.

It seems to me this will happen whenever there is more than one block that share the same hash. What am I missing?

© Stack Overflow or respective owner

Related posts about rsync