How does the rsync algorithm correctly identify repeating blocks?

Posted by Kai on Stack Overflow See other posts from Stack Overflow or by Kai
Published on 2010-04-01T03:10:51Z Indexed on 2010/04/01 3:13 UTC
Read the original article Hit count: 344

Filed under:

rsync

I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.

Consider this example, where A is the receiver and B is the sender.

A = abcde1234512345fghij
B = abcde12345fghij

As you can see, the only change is that 12345 has been removed.

Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.

abcde|12345|fghij

abcde -> 495
12345 -> 255
fghij -> 520

values = [495, 255, 520]

Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.

Hash the first block. Does this hash exist in the values list? abcde -> 495 (yes, so skip)
Hash the second block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the third block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the fourth block. Does this hash exist in the values list? fghij -> 520 (yes, so skip)
No more data, we're done.

Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.

It seems to me this will happen whenever there is more than one block that share the same hash. What am I missing?

Related posts about rsync

Rsync: how to mount truecrypt on-the-fly on the receiving side?

as seen on Super User - Search for 'Super User'
The short version: how can I keep an rsync backup on a truecrypt volume? The hard part is to mount/unmount this volume on the fly when it is needed for rsync. Details This is my current backup configuration (which works fairly well for the most part): backup source is on Win7 64 bit, destination… >>> More
Two Questions on for Rsync - rsync by date and by file name

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I have two questions with respect to rsync: 1: I have a bunch of files which are incremented by day of the year. Ex: file.txt.81, file.txt.82, etc. Now, these files are in different directories: data1/file.txt.81 data1/file.txt.82 data2/file2.txt.81 data2/file2.txt.82 How can I have rsync… >>> More
Cygwin rsync broken

as seen on Super User - Search for 'Super User'
I get an error with cygwin rsync trying to transfer files between local - remote host. Any ideas? C:\>rsync user@host:~/file newfile Password: rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender= 3… >>> More
macport selfupdate not working

as seen on Super User - Search for 'Super User'
macbookpro:~ eistrati$ port -v MacPorts 2.1.2 macbookpro:~ eistrati$ xcodebuild -version Xcode 4.5.2 Build version 4G2008a macbookpro:~ eistrati$ sudo port -d selfupdate DEBUG: Copying /Users/eistrati/Library/Preferences/com.apple.dt.Xcode.plist to /opt/local/var/macports/home/Library/Preferences DEBUG:… >>> More
Rsync from godaddy to OS X

as seen on Super User - Search for 'Super User'
I would like to use rsync to backup my website to my local computer (OS X). I started of with this guide and got pretty far. I use the following rsync-line: rsync -PzrlptgD --del --delete-excluded -r --rsync-path=~/bin/rsync user@server:~/ /local/backup/folder/ I wanted to use the -a option (same… >>> More

Developer IT