Secure, efficient, version-preserving, filename-hiding backup implemented in this way?
- by barrycarter
I tried writing a "perfect" backup program (below), but ran into 
problems (also below). Is there an efficient/working version 
of this?: 
Assumptions: you're backing up from 'local', which you own and has 
limited disk space to 'remote', which has infinite disk space and 
belongs to someone else, so you need encryption. Network bandwidth 
is finite. 
'local' keeps a db of backed-up files w/ this data for each file: 
filename, including full path 
file's last modified time (mtime) 
sha1sum of file's unencrypted contents 
sha1sum of file's encrypted contents 
Given a list of files to backup (some perhaps already backed up), 
the program runs 'find' and gets the full path/mtime for each file 
(this is fairly efficient; conversely, computing the sha1sum of each 
file would NOT be efficient) 
The program discards files whose filename and mtime are in 'local' db. 
The program now computes the sha1sum of the (unencrypted contents 
of each remaining file. 
If the sha1sum matches one in 'local' db, we create a special entry 
in 'local' db that points this file/mtime to the file/mtime of the 
existing entry. Effectively, we're saying "we have a backup of this 
file's contents, but under another filename, so no need to back it 
up again". 
For each remaining file, we encrypt the file, take the sha1sum of 
the encrypted file's contents, rsync the file to its 
sha1sum. Example: if the file's encrypted sha1sum was 
da39a3ee5e6b4b0d3255bfef95601890afd80709, we'd rsync it to 
/some/path/da/39/a3/da39a3ee5e6b4b0d3255bfef95601890afd80709 on 
'remote'. 
Once the step above succeeds, we add the file to the 'local' db. 
Note that we efficiently avoid computing sha1sums and encrypting 
unless absolutely necessary. 
Note: I don't specify encryption method: this would be user's choice. 
The problems: 
We must encrypt and backup 'local' db regularly. However, 'local' 
db grows quickly and rsync'ing encrypted files is inefficient, since 
a small change in 'local' db means a big change in the encrypted 
version of 'local' db. 
We create a file on 'remote' for each file on 'local', which is 
ugly and excessive. 
We query 'local' db frequently. Even w/ indexes, these queries are 
slow, since we're often making one query for each file. Would be 
nice to speed this up by batching queries or something. 
Probably other problems that I've now forgotten.