How do I get the best backup deduplication from compressed files?

One of the most common server backup scenarios is something like a "daily backup dump file." For example, you may back up a SQL database to a compressed file each day. In most cases, the data in the file will be very similar from one day to the next, with a small portion of the data changing.

This is a situation where SpiderOak provides an ideal backup solution, allowing you to store many historical states of the backup file using very little space within your SpiderOak account.

The way zip, gzip, and most other compression algorithms work means that even very small changes in the uncompressed file will cause a "cascade" of differences throughout the compressed file. So even though the uncompressed source files might be 99% identical, the compressed files are 99% (or even 100%) different.

Newer versions of gzip support an option called --rsyncable which increases the size of the compressed archive by about 1%, but also syncs the compressed output with the uncompressed input frequently. In this case, the compressed output will remain more similar, even when the uncompressed input has small changes.

Of course, SpiderOak will be able to do the very best de-duplication just by archiving the uncompressed file with each backup. SpiderOak will compress the data blocks anyway, so the compression beforehand isn't necessary.