How do I get the best backup deduplication from compressed files?

This article has moved to our new Help Center and will no longer be updated on this page. Please see the Help Center for the newest version.

One of the most common server backup scenarios is something like a "daily backup dump file." For example, you may back up a SQL database to a compressed file each day. In most cases, the data in the file will be very similar from one day to the next, with a small portion of the data changing.

This is a situation where SpiderOak provides an ideal backup solution, allowing you to store many historical states of the backup file using very little space within your SpiderOak account.

The way zip, gzip, and most other compression algorithms work means that even very small changes in the uncompressed file will cause a "cascade" of differences throughout the compressed file. So even though the uncompressed source files might be 99% identical, the compressed files are 99% (or even 100%) different.

Newer versions of gzip support an option called --rsyncable which increases the size of the compressed archive by about 1%, but also syncs the compressed output with the uncompressed input frequently. In this case, the compressed output will remain more similar, even when the uncompressed input has small changes.

Of course, SpiderOak will be able to do the very best deduplication just by archiving the uncompressed file with each backup. SpiderOak will compress the data blocks anyway, so the compression beforehand isn't necessary.

Not sure what deduplication is or how it works? Check out our deduplication FAQ