Support & Knowledge Base

FAQs

Still have questions? We have answers! Check out our Frequently Asked Questions for everything you need on SpiderOak

How do I get the best backup deduplication from compressed files

One of the most common server backup scenarios is something like a "daily backup dump file." For example, you may backup a SQL database to a compressed file each day. In most cases, the data in the file will be very similar from one day to the next, with a small portion of the data changing.

This is a situation where SpiderOak provides an ideal backup solution, allowing you to store many historical states of the backup file using very little space within your SpiderOak account.

The way zip, gzip, and most other compression algorithms work means that even very small changes in the uncompressed file will cause a "cascade" of differences throughout the compressed file. So even though the uncompressed source files might be 99% identical, the compressed files are 99% (or even 100%) different.

Newer versions of gzip support an option called --rsyncable which increases the size of the compressed archive by about 1%, but also syncs the compressed output with the uncompressed input frequently. In this case, the compressed output will remain more similar, even when the uncompressed input has small changes. Unfortunately, the --rsyncable option wasn't added to gzip in time for inclusion in Mac OS X Leopard, so the gzip on that platform doesn't have this.

Of course, SpiderOak will be able to do the very best de-duplication just by archiving the uncompressed file with each backup. SpiderOak will compress the data blocks anyway, so the compression beforehand isn't strictly necessary.

Couldn't find an answer to your question? Email our support with your question.

Have a Question?