What is deduplication?

This article has moved to our new Help Center and will no longer be updated on this page. Please see the Help Center for the newest version.

Deduplication is a process SpiderOak uses to save our users space and, therefore, money. If you're saving multiple copies of the same file, only the original copy of the file will take up the full amount of space; all of the other copies will be a lot smaller because SpiderOak only saves the data that differs from your original file.

For example, if you add more text or a graphic to a document, SpiderOak will only save the new data (the "diff" or difference), instead of saving the entire file again. Also, if you back up a file on one computer that has already been backed up on another computer, this file will occupy no additional space in your account.

When you are uploading a copy of a file which is already saved to our servers, SpiderOak performs deduplication before it ever begins the upload, comparing the files to the information you have already saved. It then uploads only the information that differs between the two files, such as their locations, in the form of journal entries. Although it appears that SpiderOak is uploading the entire file again, you’ll see that the upload goes much faster and takes up very little space because in fact only these journal entries are being uploaded.

There are some files, however, which are saved or compressed in such a way that they cannot be easily deduplicated by our software, even if they are essentially the same file. For example, historical versions of .docx files will still take up a great deal of space because each time these files are saved, their entire structure changes in a way that makes it hard for SpiderOak to detect any similarities. Outlook files are another common example.

If you're trying to deduplicate compressed files, you can read our FAQ about how to get the best compression on compressed files.

SpiderOak only performs deduplication on files stored in your account and not across users. We explain this in more detail in our blog post Why SpiderOak doesn't deduplicate data across users and why it should worry you if we did.