Conversations about life & privacy in the digital age

What does i_ m__n __ __v_r _____ ___ ____ ____ ___c_?

We have been getting a lot of questions lately about our block level
de-duplication, how it works, and how it is applied through the SpiderOak
process. As I consider myself to be layman, please allow me to explain this in
more simplistic terms – such that even I will be able to understand.

For the sake of this example, let us say you have created a document
entitled ‘Why peanut butter and jelly sandwiches are better when you place
salt & vinegar chips in the middle’. The size of this document is 10k.
After saving the initial version, you go back and make 9 additional edits.
Each time you make an edit, you save the document as a new version thus giving
you 10 complete versions. And with each version being exactly 10k, the
complete document takes up a total of 100k on disk (or 10 versions multiplied
by 10k).

SpiderOak, on the other hand, works much more efficiently when storing data
- creating many wonderful benefits for the user. As you can imagine, from the
first version of ‘Why peanut butter and jelly sandwiches are better when you
place salt & vinegar chips in the middle’ to the last, only small pieces
of the document have changed. One simple example is replacing the word
‘excitable’ with the word ‘volatile’ in the third paragraph. Instead of
storing (and uploading) a whole new version of the document each time a small
change is made, SpiderOak breaks each document into blocks of data and then
only backs up (or uploads) the change or delta between the new version and the
old. Using this process, the same 10 versions of the aforementioned document
on SpiderOak only amounts to 15k on disk (as opposed to 100k above).

Although the below visual example only uses two versions of a document, it
does further explain how the SpiderOak de-duplication process occurs.

This process saves our users a considerable amount of space as a user is
only billed for the de-duplicated amount. Furthermore, the upload can occur
with much greater speed because only the changed blocks of data are sent from
one version to the next. In the end, SpiderOak works extraordinarily hard to
never upload and/or store the same block of data twice – saving our users
money and time.

Question: So perhaps now you may better understand the title and how it
relates to de-duplication?

Answer: What does it mean to never store the same data twice?

Comments

  1. Ahh Memories says:

    Reminds me of the block level synchronization that some high-end NAS devices use, such as NetApp and EMC. Basically you've taken that type of technology and applied it to SpiderOak. Love it!

  2. Anonymous says:

    What is the smallest size block that SO uses, and how exactly does this work client-side with binary files?

  3. Alan says:

    The block size is dynamic based on file type and history. It is also not constant throughout the length of a file. (For example because some files change frequently at the head or tail.)

  4. Anonymous says:

    How does de-duplication work if the files are encrypted before leaving the client? That is, how can you compare versions of the files on the client with those files on the server if the files stored on SpiderOak's servers are encrypted? You surely don't download the server files and decrypt for comparison. So how does this work and maintain true security? Also doesn't the encryption process prevent much de-duplication since small changes in an input file can cause large changes in encrypted data when using modern encryption? Are the differences computed before or after encryption?

  5. Tobbe says:

    So, how can the de-duplication work if you can't decrypt the data at your servers? And how can SpiderOak comply with the US law if they can't decrypt the data? Isn't SpiderOak an American company?

  6. Michael says:

    I suspect that the de-duplication works by storing each file as a number of blocks, which are then encrypted. In that way, changes in each block could be detected without the ability to decrypt the file. Can a SpiderOak rep confirm by responding to these very important questions?

    @Tobbe-since the service does not encrypt the data (you do, on your machine), U.S. vs. Doe applies: if you are in the U.S., you cannot be forced to decrypt the data, and the company has no obligation to keep a backdoor (I am not a lawyer, this reflects a lay understanding).

  7. MTK says:

    I'd like to see some answers to the questions of Michael, Tobbe, Anonymous. I think these are quite existential for Spideroaks's service.

  8. RI says:

    Yes, me too. Great questions, yet no answers! Nobody from Spideroak seems to give a damn.