Conversations about life & privacy in the digital age

What does i_ m__n __ __v_r _____ ___ ____ ____ ___c_?

We have been getting a lot of questions lately about our block level
de-duplication, how it works, and how it is applied through the SpiderOak
process. As I consider myself to be layman, please allow me to explain this in
more simplistic terms – such that even I will be able to understand.

For the sake of this example, let us say you have created a document
entitled ‘Why peanut butter and jelly sandwiches are better when you place
salt & vinegar chips in the middle’. The size of this document is 10k.
After saving the initial version, you go back and make 9 additional edits.
Each time you make an edit, you save the document as a new version thus giving
you 10 complete versions. And with each version being exactly 10k, the
complete document takes up a total of 100k on disk (or 10 versions multiplied
by 10k).

SpiderOak, on the other hand, works much more efficiently when storing data
- creating many wonderful benefits for the user. As you can imagine, from the
first version of ‘Why peanut butter and jelly sandwiches are better when you
place salt & vinegar chips in the middle’ to the last, only small pieces
of the document have changed. One simple example is replacing the word
‘excitable’ with the word ‘volatile’ in the third paragraph. Instead of
storing (and uploading) a whole new version of the document each time a small
change is made, SpiderOak breaks each document into blocks of data and then
only backs up (or uploads) the change or delta between the new version and the
old. Using this process, the same 10 versions of the aforementioned document
on SpiderOak only amounts to 15k on disk (as opposed to 100k above).

Although the below visual example only uses two versions of a document, it
does further explain how the SpiderOak de-duplication process occurs.

This process saves our users a considerable amount of space as a user is
only billed for the de-duplicated amount. Furthermore, the upload can occur
with much greater speed because only the changed blocks of data are sent from
one version to the next. In the end, SpiderOak works extraordinarily hard to
never upload and/or store the same block of data twice – saving our users
money and time.

Question: So perhaps now you may better understand the title and how it
relates to de-duplication?

Answer: What does it mean to never store the same data twice?

SpiderOak command line options — much faster, much less memory

The newest released version of SpiderOak supports --batchmode
scheduled operation may be useful to command line users and “GUI only” people
alike. The command line version is considerably faster for most tasks (3-4x by
my estimation), and uses drastically less memory (For me on OS X, an average VM
size of 32meg, peak at 64.)

This is supported in versions 1.0.3753 and newer (released today.) On
Windows and OS X, an existing SpiderOak install should automatically upgrade
the next time it connects to the server. On Ubuntu or Debian, the apt upgrade
process should get the newest version.

Here’s what you can do (so far) from the command line:

Alan@Alan ~ $ /Applications/ --help

Usage: SpiderOak basic command line usage:

  -h, --help            show this help message and exit
  --print-selection     Print a list of selected and excluded backup items
  --reset-selection     Reset selection (but preserve excluded files)
                        Exclude the given file from the selection
                        Exclude the given directory from the selection
                        Include the given directory in the selection
  --force               Do in/exclusion even if the path doesn't exist
  --headless            Never start the GUI
  --batchmode           set the config option exit_when_nothing_to_do to true

Most of these are self explanatory. --headless and
--batchmode are the ones I use most often. We’ll be adding support
for much more command line control in the future — send mail to cmdline at if you want to suggest other options.

--headless just runs SpiderOak with no GUI at all. It just runs,
without printing anything to the console, so there’s no interactiveness or
activity indicators (except what’s written to the spideroak.log.) This
is suitable for use on servers or other environments where you want something
to run continuously, using as few resources as possible, without any user

By the way, one of the benefits to a fault tollerant application design, is
that you don’t have to be nice to it. Feel free to force quit or kill (even
-9) at any time, and SpiderOak will rollback any uncommitted transactions, and
resume uploading or building where it left of — without corruption — the next
time you start. If you need all the available bandwidth to your first person
shooter, Skype, or you’re just trying to make your battery last as long as
possible, just killall SpiderOak and restart it when you want backups
to resume.

The next option is --batchmode (which implies --headless).
This means that SpiderOak will do all available work (i.e. scan the filesystem,
then build and upload everything in the queue, download and replay transactions
from other devices), and then exit. This is a good option for scheduled use.
You can add this to a cron job, or just run it yourself periodically whenever
you want to update your backup set.

SpiderOak is also careful not to start more than one instance of itself at a
time. For example, if you schedule SpiderOak to run in --batchmode
each night, and for the first few days, SpiderOak has so much to upload that it
does not finish before the next scheduled startup time, you don’t need to worry
about coming back to find several instances running.

In the next major release of SpiderOak, we’re restructuring the user
interface to be equally or more efficient as the command line version is now.
So, we expect the 1.5.0 series GUI versions to be several times faster than the
1.0.0 series GUI versions are today.