Conversations about life & privacy in the digital age

Why and How SpiderOak architecture is different than other online storage services: The surprising consequences on database design from our Zero-Knowledge Approach to privacy.

First, let’s consider the design of products like Mozy or SugarSync. On the server they have a database of all of your files. This database includes foldernames, filenames, last modification times, sizes, etc, all plainly readable. Maybe they encrypt the data (the contents of each file). The backup client on your computer knows which files need to be uploaded by talking to the server and querying for differences between the local file system and the remote database.

In the SpiderOak world, no such central database of your files exists. Rather, you keep your own database [1]. If you have several computers all connected to your SpiderOak account, each of them maintains a local database giving them a full view into your account-wide storage.

This client-side database is updated continuously as uploads from all computers in your account progress. Each upload is a transaction. We stuff changes into a transaction until it reaches 10 meg or 500 files [2]. The contents of the transaction are sequentially numbered data blocks (the data) and entries in sequentially numbered journals (meta data). For each transaction, the server stores everything, and passes the meta-data only along to all the other devices in your account.

In this sense, SpiderOak is really more of a peer-to-peer application than a client-server application. The traffic all goes through central servers but that’s just a conveniently reliable medium for data-passing and storage. The servers can’t read any of it.

There are some clear benefits and challenging drawbacks of peer-to-peer database architecture.

The biggest benefit is obviously the stated goal of preserving full and complete privacy – or ‘Zero-Knowledge’ as we call it.

One drawback is that it’s harder to program. Usually complex systems are built first with central management and then evolve into more peer architecture for scalability reasons. Think of Napster evolving into things like Gnutella and BitTorrent. We enjoy the challenge of creating privacy-preserving software that works just as well as the alternatives. Indeed, this is one of the main
reasons we started SpiderOak. However, it does mean that almost all features require more implementation work than they might if we had chosen unencrypted storage.

Another drawback is that CPU and memory use are sometimes higher. We’re working steadily to minimize this, but ultimately SpiderOak simply has more work to do than products that don’t maintain a ‘Zero-Knowledge’ orientation. SpiderOak is much better in recent versions at minimizing system resources than it has been historically, and – in the next few versions – this will dramatically improve again.

Yet another drawback is that computational work is duplicated on each client. Instead of the server updating a single database once for each change, each of your computers updates its database for all changes. (Obviously this disadvantage doesn’t apply if you only use one computer with SpiderOak.)

That said, a surprising benefit is the implications for total service cost. You may have noticed that SpiderOak offers some of the best pricing per gigabyte for online storage available anywhere. There are other factors contributing to this, but it definitely helps that SpiderOak clients handlemost of the database work. The server’s role is mostly relegated to data storage and retrieval. This lets us focus on building servers with very dense storage without the need for high speed databases and lots of system memory to
run them in. (Although some of those needs reappear for servicing functions like Web-Access and SpiderOak Shares.)

For us, regardless of the advantages and drawbacks of the decisions we made, the choice has always been clear. We set out to build a backup system we ourselves felt comfortable using which is why ‘Zero-Knowledge’ privacy was always the right path for us.

1 If you want you can go examine this database yourself. Hint: Use the libraries from our code page to make more sense of the database structure. The database files implement a complete transactional filesystem inside a database, as well as some relational tables.

2 If you ever wondered why in the upload status, you’ll see several files all uploading at the same time, with the percentage complete changing in unison for all of them, this is why. Each transaction is uploaded as a unit. The server doesn’t know which or how many of your files it contains.

SpiderOak v2.3 beta: New Command Line Interface

Recognizing that the command line interface lacked some capabilities
compared to the GUI, we’ve given it a full makeover as the central improvement
in 2.3.

We’ve also added a folder changelog feature that lets you see exactly how a
folder changed over time, and a collection of other things.

Full release notes here.

Unless something unexpected arises, this should be the last SpiderOak
release before we merge in the Sync code. So — for those who have inquired –
look for it right around the corner.

Beta download links for all platforms are below. The usual cautions about
running beta software apply, of course. If you decide to download a beta
version, pleease send us a note at beta at spideroak.com so we can update you
on future beta releases.

Here’s the new “--help” text for using the command line. This works
the same across Windows, Mac, and Linux.


Alan@Alan ~ $ /Applications/SpiderOak.app/Contents/MacOS/SpiderOak --help
Usage: SpiderOak basic command line usage:

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -v, --verbose         be verbose

  Operational Modes and Commands:
    --backup=TARGET     ad hoc operation: backup whatever exists at TARGET in
                        the filesystem and exit (ignores existing backup
                        selection.)
    --restore=item      Restore a folder, file, or version.
                        Run "--restore help" for more info
    --headless          run in headless mode (without the graphical interface)
    --batchmode         like headless, but will exit when all available work
                        is done
    --scan-only         scan the filesystem for changes and report a summary
    --scan-and-build-only
                        scan the filesystem, and build all possible file
                        system changes as shelved upload transactions, and
                        exit without uploading them

  Information Commands:
    --userinfo, --user-info
                        Show user and device info
    --space             Show space usage information by category and by device
    --tree              Show the hierarchy of stored backup folders
    --tree-changelog    Show a log of how the hierarchy of stored backup
                        folders has changed over time
    --journal-changelog=folder_or_journal
                        Show the changelog of a given folder
    --shelved-x, --print-shelved-x
                        Show information about each shelved upload transaction

  Backup Selection Manipulation Commands:
    --selection, --print-selection
                        Show a list of selected and excluded backup items
    --reset-selection   Reset selection (but preserve excluded files)
    --exclude-file=EXCLUDE_FILE
                        Exclude the given file from the selection
    --exclude-dir=EXCLUDE_DIR
                        Exclude the given directory from the selection
    --include-dir=INCLUDE_DIR
                        Include the given directory in the selection
    --force             Do in/exclusion even if the path doesn't exist

  Maintenance Commands:
    --vacuum            Vacuum SpiderOak's local database (rebuilds indexs and
                        reclaims space)
    --rebuild-reference-database
                        rebuild the SpiderOak reference database (can take
                        awhile)

  Dangerous/Support Commands:
    Caution: Do not use these commands unless advised by SpiderOak
    support.  They can damage your installation if used improperly.

    --empty-garbage-bin
                        purge all deleted items on the current device
    --destroy-shelved-x
                        destroy each shelved upload transaction (not intended
                        for general use -- this will damage your break your
                        account if not used correctly)
    --apply-subscription-xact
                        apply all transactions previously received from remote
                        devices -- (not intended for general use -- this
                        normally happens automatically)

General Beta Download (autodetect OS and architecture, except Ubuntu Gutsy):
https://spideroak.com/directdownload?beta=yes

Platform Specific Downloads

Mac OS X 10.4 and 10.5 (Universal Binary for Intel and PowerPC):
https://spideroak.com/directdownload?beta=yes&platform=mac

Windows 2000, Server 2003, XP, Vista:
https://spideroak.com/directdownload?beta=yes&platform=win

Linux Ubuntu “Hardy” and “Intrepid” 32 bit:
https://spideroak.com/directdownload?beta=yes&platform=ubuntu32

Linux Ubuntu “Hardy” and “Intrepid” 64 bit:
https://spideroak.com/directdownload?beta=yes&platform=ubuntu64

Linux Ubuntu “Gutsy” 32 bit:
https://spideroak.com/directdownload?beta=yes&platform=ubuntu_old32

Linux Ubuntu “Gutsy” 64 bit:
https://spideroak.com/directdownload?beta=yes&platform=ubuntu_old64

Slackware >= 12.1 (preliminary support):
https://spideroak.com/static/main/spideroak-6218.tgz

Scary Halloween

Can you guess which one is mine?

Stop Judging Resumes: Virtuously Virtual Hiring Practices

In my own experience there’s been very little relationship between the
quality of a resume and the eventual usefulness of a developer. I’ve seen guys
with great work history, references, advanced degrees, numerous publications,
and so on, and yet their presence proved less valuable than their absence.
Meanwhile some of the most rewarding engineers I’ve worked with introduced
themselves with nothing more than a simple letter.

At a previous company I worked with in the dot-com era, we created an epic
test for long distance interviews for a Perl programmer/ Linux sysadmin role.
It consisted of questions that a veteran hacker would maybe know 80 or 90% off
the top of his head, and exactly which man pages to lookup for another 10%.
Cute stuff like “How can you rm a file named -rf?” and “Name
3 things you can accomplish at a GRUB prompt.” We would arrange a designated
time and email the applicant the test. They had one hour (which we would pay
them for) to return it. The test was so long and specific there was no hope of
completion if you needed Google’s help for a large portion of the answers. The
feedback from many applicants was elaborately negative.

These days our process is more to the point. If we’re considering brining
someone on staff, we start by giving them some work to do. We find detachable
development tasks that will further the SpiderOak cause, send them a minimal
set of instructions, and let them run with it. It’s usually something
smallish, 1 – 3 days at most. As an all telecommute team, we’re already
accustomed to giving code feedback. When they’re done, they send us a bill and
we send them a review.

Sometimes we give several people the same task. The results often show an
obvious contrast of strengths and weaknesses across several applicants, and it
conserves the (sometimes scarce) resource of development tasks that don’t
require detailed knowledge of core SpiderOak source code. Sometimes we’re not
sure after the first task so we give more.

I’m sure there are big corporate HR departments who would be astonished to
learn that the best predictor of a developer’s usefulness might be an ability
to complete development tasks.

1.4x Series Builds Released: 4x faster + Share RSS feeds

Last week we released the 1.4x series builds of SpiderOak out of beta.
Download here. Existing installs
on Windows and OS X should self-update, and the newest Linux packages are in
the apt. Most users seem to have upgraded already.

In addition to many internal improvements, users are likely to notice a huge
responsiveness and speed boost: we estmiate 4x faster than the predecessor.

There’s also several new command line options, and SpiderOak now supports
backup queues larger than physical memory, so go ahead and install on machines
with hundreds of thousands of files.

A new web feature you may have noticed: each SpiderOak share room now has an
associated RSS feed which will indicate when items in a share room are changed
or updated with newer versions. So, for example, everytime you add new
pictuers to a shared folder, or you make changes and a new version of a
document in one of your shared folders is archived, subscribers to the RSS feed
will automatically notice without you having to do anything extra.

See the complete release
notes
for more information.

SpiderOak command line options — much faster, much less memory

The newest released version of SpiderOak supports --batchmode
scheduled operation may be useful to command line users and “GUI only” people
alike. The command line version is considerably faster for most tasks (3-4x by
my estimation), and uses drastically less memory (For me on OS X, an average VM
size of 32meg, peak at 64.)

This is supported in versions 1.0.3753 and newer (released today.) On
Windows and OS X, an existing SpiderOak install should automatically upgrade
the next time it connects to the server. On Ubuntu or Debian, the apt upgrade
process should get the newest version.

Here’s what you can do (so far) from the command line:


Alan@Alan ~ $ /Applications/SpiderOak.app/Contents/MacOS/SpiderOak --help

Usage: SpiderOak basic command line usage:

Options:
  -h, --help            show this help message and exit
  --print-selection     Print a list of selected and excluded backup items
  --reset-selection     Reset selection (but preserve excluded files)
  --exclude-file=EXCLUDE_FILE
                        Exclude the given file from the selection
  --exclude-dir=EXCLUDE_DIR
                        Exclude the given directory from the selection
  --include-dir=INCLUDE_DIR
                        Include the given directory in the selection
  --force               Do in/exclusion even if the path doesn't exist
  --headless            Never start the GUI
  --batchmode           set the config option exit_when_nothing_to_do to true

Most of these are self explanatory. --headless and
--batchmode are the ones I use most often. We’ll be adding support
for much more command line control in the future — send mail to cmdline at
spideroak.com if you want to suggest other options.

--headless just runs SpiderOak with no GUI at all. It just runs,
without printing anything to the console, so there’s no interactiveness or
activity indicators (except what’s written to the spideroak.log.) This
is suitable for use on servers or other environments where you want something
to run continuously, using as few resources as possible, without any user
input.

By the way, one of the benefits to a fault tollerant application design, is
that you don’t have to be nice to it. Feel free to force quit or kill (even
-9) at any time, and SpiderOak will rollback any uncommitted transactions, and
resume uploading or building where it left of — without corruption — the next
time you start. If you need all the available bandwidth to your first person
shooter, Skype, or you’re just trying to make your battery last as long as
possible, just killall SpiderOak and restart it when you want backups
to resume.

The next option is --batchmode (which implies --headless).
This means that SpiderOak will do all available work (i.e. scan the filesystem,
then build and upload everything in the queue, download and replay transactions
from other devices), and then exit. This is a good option for scheduled use.
You can add this to a cron job, or just run it yourself periodically whenever
you want to update your backup set.

SpiderOak is also careful not to start more than one instance of itself at a
time. For example, if you schedule SpiderOak to run in --batchmode
each night, and for the first few days, SpiderOak has so much to upload that it
does not finish before the next scheduled startup time, you don’t need to worry
about coming back to find several instances running.

In the next major release of SpiderOak, we’re restructuring the user
interface to be equally or more efficient as the command line version is now.
So, we expect the 1.5.0 series GUI versions to be several times faster than the
1.0.0 series GUI versions are today.