Conversations about life & privacy in the digital age

SpiderOak DIY: A space efficient key/value store for arbitrarily large values. Now in beta.

Update: SpiderOak DIY service has been discontinued, and is being replaced by the our new Nimbus.io storage service which is a new work based on everything we learned from DIY and our previous internal storage projects. It is also open source, with a fancy new ZeroMQ based architecture. Please visit nimbus.io for more information and to request an invite to use that service. The information below is provided for historical purposes only.

We alpha launched DIY a few months ago to allow SpiderOak customers to directly store data on the SpiderOak storage network via https. It’s similar to Amazon S3, but tuned for large backup and archival class data, and thus much less expensive. It’s also open source, on both the server and client side.

Today DIY is now in beta, and we’ve been using it ourselves to implement new features for some time.

Basically, if you’re already using S3 as a backup storage, switching to DIY will save you a great deal. You could also use the DIY code to run your own space efficient, redundant storage clusters for large data.

One of the things we’re pleased with is how comprehensible the DIY implementation is. It turns out that focusing on space efficiency and high throughput (instead of low latency for each request) allows a number of design simplifications compared to other scalable storage systems.

This is a project you can easily jump in and make progress in quickly. It’s built using zfec for parity striping, Python, gevent, and RabbitMQ, with a framework we created for quickly building small message oriented processes.

Feed back from users and developers is much appreciated.

Who would win in a struggle between all the Mac OS X cats and all the Linux animals?

I just hope they’ve backed up their data; this is about to get messy.

Here’s the layered tiff if you want to play around.. and the 1920×699 jpeg

EDIT; here is a 2.2MB .rar package of high-res desktop-backgrounds featuring the image for single and dual-screen setups as requested – Download Link!

50 bonus gigs to the first TEN commentators to accurately identify each creature from left to right!

Update Here

So go ahead and download the SpiderOak client and join for free while you think, drop your answer in the comment section and if you are correct we will soon post a follow up with winning comments and instructions on how to get your bonus GB!

Why SpiderOak doesn’t de-duplicate data across users (and why it should worry you if we did)

One of the features of SpiderOak is that if you backup the same file
twice, on the same computer or different computers within your account, the 2nd
copy doesn’t take up any additional space. This also applies if you have
several versions of a file as it evolves over time — we only need to save the
new data blocks.

Some storage companies take this de-duplication to a second level, and do a
similar form of de-duplication across all the data from all their customers.
It’s a great deal for the company. They can sell the bytes of storage to every
user at full price while incurring zero additional cost. In some ways its
helpful to the user too — uploads are certainly faster when you don’t have to
transfer the data!

How does cross user data de-duplication even work?

The entire process of a server de-duplicating files that haven’t even been
uploaded to the server yet is a bit magical, and works through the properties
of cryptographic hash functions. These allow us to make something like a
fingerprint of any file. Like people, no two files should have the same
fingerprints, right? The server can just keep a database of file fingerprints
and compare any new data to these fingerprints.

So it’s possible for the server to de-duplicate and store my files, knowing
only the fingerprints. So, how does this affect my privacy at all then?

With only the knowledge of a file’s fingerprint, there’s no clear way to
reconstruct the file the fingerprint was made from. We could even use a
technique for prepending deduplicated files with some random data when making
fingerprints, so they would not match outside databases of common files and
their fingerprints.

However, imagine a scenario like this. Alice has operated the best BBQ
restaurant in Kansas City for decades. No one can match Alice’s amazing sauce.
Suddenly Mallory opens a BBQ joint right across the street, with better prices
and sauce that’s just as good! Alice is pretty sure Mallory has stolen the
recipe right off her computer! Her attorney convinces a court to issue a
subpoena to SpiderOak: Does Mallory have a copy of her recipe? “How would we
know? We have no knowledge of his data beyond its billable size.” Exasperated,
the court rewrites their subpoena, “Does Mallory’s data include a file with
matching fingerprints from the provided recipe file here in exhibit A?” If
we have a de-duplication database, this is indeed a question we can answer, and
we will be required to answer. As much as we enjoyed Alice’s BBQ, we never
wanted to support her cause by answering a 3rd party’s questions about customer
data.

Imagine more everyday scenarios: a divorce case; a patent, trademark, or
copyright dispute; a political case where a prosecutor wants to establish that
the high level defendant “had knowledge of” the topic. Establishing that they
had a document about whatever it was in their personal online storage account
might be very interesting to the attorneys. Is it a good idea for us to be
even capable of betraying our customers like that?

Bonus: Deduping via Cryptographic Fingerprints Enables The Ultimate Sin

The ultimate sin from a storage company isn’t simply losing customer data.
That’s far too straight forward a blunder to deserve much credit, really.

The ultimate sin is when a storage company accidentally presents Bob’s data
to Alice as if it were her own. At once Bob is betrayed and Alice is
frightened.
This is what can happen if Bob and Alice each have different files
that happen to have the same fingerprints.

Actually cryptographic hashes are more like DNA evidence at a crime scene
than real fingerprints — people with identical DNA markers can and do exist.
Cryptographers have invented many smart ways to reduce the likelihood of this,
but those ways tend to make the calculations more expensive and the database
larger, so some non-zero level of acceptable risk must of collisions be
determined. In a large enough population of data, collisions happen.

This all makes for an entertaining conversation between Alice and Bob when
they meet each other this way. Hopefully they’ll tell the operators of the
storage service, which will otherwise have no way of even knowing this error
has happened
. Of course, it’s still rather unlikely to happen to you…

There’s a Public Information Leak Anyone can Exploit

Any user of the system can check if a file is already contained within the
global storage set. They do this simply by adding the file to their own storage account,
and observing the network traffic that follows. If the upload completes
without transferring the content of the file, it must be in the backup
somewhere already.

For a small amount of additional work they could arrange to shutdown the
uploading program as soon as they observe enough network traffic to know the
file is not a duplicate. Then they could check again later. In this way, they
could check repeatedly over time, and know when a given file enters the global
storage set
.

If you wanted to, you could check right now if a particular file was already
backed up by many online storage services.

How might someone be able to maliciously use this property of a global
de-duplication system to their advantage?

  • You could send a new file to someone using the storage service and know
    for sure when it had arrived in their possession
  • You could combine with the href="http://en.wikipedia.org/wiki/Canary_trap">Canary Trap method to
    expose the specific person who is leaking government documents or corporate
    trade secrets to journalists
  • You could determine whether your copyrighted work exists on the backup
    service, and then sue the storage service for information on users storing the
    work

There are also categories of documents that only a particular user is likely
to have.

How much space savings are we really talking about?

Surely more than a few users have the same Britney Spears mp3s and other
predictable duplicates. Across a large population, might 30%, 40%, or perhaps
even 50% of the data be redundant? (Of course there should be greater
likelihood of matches as the total population increases. This effect of
increasing de-duplication diminishes though: it is more significant as the data
set grows from 1 user to 10,000 users than from 10,000 users to 20,000 users,
and so on.)

In our early planning phase with SpiderOak, and during the first few months
while we operated privately before launch, we did a study with a population of
cooperative users who were willing to share fingerprints, anonymized as much as
was practical. Of course, our efforts suffered from obvious selection bias,
and probably numerous other drawbacks that make them unscientific. However,
even when we plotted the equations up to very large populations, we found that
the savings was unlikely to be as much as 20%. We chose to focus instead on
developing other cost advantages, such as building our own backend storage
clustering software.

What if SpiderOak suddenly decides to start doing this in the future?

We probably won’t… and if we did, it’s not possible to do so
retroactively with the data already stored. Suppose we were convinced someday;
here are some ways we might minimize the dangers:

  • We would certainly discuss it with the SpiderOak community first and incorporate the often-excellent suggestions we receive
  • It would be configurable according to each user’s preference
  • We would share some portion of the space savings with each customer
  • We would only de-duplicate on commonly shared and traded filetypes, like mp3s, where it’s most likely to be effective, and least likely to be harmful

A New Approach to Syncing Folder Deletions

One of goals of SpiderOak sync is that it will never destroy data in a way
that cannot be retrieved, even if the Sync happens wrongly.  So, as a design
goal, SpiderOak sync will never delete a file or folder that is not already
backed up.

Every time SpiderOak deletes a file, it checks that the file already exists
in the folder’s journal, and the timestamp of the file currently on disk
matches that of the file in the journal, or a cryptographic fingerprint match
if the timestamps differ.  This allows only the narrowest of what programmers
call “race conditions” when deleting a file because of a sync.  Here’s how the
race works:

  1. SpiderOak checks that the timestamp and cryptographic fingerprint match the journal (i.e. it is backed up and could be retrieve from the backup set.)
  2. SpiderOak deletes the file

The trouble is that there is a very small time window between step 1 and 2. 
The user could potentially save new data into the file during this very small
time window.  If the user were to save new data into this file at this instant,
the two actions are racing to completion.

Since the time window is so very small (less than milliseconds), this is an
acceptable risk.

Now consider the same scenario for deleting a folder.  Again, SpiderOak
makes a pass through the folder and verifies that the complete contents are
available in the journals, then it removes the folder.  The trouble now though
is that the window is much larger between step 1 and 2.  A very large folder
could take minutes for SpiderOak to scan through and verify.  It maybe modified
again between the time we start scanning and the time we finish scanning, and
before the deletion begins. 

Even though SpiderOak is plugged into the OS’s system for notification of
changes to the file system, such notifications are not guaranteed to be
immediate or to happen at all (such as on a network volume.)

So there is a larger “race condition,” or opportunity for data to be saved
to the folder between step 1 and 2 in the case of a large folder.

So, SpiderOak again tries to be conservative.  Instead of deleting the
folder, it tries to rename it out of the way.  Then, later it can verify that
nothing is changed inside the folder, after it has been renamed out of the
way.

Syncing deletes of folders actually most commonly fails in this renaming
step.  Sometimes it just can’t rename it.  There are some differences in how
Windows and Unix platforms handle open files in these cases, and the rename
solution tends to work well on Unix and has greater opportunity for error on
Windows.  There are also some cases in which it categorically fails — such as
trying to rename across drive letters in Windows or (in Unix) across different
file systems.

We could fix those, but I think an entirely new approach in probably
best.

Starting in the next version, instead of approaching the “delete a folder”
action as the deletion of an entire folder, it will now approach it as the
deletion of each individual item contained within the folder and all of its
subfolders recursively. We will use the same sequence as described above for
individual file deletions for each file, from the lowest subfolders on up, and prune folders when they are free
of files.

This eliminates the need for the rename step, reduces the race condition
down to milliseconds in the case of each removed file. Most importantly, this
means that the files causing problems (i.e. the files in use, or are changing
to fast to backup and thus SpiderOak refuses to delete, etc.) will be obvious:
they will be the only files remaining.

We’ll have a beta available with this behavior soon, announced in the href="https://spideroak.com/release_notes">release notes (rss).

An Erlang/OTP SSL proxy and load balancer for Python Twisted’s Perspective Broker

(If the above sounds like gibberish to you, you’re probably not a programmer
and this post won’t be very interesting.)

SpiderOak clients maintain a SSL connection to a Python Twisted Perspective
Broker service to coordinate their actions with the server and with each
other.

To load balance client connections across several Perspective Broker
processes per storage cluster, and route connections from a single public IP to
many storage nodes, we built a proxy server in Erlang. We’ve been running this
in production for several months now.

The design is simple. Erlang/OTP answers the socket, and speaks the
perspective broker protocol just long enough to learn the authentication
credentials the user is attempting to login with. The Erlang server looks up
the user’s assigned storage cluster and node. From there, it simply proxies
the connection (including replaying the authentication sequence) to a Python
Perspective Broker server. After that, it’s a byte-for-byte pass through proxy
server.

The proxy has some added logic to handle connection affinity — multiple
devices for the same SpiderOak user are passed to the same Perspective Broker
process.

This has allowed us to consume fewer public IP addresses (one per proxy
server, instead of one for each storage node) and take advantage of multiple
processors and greater concurrency per storage machine.

Another small benefit is offloading the cost of SSL from the Python
processes. Erlang has it’s own native implementation of SSL (not based on
OpenSSL) which seems to operate with more grace.

This is our first production Erlang/OTP service, and it hasn’t been without
its speed bumps, but these days it’s as stable as any of our other daemons
while handling much greater concurrency and traffic.

Today we’re publishing the code (AGPL3) in case it’s useful to anyone else
(and feedback from the Erlang community is certainly welcome!) It would be
useful to anyone wishing to be able to distribute a Perspective Broker service
across many backend nodes according to user assignment, or perhaps a starting
point for implementing a Perspective Broker server in Erlang. It will likely
require some minor massaging to with your database scheme. Here’s a link to
the tarball: href="https://spideroak.com/dist/spideroak_ssl_proxy.tar.bz2">spideroak_ssl_proxy.tar.bz2

Announcement: We’re now selling storage à la carte via HTTPS

Update: SpiderOak DIY service has been discontinued, and is being replaced by the our new Nimbus.io storage service which is a new work based on everything we learned from DIY and our previous internal storage projects. It is also open source, with a fancy new ZeroMQ based architecture. Please visit nimbus.io for more information and to request an invite to use that service. The information below is provided for historical purposes only.

This is an alpha release for the SpiderOak Do-It-Yourself API for storing and accessing data directly on the SpiderOak storage network. This is similar to Amazon’s S3 and other cloud storage services, but designed specifically for the needs of long term data archival.

We’re happy that this service is open source, top to bottom (including the code we run on the storgae servers.) It’s also offered at the same very affordable prices as regular SpiderOak storage.

During the alpha, this is only available to SpiderOak customers. Every SpiderOak customer can retrieve an API key and get started immediately if they wish. At the beta release (which will be soon) we’ll enable general signup, and we’ll move out of beta shortly after that.

For details on the implementation, architecture, API, the git repositories for server and client code, please visit the DIY API Project Homepage for more
information.

Update 1: Several people have asked why they don’t see a DIY API key option on their billing page. This is because the DIY API is a paid service, so it’s not available with a 2gb free SpiderOak account. Since the storage is so conveniently accessible over HTTPS, we think it likely to be abused if anyone can easily create 2gb free accounts. However, we’ve setup a $1 upgrade you can use to test DIY when you don’t already have a paid account. Just email support and we’ll give you the upgrade code to use.

Feeling disconnected? This is why.

If you’re having connection problems from the SpiderOak client, the solution is to upgrade to version 3.6.9658 or later.

… because 3 years ago when we launched SpiderOak I generated the SpiderOak SSL certificates that the SpiderOak client uses to verify the identify of the storage server. This is to protect against DNS poisoning attacks (i.e. otherwise an attacker that controlled DNS could attempt to convince your SpiderOak client to upload data to a different server.) These are not the same certificates as for the SpiderOak website.

I thought I generated certificates for 10 years, but they were only for the default of 3 years, and thus connections began expiring en masse about an hour ago. Most mistakes you should only make one time, and clearly this would fall under that category.

The verification for the cert is embedded along with the new SpiderOak client. We generated new certs, and fast tracked new builds through testing and release, so please visit the direct download link and all will be well again.

Please accept my deepest apologies; I’ve requested my flogging to be scheduled tomorrow at sunrise.

Dramatic Discovery of New Interpretive Compression Algorithm

Here at SpiderOak, we’re always excited about giving back to the community. In the development of SpiderOak, we’ve contributed a number of our internal projects as open source releases. None of those projects had us quite as excited as our latest release, though. We present to you:

Click here for more details on the algorithm.

Download the source code: invertedkernsquish-0.1.0.tar.ksquish

Improve Productivity and Health by Relocating your Chair

A few months ago I started standing during my working day. The center of the
30″ display is at eye level, with the keyboard and trackball slightly above hip
level.

Motivation

I’ve read that sitting throughout the day (with your upper body supported by
leaning against the back of the chair) causes the back and abdominal muscles
which would otherwise be exerted holding your body upright to atrophy. There
seems to be some research to support this. There are even specific types of
chairs designed to enforce self-supporting posture.

I’ve experimented with many hacks to my personal space work arrangement over
the years. Many have been dead ends, but often enough they’ve been useful.

Transition

At first I couldn’t comfortably stand all day. My feet would be sore after
3 to 4 hours, so I would stand in the morning, and transition to sitting
whenever my feet complained. Changing between different pairs of shoes helped,
and being barefoot helped but was cold during the winter. I eventually settled
most often on some good quality slippers that just keep my feet warm with
minimal padding or support. You can find slippers that look almost like
professional footwear.

Positive changes I’ve noticed

  • When I’m in a moment of thought while hacking, I’ve noticed that the
    absence of any required effort to “get up” means that I have a greater tendency
    to step away from the screen while I think. I might pace around or look out
    the window.
  • No lower back pain toward the end of the day.
  • My back is overall stronger (that’s apparent through tracking my regular
    resistance training)
  • I’m warmer (higher in the room where the warmer air is, plus the effort to
    stand does burn more calories and maintains body heat.)
  • Reduced eye strain, likely because of more frequent focusing on distant
    objects.
  • Keyboard and trackball positions are slightly more comfortable with less
    pronation. I still eventually plan switching to a vertical keyboard (typing in
    handshake position instead of palms downward.)
  • Minor but noticeable improvements to digestion and elimination. This may
    sound a bit unusual to discuss, but it’s not surprising. Peristalsis seems to
    suffer from prolonged periods of little body motion.

Negative changes

  • I look like a weirdo with a monitor on a chair on a desk. I’m used to
    standing out, but I’ll get proper display mounts eventually.
  • Less tolerance for long periods of chair sitting. A couple times a week I
    work at a coffee shop. The first couple of hours of this are now actually more
    comfortable, but that ends sooner.
  • People on the internet will laugh at you.

SpiderOak releases lightweight filesystem change notification utilities for Windows, OS X, and Linux (GPLv3)

We’ve decided to open source our “directory watcher” utilities.

These are tiny programs that ship as part of SpiderOak. They are written in C and use native OS specific APIs for obtaining file system change information and reporting it back to the main SpiderOak program in a standardized way.

They might be useful to anyone else who needs file system change notification on multiple platforms.

You can clone the git repos here: