Conversations about life & privacy in the digital age

SpiderOak’s new Amazon S3 alternative is half the cost and open source

As 37signals famously described, in the software business we almost always create valuable byproducts. To build a privacy-respecting backup and sync service that was affordable, we also had to build a world class long term archival storage system.

We had to do it. Most companies in the online backup space (including BackBlaze, Carbonite, Mozy, and SpiderOak to name a few) have made substantial investments in creating an internal system to cost effectively store data at massive scale. Those who haven’t such as Dropbox and JungleDisk are not price competitive per GB and put their efforts into competing on other factors.

Long term archival data is different than everyday data. It’s created in bulk, generally ignored for weeks or months with only small additions and accesses, and restored in bulk (and then often in a hurried panic!)

This access pattern means that a storage system for backup data ought to be designed differently than a storage system for general data. Designed for this purpose, reliable long term archival storage can be delivered at dramatically lower prices.

Unfortunately, the storage hardware industry does not offer great off-the-shelf solutions for reliable long term archival data storage. For example, if you consider NAS, SAN and RAID offerings across the spectrum of storage vendors, they are not appropriate for one or both of these reasons:

  1. Unreliable: They do not protect against whole machine failure. If you have enough data on enough RAID volumes, over time you will lose a few of them. RAID failures happen every day.
  2. Expensive: Pricy hardware and high power consumption. This is because you are paying for low-latency performance that does not matter in the archival data world.

Of course #1 is solvable by making #2 worse. This is the approach of existing general purpose redundant distributed storage systems. All offer excellent reliability and performance but require overpaying for hardware. Examples include GlusterFS, Linux DRBD, MogileFS, and more recently Riak+Luwak. All of these systems replicate data to multiple whole machines making the combined cluster tolerant of machine failure at the cost of 3x or 4x overhead. Nimbus.IO takes a different approach using parity striping instead of replication, for only 1.25x overhead.

Customers purchasing long term storage don’t typically notice or care about the difference between a transfer starting in 0.006 seconds or 0.6 seconds. That’s two orders of magnitude of latency. Customers care greatly about throughput (megabyte per second of transfer speed) but latency (how long until the first byte begins moving) is not relevant the way it is if you’re serving images on a website.

Meanwhile the added cost to support those two orders of magnitude of latency performance is huge. It impacts all three of the major cost components – bandwidth, hardware, and power consumption.

A service designed specifically for bulk, long-term, high-throughput storage is easily less than half the cost to provide.

Since launching SpiderOak in 2007, we’ve rewritten the storage backend software four times and gone through five different major hardware revisions for the nodes in our storage clusters. Nimbus.IO is a new software architecture leveraging everything we’ve learned so far.

The Nimbus.IO online service is noteworthy in that the backend hardware and software is also open source, making it possible for people to either purchase storage from Nimbus.IO similar to S3, or run storage clusters locally on site.

If you are currently using or planning to adopt cloud storage, we hope you will give Nimbus.IO some consideration. Chances are we can eliminate 2/3 of your monthly bill.

Comments

  1. Bryan says:

    Suggestion: I'm assuming you want an email address for the invite field… you may want to specify that.

  2. Alex says:

    Wait is this $0.06/(GB*month) or $0.06/(GB*year) or something else I'm missing? If the latter I can decommission my server as soon as I get the invite…

  3. David says:

    What a great time to be competing on price.

  4. Tariq says:

    While I would love Alex's understanding about price [$6per year for 100GB] but looking at other prices it seems too much to wish for.

    In case price remains what it is will it be possible to bill in smaller units of 10 or 5 GB?

    When will you "Open source the SpiderOak client software " ? https://spideroak.com/blog/20110628114417-spideroak-looking-inward
    Till that happens we can not edit the user interface and find other use cases for this service. Give us the choice!!!

  5. willemijns says:

    > In case price remains what it is will it be possible to bill in smaller units of 10 or 5 GB?

    yep…

  6. anon says:

    $6 per 100GB of purchased storage. Transfer in is always free. Transfer out is $0.06/GB. PUT and LISTMATCH requests are $0.01 per 1000. Other requests are $0.01 per 10,000 or free.

  7. mikeage says:

    I'm eagerly awaiting this. I hope that nimbus will provide a filesystem like interface as well; currently I backup information to Amazon S3 via s3fs, and I'd love to be able to do this with nimbus as well (bonus points if it's Windows compatible (or webdav) as well); a good client with smart caching can make offsite storage for infrequently used files (but with occasional access) to be quite pleasant.

  8. John Dickinson says:

    The architecture page for nimbus is light on details. How does nimbus compare to something like Openstack swift (http://swift.openstack.org), which is the fully open source storage system behind Rackspace's Cloud Files?

  9. Slav says:

    I agree with John. Can you share a bit more details about architecture? what components are opensource and what are not?

  10. Phil says:

    Very promising guys!

  11. Krishna says:

    minor correctoin … it is gluster.org and not glusterfs.org

  12. Daniel @ SpiderOak says:

    @john Dickinson @Slav
    Nimbus.io will be 100% open source. This means that both software and hardware.

  13. fak3r says:

    I'm running a 100TB GlusterFS cluster across 6 nodes, but I am interested to know how your approach differs from theirs (aside from your Arch page), and what benefits it would offer to running standalone (ie- in my own datacenter)

  14. Jack says:

    The biggest advantage of Replication Based systems is geographic redundancy is simple. LA can go down in an earthquake and your data has no downtime.

    With Parity Based storage, rebuilding is super painful. If you loose a node, you need to do crazy reads all across the planet to maintain uptime. And if a datacenter goes down in a fire, you probably lost all your data completely.

    How do you deal with geographic redundancy of your data?

  15. Marc says:

    If you want to give SpiderOak a try, use this referral code:
    https://spideroak.com/download/referral/317a29ed47a76995ce1dc5c5441b214a
    It will give both of us an extra gigabyte of space for free.

  16. Todd says:

    Will you have a client similar to (but hopefully more lightweight) the spideroak app? So something that's cross platform and linux cli. I don't use any of the advanced features in spideroak like sync folders or shared files, I just use it for offsite backup. If I could get that for less I'm all for it.

  17. awake says:

    I'm a bit confused… Am I on some kind of non-amazon storage plan now and will it be cheaper to switch to one?

  18. Snirp says:

    Could and should I use this to serve static files (images) for my web app?

  19. carl says:

    SpiderOak is a great service. You start with 2GB, but if you open the account following a referral you get one extra gigabyte, and can get until 50GB with referrals! Also, if you use a coupon you can get 3GB free, the one I know is "worldbackupday" (every user can use it only once).

    Here is my referral if you want to open a new account:

    https://spideroak.com/download/referral/ed59bb64dd954bdebf06667ecee3be45

  20. Daniel @ SpiderOak says:

    @todd, we are playing around with what we can build client-wise for this. At launch however we will be providing the API and storage back end for bulk data storage only.

    @snirp I see no reason why not. Except if you have exceptionally high demands on the response time.