Conversations about life & privacy in the digital age

Hello, Fedora!

Hello everyone! After fighting with KVM, Fedora Core 10, namely bug
#475598, spots
where RPM’s macro behavior doesn’t *quite* match the RPM guide, a
miscompiling GCC, and a mis-named libssl in a pear tree, we have some
hot fresh RPMs that you can use, if you like.

You can direct download FC10 RPMs of our current SpiderOak release for
32-bit and 64-bit PCs at:

While we have performed some basic testing on these, it is my no means
the sort of complete test we’ve usually done, so there may be some rough
edges. For the security-conscious, we sign these rpms with this public key:

Version: GnuPG v1.4.6 (GNU/Linux)


Which is also available, for your convenience at

For YUM users, this package installs and owns its own yum repository
reference, that lives at /etc/yum.repos.d/spideroak.repo. This will
allow you to more easily get updated SpiderOak packages, signed with the
above key as they become available.


Python is Python is Python…. except when it isn’t Python.

One of the largest factors to recommend dynamic interpreted
languages and runtimes is, of course, memory and object management.
However, when interfacing these to external libraries, the boundary
is crossed from a managed environment to a binary ABI environment,
with all the ‘fun’ that entails. This becomes especially interesting
when your interface is a ‘light’ wrapper that does not protect
against shooting yourself in the foot or insulating you away
from the bugs of that binary ABI.

Awhile back, the excellent
valgrind tool was developed, which
is a dynamic memory and threading debugging tool for Linux
applications. Valgrind becomes an excellent tool for complicated C
and C++ programs. Because valgrind works at the OS/ABI level, it can
be adapted to any environment, however.

Here at SpiderOak we use valgrind when a debugging issue appears
to be involved with any C or C++ library we interface with; the
most frequent case is Qt. When writing an application handling
I/O in real-time from multiple sources, you end up with a sophisticated
flow of code, which makes the output of tools like valgrind difficult
to use. In Python, valgrind has been used as a tool to debug Python
itself, but not necessarily to debug Python applications, as valgrind
won’t tell you what Python code called the C or C++ or other library
code where the bug you’re hunting has appeared.

We have a patch to valgrind and a small wrapper library that lets
you recover this information. You can download this (GPLv3)
at our code page.

To use, you will need to be able to recompile your own valgrind
executable. For us, the Ubuntu gutsy or hardy valgrind source
packages are excellent for this. In our python support for valgrind,
we implement a ‘supplemental stack’ that a running program can use to
notify valgrind of where in your application it’s at, so you can
track what python functions are involved with an issue as well as the
C/C++ library functions. In our example environment, this
information is helpful when your application involves twisted or
pydispatch/louie-powered indirect calls (i.e. via Twisted deferred or
pydispatch/louie signals). We distribute this supplemental stack
patch along with a Cpython wrapper library which valgrind will use to
wray Python stack frames to retrieve the information needed.

After downloading our valgrind-python support patches, and
building, you can run your Python application with valgrind /path/to/python/interp/using/app.
Valgrind will then give you output corresponding to the python stack
frames and source locations alongside the usual Valgrind stack

This is not a turnkey or very stable solution. We absolutely do
not suggest running it in an untrusted environment. Make sure you’re not
running this with anything involving the opportunity to leak data,
or, a particularly nasty user might crack your box.
That said, valgrind often allows you to shave hours off your
debugging time for tracking down some problems. Now you can shave
hours off your debugging for those problems when they’re in Python,

For those new to valgrind, here’s a short example of how to use this in
Ubuntu, having a download of our valgrind-python-1.0.1.tar.bz2. You should
also have HREF="">Misc/valgrind-python.supp
from your python source distribution. (Or use our provided link from the python

% sudo apt-get build-dep valgrind
% sudo aptitude install fakeroot python2.5-dev
% apt-get source valgrind
% tar xjf valgrind-python-1.0.1.tar.bz2
% # this is where we add our supplemental stack patch for valgrind
% cd valgrind-3.3.0/debian/patches
% cp ../../../valgrind-python-1.0.1/50_sup-stack.dpatch .
% # go ahead and edit this line in the middle of patches if you care
% echo 50_sup-stack >> patches
% cd ../..
% fakeroot ./debian/rules binary
% sudo dpkg -i ../the_valgrind_deb_you_made.deb
% cd ../valgrind-python-1.0.1
% make
% # after make finishes, you should have in the
valgrind-python dir. This is what you run with valgrind python2.5

And so…

% LD_PRELOAD=$(pwd)/ valgrind
–suppressions=valgrind-python.supp ipython
[various valgrind boilerplate here]
>>> from ctypes import *
>>> class crasher(Union):
… _fields_=[(“x”,c_int),(“y”,c_char_p)]

>>> badptr=crasher()
>>> badptr.x=2
>>> badptr.y[0] # BOOM!

==29497== Python Stack:
==29497== <stdin>:1 <module>
==29497== Invalid read of size 1
==29497== at 0x40239D8: strlen (mc_replace_strmem.c:242)
==29497== by 0x80945A9: PyString_FromString (stringobject.c:112)
==29497== by 0x47F1474: z_get (cfield.c:1341)
==29497== by 0x47ECD0D: CData_get (_ctypes.c:2315)
==29497== by 0x47F0BE9: CField_get (cfield.c:221)
==29497== by 0x808968C: PyObject_GenericGetAttr (object.c:1351)
==29497== by 0x80C7608: PyEval_EvalFrameEx (ceval.c:1990)
==29497== by 0x402773B: PyEval_EvalFrameEx (pywrap.c:62)
==29497== by 0x80CB0D6: PyEval_EvalCodeEx (ceval.c:2836)
==29497== by 0x80CB226: PyEval_EvalCode (ceval.c:494)
==29497== by 0x80EADAF: PyRun_InteractiveOneFlags (pythonrun.c:1273)
==29497== by 0x80EAFD5: PyRun_InteractiveLoopFlags (pythonrun.c:723)
==29497== Address 0×2 is not stack’d, malloc’d or (recently) free’d

With some more configuration work, you will get valgrind
output with useful data for whichever libraries you use, and can tell
what python usage may be tweaking bugs in your non-python libraries.
Good luck!

1.4x Series Builds Released: 4x faster + Share RSS feeds

Last week we released the 1.4x series builds of SpiderOak out of beta.
Download here. Existing installs
on Windows and OS X should self-update, and the newest Linux packages are in
the apt. Most users seem to have upgraded already.

In addition to many internal improvements, users are likely to notice a huge
responsiveness and speed boost: we estmiate 4x faster than the predecessor.

There’s also several new command line options, and SpiderOak now supports
backup queues larger than physical memory, so go ahead and install on machines
with hundreds of thousands of files.

A new web feature you may have noticed: each SpiderOak share room now has an
associated RSS feed which will indicate when items in a share room are changed
or updated with newer versions. So, for example, everytime you add new
pictuers to a shared folder, or you make changes and a new version of a
document in one of your shared folders is archived, subscribers to the RSS feed
will automatically notice without you having to do anything extra.

See the complete release
for more information.

Set Algebra and Python: Algorithmic Beauty

Having a built-in set data-structure makes certain algorithms so simple to
implement. If you structure your set members in the right way, you can turn
an algorithm that when implemented by comparing flat lists would be
O(n2) (or worse) and turn it into something as beautiful
as (set1 - set2) — set subtraction, which is built right into

We use just such an algorithm in SpiderOak to detect moves of directories.
The code that crawls a user’s filesystem to find out what’s changed, and
therefore needs to be backed up, produces events like “directory x
deleted” and “new directory y containing files z.” Since we
don’t get actual “move” events, we need to detect them by comparing subtrees
of the user’s filesystem. If there’s a directory structure with enough of the
same files under it, we assume it’s a move.

You can see how finding similar subtrees, when not done intelligently, can
easily have O(n2) complexity. I recently rewrote this
algorithm to utilize the simplicity and built-in optimization of Python’s set
operations. Now the code is not only more clear (and hence easier to
maintain), but the comparison now actually completes within the lifetime of
the universe.

Often people don’t realize that relational databases are based on set
algebra. Programmers used to optimizing SQL queries should feel right at home
optimizing other algorithms using set operations. I think sets are a
tragically under-utilized structure, made so elegant in Python.

SpiderOak command line options — much faster, much less memory

The newest released version of SpiderOak supports --batchmode
scheduled operation may be useful to command line users and “GUI only” people
alike. The command line version is considerably faster for most tasks (3-4x by
my estimation), and uses drastically less memory (For me on OS X, an average VM
size of 32meg, peak at 64.)

This is supported in versions 1.0.3753 and newer (released today.) On
Windows and OS X, an existing SpiderOak install should automatically upgrade
the next time it connects to the server. On Ubuntu or Debian, the apt upgrade
process should get the newest version.

Here’s what you can do (so far) from the command line:

Alan@Alan ~ $ /Applications/ --help

Usage: SpiderOak basic command line usage:

  -h, --help            show this help message and exit
  --print-selection     Print a list of selected and excluded backup items
  --reset-selection     Reset selection (but preserve excluded files)
                        Exclude the given file from the selection
                        Exclude the given directory from the selection
                        Include the given directory in the selection
  --force               Do in/exclusion even if the path doesn't exist
  --headless            Never start the GUI
  --batchmode           set the config option exit_when_nothing_to_do to true

Most of these are self explanatory. --headless and
--batchmode are the ones I use most often. We’ll be adding support
for much more command line control in the future — send mail to cmdline at if you want to suggest other options.

--headless just runs SpiderOak with no GUI at all. It just runs,
without printing anything to the console, so there’s no interactiveness or
activity indicators (except what’s written to the spideroak.log.) This
is suitable for use on servers or other environments where you want something
to run continuously, using as few resources as possible, without any user

By the way, one of the benefits to a fault tollerant application design, is
that you don’t have to be nice to it. Feel free to force quit or kill (even
-9) at any time, and SpiderOak will rollback any uncommitted transactions, and
resume uploading or building where it left of — without corruption — the next
time you start. If you need all the available bandwidth to your first person
shooter, Skype, or you’re just trying to make your battery last as long as
possible, just killall SpiderOak and restart it when you want backups
to resume.

The next option is --batchmode (which implies --headless).
This means that SpiderOak will do all available work (i.e. scan the filesystem,
then build and upload everything in the queue, download and replay transactions
from other devices), and then exit. This is a good option for scheduled use.
You can add this to a cron job, or just run it yourself periodically whenever
you want to update your backup set.

SpiderOak is also careful not to start more than one instance of itself at a
time. For example, if you schedule SpiderOak to run in --batchmode
each night, and for the first few days, SpiderOak has so much to upload that it
does not finish before the next scheduled startup time, you don’t need to worry
about coming back to find several instances running.

In the next major release of SpiderOak, we’re restructuring the user
interface to be equally or more efficient as the command line version is now.
So, we expect the 1.5.0 series GUI versions to be several times faster than the
1.0.0 series GUI versions are today.

Challenges in compatibility

Recently in the SpiderOak application, we fixed a bug relating to
cross-platform compatibility. As some observers have mentioned, we implement
most of our system in Python.

In the Python world, there is an amazing amount of support libraries and
software available, with differing degrees of maturity. Much software is still
only tested in a limited degree of situations. Even in those whom approach
software testing with a comprehensive mindset may not be able to engineer or
have the resources to test the wide variations in which their software may be
deployed. That’s not necessarily their job though. Most Open Source licenses
come with an explicit disclaimer of warranty, after all.

In our case, we ran into a suprise using Python Crypto Toolkit (PCT). PCT
is a lightweight implementation of common crypto primitives in python,
provided as a mixed python/C library. For public key operations, PCT uses an
internal RSA object behind a generic public-key interface. Problem is, it
actually uses 2 different such objects. If there is a C bignum library
available, it uses that (usually GMP), as it provides math operations on large
numbers that are significantly faster than what standard math implementations
usually provide.

There’s just one little problem…

You think this kind of behavior would be handled at run-time since both
variants of the object handle the same internal data, right? Oops. PCT has 2
different object-types, which one is created depends on whether or not a given
install of PCT has the previously mentioned math module available. However,
when you serialize that object (i.e. save it to disk), that entry is saved,
tagged with the classname of the object. Unless you ensure your serializer
knows this, files saved this way (say, crypto keys), will fail to load again
on a platform installed with a non-fastmath version of this library if the
objects were created by the fastmath variant. To do this in PCT, you have to
patch the module after loading it so fastmath objects don’t get created
and tell the serializer that RSAobj and RSAobj_c (the fastmath one) are
really the same. Thankfully, the PCT developer(s) ensured that these objects
serialize the state the same way. Even when you’re thinking ahead, you can
still get suprised.

Long story short, test and audit EVERYTHING, and have someone else look
too! You WILL be suprised.