Conversations about life & privacy in the digital age

Speeding up and running legacy test suites, part two

This is part two in a two part series on Test Driven Development at
SpiderOak. In
part one,
I discussed ways to decrease the time it takes to run a test suite. In part
two, I discuss two ways to run a test suite that are painful if the tests are
slow, but greatly beneficial if performed often with fast tests.

Once we have tests that run in milliseconds rather than minutes, we’ll want
to run them as often as possible. As I work, I’m constantly saving the current
file and running the tests, as is necessary when practicing test-driven
development. Rather than switching to a command prompt after each change in
order to run the tests, I just map a key in vim to do it automatically.

Whenever I start a programming session, I open the module I’m working on and
its corresponding test module in a vertical split in vim. SpiderOak has a few
runtime dependencies, and because we don’t use the system-provided Python
interpreter on Mac, I have to source a script to set up the runtime
environment. When running commands from vim, the environment is inherited, so
by sourcing the script before running vim, things work just as they would if
you invoke them from the command line directly.

$ (. /opt/so2.7/bin/env.sh; PYTHONPATH=some_path vim -O package/module.py package/test/test_module.py)

Once I’m in vim, I map a key to run the tests, modifying the mapping for
whatever module I happen to be working on.

:map ,t :w:!python -m package.test.test_module

This binds ,t to first write the file, then run python -m
package.test.test_module
. Of course, this will change depending on what
you’re working on and how you invoke your tests.

Running tests on a range of git commits

In my git workflow, I sometimes find myself staging changes piecemeal, or
rebasing, reordering, or squashing commits. These kinds of actions can lead to
commits with code in a state that hasn’t been tested. To make testing these
intermediate states easier, I have adapted

a script from Gary Bernhardt
to checkout each commit in a given range and
run a command on the result. Here’s my adapted version of the script:

#!/bin/bash
set -e

ORIG_HEAD=$(git branch | grep '^*' | sed "s/^* //" | grep -v '^(no branch)' || true)
REV_SPEC=$1
shift

git rev-list --reverse $REV_SPEC | while read rev; do
    echo "Checking out: $(git log --oneline -1 $rev)"
    git checkout -q $rev
    find . -name "*.pyc" -exec rm {} ;
    "$@"
done
if [ $? -eq 0 ]; then
    [ -n $ORIG_HEAD ] && git checkout -q $ORIG_HEAD
fi

This keeps track of the current HEAD, checks out each revision in the
provided range, and then runs whatever command follows the range on the command
line. If all goes well, it will check out the original HEAD, to leave you back
where you started. If at any point the command exits with an error code, the
process will stop, so you can fix the problem.

For example, to run the command python test/run_all_tests.py on
every commit between origin/master and the current HEAD, you would
run:

$ ./run_command_on_git_revisions.sh origin/master.. python test/run_all_tests.py

Using the tools and techniques from this post and href="/blog/20121015153905-speeding-up-and-running-legacy-test-suites-part-one">part
one, I am able to run the SpiderOak tests quickly, after every change. This
enables me to use a TDD approach and not be slowed down by sluggish tests. With
the confidence that a comprehensive suite of tests provides, I can make
sweeping changes to parts of the SpiderOak code without worrying if I broke
something. Moreover, if I’m unsure of a solution, I can just try something and
see if it works. Because I’m not slowed down by the tests, trying an unproven
solution is rarely too large of an investment. Plus, there’s something
satisfying about making a large test suite pass in the blink of an eye.

Speeding up and running legacy test suites, part one

This is part one in a two part series on Test Driven Development at SpiderOak.
In part one, I discuss ways to decrease the time it takes to run a test suite.
In part two, I’ll discuss two ways to run a test suite that are painful if the
tests are slow, but greatly beneficial if performed often with fast tests.

As any experienced developer will likely say, the longer a test suite takes to
run, the less often it will be run. A test suite that is seldom run can be
worse than no test suite at all, as production code behavior diverges from that
of the tests, possibly leading to a test suite that lies to you about the
correctness of your code. A top priority, therefore, for any software
development team that believes testing is beneficial, should be to maintain
fast tests.

Over the years, SpiderOak has struggled with this. The reason, and I suspect
many test suites run slowly for similar reasons, is tests which claim to be
testing a “unit”, but actually end up running code from many parts of the
system. In the early days of SpiderOak we worked around some of the problem by
caching, saving/restoring state using test fixtures, etc. But a much better
approach, which we’re in the process of implementing, is to make unit tests
actually test small units rather than entire systems. During the
transition, we still have the existing heavy tests to fall back on, but for
day-to-day development, small unit tests profoundly increase productivity.

There are many techniques for keeping tests small and fast, and even more for
transitioning a legacy test suite. Each code base will ultimately require its
own tricks, but I will outline a few here that we’ve adopted at SpiderOak.

Mocks

Mock objects are “stand-in” objects that replace parts of your code that are
expensive to set up or perform, such as encryption, network or disk access,
etc. Using mocks can greatly improve the running time of your tests. At
SpiderOak, we use Michael Foord’s excellent
Mock library.

One area where mocking has been particularly helpful in speeding up the legacy
tests in SpiderOak is by reducing startup time. In some cases, even if
individual tests run quickly, running the test suite can still take a long time
due to unnecessary startup costs, such as importing modules unrelated to the
code under test. To work around this, I often inject a fake module into
Python’s import system to avoid loading huge amounts of code orthogonal to what
I’m trying to test. As an example, at the top of a test module, you might see
the following:

import sys
from test.util import Bucket

# don't waste time importing the real things, since we're isolating anyway
sys.modules['foo'] = Bucket()
sys.modules['foo.bar'] = sys.modules['foo'].bar

import baz

How it works

When you import a module in Python, the interpreter first looks for it in
sys.modules. This speeds up subsequent imports of a module that has already
been imported. We can also take advantage of this fact to prevent importing of
bloated modules altogether, by sticking a lightweight fake object in there,
which will get imported instead of the real code.

In the example above, foo is a bloated module that takes a long time to load,
and baz is the module under test. baz imports foo, so without this
workaround, the test would take a long time to load as it imports foo. Since
we’re writing isolated unit tests, using Mocks to replace things in foo, we
can skip importing foo for the tests altogether, saving time.

Bucket is a simple class that I use whenever I need an object on which I can
access an arbitrary path of attributes. This is perfect for fake package/module
structures, so I often use it for this purpose.

from collections import defaultdict

class Bucket(defaultdict):
    def __init__(self, *args, **kw):
        super(Bucket, self).__init__(Bucket, *args, **kw)
        self.__dict__ = self

This class allows you to access arbitrary attributes and get another Bucket
back. For example:

bucket = Bucket()
some_object = bucket.some.path.to.some_object
assert type(some_object) == Bucket

A caveat: since Python imports packages and modules recursively, you need to insert each
part of the dotted path into sys.modules for this to work. As you can see, I
have done this for foo.bar in the example from above.

sys.modules['foo'] = Bucket()
sys.modules['foo.bar'] = sys.modules['foo'].bar

Ideally, using an isolated approach to TDD with Mock objects, your project
would never evolve into a state where importing modules takes a long time, but
when working with a legacy codebase, the above approach can sometimes help your
tests run faster, which means they’ll be run more often, during the transition.

Next, part two will outline two ways to run your tests regularly. After all, a
test suite is only useful when it is actually used.

LAN Sync Overview

We’re just around the corner from releasing our much-requested LAN (Local Area Network) sync feature, which is now in the final stages of beta testing.

LAN sync enables much more efficient syncing of data between two or more computers. When SpiderOak is processing a sync between devices on the same LAN (i.e. on the same network), the devices can speak to each other locally, passing any needed data blocks directly to each other without the receiving devices having to download them from the SpiderOak servers. The originating device will still upload the changed or added files to SpiderOak to ensure the data is safely backed up and available to other computers in your SpiderOak network.

To illustrate the difference between ‘normal’ sync and LAN sync, please see the diagrams below:

Normal Sync: Device 1 uploads added or changed data to the SpiderOak servers over the Internet. Once the upload is complete and SpiderOak determines that the data is part of a sync, SpiderOak will start downloading the data to Device 2 which also occurs over the Internet. LAN Sync: Device 1 uploads added or changed data to the SpiderOak servers over the Internet (A). Once the upload is complete and SpiderOak determines that the data is part of a sync, SpiderOak again downloads the data to Device 2, only this time the download occurs directly between devices over the LAN (B). This optimization saves you time and bandwidth.

NOTE: The LAN sync feature only works for devices that are on the same LAN. For example, a sync between a home computer and a work computer will still go through the SpiderOak servers, as they do not share the same physical connection.

However, if you carry a laptop between work and home, SpiderOak is smart enough to determine the most efficient sync method, always transferring data blocks over the LAN when possible. At work, your laptop will sync with work devices over the LAN, and home devices over the internet. At home, SpiderOak will use your home LAN to efficiently transfer data from your home devices to your laptop, and data from your work devices will be downloaded over the internet.

In the end, regardless of the sync method SpiderOak uses, you can always feel safe in the knowledge that your important data is stored – securely & privately – in the cloud.

ShareRoom Embedding

We’ve had a lot of requests for a simple code snippet you can use to add a login form for ShareRooms to your own website. Well, we’ve come up with something for you! Just copy the HTML code below and paste it into your website or blog. It couldn’t be easier! Feel free to customize the code if you’re familiar with HTML, too. The important bits are the form tag, and the share_id and room_key form fields. For future reference, we’ve added a FAQ entry.

Note: If you only want to link to a specific share room, the code below is not necessary. Simply open the share room you want to link to by using the share login form on spideroak.com, and copy the URL shown in the address bar of your web browser. Paste this link into your blog or website just as you would link to any other website.

Introducing the SpiderOak Web API

In our efforts to give back to the community, and create an environment
where our customers can use the SpiderOak service how they want to, we have
released a document describing
the SpiderOak Web API.
This is the very same API we use to implement our ShareRoom and My Login
functionality through the SpiderOak.com website.

Additionally, we have packaged up the custom JavaScript we use to implement
the tree navigation structure in our web interface. The widget is a jQuery
plugin that dynamically creates HTML List elements following the structure of
JavaScript objects. It can drill down into the tree structure by retrieving
data using XMLHTTPRequests, enabling great performance for large trees. The
code is available for download on our Code Page.

We’re also offering other more advanced SpiderOak APIs, such as an API for
‘Zero-Knowledge’ remote command and control of all the devices within a SpiderOak
account, or across many accounts.
Email us for details.

We hope these things are useful, and we can’t wait to see how the community
uses the API to use SpiderOak in ways we haven’t even thought of!

IMPORTANT NOTE: Accessing your data remotely through the web is the only
instance when your data does become readable; however, these machines are only
accessible by a select number of SpiderOak employees. For continued
‘Zero-Knowledge’ privacy we only recommend accessing your data through the
SpiderOak client as it downloads the data before decrypting the blocks.

Update 19 May, 2010: The SpiderOak Web API now supports JSONP! Just append a query string with a callback argument to the regular API URLs, et voilĂ !

ShareRoom and Web Login Display Issues in Internet Explorer

We are aware of an issue with Internet Explorer 7 and above that causes the ShareRoom and My Login interface to display incorrectly (just the page header is shown, and the folders/files below are blank). We are working on fixing this for all Internet Explorer users, but in the meantime there is a simple workaround.

If you are using Internet Explorer 7 or 8, there is an option you can enable to fix the display, called “Compatibility Mode”. You can enable this by clicking on the “Broken page” icon at the righthand side of the address bar.

Other browsers such as Internet Explorer 6, Safari, Firefox, and Opera are not affected. We apologize for any inconvenience this may cause.

Set Algebra and Python: Algorithmic Beauty

Having a built-in set data-structure makes certain algorithms so simple to
implement. If you structure your set members in the right way, you can turn
an algorithm that when implemented by comparing flat lists would be
O(n2) (or worse) and turn it into something as beautiful
as (set1 - set2) — set subtraction, which is built right into
Python.

We use just such an algorithm in SpiderOak to detect moves of directories.
The code that crawls a user’s filesystem to find out what’s changed, and
therefore needs to be backed up, produces events like “directory x
deleted” and “new directory y containing files z.” Since we
don’t get actual “move” events, we need to detect them by comparing subtrees
of the user’s filesystem. If there’s a directory structure with enough of the
same files under it, we assume it’s a move.

You can see how finding similar subtrees, when not done intelligently, can
easily have O(n2) complexity. I recently rewrote this
algorithm to utilize the simplicity and built-in optimization of Python’s set
operations. Now the code is not only more clear (and hence easier to
maintain), but the comparison now actually completes within the lifetime of
the universe.

Often people don’t realize that relational databases are based on set
algebra. Programmers used to optimizing SQL queries should feel right at home
optimizing other algorithms using set operations. I think sets are a
tragically under-utilized structure, made so elegant in Python.