Conversations about life & privacy in the digital age

Set Algebra and Python: Algorithmic Beauty

Having a built-in set data-structure makes certain algorithms so simple to
implement. If you structure your set members in the right way, you can turn
an algorithm that when implemented by comparing flat lists would be
O(n2) (or worse) and turn it into something as beautiful
as (set1 - set2) — set subtraction, which is built right into
Python.

We use just such an algorithm in SpiderOak to detect moves of directories.
The code that crawls a user’s filesystem to find out what’s changed, and
therefore needs to be backed up, produces events like “directory x
deleted” and “new directory y containing files z.” Since we
don’t get actual “move” events, we need to detect them by comparing subtrees
of the user’s filesystem. If there’s a directory structure with enough of the
same files under it, we assume it’s a move.

You can see how finding similar subtrees, when not done intelligently, can
easily have O(n2) complexity. I recently rewrote this
algorithm to utilize the simplicity and built-in optimization of Python’s set
operations. Now the code is not only more clear (and hence easier to
maintain), but the comparison now actually completes within the lifetime of
the universe.

Often people don’t realize that relational databases are based on set
algebra. Programmers used to optimizing SQL queries should feel right at home
optimizing other algorithms using set operations. I think sets are a
tragically under-utilized structure, made so elegant in Python.