[PRL] Implementing sets efficiently in a functional language

Thu Oct 16 03:35:04 EDT 2003

David Herman <dherman at ccs.neu.edu> writes:

> For those of us taking Algorithms this semester, this paper might be
> interesting. Apparently there was a shorter version in JFP, but that's
> not available on the web. I've only skimmed it, but it looks like a
> nice example (aside from the atrocious spelling) of algorithms in FP,
> to compare to the imperative versions we've been learning.
>
>      Adams, Stephen. Implementing Sets Efficiently in a Functional
>      Language, 1993.
>      http://swiss.csail.mit.edu/~adams/BB/
>
> Apparently the algorithms presented in this paper were used for the
> Data.FiniteMap data structure in the GHC Haskell library as well.

You may be interested in knowing that I wrote a log-based append-only
database based on that very paper.

The principle is this:  if the database is small, you keep it all in
memory.  If you disallow mutation to database objects, then you only
need to log the creation of the object.  So to recover the database
across sessions, you simply `replay' all the object creations you
logged.  Since you never mutate an object, there is no need (except at
recovery time) to review the log, so you can run the log in
append-only mode and completely finesse all the issues of
synchronizing with the disk.

The only thing is, given a persistent object, how do you name it in
such a way that it can be retrieved in a future session?  Every object
has a serial number and you keep a map associating the serial numbers
with the object.  This also affords a mechanism to simulate mutation:
you simply replace the mapping with a new mapping.  If the mapping
is provided by an associative set, you get transactions trivially
by either discarding the old mapping on commit or discarding the
new one on abort.

But since the database is append-only, any mapping information you
write to it must be purely functional.  By implementing the database
indexing using Stephen Adams' functional set abstraction, we can
simply log the creation of new mapping nodes along with the other
data.  At each transaction, we only need record the address of
the root node at the tail-end of the database.

The database code ends up being quite small, about 3000 lines of
common lisp, but it supports nested transactions and multiple database
extents (two phase commits) and it is quite robust.

There is *one* downside:  I haven't written the garbage collector
for the log, yet.