[PRL] Implementing sets efficiently in a functional language

Thu Oct 16 10:30:37 EDT 2003

David Herman <dherman at ccs.neu.edu> writes:

> For those of us taking Algorithms this semester, this paper might be
> interesting. Apparently there was a shorter version in JFP, but that's
> not available on the web. I've only skimmed it, but it looks like a
> nice example (aside from the atrocious spelling) of algorithms in FP,
> to compare to the imperative versions we've been learning.
>
>      Adams, Stephen. Implementing Sets Efficiently in a Functional
> Language, 1993.
>      http://swiss.csail.mit.edu/~adams/BB/
>
> Apparently the algorithms presented in this paper were used for the
> Data.FiniteMap data structure in the GHC Haskell library as well.

I replied to this from home, but for some reason it didn't show
(probably my home address is not recognized)

I used the technique that Stephen Adams writes about to implement a
simple, but effective, purely functional object-oriented database.
The basic idea is that if you never allow objects to be modified, you
can dispense with a lot of concurrancy issues you'd normally encounter
in a database.  

If the database is relatively small (and these days `small' can mean
up to a gigabyte), you can keep it all in RAM.  The disk is used only
for recovery.  Since objects are never mutated, you need only record
their initial contents on the disk.  When recovering the database, you
simply play back the log and re-create the objects.  Except during
recovery, the database need only be opened in `append-only' mode.
This finesses the entire readers/writers problem: one person writes,
nobody reads.  

The difficulty, however, is how to name objects such that the name
persists across different runs of the system.  For this you need a way
to assign a unique name to each object and be able to retrieve the
object by name.  An associative set is needed, but because the
database is purely functional, the set implementation must be, too.
The functional sets that Adams describes are the solution:
modifications to the set create a new set and allocate about log(n)
nodes (which are appended to the log).  To support the illusion of
side effects, we create a new associative set based on the old set but
with one of the mapping changed to point to a new object.  This
allocates about log(n) nodes as well.

Transactions are almost embarrassingly trivial to implement.  Each
transaction keeps a copy of the most recent version of the object map.
When a transaction commits, it writes a commit record to the log file.
The commit record contains the root node of the object map.  If a
transaction aborts, it simply doesn't write a commit record.  Disk
failure therefore manifests itself as an aborted transaction (assuming
that the media hasn't been trashed, that is); you don't even need
to synchronize with the disk controller.

I used the CLOS meta-object protocol to integrate persistence into the
Common Lisp object model:  declaring a class to be `persistent' is all
the user need do.  Calling make-instance on a persistent class creates
the persistent object.  The database supports nested transactions and
two-phase commit (for multiple log extents).  It's only about 4,000
lines of Common Lisp.  In practice, it is quite robust --- I have *never*
gotten a corrupted database even when the Lisp has crashed in the middle
of a transaction.

If people are interested in this sort of thing, I'd be happy to give
a detailed review of it.  It should port to Scheme fairly easily, too.