Messaging talk, scheduled for 12 Dec 2011.

* Title & Abstract

Messaging Middleware: A semi-ad-hoc approach to Distributed Systems

Before I joined NEU last year, I had been working for about four years
on "messaging middleware", an industry response to inflexibility in
network stacks. Existing real-world approaches to group communication
are manual, arcane, and low-level. In this talk, I'll survey the state
of the art in messaging middleware, and attempt to relate a handful of
widely-used communication protocols to a general model of group
communication.

* Outline

** Introduction

Most of the talk will be very operational, with quite a lot of
low-level detail.

I'll spend quite a bit of time developing an approach which I hope
will help lead to higher-level, less manual tools for implementing
communication networks. For example, you could specify your
requirements and the system could choose TCP or PGM depending on the
closest fit, and could perhaps paper over the differences by using a
"residual" protocol.

PL connection: what is a language with more general communication than
point-to-point RPC like? How are exceptions/failures manifested? What
kinds of things are natural that were unnatural before, and vice
versa?
 - These questions are important because people are building systems
   using messaging in part to compensate for deficiencies in their
   other tools. It is becoming urgent to learn what the "natural
   language" of messaging systems is like. (What are the primitive
   notions? What is the relevant scoping principle?)

** What is messaging?

messaging : network :: database : filesystem/store

While lambda-calculus serves as a unifying model for programming
languages, well- and widely-understood, there's nothing analogous for
communications. The pi calculus is a really solid foundation, but it's
unclear whether it's really foundational for communication in the same
way that lambda is for sequential PL.

Looking at the analogy just above, it feels like the most rudimentary
store in a model of a computer system is the S in the CESK
machine. Where's the analogous model of communication?

You can always encode or model communication using state, and likewise
model state using communication. In a way they're fundamentally the
same thing.

But lots of people try to use databases as their shared medium for
communication between components in a distributed system. This seldom
works out very well because the kinds of interaction patterns are a
poor fit for the kinds of interactions database give you. Likewise,
there are people who try to store data in their messaging systems for
long periods of time.

One good rule of thumb is something I first heard from John Day:
 - if you can bound the message TTL, it's networking;
 - if you can't, it's a database.

** History

The history of messaging middleware overlaps with the history of
general studies of communication, so it can be hard to come up with
good boundaries.

Brief history of messaging in industry ("enterprise"):
 - 1983, idea of "software bus", Vivek Ranadivé
 - 1985, Ranadivé's company, Teknekron, finds first "TIB" customer in
   Goldman Sachs
 - 1990, Development of IBM MQseries begins
 - 1993, MQseries released
 - 1997, MSMQ
 - 2001, JMS
 - 2004, AMQP development begins
 - 2006, AMQP specification released

Messaging in Academia:
 - Long, long history of developments
 - 1964, Paul Baran's "On Distributed Communications" at Rand Corp
   introduces the idea of a mesh network
 - 1976, Belsnes analyses at-most-once, at-least-once, exactly-once
 - 1987, "News" subsystem of ISIS, Birman and Joseph; modelled after net news!
 - 1990, Paxos for general-purpose fault-tolerant consensus
 - Lamport's whole career has involved thinking about information
   replication and coordination in distributed systems

Messaging in IETF/OSI circles:
 - Mid-to-late 1970s: TCP
 - 1979, Usenet created by Truscott and Ellis (UUCP-based)
 - 1981, Delta-t specification published
 - 1983, Usenet message interchange (RFC 850)
 - 1986, RFC 977 (NNTP)

** What is messaging used for?

 - decoupling
    - of producers from consumers (logically, IOW)
       - has an aspecty flavour
    - in time (deferring work)
    - in space
    - in scale
 - offloading of work (for latency management)
 - managing concurrency in languages poorly suited to it
 - reliable storage of important messages (higher TTLs, IOW)
    - in particular, human- and corporate-timescale messaging
    - "better late than never" vs. "better never than late"
 - data distribution and syndication
 - load balancing
 - mux/demux (e.g. HTTP routing, RPC service routing)
 - simply as another option for a transport, bulk data

 - various conversational patterns
    - RPC (+error)
    - scatter/gather (Amazonish)
    - state replication (for fault tolerance; also e.g. bittorrent)

** What does it look like, concretely?

Many systems (JMS (generally), AMQP, MQseries) include a message
broker component, in a hub-and-spokeish configuration.

Others (0MQ, TIBCO) are decentralised.

There are hybrid systems that make sense, too. (e.g. PGM, my
multicast-sync-experiments)

** Examples of the uses of messaging

*** Decoupling producers from consumers

Authentication daemon. Task: add logging of authentication requests
and responses.

*** Decoupling in time

Queue up emails for transmission later, out of the main request path.

*** Decoupling in space and in scale

Separate logical parts of the application into separate processes,
possibly on separate machines,

 - to allow the application to manage its data locality (e.g. big
   database in Europe: send the query, don't fault in the whole
   database!)

 - to allow the application to scale past the capacity of a single CPU

 - to allow the application's different pieces to scale differently:
   e.g. image resizing service, more CPU is needed for the image
   resizing than for the web frontend, so have more workers processing
   that part of the app

*** Offloading of work for latency management

Email example from above; also any kind of batch job such as Ohloh's
downloading and analysis of sourcecode in the background. Don't want
to block the web UI while that's going on.

*** Managing concurrency in non-concurrent languages

Many, many languages have an awful time coping with concurrency. Using
messaging helps isolate different concurrent subsystems. Messaging is
often language-neutral, which helps polyglot solutions work smoothly.

*** Reliable storage of important messages

Message brokers often include a reliable message queue facility, where
messages are securely held in permanent storage for later delivery.

Example: Orders placed in a trading scenario. Cost of a dropped
message can be in the millions.

It's crucial in situations like these to manage the transfer of
responsibility for a message properly. Reinterpret ACKs: instead of
them being "I hear you!", they're really "I take responsibility for
this message now!". Deduplication at the application level is also
vitally important.

*** Data distribution and syndication

RSS feeds are a form of pub/sub (polling, if we set aside
PubSubHubBub).

Stock ticks from the exchange.

Log events going to alerting subsystems.

*** Load balancing

Stateful (or stochastic) routing components in the network can
distribute load to pools of effectively-identical workers. E.g. image
resize service.

*** Multiplexing/demultiplexing

HTTP URL routing. Turn the path of the URL into a topic, send it
through a broker. Consumers subscribe to paths they are interested in
handling.

*** Messaging as a Transport

A form of email for machines.

Database access.

API access.

*** Conversational patterns

RPC: Use temporary reply queues. Bound message lifetime.

Scatter/gather: the Amazon example. Late arrival = you missed your chance.

Pub/sub: example as above

State replication: use messaging for the replicas to coordinate. Can
overlay such as Paxos, etc. Bittorrent is an interesting example here,
that I'll revisit below.

*** Coping with unreliable transports, low bandwith, and/or multihoming

Even TCP isn't totally reliable! Consider using mobile data, or moving
your laptop from home to work.

Treat the medium as an "object fabric", exploiting "object factories"
to create objects which manage the location and transfer of
information.

Creating temporary queues to buffer messages if you have an unreliable
transport.

Creating mixers to cope with limited-bandwidth connections in a
multicast audio setup. Similarity to creating temporary queues: in
both cases you're pushing some of the app functionality into
general-purpose "mid network" location.

We're not working exclusively at the "ends" anymore! The end-to-end
principle is valuable but not inviolable. Another way of looking at it
is that you're still honouring the end-to-end principle, but at the
*intersection* of multiple networks.

** Deriving messaging from first principles

I'm going to introduce a simple data communications problem and
progressively add detail to it until we arrive at a model general
enough to encompass many of the things people are doing with
communication networks today.

Broadly, I'll use the term "message" for an application-level piece of
information, and "packet" for a transport-level bundle of information
sent directly across some medium.

*** Two-party unidirectional communication, single message

To begin with, consider the problem of a single Sender wanting to send
a single message to a single Receiver across a perfect medium.

-> DATA(message)

We're done! That was easy. But what if the medium isn't perfect?

*** Packet loss and duplication

What if some packets don't make it across the link? Sometimes the
sender won't care, and that's fine. But when the sender does care,
what can be done?

Easy: make the sender repeat itself. Forever. With every repetition,
the probability of successful receipt increases rapidly.

But then the receiver might process a given message twice!

Easy: the receiver remembers which messages it's seen before, and
discards duplicates that it receives.

Of course, in some cases duplicates don't matter - the data describe
an idempotent action - so the receiver doesn't need to remember a
thing, and can be perfectly stateless.

*** Packet corruption

What if some packets make it across, but don't arrive containing
exactly the same data that they left with?

Use a checksum. Discard corrupted packets, reducing the problem to
that of packet loss.

*** Two-party unidirectional communication, many messages

What if there are many messages to send?

Simply send them all, over and over, forever, either at the same time,
or if that's not possible, one after the other, round-robin.

*** Reclaiming state space at the sender

Eventually the sender's retransmit buffer will get full, and the
sender will run out of memory. The receiver has to help the sender out
here, by sending an acknowledgement-of-receipt.

-> DATA(message)
<- ACK(message)

When the sender hears an ACK for a given message, it can remove that
message from the retransmit buffer.

This has an additional advantage: instead of probabilistic knowledge
of receipt, the sender now has exact knowledge of receipt.

*** Ack loss

What if an ACK goes missing? The sender will never know that the
message was received successfully!

Every time the receiver receives some DATA, it should issue a
corresponding ACK. Now the tables are turned: the receiver has only
probabilistic knowledge that the sender has heard the ACK it issued,
but at least the sender will eventually learn that the message was
successfully received.

*** Reclaiming state space at the receiver

At this point in the development, we have a system for reclaiming both
send queue slots and bandwidth of the medium.

If we're worried about making sure the receiver doesn't process a
given message more than once, however, we're left with a problem: the
receiver has to remember every message it has ever received, so that
it can detect duplicates. The receiver will eventually run out of
memory.

The solution is that the receiver is permitted to free up a slot in
its deduplication buffer only when it learns that the sender has
learned that the receiver has received the datum.

  The receiver knows that
      the sender knows that
          the receiver heard
              the sender's message.

All the sender has to do is let the receiver know that it considers a
given transmission to be complete.

-> DATA(message)
<- ACK(message)
-> DONE(message)

If a given DONE is lost, no harm is done: when the receiver is running
short on space, it can spontaneously reissue an ACK from its table,
which will cause another DONE to be sent. If the sender receives an
ACK referring to a message unknown to the sender, again, no harm is
done: the sender can issue a DONE without causing any problems.

A DONE message, then, means both

 - "I once was concerned that this message hadn't been heard by
    you, but I am no longer worried about that," and

 - "You don't need to send me ACKs for that message."

*** Correctness: achieved

We now have a system that can achieve exactly-once delivery of
messages from a single sender to a single receiver, and that doesn't
need infinite buffer space on the sending or receiving side to do so.

It's time to broaden our focus to questions of efficiency, and then to
systems with multiple senders and multiple receivers.

We're still imagining, by the way, that we have infinite bandwidth to
work with. We'll revisit this shortly.

*** An aside on reliable delivery and transfer-of-responsibility

Two Generals Problem (called by Lamport the "Chinese Generals
problem") generalized to Byzantine Generals problem. Solution is to
accept and mitigate the uncertainty inherent in the medium.

In its original form it doesn't *quite* apply to message transmission!
The difference is that 2GP is to do with symmetric shared
understanding of some fact - say the value of a bit. In message
transmission, the goal is to reach an asymmetric shared understanding
of the bit: as soon as the receiver knows the value of the bit, the
receiver can go ahead (whereas in 2GP the receiver must reliably learn
that the sender knows the receiver heard the sender). Infinite regress
is avoided by the asymmetry.

*** Flow control

Now, imagine that the receiver can process incoming messages at a rate
of 10 Hz, but that the sender is firing off messages at a rate of 100
Hz. Once the receive buffer fills up, nine out of every ten messages
are going to be dropped by the receiver, because it has no space in
its buffer for them. This leads to an enormous waste of effort.

The answer is for the sender and receiver to agree on what to send
ahead of time. The sender advertises a collection of messages, and the
receiver expresses interest in receiving the messages. I've called
this expression-of-interest "subscription" here; we'll see later that
it corresponds to subscription, presence, and the traditional
window-opening systems of current internet transport protocols.

-> AD(message)
<- SUB(message)
-> DATA(message)
<- ACK(message)
-> DONE(message)

We've avoided our receiver being inundated with DATA packets, but we
seem only to have pushed back the problem one level: we're still
sending an enormous number of AD packets.

Other approaches to flow control include agreeing on an acceptable
rate ahead of time. In the telephony world, if a component is
overloaded it will send a "gap request" message upstream, telling the
upstream system to offer no more than a certain number of bytes or
packets per second of traffic. You can think of this as an abstraction
of a timed sequence of SUB messages.

*** Compressing ADs, SUBs, ACKs, and DONEs

So far we've been identifying messages by simply copying in the
entirety of a message when we want to refer to it. This is clearly not
practical. What realistic systems do is arrange for messages to
receive shorter names so that they can be referred to indirectly.

Let's imagine using the SHA-1 hash of a message as its name. We end up
with AD, SUB, ACK and DONE traffic proportional to the number of
"outstanding" or "uncommitted" messages (i.e. proportional to the
buffer sizes at each end), and DATA traffic proportional to the rate
of processing at the receiver.

*** Sequence numbers

Using SHA-1 hashes to identify messages isn't bad, but we can do
better.

The usual way of naming messages is to introduce sequence
numbers. Sequence numbers are particularly effective because sets of
sequence numbers can be very easily compressed.

Messages such as AD, SUB, ACK and DONE now logically contain *sets* of
messages, rather than individual messages; and furthermore, they
physically contain sets of message *names*.

Periodically, the sender will issue an AD containing the set of
sequence numbers which it still wishes to make sure that some receiver
accepts.

Periodically, the receiver will issue a SUB indicating which messages
it is willing to receive.

When the sender notices a match between a SUB and an AD, it can send
the DATA on the assumption that the receiver is ready for it.

The receiver can ACK the full set of sequence numbers accepted so
far. Since sets of sequence numbers are so efficiently compressible,
there's no need to use DONEs to reclaim space in the receiver's
deduplication buffer.

*** An aside on sequence numbers and timers

Once you get into compressed names for messages you have the
additional problem of managing the namespace. This means you have to
think about stable storage and crash recovery timers for avoiding
premature reuse of identifiers.

You need a sequence number space rightsized to the delays you
anticipate. For example, let MPL = 120s, R = 900s, A = 60s, T = 100
gigabits per second. Then we need a power of two greater than 1.2 x
10^14 bits. This is beyond the capability of a 32-bit counter, but a
64-bit counter will do nicely. Longer delays (e.g. interplanetary)
require larger sequence number spaces.

*** AD and DONE are inverses

AD contains all the sequence numbers the sender considers
outstanding. DONE contains all the sequence numbers the sender
considers completed. Since both sets increase monotonically, and DONE
never contains any number not contained in AD, we can see that DONE is
redundant: simply advertising the set of sequence numbers the sender
considers outstanding automatically informs the receiver of the
sequence numbers considered complete!

This doesn't hold for other kinds of names for messages than sequence
numbers.

*** We've reached TCP, Delta-t, and friends

Our protocol, modulo important issues of congestion avoidance and
timer management, is at the point that TCP and Delta-t are at:
single-sender, single-receiver, reliable, flow-controlled delivery of
messages.

TCP has an implicit AD: it is, effectively, always advertising every
future message contained on the connection. Because sequence numbers
compress so well, there's no DONE. So one direction of a TCP
connection effectively involves only the messages SUB, DATA, and ACK.

*** Sparse sets in SUB and ACK

If a group of messages are related, and a receiver has a finite
buffer, and an entire group must be received at once before processing
of the group can begin, it can be important to permit a sparse set of
sequence numbers to be transmitted in a SUB packet. Doing so gives the
receiver the option of selectively completing a group of related
messages before receiving new messages outside the group of interest.

Sparse sequence number sets in ACK messages are also useful, because
they give the sender improved information on the lossiness of the
link.

This is by-and-large a syntactic issue, however, and it works perfectly well with
 - SUB and ACK carrying single message IDs,
 - SUB and ACK carrying a single message ID implying every sequence
   number less than that ID, or
 - SUB and ACK carrying arbitrary sets of message IDs.

*** Garbage-collection to "reuse" bandwidth

Real links don't have infinite bandwidth. ACKs, besides letting us
reuse slots in the sender's buffer, also help us "reuse" the available
capacity of a communication link with finite bandwidth.

By signalling that subsequent retransmissions of a message would be
superfluous to the ACKing receiver, the bandwidth taken up by the
(infinite repeats of a) message can be "reused" by some other message
waiting to be sent.

*** Packet reordering or delay

Real networks reorder packets, and can sometimes delay packets for
surprisingly long periods of time.

Care has to be taken when processing AD and SUB packets that stale
information isn't acted on: otherwise confusion and message
duplication may result.

One way of avoiding this problem is to attach a
monotonically-increasing counter to each AD and SUB packet. Each end
remembers the highest counter value it has seen so far, and discards
packets with counter values lower than that value.

*** Topics, Subscription and Presence

Let's shift from single-sender, single-receiver to a general broadcast
medium with multiple senders and multiple receivers.

We have already seen that messages need to be unique in order to pass
the deduplication filters at receivers. This implies that if "Hello"
as said by A is to be regarded as different from "Hello" as said by B,
then the names "A" and "B" have to be made part of the messages
themselves, somehow.

In pub/sub systems, receivers generally are permitted to subscribe to
subsets of the available messages. Let's broaden the language used by
SUB messages to include more general predicates over messages.

This is a slight shift in perspective: ADs, SUBs and ACKs are still
describing sets of messages, but they're using an unusual language for
describing those sets. The language used permits discussion of some of
the semantic features of messages other than their raw identity.

Concretely, let's imagine that each message is logically
 - a topic string,
 - a sequence number, unique to the medium, and
 - a message body

Then we'd want ADs to be able to describe the topics on offer as well
as the sequence numbers within each topic. We'd want SUBs to indicate
topics of interest to the receiver, as well as sequence numbers that
receiver is ready to receive.

*** We have a pub/sub system

Presto - we have a pub/sub system with reliable transfer! It's here
that the semantic nature of the medium first shows up:

 - there are options for how many acks are required for a sender to
   consider a message successfully completed

 - there are options for routing individual messages to individual
   subscribers: stateful routing (round-robin) becomes possible

 - data distribution trees and other routing structures can be
   computed

 - access can be controlled by the network

 - security properties can be enforced by the network (such as not
   permitting certain stations to subscribe to traffic intended for
   other stations)

 - you can implement the network's semantic in several ways:
   hub-and-spoke style (like AMQP, and like Racket's universe for that
   matter!) or distributed style (like IP multicast, PGM, 0MQ etc)

(If you squint, this setup is like a potentially-reliable variant of
the IP network: the topics would be IP addresses.)

*** The key is the languages involved

Choosing different languages for describing sets of messages gives you
different network routing characteristics.

The type used for AD doesn't have to be the same as that for SUB, but
they do have to be able to be "run against" each other so that the
network can route messages appropriately.

You need to be able to compute the intersection of ADs and SUBs. For
instance, if they both used regular expressions to describe sets of
strings, you'd need to be able to compute the intersection of two
regular expressions.

*** Presence and Subscription

Having run the ADs in the network against the SUBs in the network, in
order to get flow control to be smooth across the entire distributed
system, you need to send the intersection of AD and SUB upstream.

Flow control is an end-to-end concern. We see this with the sockets
API: once the send buffers are full, the socket won't accept any
more. Generally, it's painful to push this backpressure further
upstream - even in languages like Erlang which have some crude flow
control and queueing built in to the runtime.

Propagating flow-control backpressure (a.k.a. subscription) upstream
(and symmetrically, propagating advertisement downstream) leads
directly to presence!

Subscription, presence and flow control are very closely related;
after all, what is "availability" but "willingness to listen"? Think
of how you use IM: even there, it plays a similar (though
multipurpose) role to flow control.

*** Flows and flow IDs

Connections in the TCP world generalise to "flows" in a less
connection-oriented world. A flow is a stream of messages from one
sender to one receiver, in a possibly multiple-sender,
multiple-receiver setting. Often, application-level context is
associated with a flow ID.

In TCP, for instance, the flow ID is the
sourceIP:sourcePort:destIP:destPort combination.

Flow IDs relate to presence as well: they are roughly the residual
after "running" the AD and SUB terms against each other.

Flow ID is pretty vague, and itself generalises to the notion of a
*conversational context*, which can be N-way and serves to scope some
construct of interest to the application.

*** Separating UNSUB from ACK

While we can imagine reasons why ACKs might also want to be
generalised to being topic-oriented as well, it's not so much
required, as ACKs are used to indicate successful receipt of specific
messages rather than topic selection.

For this reason, it's an interesting idea to separate unsubscription from ACK, giving

-> AD(topics and sequence numbers)
<- SUB(topics and sequence numbers)
-> DATA(message)
<- UNSUB(topics and sequence numbers)
<- ACK(sequence numbers)

There's no need to include DONE here, for the reasons given
earlier. In fact, a similar argument applies to UNSUB! Having
introduced it, we can immediately dispense with it (for certain
choices of message identifier), folding its functionality into SUB. A
SUB message, then, indicates which messages the receiver is
potentially interested in ACKing at some point in the future.

Care has to be taken to ensure that the sender hears an ACK before the
associated message IDs are removed from the SUB!

*** Bittorrent

Topic = filename
Sequence number = chunk

Senders and receivers share roles. ACKs are deemphasised.

*** Stock ticker

Topic = stock
Sequence number = not present
flow ID = topic (implicitly single-sender) or exchange&topic (if multiple markets)

This makes the flow-control/subscription hybrid into pure subscription.
Acks not used.
Data distribution trees can permit efficient fanout.

*** PGM

PGM is a big, complex protocol for reliable (exactly-once) multicast
between many sources and many sinks, with optional packet
ordering. "PGM guarantees that a receiver in the group either receives
all data packets from transmissions and repairs, or is able to detect
unrecoverable data packet loss."
 - NAKs are unicast back to the sender
 - NAK Confirmations, Repair Data

*** IRC

Topic = (nick * room name) for AD, (room name) for SUB
Sequence number = absent

Symmetric membership
No acks
Subscriptions are indicated by the NAMES in a room

*** Twitter

Topic = username
Sequence number = absent
Subscription = following
AD = implicit

*** RTP

RTCP collects statistics on packet loss and jitter. It also carries a
kind of presence!

*** JMS, AMQP

*Really* twisted, confused system. Multiple layers squished together
in a very awkward way. On the plus side, you can usually achieve your
goals with the primitives on offer; and AMQP includes a rudimentary
"object factory" for managing locality of state in the distributed
system as a whole.

** An aside on position and ownership

You don't *have* to keep the layers syntactically separate, but you
*do* have to keep them semantically separate.

Interesting principle: don't confuse position with ownership. Not sure
where it originated, or who came up with it; I heard it first at the
AMQP F2F in Feb 2007. Example: RTP has an M(arker) bit under
application control in the packet header. Presumably it's there to
permit its examination by middleboxes, but it's logically owned by the
application and thus is part of the payload, not the header.

Another example: TCP/IP puts the topic (IP addresses and port numbers)
in the header, separate from the body of the stream. If you view the
whole TCP/IP network as a single thing, the topic is logically part of
the message. The network enforces the subscription for you as part of
joining.

** You can't escape the fiddly details

Why am I talking about all of these low-level details of transport
protocols like retransmission, reordering, deduplication, flow
control? Because when people start working with messaging middleware
they assume that the middleware will save them from having to think
about these things but this is in general not possible! In general,
one finds oneself painfully reimplementing deduplication and flow
control, encoding it poorly and in an unplanned and reactive way in
the protocol used at the application level. It'd be a great boon to
distributed systems designers to have a good language they could use
in this area, one that covered all the necessary.

** Transport protocols generally

Watson's requirements for a transport protocol (slightly paraphrased):
 - Minimum packet exchange for RPC
 - High-throughput bulk data transport & streaming
 - Flow control (without polling for reliable 0-window opening)
 - Error control (lost, damaged, duplicate, reordered packets)
 - Large, flexible namespace for transport endpoints
 - Message boundary preservation
 - Secure communication

Split between "network" and "transport" protocols possibly misleading;
see Delta-t, but also the natural (?) emergence of routing
("networkish") and flows ("transportish") from generalised pubsub
media

Duplication, loss, and reordering

"reasonable conditions of loss and reordering rates (5% and 0.2%,
respectively)" - from Fonseca & Crovella 2005

Retransmission intervals: Round-trip-time; Bayesian estimators of
packet loss and of congestion; Control theory

Packet loss estimate.
RTT estimate.

Proposal of a protocol specifically for RTT estimation - RFC862 echo
protocol; NTP estimates RTT; RTCP measures RTT

Protocol specifically for congestion detection - DCCP, Datagram
Congestion Control Protocol, RFC 4340

Long-distance (interplanetary, interstellar) communication: TCP
doesn't work. Does the "general model"?

** Layering in communication systems

Enrollment is frequently overlooked!

** Application-level ordering

Sequencing of packets discussing application-level data is very
different from sequencing within application-level data! For example,
control packets talking about which packets are on offer and which
have been received have ordering requirements specific to the
transport protocol, which is a completely different statement from any
statement of ordering requirements among packets carried by the
transport protocol.

ordering constraints: where they're semantic, how do you get the
network to help you out? What about other kinds of causality flow?

** Connection management

Watson's connection management requirements:
 G1: An identifier of an information unit used for error control (or
     any other service) must not be reused while one or more copies
     (duplicates) of that unit or its Ack are alive.
 G2: The error control information being transmitted between each end
     must itself be error controlled.
 G3: If the crash of an end can cause it to lose its state, then
     appropriate crash recovery mechanism must assure the other
     requirements are met.

 01: If no connection exists, and the receiver is willing to receive,
     no duplicate packets from a previously closed connection should
     cause a new connection to be established and duplicate data to be
     accepted, unless the operations represented by the data are known
     to be idempotent.
 02: If a connection exists, then no packets from a previously closed
     connection should be acceptable within a current connection.

 C1: No packet from a previous connection should cause an existing
     connection to close.
 C2: A receiving side should not close until it has received all of a
     sender's possible retransmissions and can respond to them.
 C3: A sending side should not close until it has received
     acknowledgement of all that it has sent. In particular it should
     allow time for an acknowledgement of its final retransmission, if
     needed, before reporting a failure to its client program.

Watson's connection management requirements (summarised):
 G1 - If an ID could be live in the net somewhere, don't reuse it
 G2 - Make it possible for the receiver to know the start and end
      of a stream, i.e. include the open/close flags in the ack'd bits
 G3 - Crashes are no excuse for failing
 O1 - Martian packets shouldn't cause non-idempotent receives when closed
 O2 - Martian packets shouldn't cause receives when open
 C1 - Martian packets shouldn't close existing connections
 C2 - Don't forget too early on the receiving end
 C3 - Don't forget too early on the sending end

** Other distributed systems topics

Lamport's Paxos - a family of protocols for solving consensus problems
in the face of unreliable participants. Foundation of "state-machine
approach" to distributed systems, a systematic technique for producing
fault-tolerant distributed implementations of algorithms.

** Congestion control

Congestion control: again control theory. Different strategies
interact very differently; even within TCPs. Saying something is "TCP
friendly" is a bit vague; you want to know specifically which variant
congestion control algorithm is being considered. See also
http://curvecp.org/ and its interesting decongestion system
http://curvecp.org/decongestion.html (includes good discussion of
various interactions - latecomer's advantage etc.)

"the "TCP-friendly" equation, (that is, maintaining the arrival rate
to at most some constant over the square root of the packet loss
rate)"
http://www.psc.edu/networking/projects/tcpfriendly/ - Mahdavi & Floyd

Rate-based flow control: "In order to implement a TCP-friendly
congestion control algorithm, these applications should simply choose
to send at a rate no higher than a TCP connection, operating along the
same path, would achieve." - Mahdavi & Floyd 1997

Duality between ack and flow-control (corresponding to "push" and
"pull" - enumeration vs observation)

** Security considerations

Security - IPsec is a massively complex-seeming bundle of optional
protocols. DJB's CurveCP seeks to radically decomplexify things and
looks reasonable to this amateur.

Authentication of messages - PKI and naming are the tricky parts.

Per-message encryption - you have the problem of key distribution
etc. Multi-reader, multi-writer is a bitch, esp in an open system

Denial-of-service: memory usage attacks. Un-erasable state. Trickles.

Cryptographic capabilities

** Network continuations

Trickles:
 - no server-side memory usage for the state vector
 - TCP-like performance, plays nicely wrt congestion
 - anycast becomes a possibility! "[eliminates] the need for route stability"
 - "a form of network continuation" - see paper
    - split into transport continuation, holding the TCP TCB...
    - ... and user continuation, state from an app server (!)

** Relating specific protocols to the general model of communication

HTTP, TCP (and its lesser-known relation, Delta-T), IRC, XMPP,
Twitter, AMQP, PGM (Pragmatic General Multicast), RTP, 0MQ, RSS/Atom
and PubSubHubBub, BitTorrent

Delta-t is slightly weird - bit oriented, "reliable" 0-window
opening - but otherwise quite sensible

** Timers for discarding state

 - MPL = max packet lifetime  
 - R = max time sender willing to keep retransmitting  
 - A = max ack delay  
 - T = max new SN generation rate  
 - SNs range within `[0,2^n)`
 - 2^n > (2*MPL + R + A)T

 - Delta-t = MPL + R + A
 - receive-time = 2 * Delta-t
 - send-time    = 3 * Delta-t

** When should a sender retransmit?

TCP
 - timers based on RTT
 - duplicate acks

** Semantic media, Stateful media

Anycast and its relation to the statefulness of queues (they
distribute messages round-robin); General possibly-stateful (!)
semantics of a medium

** Responsibility transfer and failure-resiliency

Responsibility transfer in a failure-resilient setting: send a
message, collect acks from N of M listeners. They now hold collective
responsibility for the message. In order to take responsibility
further down the chain from the N of M: send acks on to all the
replicated copies. (They should all be advertising the same message
ID.)

** Separating mechanism from policy

Lying in TCP - both for unreliability and for making it a
message-oriented protocol (problems with firewalls etc, of
course). Confuse RTT estimators, confuse congestion detection
etc. Heavily overloaded semantics. Separation of mechanism from
policy. (RTP deals with multicast because it knows about asymmetry!
TCP can't cope in that setting)

** Conclusion

I've described the kinds of things messaging middleware systems do in
industry, and I've developed a sketch of a general model that covers
many of the applications of networking and related it to some common
protocols in use.

Most messaging middleware is used on LANs where everything can be
controlled. Few solutions "scale up" to the internet. Congestion &
state management are particularly interesting issues at internet
scale. Congestion control for anything but TCP-like flows is still a
completely open problem - the main problem is scoping a flow - what
counts as a flow? What's fair? We're lucky we have so much excess
capacity, it seems!

Open questions I'm interested in include:

 - What's the communications analogue of the S in CESK?

 - What would a programming language based on this more general notion
   of communication be like?

 - What does this view of communication suggest for the design of
   interfaces for communication in operating systems?

 - Could there be One Transport Protocol To Rule Them All?

* Bibliography

P. Baran, "On Distributed Communications: I. Introduction to
Distributed Communications Networks," 1964.

   - Early work on decentralised communications mesh networks,
     including discussion of reliable message delivery

P. Baran, "Reliable digital communications systems using unreliable
network repeater nodes." RAND, 1960.

   - Early discussion of reliable message delivery

D. Belsnes, "Single-Message Communication," IEEE Transactions on
Communications, vol. 24, no. 2, pp. 190–194, 1976.

   - Two-, three-, four- and five-packet protocols for reliable single
     message delivery, plus analysis of their failure modes and crash
     resistance.

U. Maheshwari, "Hula: An efficient protocol for reliable delivery of
messages," Technical Report MIT-LCS-TR-720. MIT Laboratory for
Computer Science, Jul-1997.

   - A modern protocol sharing similarities with Delta-t, but using a
     synchronised global clock to improve reordering and delay
     resilience.

N. Busi and G. Zavattaro, "Publish/subscribe vs. shared dataspace
coordination infrastructures: Is it just a matter of taste?," in
Proceedings Tenth IEEE International Workshop on Enabling
Technologies: Infrastructure for Collaborative Enterprises. WET ICE
2001, 2001.

   - Encoding of pub/sub in a tuplespace formalism, and a tuplespace
     in a pub/sub formalism. Interestingly, LINDA-like tuplespaces are
     too weak to encode pub/sub properly; you need an additional
     primitive. Another interesting point is the way subscriptions are
     encoded.

N. Fonseca and M. Crovella, "Bayesian Packet Loss Detection for TCP,"
in Proceedings 24th Annual Joint Conference of the IEEE Computer and
Communications Societies. (INFOCOM 2005), 2005, vol. 3, pp. 1826–1837.

   - An off-the-shelf google hit I found discussing Bayesian
     estimators for network reliability characteristics. I've only
     skimmed it.

R. W. Watson, "The Delta-t Transport Protocol: Features and
Experience," in Proceedings 14th Conference on Local Computer
Networks, 1989, pp. 399–407.

   - Great discussion of transport protocols generally:
     characteristics they should enjoy, and approaches to achieving
     those goals. Some detail on Delta-t specifics, but for more you
     should see the 1981 protocol specification.

R. W. Watson, "Delta-t Protocol Specification," Technical Report
UCID-19293. Lawrence Livermore National Laboratory, Dec-1981.

   - Full description of the Delta-t protocol, as implemented,
     including Pascal source code for key algorithms.

A. Shieh, A. C. Myers, and E. G. Sirer, "Trickles: A Stateless Network
Stack for Improved Scalability, Resilience, and Flexibility," in
Proceedings of the 2nd Symposium on Networked Systems Design &
Implementation (NSDI 2005), 2005.

   - Fascinating paper on a very interesting approach: using "network
     continuations" to reduce the per-connection memory requirements
     at TCP servers! Reminiscent of Racket's web-server stateless
     servlets technique.

J. Mahdavi and S. Floyd, "TCP-Friendly Unicast Rate-Based Flow
Control," Jan-1997. [Online]. Available:
http://www.psc.edu/networking/papers/tcp_friendly.html. [Accessed:
10-Dec-2011].

   - Pithy discussion of the basic constraints required of new
     protocols in order that they be TCP friendly. The state of the
     art here seems to have moved on a little, but I don't have a
     clear picture of it.

D. R. Cheriton and D. Skeen, "Understanding the limitations of
causally and totally ordered communication," ACM SIGOPS Operating
Systems Review, vol. 27, no. 5, pp. 44-57, Dec. 1993.

   - Beginnings of a public snit about the value of transport
     protocols providing causal ordering primitively. Cheriton & Skeen
     say it's harmful; Birman & Joseph say it's helpful; both are
     clearly right, so it should clearly be optional in any given
     transport.

K. P. Birman and T. A. Joseph, "Exploiting virtual synchrony in
distributed systems," ACM SIGOPS Operating Systems Review, vol. 21,
no. 5, pp. 123-138, Nov. 1987.

  - The original description of the ISIS system. The "news" subsystem
    rates a lone paragraph (section 3.9).

E. Meijer, "Subject/Observer is Dual to Iterator," in ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI),
FIT session, 2010.

  - Makes the observation that the interfaces Observable and
    Enumerator from popular OO languages are categorical duals. I
    extended that thinking to the realm of acknowledgement and
    credit-based flow control in a blog post,
    http://www.eighty-twenty.org/index.cgi/tech/origins-of-ack-and-flow-control-20110515.html

J. Waldo, G. Wyant, A. Wollrath, and S. Kendall, "A Note on
Distributed Computing," Technical Report SMLI TR-94-29. Sun
Microsystems Laboratories, Inc, Nov-1994.

  - Points out the problems inherent in papering over the latency,
    memory-locality, partial-failure and concurrency issues involved
    in remotely accessing resources. Argues for an explicit reflection
    of the difference between local and remote access in programming
    languages. This is the paper people reference to point out the
    issues with "RPC" (as it was meant in the 80s).