Messaging talk, scheduled for 12 Dec 2011. * Title & Abstract Messaging Middleware: A semi-ad-hoc approach to Distributed Systems Before I joined NEU last year, I had been working for about four years on "messaging middleware", an industry response to inflexibility in network stacks. Existing real-world approaches to group communication are manual, arcane, and low-level. In this talk, I'll survey the state of the art in messaging middleware, and attempt to relate a handful of widely-used communication protocols to a general model of group communication. * Outline ** Introduction Most of the talk will be very operational, with quite a lot of low-level detail. I'll spend quite a bit of time developing an approach which I hope will help lead to higher-level, less manual tools for implementing communication networks. For example, you could specify your requirements and the system could choose TCP or PGM depending on the closest fit, and could perhaps paper over the differences by using a "residual" protocol. PL connection: what is a language with more general communication than point-to-point RPC like? How are exceptions/failures manifested? What kinds of things are natural that were unnatural before, and vice versa? - These questions are important because people are building systems using messaging in part to compensate for deficiencies in their other tools. It is becoming urgent to learn what the "natural language" of messaging systems is like. (What are the primitive notions? What is the relevant scoping principle?) ** What is messaging? messaging : network :: database : filesystem/store While lambda-calculus serves as a unifying model for programming languages, well- and widely-understood, there's nothing analogous for communications. The pi calculus is a really solid foundation, but it's unclear whether it's really foundational for communication in the same way that lambda is for sequential PL. Looking at the analogy just above, it feels like the most rudimentary store in a model of a computer system is the S in the CESK machine. Where's the analogous model of communication? You can always encode or model communication using state, and likewise model state using communication. In a way they're fundamentally the same thing. But lots of people try to use databases as their shared medium for communication between components in a distributed system. This seldom works out very well because the kinds of interaction patterns are a poor fit for the kinds of interactions database give you. Likewise, there are people who try to store data in their messaging systems for long periods of time. One good rule of thumb is something I first heard from John Day: - if you can bound the message TTL, it's networking; - if you can't, it's a database. ** History The history of messaging middleware overlaps with the history of general studies of communication, so it can be hard to come up with good boundaries. Brief history of messaging in industry ("enterprise"): - 1983, idea of "software bus", Vivek Ranadivé - 1985, Ranadivé's company, Teknekron, finds first "TIB" customer in Goldman Sachs - 1990, Development of IBM MQseries begins - 1993, MQseries released - 1997, MSMQ - 2001, JMS - 2004, AMQP development begins - 2006, AMQP specification released Messaging in Academia: - Long, long history of developments - 1964, Paul Baran's "On Distributed Communications" at Rand Corp introduces the idea of a mesh network - 1976, Belsnes analyses at-most-once, at-least-once, exactly-once - 1987, "News" subsystem of ISIS, Birman and Joseph; modelled after net news! - 1990, Paxos for general-purpose fault-tolerant consensus - Lamport's whole career has involved thinking about information replication and coordination in distributed systems Messaging in IETF/OSI circles: - Mid-to-late 1970s: TCP - 1979, Usenet created by Truscott and Ellis (UUCP-based) - 1981, Delta-t specification published - 1983, Usenet message interchange (RFC 850) - 1986, RFC 977 (NNTP) ** What is messaging used for? - decoupling - of producers from consumers (logically, IOW) - has an aspecty flavour - in time (deferring work) - in space - in scale - offloading of work (for latency management) - managing concurrency in languages poorly suited to it - reliable storage of important messages (higher TTLs, IOW) - in particular, human- and corporate-timescale messaging - "better late than never" vs. "better never than late" - data distribution and syndication - load balancing - mux/demux (e.g. HTTP routing, RPC service routing) - simply as another option for a transport, bulk data - various conversational patterns - RPC (+error) - scatter/gather (Amazonish) - state replication (for fault tolerance; also e.g. bittorrent) ** What does it look like, concretely? Many systems (JMS (generally), AMQP, MQseries) include a message broker component, in a hub-and-spokeish configuration. Others (0MQ, TIBCO) are decentralised. There are hybrid systems that make sense, too. (e.g. PGM, my multicast-sync-experiments) ** Examples of the uses of messaging *** Decoupling producers from consumers Authentication daemon. Task: add logging of authentication requests and responses. *** Decoupling in time Queue up emails for transmission later, out of the main request path. *** Decoupling in space and in scale Separate logical parts of the application into separate processes, possibly on separate machines, - to allow the application to manage its data locality (e.g. big database in Europe: send the query, don't fault in the whole database!) - to allow the application to scale past the capacity of a single CPU - to allow the application's different pieces to scale differently: e.g. image resizing service, more CPU is needed for the image resizing than for the web frontend, so have more workers processing that part of the app *** Offloading of work for latency management Email example from above; also any kind of batch job such as Ohloh's downloading and analysis of sourcecode in the background. Don't want to block the web UI while that's going on. *** Managing concurrency in non-concurrent languages Many, many languages have an awful time coping with concurrency. Using messaging helps isolate different concurrent subsystems. Messaging is often language-neutral, which helps polyglot solutions work smoothly. *** Reliable storage of important messages Message brokers often include a reliable message queue facility, where messages are securely held in permanent storage for later delivery. Example: Orders placed in a trading scenario. Cost of a dropped message can be in the millions. It's crucial in situations like these to manage the transfer of responsibility for a message properly. Reinterpret ACKs: instead of them being "I hear you!", they're really "I take responsibility for this message now!". Deduplication at the application level is also vitally important. *** Data distribution and syndication RSS feeds are a form of pub/sub (polling, if we set aside PubSubHubBub). Stock ticks from the exchange. Log events going to alerting subsystems. *** Load balancing Stateful (or stochastic) routing components in the network can distribute load to pools of effectively-identical workers. E.g. image resize service. *** Multiplexing/demultiplexing HTTP URL routing. Turn the path of the URL into a topic, send it through a broker. Consumers subscribe to paths they are interested in handling. *** Messaging as a Transport A form of email for machines. Database access. API access. *** Conversational patterns RPC: Use temporary reply queues. Bound message lifetime. Scatter/gather: the Amazon example. Late arrival = you missed your chance. Pub/sub: example as above State replication: use messaging for the replicas to coordinate. Can overlay such as Paxos, etc. Bittorrent is an interesting example here, that I'll revisit below. *** Coping with unreliable transports, low bandwith, and/or multihoming Even TCP isn't totally reliable! Consider using mobile data, or moving your laptop from home to work. Treat the medium as an "object fabric", exploiting "object factories" to create objects which manage the location and transfer of information. Creating temporary queues to buffer messages if you have an unreliable transport. Creating mixers to cope with limited-bandwidth connections in a multicast audio setup. Similarity to creating temporary queues: in both cases you're pushing some of the app functionality into general-purpose "mid network" location. We're not working exclusively at the "ends" anymore! The end-to-end principle is valuable but not inviolable. Another way of looking at it is that you're still honouring the end-to-end principle, but at the *intersection* of multiple networks. ** Deriving messaging from first principles I'm going to introduce a simple data communications problem and progressively add detail to it until we arrive at a model general enough to encompass many of the things people are doing with communication networks today. Broadly, I'll use the term "message" for an application-level piece of information, and "packet" for a transport-level bundle of information sent directly across some medium. *** Two-party unidirectional communication, single message To begin with, consider the problem of a single Sender wanting to send a single message to a single Receiver across a perfect medium. -> DATA(message) We're done! That was easy. But what if the medium isn't perfect? *** Packet loss and duplication What if some packets don't make it across the link? Sometimes the sender won't care, and that's fine. But when the sender does care, what can be done? Easy: make the sender repeat itself. Forever. With every repetition, the probability of successful receipt increases rapidly. But then the receiver might process a given message twice! Easy: the receiver remembers which messages it's seen before, and discards duplicates that it receives. Of course, in some cases duplicates don't matter - the data describe an idempotent action - so the receiver doesn't need to remember a thing, and can be perfectly stateless. *** Packet corruption What if some packets make it across, but don't arrive containing exactly the same data that they left with? Use a checksum. Discard corrupted packets, reducing the problem to that of packet loss. *** Two-party unidirectional communication, many messages What if there are many messages to send? Simply send them all, over and over, forever, either at the same time, or if that's not possible, one after the other, round-robin. *** Reclaiming state space at the sender Eventually the sender's retransmit buffer will get full, and the sender will run out of memory. The receiver has to help the sender out here, by sending an acknowledgement-of-receipt. -> DATA(message) <- ACK(message) When the sender hears an ACK for a given message, it can remove that message from the retransmit buffer. This has an additional advantage: instead of probabilistic knowledge of receipt, the sender now has exact knowledge of receipt. *** Ack loss What if an ACK goes missing? The sender will never know that the message was received successfully! Every time the receiver receives some DATA, it should issue a corresponding ACK. Now the tables are turned: the receiver has only probabilistic knowledge that the sender has heard the ACK it issued, but at least the sender will eventually learn that the message was successfully received. *** Reclaiming state space at the receiver At this point in the development, we have a system for reclaiming both send queue slots and bandwidth of the medium. If we're worried about making sure the receiver doesn't process a given message more than once, however, we're left with a problem: the receiver has to remember every message it has ever received, so that it can detect duplicates. The receiver will eventually run out of memory. The solution is that the receiver is permitted to free up a slot in its deduplication buffer only when it learns that the sender has learned that the receiver has received the datum. The receiver knows that the sender knows that the receiver heard the sender's message. All the sender has to do is let the receiver know that it considers a given transmission to be complete. -> DATA(message) <- ACK(message) -> DONE(message) If a given DONE is lost, no harm is done: when the receiver is running short on space, it can spontaneously reissue an ACK from its table, which will cause another DONE to be sent. If the sender receives an ACK referring to a message unknown to the sender, again, no harm is done: the sender can issue a DONE without causing any problems. A DONE message, then, means both - "I once was concerned that this message hadn't been heard by you, but I am no longer worried about that," and - "You don't need to send me ACKs for that message." *** Correctness: achieved We now have a system that can achieve exactly-once delivery of messages from a single sender to a single receiver, and that doesn't need infinite buffer space on the sending or receiving side to do so. It's time to broaden our focus to questions of efficiency, and then to systems with multiple senders and multiple receivers. We're still imagining, by the way, that we have infinite bandwidth to work with. We'll revisit this shortly. *** An aside on reliable delivery and transfer-of-responsibility Two Generals Problem (called by Lamport the "Chinese Generals problem") generalized to Byzantine Generals problem. Solution is to accept and mitigate the uncertainty inherent in the medium. In its original form it doesn't *quite* apply to message transmission! The difference is that 2GP is to do with symmetric shared understanding of some fact - say the value of a bit. In message transmission, the goal is to reach an asymmetric shared understanding of the bit: as soon as the receiver knows the value of the bit, the receiver can go ahead (whereas in 2GP the receiver must reliably learn that the sender knows the receiver heard the sender). Infinite regress is avoided by the asymmetry. *** Flow control Now, imagine that the receiver can process incoming messages at a rate of 10 Hz, but that the sender is firing off messages at a rate of 100 Hz. Once the receive buffer fills up, nine out of every ten messages are going to be dropped by the receiver, because it has no space in its buffer for them. This leads to an enormous waste of effort. The answer is for the sender and receiver to agree on what to send ahead of time. The sender advertises a collection of messages, and the receiver expresses interest in receiving the messages. I've called this expression-of-interest "subscription" here; we'll see later that it corresponds to subscription, presence, and the traditional window-opening systems of current internet transport protocols. -> AD(message) <- SUB(message) -> DATA(message) <- ACK(message) -> DONE(message) We've avoided our receiver being inundated with DATA packets, but we seem only to have pushed back the problem one level: we're still sending an enormous number of AD packets. Other approaches to flow control include agreeing on an acceptable rate ahead of time. In the telephony world, if a component is overloaded it will send a "gap request" message upstream, telling the upstream system to offer no more than a certain number of bytes or packets per second of traffic. You can think of this as an abstraction of a timed sequence of SUB messages. *** Compressing ADs, SUBs, ACKs, and DONEs So far we've been identifying messages by simply copying in the entirety of a message when we want to refer to it. This is clearly not practical. What realistic systems do is arrange for messages to receive shorter names so that they can be referred to indirectly. Let's imagine using the SHA-1 hash of a message as its name. We end up with AD, SUB, ACK and DONE traffic proportional to the number of "outstanding" or "uncommitted" messages (i.e. proportional to the buffer sizes at each end), and DATA traffic proportional to the rate of processing at the receiver. *** Sequence numbers Using SHA-1 hashes to identify messages isn't bad, but we can do better. The usual way of naming messages is to introduce sequence numbers. Sequence numbers are particularly effective because sets of sequence numbers can be very easily compressed. Messages such as AD, SUB, ACK and DONE now logically contain *sets* of messages, rather than individual messages; and furthermore, they physically contain sets of message *names*. Periodically, the sender will issue an AD containing the set of sequence numbers which it still wishes to make sure that some receiver accepts. Periodically, the receiver will issue a SUB indicating which messages it is willing to receive. When the sender notices a match between a SUB and an AD, it can send the DATA on the assumption that the receiver is ready for it. The receiver can ACK the full set of sequence numbers accepted so far. Since sets of sequence numbers are so efficiently compressible, there's no need to use DONEs to reclaim space in the receiver's deduplication buffer. *** An aside on sequence numbers and timers Once you get into compressed names for messages you have the additional problem of managing the namespace. This means you have to think about stable storage and crash recovery timers for avoiding premature reuse of identifiers. You need a sequence number space rightsized to the delays you anticipate. For example, let MPL = 120s, R = 900s, A = 60s, T = 100 gigabits per second. Then we need a power of two greater than 1.2 x 10^14 bits. This is beyond the capability of a 32-bit counter, but a 64-bit counter will do nicely. Longer delays (e.g. interplanetary) require larger sequence number spaces. *** AD and DONE are inverses AD contains all the sequence numbers the sender considers outstanding. DONE contains all the sequence numbers the sender considers completed. Since both sets increase monotonically, and DONE never contains any number not contained in AD, we can see that DONE is redundant: simply advertising the set of sequence numbers the sender considers outstanding automatically informs the receiver of the sequence numbers considered complete! This doesn't hold for other kinds of names for messages than sequence numbers. *** We've reached TCP, Delta-t, and friends Our protocol, modulo important issues of congestion avoidance and timer management, is at the point that TCP and Delta-t are at: single-sender, single-receiver, reliable, flow-controlled delivery of messages. TCP has an implicit AD: it is, effectively, always advertising every future message contained on the connection. Because sequence numbers compress so well, there's no DONE. So one direction of a TCP connection effectively involves only the messages SUB, DATA, and ACK. *** Sparse sets in SUB and ACK If a group of messages are related, and a receiver has a finite buffer, and an entire group must be received at once before processing of the group can begin, it can be important to permit a sparse set of sequence numbers to be transmitted in a SUB packet. Doing so gives the receiver the option of selectively completing a group of related messages before receiving new messages outside the group of interest. Sparse sequence number sets in ACK messages are also useful, because they give the sender improved information on the lossiness of the link. This is by-and-large a syntactic issue, however, and it works perfectly well with - SUB and ACK carrying single message IDs, - SUB and ACK carrying a single message ID implying every sequence number less than that ID, or - SUB and ACK carrying arbitrary sets of message IDs. *** Garbage-collection to "reuse" bandwidth Real links don't have infinite bandwidth. ACKs, besides letting us reuse slots in the sender's buffer, also help us "reuse" the available capacity of a communication link with finite bandwidth. By signalling that subsequent retransmissions of a message would be superfluous to the ACKing receiver, the bandwidth taken up by the (infinite repeats of a) message can be "reused" by some other message waiting to be sent. *** Packet reordering or delay Real networks reorder packets, and can sometimes delay packets for surprisingly long periods of time. Care has to be taken when processing AD and SUB packets that stale information isn't acted on: otherwise confusion and message duplication may result. One way of avoiding this problem is to attach a monotonically-increasing counter to each AD and SUB packet. Each end remembers the highest counter value it has seen so far, and discards packets with counter values lower than that value. *** Topics, Subscription and Presence Let's shift from single-sender, single-receiver to a general broadcast medium with multiple senders and multiple receivers. We have already seen that messages need to be unique in order to pass the deduplication filters at receivers. This implies that if "Hello" as said by A is to be regarded as different from "Hello" as said by B, then the names "A" and "B" have to be made part of the messages themselves, somehow. In pub/sub systems, receivers generally are permitted to subscribe to subsets of the available messages. Let's broaden the language used by SUB messages to include more general predicates over messages. This is a slight shift in perspective: ADs, SUBs and ACKs are still describing sets of messages, but they're using an unusual language for describing those sets. The language used permits discussion of some of the semantic features of messages other than their raw identity. Concretely, let's imagine that each message is logically - a topic string, - a sequence number, unique to the medium, and - a message body Then we'd want ADs to be able to describe the topics on offer as well as the sequence numbers within each topic. We'd want SUBs to indicate topics of interest to the receiver, as well as sequence numbers that receiver is ready to receive. *** We have a pub/sub system Presto - we have a pub/sub system with reliable transfer! It's here that the semantic nature of the medium first shows up: - there are options for how many acks are required for a sender to consider a message successfully completed - there are options for routing individual messages to individual subscribers: stateful routing (round-robin) becomes possible - data distribution trees and other routing structures can be computed - access can be controlled by the network - security properties can be enforced by the network (such as not permitting certain stations to subscribe to traffic intended for other stations) - you can implement the network's semantic in several ways: hub-and-spoke style (like AMQP, and like Racket's universe for that matter!) or distributed style (like IP multicast, PGM, 0MQ etc) (If you squint, this setup is like a potentially-reliable variant of the IP network: the topics would be IP addresses.) *** The key is the languages involved Choosing different languages for describing sets of messages gives you different network routing characteristics. The type used for AD doesn't have to be the same as that for SUB, but they do have to be able to be "run against" each other so that the network can route messages appropriately. You need to be able to compute the intersection of ADs and SUBs. For instance, if they both used regular expressions to describe sets of strings, you'd need to be able to compute the intersection of two regular expressions. *** Presence and Subscription Having run the ADs in the network against the SUBs in the network, in order to get flow control to be smooth across the entire distributed system, you need to send the intersection of AD and SUB upstream. Flow control is an end-to-end concern. We see this with the sockets API: once the send buffers are full, the socket won't accept any more. Generally, it's painful to push this backpressure further upstream - even in languages like Erlang which have some crude flow control and queueing built in to the runtime. Propagating flow-control backpressure (a.k.a. subscription) upstream (and symmetrically, propagating advertisement downstream) leads directly to presence! Subscription, presence and flow control are very closely related; after all, what is "availability" but "willingness to listen"? Think of how you use IM: even there, it plays a similar (though multipurpose) role to flow control. *** Flows and flow IDs Connections in the TCP world generalise to "flows" in a less connection-oriented world. A flow is a stream of messages from one sender to one receiver, in a possibly multiple-sender, multiple-receiver setting. Often, application-level context is associated with a flow ID. In TCP, for instance, the flow ID is the sourceIP:sourcePort:destIP:destPort combination. Flow IDs relate to presence as well: they are roughly the residual after "running" the AD and SUB terms against each other. Flow ID is pretty vague, and itself generalises to the notion of a *conversational context*, which can be N-way and serves to scope some construct of interest to the application. *** Separating UNSUB from ACK While we can imagine reasons why ACKs might also want to be generalised to being topic-oriented as well, it's not so much required, as ACKs are used to indicate successful receipt of specific messages rather than topic selection. For this reason, it's an interesting idea to separate unsubscription from ACK, giving -> AD(topics and sequence numbers) <- SUB(topics and sequence numbers) -> DATA(message) <- UNSUB(topics and sequence numbers) <- ACK(sequence numbers) There's no need to include DONE here, for the reasons given earlier. In fact, a similar argument applies to UNSUB! Having introduced it, we can immediately dispense with it (for certain choices of message identifier), folding its functionality into SUB. A SUB message, then, indicates which messages the receiver is potentially interested in ACKing at some point in the future. Care has to be taken to ensure that the sender hears an ACK before the associated message IDs are removed from the SUB! *** Bittorrent Topic = filename Sequence number = chunk Senders and receivers share roles. ACKs are deemphasised. *** Stock ticker Topic = stock Sequence number = not present flow ID = topic (implicitly single-sender) or exchange&topic (if multiple markets) This makes the flow-control/subscription hybrid into pure subscription. Acks not used. Data distribution trees can permit efficient fanout. *** PGM PGM is a big, complex protocol for reliable (exactly-once) multicast between many sources and many sinks, with optional packet ordering. "PGM guarantees that a receiver in the group either receives all data packets from transmissions and repairs, or is able to detect unrecoverable data packet loss." - NAKs are unicast back to the sender - NAK Confirmations, Repair Data *** IRC Topic = (nick * room name) for AD, (room name) for SUB Sequence number = absent Symmetric membership No acks Subscriptions are indicated by the NAMES in a room *** Twitter Topic = username Sequence number = absent Subscription = following AD = implicit *** RTP RTCP collects statistics on packet loss and jitter. It also carries a kind of presence! *** JMS, AMQP *Really* twisted, confused system. Multiple layers squished together in a very awkward way. On the plus side, you can usually achieve your goals with the primitives on offer; and AMQP includes a rudimentary "object factory" for managing locality of state in the distributed system as a whole. ** An aside on position and ownership You don't *have* to keep the layers syntactically separate, but you *do* have to keep them semantically separate. Interesting principle: don't confuse position with ownership. Not sure where it originated, or who came up with it; I heard it first at the AMQP F2F in Feb 2007. Example: RTP has an M(arker) bit under application control in the packet header. Presumably it's there to permit its examination by middleboxes, but it's logically owned by the application and thus is part of the payload, not the header. Another example: TCP/IP puts the topic (IP addresses and port numbers) in the header, separate from the body of the stream. If you view the whole TCP/IP network as a single thing, the topic is logically part of the message. The network enforces the subscription for you as part of joining. ** You can't escape the fiddly details Why am I talking about all of these low-level details of transport protocols like retransmission, reordering, deduplication, flow control? Because when people start working with messaging middleware they assume that the middleware will save them from having to think about these things but this is in general not possible! In general, one finds oneself painfully reimplementing deduplication and flow control, encoding it poorly and in an unplanned and reactive way in the protocol used at the application level. It'd be a great boon to distributed systems designers to have a good language they could use in this area, one that covered all the necessary. ** Transport protocols generally Watson's requirements for a transport protocol (slightly paraphrased): - Minimum packet exchange for RPC - High-throughput bulk data transport & streaming - Flow control (without polling for reliable 0-window opening) - Error control (lost, damaged, duplicate, reordered packets) - Large, flexible namespace for transport endpoints - Message boundary preservation - Secure communication Split between "network" and "transport" protocols possibly misleading; see Delta-t, but also the natural (?) emergence of routing ("networkish") and flows ("transportish") from generalised pubsub media Duplication, loss, and reordering "reasonable conditions of loss and reordering rates (5% and 0.2%, respectively)" - from Fonseca & Crovella 2005 Retransmission intervals: Round-trip-time; Bayesian estimators of packet loss and of congestion; Control theory Packet loss estimate. RTT estimate. Proposal of a protocol specifically for RTT estimation - RFC862 echo protocol; NTP estimates RTT; RTCP measures RTT Protocol specifically for congestion detection - DCCP, Datagram Congestion Control Protocol, RFC 4340 Long-distance (interplanetary, interstellar) communication: TCP doesn't work. Does the "general model"? ** Layering in communication systems Enrollment is frequently overlooked! ** Application-level ordering Sequencing of packets discussing application-level data is very different from sequencing within application-level data! For example, control packets talking about which packets are on offer and which have been received have ordering requirements specific to the transport protocol, which is a completely different statement from any statement of ordering requirements among packets carried by the transport protocol. ordering constraints: where they're semantic, how do you get the network to help you out? What about other kinds of causality flow? ** Connection management Watson's connection management requirements: G1: An identifier of an information unit used for error control (or any other service) must not be reused while one or more copies (duplicates) of that unit or its Ack are alive. G2: The error control information being transmitted between each end must itself be error controlled. G3: If the crash of an end can cause it to lose its state, then appropriate crash recovery mechanism must assure the other requirements are met. 01: If no connection exists, and the receiver is willing to receive, no duplicate packets from a previously closed connection should cause a new connection to be established and duplicate data to be accepted, unless the operations represented by the data are known to be idempotent. 02: If a connection exists, then no packets from a previously closed connection should be acceptable within a current connection. C1: No packet from a previous connection should cause an existing connection to close. C2: A receiving side should not close until it has received all of a sender's possible retransmissions and can respond to them. C3: A sending side should not close until it has received acknowledgement of all that it has sent. In particular it should allow time for an acknowledgement of its final retransmission, if needed, before reporting a failure to its client program. Watson's connection management requirements (summarised): G1 - If an ID could be live in the net somewhere, don't reuse it G2 - Make it possible for the receiver to know the start and end of a stream, i.e. include the open/close flags in the ack'd bits G3 - Crashes are no excuse for failing O1 - Martian packets shouldn't cause non-idempotent receives when closed O2 - Martian packets shouldn't cause receives when open C1 - Martian packets shouldn't close existing connections C2 - Don't forget too early on the receiving end C3 - Don't forget too early on the sending end ** Other distributed systems topics Lamport's Paxos - a family of protocols for solving consensus problems in the face of unreliable participants. Foundation of "state-machine approach" to distributed systems, a systematic technique for producing fault-tolerant distributed implementations of algorithms. ** Congestion control Congestion control: again control theory. Different strategies interact very differently; even within TCPs. Saying something is "TCP friendly" is a bit vague; you want to know specifically which variant congestion control algorithm is being considered. See also http://curvecp.org/ and its interesting decongestion system http://curvecp.org/decongestion.html (includes good discussion of various interactions - latecomer's advantage etc.) "the "TCP-friendly" equation, (that is, maintaining the arrival rate to at most some constant over the square root of the packet loss rate)" http://www.psc.edu/networking/projects/tcpfriendly/ - Mahdavi & Floyd Rate-based flow control: "In order to implement a TCP-friendly congestion control algorithm, these applications should simply choose to send at a rate no higher than a TCP connection, operating along the same path, would achieve." - Mahdavi & Floyd 1997 Duality between ack and flow-control (corresponding to "push" and "pull" - enumeration vs observation) ** Security considerations Security - IPsec is a massively complex-seeming bundle of optional protocols. DJB's CurveCP seeks to radically decomplexify things and looks reasonable to this amateur. Authentication of messages - PKI and naming are the tricky parts. Per-message encryption - you have the problem of key distribution etc. Multi-reader, multi-writer is a bitch, esp in an open system Denial-of-service: memory usage attacks. Un-erasable state. Trickles. Cryptographic capabilities ** Network continuations Trickles: - no server-side memory usage for the state vector - TCP-like performance, plays nicely wrt congestion - anycast becomes a possibility! "[eliminates] the need for route stability" - "a form of network continuation" - see paper - split into transport continuation, holding the TCP TCB... - ... and user continuation, state from an app server (!) ** Relating specific protocols to the general model of communication HTTP, TCP (and its lesser-known relation, Delta-T), IRC, XMPP, Twitter, AMQP, PGM (Pragmatic General Multicast), RTP, 0MQ, RSS/Atom and PubSubHubBub, BitTorrent Delta-t is slightly weird - bit oriented, "reliable" 0-window opening - but otherwise quite sensible ** Timers for discarding state - MPL = max packet lifetime - R = max time sender willing to keep retransmitting - A = max ack delay - T = max new SN generation rate - SNs range within `[0,2^n)` - 2^n > (2*MPL + R + A)T - Delta-t = MPL + R + A - receive-time = 2 * Delta-t - send-time = 3 * Delta-t ** When should a sender retransmit? TCP - timers based on RTT - duplicate acks ** Semantic media, Stateful media Anycast and its relation to the statefulness of queues (they distribute messages round-robin); General possibly-stateful (!) semantics of a medium ** Responsibility transfer and failure-resiliency Responsibility transfer in a failure-resilient setting: send a message, collect acks from N of M listeners. They now hold collective responsibility for the message. In order to take responsibility further down the chain from the N of M: send acks on to all the replicated copies. (They should all be advertising the same message ID.) ** Separating mechanism from policy Lying in TCP - both for unreliability and for making it a message-oriented protocol (problems with firewalls etc, of course). Confuse RTT estimators, confuse congestion detection etc. Heavily overloaded semantics. Separation of mechanism from policy. (RTP deals with multicast because it knows about asymmetry! TCP can't cope in that setting) ** Conclusion I've described the kinds of things messaging middleware systems do in industry, and I've developed a sketch of a general model that covers many of the applications of networking and related it to some common protocols in use. Most messaging middleware is used on LANs where everything can be controlled. Few solutions "scale up" to the internet. Congestion & state management are particularly interesting issues at internet scale. Congestion control for anything but TCP-like flows is still a completely open problem - the main problem is scoping a flow - what counts as a flow? What's fair? We're lucky we have so much excess capacity, it seems! Open questions I'm interested in include: - What's the communications analogue of the S in CESK? - What would a programming language based on this more general notion of communication be like? - What does this view of communication suggest for the design of interfaces for communication in operating systems? - Could there be One Transport Protocol To Rule Them All? * Bibliography P. Baran, "On Distributed Communications: I. Introduction to Distributed Communications Networks," 1964. - Early work on decentralised communications mesh networks, including discussion of reliable message delivery P. Baran, "Reliable digital communications systems using unreliable network repeater nodes." RAND, 1960. - Early discussion of reliable message delivery D. Belsnes, "Single-Message Communication," IEEE Transactions on Communications, vol. 24, no. 2, pp. 190–194, 1976. - Two-, three-, four- and five-packet protocols for reliable single message delivery, plus analysis of their failure modes and crash resistance. U. Maheshwari, "Hula: An efficient protocol for reliable delivery of messages," Technical Report MIT-LCS-TR-720. MIT Laboratory for Computer Science, Jul-1997. - A modern protocol sharing similarities with Delta-t, but using a synchronised global clock to improve reordering and delay resilience. N. Busi and G. Zavattaro, "Publish/subscribe vs. shared dataspace coordination infrastructures: Is it just a matter of taste?," in Proceedings Tenth IEEE International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises. WET ICE 2001, 2001. - Encoding of pub/sub in a tuplespace formalism, and a tuplespace in a pub/sub formalism. Interestingly, LINDA-like tuplespaces are too weak to encode pub/sub properly; you need an additional primitive. Another interesting point is the way subscriptions are encoded. N. Fonseca and M. Crovella, "Bayesian Packet Loss Detection for TCP," in Proceedings 24th Annual Joint Conference of the IEEE Computer and Communications Societies. (INFOCOM 2005), 2005, vol. 3, pp. 1826–1837. - An off-the-shelf google hit I found discussing Bayesian estimators for network reliability characteristics. I've only skimmed it. R. W. Watson, "The Delta-t Transport Protocol: Features and Experience," in Proceedings 14th Conference on Local Computer Networks, 1989, pp. 399–407. - Great discussion of transport protocols generally: characteristics they should enjoy, and approaches to achieving those goals. Some detail on Delta-t specifics, but for more you should see the 1981 protocol specification. R. W. Watson, "Delta-t Protocol Specification," Technical Report UCID-19293. Lawrence Livermore National Laboratory, Dec-1981. - Full description of the Delta-t protocol, as implemented, including Pascal source code for key algorithms. A. Shieh, A. C. Myers, and E. G. Sirer, "Trickles: A Stateless Network Stack for Improved Scalability, Resilience, and Flexibility," in Proceedings of the 2nd Symposium on Networked Systems Design & Implementation (NSDI 2005), 2005. - Fascinating paper on a very interesting approach: using "network continuations" to reduce the per-connection memory requirements at TCP servers! Reminiscent of Racket's web-server stateless servlets technique. J. Mahdavi and S. Floyd, "TCP-Friendly Unicast Rate-Based Flow Control," Jan-1997. [Online]. Available: http://www.psc.edu/networking/papers/tcp_friendly.html. [Accessed: 10-Dec-2011]. - Pithy discussion of the basic constraints required of new protocols in order that they be TCP friendly. The state of the art here seems to have moved on a little, but I don't have a clear picture of it. D. R. Cheriton and D. Skeen, "Understanding the limitations of causally and totally ordered communication," ACM SIGOPS Operating Systems Review, vol. 27, no. 5, pp. 44-57, Dec. 1993. - Beginnings of a public snit about the value of transport protocols providing causal ordering primitively. Cheriton & Skeen say it's harmful; Birman & Joseph say it's helpful; both are clearly right, so it should clearly be optional in any given transport. K. P. Birman and T. A. Joseph, "Exploiting virtual synchrony in distributed systems," ACM SIGOPS Operating Systems Review, vol. 21, no. 5, pp. 123-138, Nov. 1987. - The original description of the ISIS system. The "news" subsystem rates a lone paragraph (section 3.9). E. Meijer, "Subject/Observer is Dual to Iterator," in ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), FIT session, 2010. - Makes the observation that the interfaces Observable and Enumerator from popular OO languages are categorical duals. I extended that thinking to the realm of acknowledgement and credit-based flow control in a blog post, http://www.eighty-twenty.org/index.cgi/tech/origins-of-ack-and-flow-control-20110515.html J. Waldo, G. Wyant, A. Wollrath, and S. Kendall, "A Note on Distributed Computing," Technical Report SMLI TR-94-29. Sun Microsystems Laboratories, Inc, Nov-1994. - Points out the problems inherent in papering over the latency, memory-locality, partial-failure and concurrency issues involved in remotely accessing resources. Argues for an explicit reflection of the difference between local and remote access in programming languages. This is the paper people reference to point out the issues with "RPC" (as it was meant in the 80s).