In order to develop develop high-performance yet correct distributed applications, it is essential to understand how your application's needs map to consistency. This tutorial aims to clarify and give the audience a working understanding of the crowded space of consistency models. What does a given model provide? What are the costs?
A distributed system has data scattered and replicated across nodes, separated by slow and unreliable networks. Designers face an inherent trade-off between system cost and application cost. If the system is "strongly consistent," it masks the ugly details, at the cost of constantly synchronising, and even stalling when the network is down. Going to a "weakly consistent" model can significantly improve cost and performance, but weakens the guarantees, moving some of the power and of the responsibility onto the application.
The strongest consistency models have three remarkable features: atomic transactions, causal order, and absence of concurrent updates. Each of these features helps to maintain a different class of application invariants, and each has a cost. Relaxing one of the features lowers the cost and weakens the corresponding invariants, mostly independently of the others. We will illustrate each of the three axes, the associated class of invariants, and the underlying mechanisms.
Consistency models that allow concurrent updates are tricky, since data replicas may diverge. To address this issue, application developers can build upon libraries of CRDTs (Conflict-free Replicated Data Types). A CRDT is a data type that encapsulates divergence control and resolution, and that ensures that data necessarily converges to correct state, thanks to some simple mathematical properties. We will spend some time explaining the concepts, design and implementation of CRDTs.
We will also show how developers can leverage the two other dimensions, atomicity and causality, to ensure correctness of their applications.
- application developers, system developers.
The CRDT abstraction (Conflct-free Replicated Data Type) is a useful tool for building highly scalable and available applications. CRDTs combine three key intuitions: (i) Encapsulate distribution inside the boundaries of a data type, so that ordinary programmers can build applications by combining CRDTs from a library; (ii) support "write now, propagate later" concurrent updates to ensure scalability, availability and performance; (iii) some simple of mathematical properties that ensure that data replicas converge safely. This talk presents the principles and design of some useful CRDTs. We discuss the CRDT approach in the context of some real applications of our industrial partners. We present some of the difficulties and limitations of CRDTs, in the more general perspective of consistency and scalability. We present recent extensions that address some of these limitations, in particular formaintaining application invariants.
Marc Shapiro does his research on distributed computer systems, data replication and consistency algorithms, and distributed garbage collection. He invented the proxy concept, which is now universal on the Internet. He published at SOSP and OSDI, the two most prestigious venues of the area (one of the only two French papers at both venues). He was instrumental in the creation of EuroSys, the main European venue in the area. He authored 64 international publications, 17 recognised software systems, and four patents. Dr Shapiro's research started with a PhD from Université Paul Sabatier for research performed at atLAAS in Toulouse, France (1980), followed by a post-doc at MIT, and a researcher position at CMIRH. He is a researcher at INRIA since 1984. He spent a one-year sabbatical at Cornell (1993—1994), and he led the Cambridge Distributed Systems group at Microsoft Research Cambridge (UK) from 1999 to 2005. He is currently a Senior Researcher for INRIA Paris-Rocquencourt, in the Regal group, located at LIP6.