Building a distributed system like CockroachDB that is strongly consistent and survives disasters is a big task. Beyond the required functionality, it must also be correct, performant, and stable. But four months into CockroachDB’s beta launch, we were still unable to keep a 10-node cluster running under continuous load for two solid weeks. In this talk, I’ll outline the factors that contributed to product instability including team structure, code churn, and developer focus. I will then give a deep dive into both the technical and process-related fixes we undertook and ongoing efforts to prevent recurring instability. Finally, I'll share my thoughts on whether a period of instability in a distributed system like CockroachDB can be avoided, and how.
Spencer Kimball is the co-founder and CEO of Cockroach Labs, where he maintains a delicate balance between a love for programming distributed systems and the excitement of helping the company grow smoothly. He cut his teeth on databases during the dot com heyday, and had a front row seat at Google for a decade’s worth of their evolution.