Why Reliability Engineering?
Why is Reliability Engineering relevant at a company like Coinbase? Why would we want to build a Reliability Engineering team?
“Our goal is to make Coinbase the most trusted and easiest to use digital currency exchange.”
-Brian Armstrong, Co-founder & CEO
It all comes back to what our CEO Brian Armstrong said about Coinbase wanting to be the most trusted. Our goal in the cryptocurrency industry is to create an open financial system for the world — and part of that requires us to build the most trusted digital currency exchange. In order to be the most trusted exchange, we need to be the most reliable. Being reliable is a competitive advantage in our industry, while being unreliable is a serious risk to our business.
Before you get too deep into this article, please note that we’re actively hiring great Reliability Engineers, so if any of this sounds interesting to you please head over to our Senior Reliability Engineer job posting here.
What is Reliability Engineering?
The mission of the Reliability Engineering team at Coinbase is:
“Help engineers design & keep their promises in production.”
The word “promise” in our mission statement is a reference to Promise Theory which was invented by Mark Burgess. While we use many of the principles from the Google SRE books, we found Promise Theory to be more human-friendly than the term “Service Level Objective” which is a bit jargon-y. Based on the investigations into safety and reliability by people like Sidney Dekker and companies such as Toyota (see the Toyota Way), we consider reliability to be ultimately a human challenge. For this reason we preferred to reference a concept which every human already understands — that of making and keeping promises.
Major differences between Reliability Engineering at Coinbase vs Site Reliability Engineering (SRE) at some other companies:
- We are generalist software engineers first and foremost. We focus on solving challenges by writing better software rather than adding more and more humans to push buttons. Everyone on the team is a strong software engineer, working on multiple software systems in a variety of programming languages.
- We do not have front-line pager responsibility. We are on-call for the systems that we ourselves own (e.g. the Coinbase observability stack), but we are not the first line of incident response for other teams. Service and product teams have their own pager rotations.
We like to apply the metaphor of ‘teaching a person to fish vs giving them a fish’ to how we operate — our mission is to “teach teams to fish” in terms of reliability. This is in contrast to “giving them a fish” by handling front-line pager duties on their behalf. Another way of putting it is that our goal is to up-level every engineering team at Coinbase to be self-sufficient in Reliability Engineering.
How do Reliability Engineers work?
One of the important things to realize about reliability engineering is that it is inherently cross-cutting throughout the organization. Reliability is not itself a functional silo — it is a value and a business output. Our customers are every single engineering team at Coinbase. Since we work with so many customers, we have defined different models of engagement to meet their needs:
- Advisory. This is answering questions, or responding to ad-hoc requests without formal deliverables. For example responding to “Help me monitor/scale/improve my thing” questions in Slack, or jumping into production incidents to support responders.
- Consulting. We often run structured reliability workshops and pairing sessions with other teams. In these engagements, we have a shared goal (in our case, OKR) with the team we’re consulting with — thus there is a measurable outcome. While consulting engagements are formal, they are typically part-time endeavours.
- Embedding. Sometimes teams will need full-time reliability support from our engineers, and they request that we physically sit and work with them, participating in their standups, sprint plannings, etc. This is where we use embedding. Similar to Consulting, this work has a shared goal and measurable outcome (OKR) — the difference is the reliability engineer is a temporary (typically, one calendar quarter) member of the customer team.
Beyond the various ways we engage with customers, we follow a standard “agile” software engineering process. We have a weekly planning meeting to update our Kanban board, conduct monthly retrospectives and hold daily standups. Longer-term strategy and measurements are captured in quarterly OKRs which we derive from customer feedback and internal discussion.
Introducing the Coinbase Reliability Engineering Team
The Reliability Team was founded in 2018 with one engineer (Luke Demi) and myself (Niall O’Higgins) as manager. Since then, we’ve grown to 7 engineers and shipped a lot of improvements.
In the words of folks on the team, here are some accomplishments we can speak about publicly as well as impressions and experiences from working on reliability!
After joining Coinbase in 2016, my initial efforts within the company focused on building self-service infrastructure for engineers. However in 2017 as interest in cryptocurrency surged, Coinbase began to experience outages across our systems. Solving these types of reliability problems excited me, so I dove in head first to get to the bottom of these issues.
We were able to survive 2017, but it was clear that in order to withstand future surges and provide a reliable experience for our customers we would need to make reliability a core component of the engineering culture at Coinbase.
I find the Reliability Team exciting because we’re able to both advise teams on best practices for choosing reliability indicators (Service Level Indicators AKA SLIs) and promises (AKA SLOs) as well as build the tools that let engineers understand the performance of their systems in production.
I joined Coinbase in July 2018. Being the 3rd engineer on the Reliability Team was an amazing experience. There are so many things I love about the company and I’d like to highlight few of them:
- An opportunity to work with / learn from smart and talented people.
- Project ownership. An engineer on the Reliability Team owns a project all the way through from design to shipping.
- Ability to contribute to Open Source.
- Learn, learn and learn. Coinbase provides so many opportunities to learn new technology. It feels like we are utilizing every spare minute to learn new things! We have Lunch & Learn sessions with guests from leading technology companies, every engineer has an annual educational budget to go to conferences or take online classes.
- Delicious meals on site 🙂
When I first joined the Reliability Team in November 2018, I was under the impression that I would be thrown into the deep end of blockchain — drowning in Bitcoin, Ethereum, and smart contracts. Colleagues also warned me of endless firefighting and nightmarish on-call rotations. Fortunately, this was not the case.
The Reliability Team doesn’t work with blockchains directly and aren’t the first ones being paged for every single incident. Each Coinbase team owns the daily operations of their specific products or services. This allows for distributed knowledge across the organization.
As a new college graduate I initially felt overwhelmed, but everyone on the team has been incredibly supportive and willing to share their knowledge. Within a month, Niall and I improved our incident management system by integrating it with JIRA. I wrote my first design document to further integrate PagerDuty with our incident management system and I am continually making incremental changes to our system.
One of the most important things I’ve learned is that working with amazing team members is priceless. The Reliability Team is a group of curious, empathetic, and intelligent individuals and there’s no other group I would rather be with for five days a week.
The most interesting part of being on the Reliability Team for me is our high-level perspective across the organization. Since we are not tasked with handling day-to-day operations of any specific Coinbase product (Coinbase.com, Coinbase Pro, Coinbase Wallet, etc), we can focus on improving the ability for teams to observe and understand their systems. This means that teams can move faster, incidents are resolved quicker, and there’s a decentralization of knowledge across the organization.
Here’s some examples of improvements that I’ve contributed to over the past year:
- Writing lightweight stats, tracing, and logging libraries for the various languages in use across the organization.
- Contributing to “paved roads” for various languages and ensuring that developers have a good starting point for new services, with sane defaults.
- Introducing new vendors (such as Datadog) to bring more dimensions of observability, unlocking new ways of monitoring systems.
- Bringing a perspective of reliability to technology choices made by teams and helping them ask the right questions.
- Contributing to our deployment tooling to integrate high level monitoring by default on all services.
- Enabling the use of gRPC across the organization through client generation in various languages and integration into our AWS architecture. See blog post “gRPC to AWS Lambda: Is it Possible?”
In addition to shared tooling, we engage with many teams across the organization by running workshops, review sessions, and office hours.
Workshops are hands-on sessions that focus on topics like observability tooling and promise construction, within the context of that team’s services or problem domain.
Review sessions happen both early in the design process for services and later when they are nearing production. These reviews do not act as a gate or “green check mark” for teams, but instead make sure that they are asking the right questions and highlighting ways that the reliability team can level up teams across the organization.
Office hours are open time every week for any engineer to bring problems or feedback to our team by pairing with an engineer. Topics usually include: how to build effective monitors and dashboards, integrating tracing or metrics libraries, what database should I use for this particular problem, and more.
At the end of the day, my favorite part about the Reliability Team is the diverse set of engineers we have. The breadth and depth of knowledge shared by everyone is a great support structure for tackling a problem of any scale.
I have an unusual background for an infrastructure engineer. I studied graphic design in school and worked for the first half of my career as a designer. Joining the Reliability Team was, for me, the latest step in a long, ongoing journey away from the front end. I’ve really enjoyed the new challenges I’ve faced on this team and have been pleasantly surprised at how often my experience as a designer ends up being relevant here.
My favorite part about being on the Reliability Team is being close to where the excitement is happening across the company. The greatest need for reliability expertise is often around new product launches or new-found success of some existing product. We’ve been pursuing a new model of embedding reliability engineers in other teams where their expertise is needed most. I’m personally currently embedded in the Consumer team, which is responsible for Coinbase.com and the Coinbase mobile apps. I’ve enjoyed feeling close to the front lines of product development while still focusing on infrastructure.
Another rewarding aspect of being on the Reliability Team has been turning our work into conference talks. Over the past year I had the chance to speak at MongoDB World and QCon about designing load testing strategies. I had never given a talk before, so this was a great learning opportunity for me and I ended up having a lot of fun doing it.
Working on the Reliability Team is one of the most fun positions at Coinbase because we get to be a part of so many different initiatives and projects across the company. We’ve got a great diversity of expertise on the team. I’ve never learned so much so quickly.
Reliability Engineering and the Future
In the past year, our team has helped all of Coinbase build a culture of reliability in the following ways:
- Moving the entire engineering team from a reactive stance on reliability (firefighting, etc.) to a proactive one (installing smoke detectors) with service level indicators and promises.
- Providing a world-class observability stack comprised of three pillars — tracing, metrics and logs.
- Designing and implementing high-performance infrastructure services.
We look forward to doing much more over the next year such as:
- Building the serverless foundation to accelerate feature development.
- Helping move to a service oriented architecture by building core infrastructure such as the service mesh.
- Leveling up every single team in terms of performance engineering, quality and incident response.
If any of this sounds interesting to you please head over to our Senior Reliability Engineer job posting here.
This website contains links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.