Writing services is “easy” and “exciting”, running services is “boring”, or so conventional wisdom goes.
There are a million talks and articles about how to build an API using the latest technologies, but much more rarely does anybody come along and excitedly tell us how improve our on-call experience.
As someone who has been paged at 2:30am on a Sunday morning, I know the frustration of handling a page, opening the runbook, and discovering that it means you need to spend time getting up to speed on lots of things before you can begin to address the issue.
I’ve been in situations where pages are triggering in cascading failures, and I had to spend the first 2 minutes acknowledging the pages before I could start investigating, and the Incident Commander is pestering you for answers on things like “customers” and “impact”.
First off…what is a runbook?
IBM helpfully defines a runbook.
A new set of events has occurred and a couple of skilled operators work out a new procedure to solve the issue that caused these events.
Simply, a runbook is “If this is broken, try this, if that didn’t work, maybe try this, otherwise, call someone else”.
Services like PagerDuty allow you to link a Runbook to an alert, so that the responding engineer knows where to start.
In addition, PagerDuty provides some good principles on alerting which I can definitely recommend, and as they say:
“Provide clear steps to resolve the problem, or link to a run book. Alerts with neither of these things are useless.”
Alerts without steps to resolve a problem are very rarely useful, if you think “Oh, it won’t happen that often and I’ll know what to do?” what happens if you’re enjoying your dinner with family, and your phone starts to ring? Or worse, you’ve left the company, and ex-colleagues have to piece it together without your knowledge?
But, how do we make good runbooks? what makes a good runbook?
I’ve seen a lot of runbooks in my time, and I have a set of things that I look for in a runbook, and I try and apply these to the ones I write…
Probably the most important Runbook in any organisation, is the “Preparing for On-call” guide.
Preparing for On-call
This is a document that provides step-by-step the things you need to do to be “on-call” for your organisation.
- How do you setup your “pager”?
- Pagerduty app on your phone?
- Are the contact numbers up-to-date?
- Can you send a test notification?
- If it’s on your phone, make sure your device isn’t silencing notifications overnight.
- Service access
- Do you need privileged access to your company’s infrastructure? Maybe you need to login to AWS or GCP or Azure.
- Are your credentials somewhere you can easily find them, check they haven’t expired, login to these services before each on-call rotation to ensure your access hasn’t expired.
- Network access devices
- Does your company issue you with a Mifi or other device for mobile broadband, is the battery well charged?
- If it’s your mobile phone, do you have a significant data policy, otherwise you’ll be stuck at home or running to find bandwidth.
There are four stages in any incident:
- Identifying impact
- What’s the impact of the failures described in this playbook?
- How will this affect users?
- How can we determine which users are affected?
- Is it clear how to measure the state of the service, to know when the issue might have started, and when the incident is over?
- What other services will be impacted by the failures?
The first thing we generally need to assess is impact, is this an “incident worthy” page, do we need to spin up our incident response process, or is this something that we’ve caught early enough that we should be able to bring it under control?
Start with links to the relevant dashboard and identify which metrics are important, this makes it easy to gauge the length of time it’s been happening, the “normal” range for the metric, and also, when it has returned to “normal”.
How does it impact users? Will they see delays or errors? Will they be automatically retried? Will users have to do something manually to recover? Will customers lose data?
In the event of opening an incident, these will be the things that the incident commander will need to communicate effectively with customers and stakeholders.
- Is there an index with handled scenarios and with likely errors / messages findable in the text?
- Are there any specific errors logged that might help diagnose the issues in the runbook?
- Are the dashboards up-to-date? Are relevant metrics linked to in the runbook?
- Are dependencies of the service called out? Are you reliant on another service / external service? If they fail, how can we determine this easily?
- Do you send errors to an external service e.g. Rollbar or Sentry? Provide a link! Do responders need access? How can they get access?
- Is the source repository for the code linked to?
When we as developers get errors we don’t understand, our first port-of-call is to copy and paste into Google, and more often than not, someone else has had this happen before, and with a bit of luck, there will be a pointer to the cause.
Think of your runbook as the “top-hit” for the issues that might affect your service, how can you ensure that responders find this?
By definition, distributed systems require communication with other services, either internal or external to our organisation, I know that when several of the services I’m was on the hook for fail, a big reason is because an upstream service is having problems, sure we can put in things like circuit-breakers and bulkheads, but if your only source of truth for something you need is upstream, especially outside of your organisation, then your service can be “dead in the water”, how could someone work out that an upstream is failing?
The worst time to discover that you need an account to access the errors your service is encountering, is when you get paged, make these explicit and ensure that everybody has access to these tools, standardise this tooling, reducing the cost of responding to a page.
The source code repository for the code is useful, it’s unlikely that you’ll have time to read the entire source for the service to determine how it’s failing, but, being able to see where errors are raised in context is valuable, going from a traceback to see the code in context helps a lot in understanding failures
During investigation, capture graphs and logs into the incident channel or somewhere they can be found later.
Simply screengrabbing the graph you’re basing your decision on, or copy-pasting the log entry you’ve found that you believe to be signficant into your incident channel can help later, it helps understand the decisions, and will lead to making the playbook better.t
Hopefully by now you know what’s wrong, how do you get the service back up and running?
- Do the commands assume any plugins / privileges / permissions? i.e. do you make use of a non-standard tools or metrics or access?
- Are you using any “personal” aliases that other folks may not have in the commands? 3. Are the hosts mentioned in the runbook up-to-date? Are you relying on ids which can change?
- For commands that are recommended, are the dangers (if any) of executing the commands spelled out?
- What’s the escalation policy? How would the responder know when they should call in someone else to help?
Putting commands into documents can be really useful, especially during investigation and remediation, but it’s easy to use things that require privileges or tools that may not be installed on every machine, it’s disconcerting to copy and paste a command and get a “command not found” message.
Make sure that the hosts / service names are up-to-date, sure, as our service adds more features, and is no longer the “biller” it’s sensible to rename it to “biller-and-reporter” (actually, this is a good reason to prefer codenames), but if your documentation still refers to “biller”, don’t assume that everybody will know that it was renamed last year…
I’ve seen playbooks that tell the responder to execute a command, but fail to mention that executing that command will have a very big impact on things, yes, it will make things better and possibly “the last resort”, but at a cost.
Make sure this cost is spelled out, if this will truncate a queue or a database, or lose data, sometimes these things are necessary, but make sure that the responder knows that this will happen, and can understand the tradeoffs.
Runbooks are mostly driven by failures, retrospectives should highlight when a runbook needs to be written (or updated), this means that there are often cases that have never been seen before, and despite the best efforts, it’s not clear what to do next, ensure that the responder has a clear escalation path.
During remediation, capture commands and output somewhere they can be reviewed later.
This can be harder than it looks, but, as with during investigation, the commands can form the basis for improvements to a runbook, and the output can be used to improve future investigations.
Post-incident analysis of what went right and wrong, and how things can be improved is essential, not following up on these items is crucial to avoiding burnout.
Hopefully, the graphs and logs and commands and output you gathered during the investigation and remediation phases can feed into the retrospective.
“Why did you shutdown the left-hand engine?” “Because this graph indicated that the left-hand engine was on fire!”
We don’t have flight-recorders, but these will provide the basis for the post-incident response.
At Heroku, we followed this process more-or-less, with varying degrees of success.
I’ll not go into too much detail here, there are lots of good guides to incident retrospectives (post-mortems) a couple of my favourites are:
Suffice to say, you should be looking for things that you can do, and, crucially, implementing them, and taking responsibility for them, major system rearchitectures are rarely the solution to an incident, but circuit-breakers, retries, and other resiliency techniques can go a long way to making things better, sometimes a service is just designed in a non-resilient manner, depending on the scale of the incident, it might be the case that you need to urgently rework it, the retrospective is not the time to decide this, but it can certainly provide an impetus.
I remember during one incident, I went to add a remediation item to a Trello board, and I found one from the day before, so I assumed someone had beaten me to it.
In actual fact, the same thing had bitten another team, and contributed to another incident exactly 12 months before, but it had never been addressed…don’t let that happen to you.
Runbooks are a key part of your devops maturity, they enable more folks to respond to pages, which in turn reduces the load on developers, but on-call and being paged is a stressful situation, and it can be a bit overwhelming for some folks, the principles I’ve outlined should help reduce that, and improve transparency and communication with those that matter.
I can recommend watching https://www.youtube.com/watch?v=PaF3sPpGBc8 for some excellent thinking on reducing on-call burden for engineers.
Let me also recommend Lex Neva’s http://sreweekly.com/ for a weekly dose of other people’s failures, and what they are doing about them.
Thanks must go to my former colleagues at Heroku who reviewed this, or the original principles that it’s based on especially Yannick (@yann_ck) for suggesting “Playbook 0” about preparing for on-call.