Code for America is a nonprofit that focuses on reforming government services to make them simple, easy to use, and accessible for all Americans. Founded...by Andrew Turner
June 12, 2019
Reliability has always been one of the primary design considerations at PagerDuty. (We even use PagerDuty at PagerDuty!) But what do we do when the unexpected happens and something does go wrong? It’s of the utmost importance that we are prepared and can get our systems back into full working order as quickly as possible. We pride ourselves on being able to quickly resolve issues that arise and keep our systems working within their SLA. We’ve worked very hard to accomplish this, and our incident response process is where it all begins.
Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. Few companies seem to talk about their internal processes for dealing with major incidents. It’s sometimes considered taboo to even mention the word “incident” in any sort of communication. We would like to change that.
To that end, we’re happy to announce that we have now open-sourced our incident response documentation for use by the community! Learn from how we prepare for incidents, handle major incidents, and train our engineers to go on-call. It is our hope that others will use the documentation as a starting point to formalize their own processes.
The PagerDuty Incident Response Documentation is a collection of best practices detailing how to efficiently deal with any major incidents that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.
It is intended for on-call practitioners and those involved in an operational incident response process, or those wishing to enact a formal incident response process.
Incident response is something every organization needs to consider in order to deliver the best possible service to their own customers. Normally, the knowledge of how to handle incidents within your company is built up over time, getting better with each incident. While tools such as PagerDuty’s Major Incidents Application can help you recover quickly, the process you follow is just as important. This documentation will help you decrease your response time for major incidents by building on the knowledge we’ve internally developed over the years.
It covers everything from preparing to go on-call, definitions of severities, incident call etiquette, all the way to how to run a post-mortem (we even provide our post-mortem template). We even include our security incident response process.
It’s worth noting this isn’t an exact clone of our internal documentation; it has some information removed or changed. Things such as our phone bridge numbers, names of internal tools and systems which are not (yet) open sourced, images of our dashboards, etc. We have basically omitted anything that is specific to PagerDuty or we consider too proprietary to share. The bulk of the useful information is within the principles and process, rather than specifics of tools we use.
The documentation is provided under the Apache License 2.0. In plain English, that means you can use and modify the documentation and use it both commercially and for private use. However, you must include any original copyright notices and the original LICENSE file.
Whether you are a PagerDuty customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account. Feel free to fork the repository and use it as a base for your own internal documentation.
We also encourage you to raise pull requests if you have improvement suggestions.