Open-Sourcing Our Incident Response Documentation

Open-Sourcing Our Incident Response Documentation

Reliability has always been one of the primary design considerations at PagerDuty. (We even use PagerDuty at PagerDuty!) But what do we do when the unexpected happens and something does go wrong? It’s of the utmost importance that we are prepared and can get our systems back into full working order as quickly as possible. We pride ourselves on being able to quickly resolve issues that arise and keep our systems working within their SLA. We’ve worked very hard to accomplish this, and our incident response process is where it all begins.

Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. Few companies seem to talk about their internal processes for dealing with major incidents. It’s sometimes considered taboo to even mention the word “incident” in any sort of communication. We would like to change that.

documentationTo that end, we’re happy to announce that we have now open-sourced our incident response documentation for use by the community! Learn from how we prepare for incidents, handle major incidents, and train our engineers to go on-call. It is our hope that others will use the documentation as a starting point to formalize their own processes.

incident_response_ossWhat is it?

The PagerDuty Incident Response Documentation is a collection of best practices detailing how to efficiently deal with any major incidents that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.

Who is it for?

It is intended for on-call practitioners and those involved in an operational incident response process, or those wishing to enact a formal incident response process.

Why do I need it?

Incident response is something every organization needs to consider in order to deliver the best possible service to their own customers. Normally, the knowledge of how to handle incidents within your company is built up over time, getting better with each incident. While tools such as PagerDuty’s Major Incidents Application can help you recover quickly, the process you follow is just as important. This documentation will help you decrease your response time for major incidents by building on the knowledge we’ve internally developed over the years.

What is covered?

It covers everything from preparing to go on-call, definitions of severities, incident call etiquette, all the way to how to run a post-mortem  (we even provide our post-mortem template). We even include our security incident response process.

What is missing?

It’s worth noting this isn’t an exact clone of our internal documentation; it has some information removed or changed. Things such as our phone bridge numbers, names of internal tools and systems which are not (yet) open sourced, images of our dashboards, etc. We have basically omitted anything that is specific to PagerDuty or we consider too proprietary to share. The bulk of the useful information is within the principles and process, rather than specifics of tools we use.

License

The documentation is provided under the Apache License 2.0. In plain English, that means you can use and modify the documentation and use it both commercially and for private use. However, you must include any original copyright notices and the original LICENSE file.

Whether you are a PagerDuty customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account. Feel free to fork the repository and use it as a base for your own internal documentation.

We also encourage you to raise pull requests if you have improvement suggestions.

Share

Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

  • This is great! Love that you open sourced it 🙂

  • Thanks for sharing!

  • Erin Driggers

    This is great, thanks for sharing. I have a couple of questions about how things were implemented in the Pagerduty system. For example, on the Alerting Principles page, it lists 4 priorities – how do those priorities correspond to the Notification Urgency options when setting up a service? How do you map the different roles to a particular incident in the pagerduty application or do you do this outside of the pagerduty app?

    • Erin Driggers

      One more question – in the process it mentions using !ic page in slack – what integration method was used to accomplish this with Slack?

      • Hi Erin,

        how do those priorities correspond to the Notification Urgency options when setting up a service?

        There are a few examples further down the page that show you how we use the Notification Urgency options internally for each alerting priority level. You should be able to use that as a starting point for your own configuration.

        How do you map the different roles to a particular incident in the pagerduty application or do you do this outside of the pagerduty app?

        Only the Incident Commander, Customer Liaison, and SME roles are specifically mapped to the incident in PagerDuty. When we trigger the PagerDuty incident it is assigned to the Incident Commanders and Customer Liaisons who are on-call. The Subject Matter Experts are then added to the incident as Responders.

        The other roles, such as Scribe, are assigned by the incident commander at the start of the call, and are not explicitly called out in the PagerDuty application. They will however be listed in the post-mortem that the person had that role for the incident.

        it mentions using !ic page in slack – what integration method was used to accomplish this with Slack?

        This is a custom internal bot command we use to page our Incident Commanders. We’re hoping to open-source it soon, but we’re not quite there yet. It essentially just triggers a PagerDuty incident on a service, where our Incident Commanders are the ones on-call for that service.

        Hope this answered your questions! If you would like more information, or would like to continue the discussion, feel free to reach out to support@pagerduty.com at any time.

        Thanks,
        Rich

        • Erin Driggers

          Thank you for response Rich.

        • Christian Nuss

          Hey @richadams:disqus ! could you provide some information on how “status stalk” (mentioned in Scribe training) works and behaves? I like the usage you called out in the blog, and would love to see further information. Have you open sourced it possibly?

          • Hi Christian,

            The stalk command polls our internal monitoring tools every 15 seconds. If the status is different than the last one it reported, it will post the new status into our chat room. The status it posts will be an overall assessment of our severity level, and a few key metrics we use to determine our system health.

            It allows us to focus on resolving the incident, rather than having someone refreshing a dashboard every so often to see if things are returning to normal. Once the status posts as “NORMAL”, then we can be confident the incident is over and start to wind down our response.

            The status plugin isn’t currently open-sourced, although I would definitely like to get it out some time in the future!

            Hope this helped to answer your question! Feel free to reach out to support@pagerduty.com if you have anything else we can help you with.

            Thanks,
            Rich

          • I updated our documentation yesterday to include a new section on ChatOps which explains our chat commands in more detail, along with some screenshots.