Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty is thrilled to be named a leader in G2Crowd’s Fall 2018 Grid Report for Incident Management. The ranking is based on high customer satisfaction...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
The on-call engineer has a critical role to play in incident management. They can mean the difference between an incident turning critical or being managed and resolved quickly.
Startups may not have many choices around who should be on call, but as the organization grows and incident management becomes more complex and with higher stakes, it’s important to have a structured process for the on-call engineer. Whether you’re a startup or an enterprise, you can benefit from having a clear process for equipping your on-call engineer to succeed. Here are a few guidelines.
In the first few minutes of the incident occurring, the on-call engineer needs to know the severity and service impact of the incident. Based on that, he or she needs to gauge what are the downstream services that have been affected, as well as who is needed to resolve the incident and how to onboard them quickly. This requires having a working knowledge of how the system functions, so that when something breaks, they are able to identify root cause and what to prioritize working on. The rotation of the on-call engineer should be automatically scheduled. This way, the load is shared, the team optimizes for fairness and accountability, and everyone can handle incidents and don’t lose their touch. Larger teams sometimes may have dedicated incident managers who can initiate the first response. In either case, the primary goal of the on-call engineer is to get the necessary resources looped in to resolve an incident, if they can’t troubleshoot it and fix it themselves.
You should have a secondary (and probably even tertiary, etc.) on-call engineer as backup. This ensures that nothing falls through the cracks should the first-level responder sleep through the 3am page. This also means that there needs to be a schedule for rotation of roles within the team. Set up automated rules so that the incident notification gets escalated to the backup engineer if there’s no response from the primary engineer.
Since there’s a lot at stake when an incident occurs, your on-call engineer needs to be able to follow protocol as well as think on the go. He or she needs to understand how to get in touch with different cross-functional stakeholders (from customer support, marketing, PR, etc.) so that remediation status can be communicated externally in an appropriate manner. It is also useful to hand the on-call engineer a checklist or flowchart to follow when incidents occur.
As every minute of downtime can mean thousands of dollars lost, here are the steps an on-call engineer needs to take during an incident as quickly as possible:
Identify & Log
The first step is to identify or detect the incident and make logs. Logging can help you get to the root cause of the issue quickly and provides context for a comprehensive post-mortem of the incident once it’s resolved. Since it’s important to respond to the incident quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.
Categorize & Prioritize
Due to the vast variety of problems that a team can encounter, it is important to categorize incidents to prevent confusion. Note the number of users affected, the “blast radius” of the issue with respect to affected services, the potential revenue impact, and so on. Prioritizing incidents can help the on-call engineer make a call on whether the incident requires the time and resources of the rest of the team. Minor, less complex incidents should be handled by the engineer alone if possible to save the entire team’s time. Non-actionable alerts should also be suppressed, to further ensure that on-call engineers can focus on what matters.
Notify the Right People
Platforms like PagerDuty and its built in ChatOps and collaboration integrations are best practice for recruiting the relevant people, and bring them together in the right place at the right time. In particular, using specific ChatOps channels/rooms, shared video calls and conferencing, and fixing issues in-context can make a big difference in the speed of resolution and level of business impact. While communicating with team members, it’s also important to be brief and concise in describing the incident to save both yourself and others time. Teams can get distracted with alert overload, and a solution like PagerDuty is imperative to suppress the noise, and surface the signal.
Troubleshooting doesn’t have to happen only when the whole team is notified and present. Even while waiting for their responses, it is vital that first responders like the on-call engineer be able to troubleshoot on the go. Rapid responses can be a lifesaver, much like real life emergency services, where the first few minutes are incredibly important.
Managing and equipping on-call resources is a crucial task for any development or operations team to be successful. Having sufficient backups and well-thought-out processes and plans in place ensure efficiency when things go south. If on-call engineers follows the basic steps outlined above, teams can spend more time creating and innovating, and less time fixing.
Over-What? If you’ve ever been on call, you know that the incidents don’t stop because you have the flu. Or when you’re attending your child’s...
At PagerDuty, we believe the best way to truly understand the health of your employees is to leverage the real-time human data that is already...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018