The on-call engineer has a critical role to play in incident management. They can mean the difference between an incident turning critical or being managed and resolved quickly.
Startups may not have many choices around who should be on call, but as the organization grows and incident management becomes more complex and with higher stakes, it’s important to have a structured process for the on-call engineer. Whether you’re a startup or an enterprise, you can benefit from having a clear process for equipping your on-call engineer to succeed. Here are a few guidelines.
First response is critical
In the first few minutes of the incident occurring, the on-call engineer needs to know the severity and service impact of the incident. Based on that, he or she needs to gauge what are the downstream services that have been affected, as well as who is needed to resolve the incident and how to onboard them quickly. This requires having a working knowledge of how the system functions, so that when something breaks, they are able to identify root cause and what to prioritize working on. The rotation of the on-call engineer should be automatically scheduled. This way, the load is shared, the team optimizes for fairness and accountability, and everyone can handle incidents and don’t lose their touch. Larger teams sometimes may have dedicated incident managers who can initiate the first response. In either case, the primary goal of the on-call engineer is to get the necessary resources looped in to resolve an incident, if they can’t troubleshoot it and fix it themselves.
Have a secondary on-call engineer
You should have a secondary (and probably even tertiary, etc.) on-call engineer as backup. This ensures that nothing falls through the cracks should the first-level responder sleep through the 3am page. This also means that there needs to be a schedule for rotation of roles within the team. Set up automated rules so that the incident notification gets escalated to the backup engineer if there’s no response from the primary engineer.
Ensure your on-call engineer has the required training
Since there’s a lot at stake when an incident occurs, your on-call engineer needs to be able to follow protocol as well as think on the go. He or she needs to understand how to get in touch with different cross-functional stakeholders (from customer support, marketing, PR, etc.) so that remediation status can be communicated externally in an appropriate manner. It is also useful to hand the on-call engineer a checklist or flowchart to follow when incidents occur.
As every minute of downtime can mean thousands of dollars lost, here are the steps an on-call engineer needs to take during an incident as quickly as possible:
Identify & Log
The first step is to identify or detect the incident and make logs. Logging can help you get to the root cause of the issue quickly and provides context for a comprehensive post-mortem of the incident once it’s resolved. Since it’s important to respond to the incident quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.
Categorize & Prioritize
Due to the vast variety of problems that a team can encounter, it is important to categorize incidents to prevent confusion. Note the number of users affected, the “blast radius” of the issue with respect to affected services, the potential revenue impact, and so on. Prioritizing incidents can help the on-call engineer make a call on whether the incident requires the time and resources of the rest of the team. Minor, less complex incidents should be handled by the engineer alone if possible to save the entire team’s time. Non-actionable alerts should also be suppressed, to further ensure that on-call engineers can focus on what matters.
Notify the Right People
Platforms like PagerDuty and its built in ChatOps and collaboration integrations are best practice for recruiting the relevant people, and bring them together in the right place at the right time. In particular, using specific ChatOps channels/rooms, shared video calls and conferencing, and fixing issues in-context can make a big difference in the speed of resolution and level of business impact. While communicating with team members, it’s also important to be brief and concise in describing the incident to save both yourself and others time. Teams can get distracted with alert overload, and a solution like PagerDuty is imperative to suppress the noise, and surface the signal.
Troubleshooting doesn’t have to happen only when the whole team is notified and present. Even while waiting for their responses, it is vital that first responders like the on-call engineer be able to troubleshoot on the go. Rapid responses can be a lifesaver, much like real life emergency services, where the first few minutes are incredibly important.
Managing and equipping on-call resources is a crucial task for any development or operations team to be successful. Having sufficient backups and well-thought-out processes and plans in place ensure efficiency when things go south. If on-call engineers follows the basic steps outlined above, teams can spend more time creating and innovating, and less time fixing.