Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
Monitoring applications and systems is one thing — knowing what to do with all the data being gathered is quite another. Most IT organizations today have deployed multiple types of monitoring systems. Much of the time, the alerts these systems generate represent minor deviations from normal operations that can be largely ignored. When there is an actual alarm that signifies an impending catastrophic failure, however, most IT organizations, unfortunately, don’t have a well-defined set of procedures in place that enable them to respond quickly enough to mitigate the customer impact.
The good news is that most modern monitoring tools these days expose a well-defined set of application programming interfaces (APIs) that make it possible to share data with an IT incident resolution platform. This makes it easier to triangulate alarms being generated by multiple monitoring systems to group related symptoms and identify the root cause of an issue, minimizing cognitive load when the IT team is assessing and collaborating on the incident. It also makes it possible for the team to analyze data in a central hub to ensure that the same issue doesn’t occur again.
In the age of the digital business, there is a direct correlation between any degradation in application performance or an outage, lost revenue, and customer churn. Yet, the complexity of IT environments today makes dealing with those issues inevitable. In fact, a new survey of IT professionals conducted by Ipswitch, a provider of network monitoring tools, finds that a full 66% feel that increased IT complexity has made it more difficult for them to do their jobs successfully. Another 44% also admit they are either not monitoring everything they want to on their networks, or simply don’t know if they are.
In the complex world of IT, monitoring applications and systems are indispensable. The challenge is first turning all the data these tools collect into something that represents actionable intelligence. After that, the processes needed to enable IT people to actually act on that intelligence need to be embedded in the “memory muscle” of the IT organization. The tools themselves only represent one tenth of the IT management equation. The other nine-tenths consist of the people and processes that make investing in the tools worthwhile in the first place.
Unfortunately, whenever there is an issue, most IT organizations try to gather all the affected parties in a “war room” where everyone takes turns trying to prove their respective innocence. This generally wastes time, pits IT staff unproductively against one another, and does little to actually solve the problem at hand. Putting in place an incident resolution system creates a set of structured processes for identifying the root cause of a problem and then resolving it as quickly as possible. In fact, most of the time the issue at hand can be resolved without ever calling a meeting. Instead, far less time and blame is wasted when the IT staff follows a set of procedures (for example, embedded runbooks, automated troubleshooting commands, etc.), that make it easy to access the right information to address the problem at hand.
Using this approach means most problems will be resolved long before the organization as a whole even realizes there was an issue. After that, it’s entirely up to the IT organization to determine just how much they want to share of what may or may not have occurred in any given day.
Data itself is only one piece of the equation because it’s passive. By leveraging best practice incident resolution, people can equip themselves with the right procedures and know-how to actually use that data to rapidly fix issues, instead of running around without direction and pointing fingers. Only then does the real value of IT monitoring get realized.
For tried and true best practices on incident response, be sure to check out our free trainings: