This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Have you ever caught a ticket that you just couldn’t figure out? You spend hours on Google, slowly reading the entirety of Stack Overflow, while occasionally banging your face against the desk. By the fourth hour, solving the problem becomes a matter of pride. Productivity be damned! It’s times like these when a process for effective incident management can save your sanity.
Don’t get me wrong — I understand the desire to solve a problem without involving anyone else. Whether it comes from hubris, shame, or just an honest desire not to bother anyone, it happens to me all the time. Problem-solving is an unnatural obsession of mine, but when it comes to the health of my projects, I’ve found that following a pre-defined process makes everyone’s life easier.
Some problems are real, some aren’t. Not every issue is mission critical, so when a ticket hits your desk, the first step should be deciding where in the stack it belongs. It needs to find its place amongst the other bugs, chores, and stories that you and the rest of your team are managing. Make a detailed impact report, and then consult with any relevant project managers to help guide your decision.
A reproducible bug is a fixable bug. Once a prioritized issue has reached the top of your queue, the next step should be compiling steps to reproduce it. Are users inadvertently triggering a crash? Maybe it’s a memory issue or storage issue. The important thing to remember is that all you are trying to do is understand how to replicate the problem, not fix it — just yet. Once you can reproduce it (or learn that it isn’t easily reproducible), it can be fixed.
Once you are able to reproduce the issue, the next step is to identify the appropriate subject matter expert to pass off to (hint: it may be you). Knowing who to tap might be difficult, depending on the nature of the issue, but a good rule of thumb is to ask the person that last worked on that particular feature. Regardless of who you escalate the issue to, be sure to include a thoroughly detailed report of everything you’ve learned so far. They’ll thank you for it.
So an issue has been run down a bit and dropped in your queue. The next step is investigating the problem. This is the point where you follow the reproduction steps, gather logs, question other subject matter experts, identify possible problems, and test your solutions. Lather, rinse, and repeat until you know exactly what is happening and why.
At this point, you know what the problem is, how to reproduce it, and exactly what is causing it. You’ve identified the root cause and have a tested and working fix. While it’s obvious that the next step is to deploy your fix, you can’t stop there. After the problem has been resolved and everything is stable, you need to notify all affected parties that the issue has been fixed. It’s also important to disseminate the details of the solution to relevant subject matter experts and, if necessary, hold a post-mortem to ensure that everybody understands what happened and how it was resolved.
Effective incident management, like all things done correctly, relies on an established process and proper communication. While the actual steps taken may change from project to project, the teams that most successfully mitigate issues have great communication, and a plan in place before they ever need it.