Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Have you ever caught a ticket that you just couldn’t figure out? You spend hours on Google, slowly reading the entirety of Stack Overflow, while occasionally banging your face against the desk. By the fourth hour, solving the problem becomes a matter of pride. Productivity be damned! It’s times like these when a process for effective incident management can save your sanity.
Don’t get me wrong — I understand the desire to solve a problem without involving anyone else. Whether it comes from hubris, shame, or just an honest desire not to bother anyone, it happens to me all the time. Problem-solving is an unnatural obsession of mine, but when it comes to the health of my projects, I’ve found that following a pre-defined process makes everyone’s life easier.
Some problems are real, some aren’t. Not every issue is mission critical, so when a ticket hits your desk, the first step should be deciding where in the stack it belongs. It needs to find its place amongst the other bugs, chores, and stories that you and the rest of your team are managing. Make a detailed impact report, and then consult with any relevant project managers to help guide your decision.
A reproducible bug is a fixable bug. Once a prioritized issue has reached the top of your queue, the next step should be compiling steps to reproduce it. Are users inadvertently triggering a crash? Maybe it’s a memory issue or storage issue. The important thing to remember is that all you are trying to do is understand how to replicate the problem, not fix it — just yet. Once you can reproduce it (or learn that it isn’t easily reproducible), it can be fixed.
Once you are able to reproduce the issue, the next step is to identify the appropriate subject matter expert to pass off to (hint: it may be you). Knowing who to tap might be difficult, depending on the nature of the issue, but a good rule of thumb is to ask the person that last worked on that particular feature. Regardless of who you escalate the issue to, be sure to include a thoroughly detailed report of everything you’ve learned so far. They’ll thank you for it.
So an issue has been run down a bit and dropped in your queue. The next step is investigating the problem. This is the point where you follow the reproduction steps, gather logs, question other subject matter experts, identify possible problems, and test your solutions. Lather, rinse, and repeat until you know exactly what is happening and why.
At this point, you know what the problem is, how to reproduce it, and exactly what is causing it. You’ve identified the root cause and have a tested and working fix. While it’s obvious that the next step is to deploy your fix, you can’t stop there. After the problem has been resolved and everything is stable, you need to notify all affected parties that the issue has been fixed. It’s also important to disseminate the details of the solution to relevant subject matter experts and, if necessary, hold a post-mortem to ensure that everybody understands what happened and how it was resolved.
Effective incident management, like all things done correctly, relies on an established process and proper communication. While the actual steps taken may change from project to project, the teams that most successfully mitigate issues have great communication, and a plan in place before they ever need it.
This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...
Dynamic Notifications are now out in the wild! With our launch today, we give PagerDuty users the power to dynamically adjust how they are notified...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018