Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
This post is part of our three-part series on best practices in communication during critical incidents. Read about outage communication with internal stakeholders and with customers.
You’ve just realized that something has gone critically wrong, and you can’t fix it yourself. Particularly if you work within a collaborative DevOps environment, it’s better to get by with a little help from you friends. Effectively coordinating the incident response across subject matter experts and front-line responders is a secret to operational success that differentiates top teams. So it’s important that you have an effective and efficient way to to sound the alarm, and make sure that your conversations are recorded and actionable.
The first step to effective outage communication within your incident response team is making sure the right people get involved. You should have clear processes in place for identifying subject matter experts, for contacting them, and for bringing them together in a single place. You should also have a designated team for handling external communication, and, if the outage is severe enough, you should loop them in right away so they can get a headstart on alerting your customers. A system like PagerDuty can help on both accounts by automatically storing on-call schedules and people’s preferred contact methods.
Secondly, it’s important to understand the importance of documentation. During the course of the incident, the response team will uncover many pieces of information, and make make quick decisions about the best way to contain the issue. Documenting in the moment is important to ensure nothing is forgotten or missed. Fortunately, there are tools and processes that can help here.
First, you’re going to need to talk it out. Create a fixed conference line; no one should be wasting time setting up a bridge manually for every call. Everyone on the team should know the dial in details, or where to find them. It’s a good idea to include the details in the PagerDuty event or in the service description, so they can be easily accessed when they’re needed. It’s also a good idea to record your conversations in case you want to debug your process.
Phone calls are great for real-time conversations and discussion. But phone calls are flawed: a call only provides low-fidelity data on the outage and on how tasks are being distributed. There is no text to hold people accountable for the decisions made over the course of the conversation, so how can you track your conversations?
The answer is ChatOps. Having a simultaneous discussion in a chat client provides actionable, searchable, time-stamped data of who is doing what, and on what services. And make sure you name your services. Here at PagerDuty, our services are named after Greek deities. This way, our entire team can understand what we’re talking about when we refer to Artemis.
To make incident response even easier, you can connect your tools to your chat client. Pipe in PagerDuty incidents, and use plugins to customize and make the most of your chat service. For example, you can use a chat bot to contribute server updates to the chat, or you can have Datadog graphs contribute analytics in the chat window. You can also issue actions to tools in the chat, and bots can take actions or capture follow up tasks.
Record your chat record in a CMS or in PagerDuty notes, so that it can be cited later. This can be a great teaching instrument for post-resolution learning, and can help your team become more efficient in the future by learning from how they handled issues in the past. And that timestamped, searchable discussion that came in handy when you were solving your incident also makes it far easier to write a post-mortem.
One added benefit to ChatOps seems obvious, but is worth mentioning: written communication is generally higher in quality than spoken communication. Your team has more time to organize their thoughts than on a conference call or face-to-face, and they can more easily reference what other team members have said over the course of the conversation to create a clear plan of action.
Effective communication during an incident makes your life easier when you’re training new team members, too. You don’t have to reverse engineer your past experiences into a future plan of action, or a runbook. You’re writing training materials and action plans in real time, ready to use from the moment you’re done documenting and solving the incident.
How does your team communicate outages internally? Let us know in the comments section
For further reading, check out Best Practices in Outage Communication: Customers.
This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...
Dynamic Notifications are now out in the wild! With our launch today, we give PagerDuty users the power to dynamically adjust how they are notified...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018