Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
The ITOps world is a harsh working environment where ITOps personnel are expected to minimize the business impact of incidents at all hours of the day—regardless of the impact to themselves or their families. As more companies undergo digital transformation, the number of alerts and interruptions flowing to IT first responders will continue to increase.
This constant and growing pressure to keep business systems running around the clock is leading to higher-than-ever responder burnout, resulting in increased employee attrition and negative impact on the customer experience. In September 2018, we inspected 85,000 services to determine which monitoring systems are generating interrupt notifications (defined as SMS, voice, and push notifications) on each service.
The results? We found that services integrated with AWS have a consistently higher health score for every day through the first 7 months of 2018. On average, AWS integrated services have a higher daily health score by more than 3 points, as shown below.
We also found that AWS integrated services had:
So what is AWS doing to generate less noise and, therefore, less alert fatigue?
Short answer: We can’t definitively answer that. We can only speculate why AWS users experience better-than-average health compared to users of other DevOps tools—for example, there could be general AWS resiliency across both service offerings and instances or EC2 instance auto recovery and the highly available nature of most AWS services enable greater operations efficiency and generate less alerts. What we do know, however, is that based on our data collected from over 10,500 customers in the past decade, we have proven best practices that you can implement in order to attain measurable improvement in all three facets of operations health: people, efficiency, and maturity.
Run a Transient Notifications Analysis
One of the easiest ways to improve operations health is to run an analysis of transient notifications, which are alerts that auto close/auto resolve quickly after they’re generated.
Let’s say you’re an on-call responder who’s been awakened in the middle of the night by an SMS interrupt notification. You groggily acknowledge the event on your phone, then get out of bed and head to your laptop to begin remediation efforts. But the management system has already closed the incident, making it no longer relevant as it shows as closed (resolved). Now you’re grouchy—being woken up by an on-call alert is part of the job but being awakened for something that has already resolved itself is incredibly frustrating, especially if it occurs multiple times a night.
To help prevent such scenarios, you should run transient notification analyses to determine the number of transients that occur in under two minutes on each service. Then, depending on the percentage of transients, we add in a notification buffer of two minutes to absorb those transients while the upstream issue causing those them is being addressed. Any incident that remains open past the two-minute buffer is sent to whoever is on call. Absorbing transients in this manner increases the health of your teams, as well as the overall effectiveness of your operations by eliminating a significant source of false positives.
Humans are good at many things, but attempting to determine the scope of an incident by looking at a table of alerts gathered from a myriad of sources is not one of them.
With alert grouping, two great things happen together:
1) Alerts are automatically associated and grouped into incidents that provide much better situational awareness when compared to doing so manually, and
2) the on-call responder will receive 1 interrupt notification for an incident that includes 50 alerts as opposed to receiving 51 separate notifications for 50 alerts and 1 incident.
Having a consistent taxonomy for your Teams, Schedules, Escalation Policies, and Services is another important best practice. Why? Because properly named services can shave crucial minutes off of incident response times by giving the responder context around what’s broken—making it easier to escalate incidents, bring in more subject matter experts, and, most importantly, decrease the business impact of incidents.
Keep in mind that one of the most important aspects of achieving better operations health is to work on continuous and measurable improvement. There are numerous other best practices you can use to help our customers improve their operations.
To help mitigate the negative effects of on-call life, PagerDuty’s Operations Health Management Service (OHMS) analyzes the health of the organization through a human factors lens by pinpointing the services that are causing degradation of operations health and providing specific actionable recommendations for improving health in a measurable way. To learn more about our Operations Health Management Service, contact us today.
The ITOps world is a harsh working environment where ITOps personnel are expected to minimize the business impact of incidents at all hours of the...
A few weeks ago, I had the pleasure of attending PagerDuty Summit 2018 as Zenoss was a proud partner and sponsor for the conference. It...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019