PagerDuty Logo

PagerDuty Blog


New Updates to Advanced Analytics

We’re pleased to announce improvements to our reporting capabilities that enable teams to gain even greater insight. Now, teams can optimize their monitoring by visualizing metrics such as common incidents, SLA performance, and noisy incidents.


In Alerting, Announcements, Features, ITOps & Modern Ops, Product


#HugOps in Practice: Empathy Skills for DevOps

We think we’re doing the whole DevOps thing right — new hires can deploy on day one, Travis CI is humming along, and we own the code we ship. But then something breaks, something doesn’t go according to plan, tempers flare up, and all that warm, fuzzy collaboration seems to evaporate. What’s going on? What happened to #HugOps?


In DevOps, HumanOps


PagerDuty + Opsmatic = Faster incident resolution

Opsmatic provides real-time visibility of any change to the live state of your infrastructure and intelligently alerts you before trouble begins. The recent addition of Assertions gives you a precise way to check and enforce policy across all your hosts. It’s only natural that Opsmatic has partnered with PagerDuty to ensure flawless alerting and effective incident collaboration. PagerDuty’s operations performance platform ensures that the right people on your team get alerted and can resolve incidents before they become emergencies.


In Announcements, Community, Partnerships


Lessons from Virtuoso: Three Steps You Can Take to Reduce Alert Volume by up to 94% in Three Weeks

We recently sat down with Shawn Motley, Senior DevOps Engineer at Virtuoso, to talk about his experiences with PagerDuty and the Event Enrichment Platform (EEP). Virtuoso is a travel portal for high-end clients, with over 200 employees and 8 web properties. When Virtuoso began focusing on their DevOps initiative 7 months ago, they were receiving thousands of events every 24 hours, the majority of which were noise. Learn how they reduced their alert volume by 94% in 3 weeks with PagerDuty and Event Enrichment by following 3 easy steps.


In Alerting, Partnerships


What is Operational Maturity?

Long-time PagerDuty customers Dropbox, Flipboard, and Splunk spoke about their hard-won experience, shared war stories, and discussed what they’ve learned about operations at scale. They also had advice about how what they’ve learned can be applied to other teams. We were delighted to talk with customers, partners, and the extended community about what it means to be operationally mature. Here is what was said about Operational Maturity.


In Community


Why We Didn’t Build a Native Chat Client

Transparency and collaboration are at the core of DevOps philosophy, and ChatOps is an important aspect of both. ChatOps puts an entire team or organization’s work in one place – everyone’s actions, notifications and diagnoses happen in full view. A native PagerDuty chat client would be designed for use during incidents, and wouldn’t replace the chat client you use every day. Having two different chat records, which a native chat client would encourage, runs counter to the DevOps philosophy.


In Alerting, Features, Operations Performance


The Best Metrics for Driving Cultural Change in DevOps Teams

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.


In Alerting, Best Practices & Insights, DevOps, On-Call Life, Operations Performance


Monitoring Business Metrics and Refining Outage Response

Whether your server’s CPU is pegged at 100% or someone is chopping down your rainforest, PagerDuty has no opinions on how you use our platform to trigger a response from your on-call team. But here’s one area where we do have a strong opinion: alerting on business metrics. You should do it.


In Alerting, On-Call Life, Operations Performance


Customer Perspective: Setting Up IT Operations Software for Startups

This is a guest blog post written by Anthony Gibbons, the Operations Manager at Airhead Education. Anthony gives his perspective as a startup setting up PagerDuty as their IT Operations Software: “With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path… It probably took me an hour to integrate all of my services with PagerDuty.”


In Alerting, Community, ITOps & Modern Ops, On-Call Life, Operations Performance


PagerDuty Rocking Out at Velocity Santa Clara

Five (count ‘em) PagerDuty engineers/product managers were chosen to speak at Velocity Santa Clara next week.
BOOM! What a beautiful world we live in. But what are they going to be speaking about? We’re glad you asked.


In Announcements, Community, Events


CloudMonix and PagerDuty Join Hands for Next-Gen Cloud Monitoring

With CloudMonix’s core objective of simplifying, streamlining and automating routine or complex tasks for Cloud System Administrators and IT Professionals – we are always on look to improve the way we deliver our services. That’s why we have partnered up with PagerDuty, to deliver instant alerts and notifications on PagerDuty’s leading Incident Management platform.


In Announcements, Community, Partnerships


Gain Greater Context with Rich Incidents

The site is down. Alarms are going off. Before you can fix anything, you first have to understand what’s going on. And gaining context can be hard as you look across multiple systems and metrics. We’re pleased to announce Rich Incidents, a new feature for PagerDuty that helps incident responders gain additional context. Now, responders can go straight from an alert to a conference bridge, chat room, or runbook, giving them instantaneous access to each other and to any documentation they might need. Additionally, embedded graphs give more context into an incident, helping you respond faster and maintain a dependable product for your customers.


In Alerting, Announcements, Features


The Discovery of Apache ZooKeeper’s Poison Packet

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.


In Alerting, Community, Operations Performance, Reliability


Boundary Integrates with PagerDuty

When it comes to monitoring the health of your IT system, the team at Boundary lives by the philosophy that every second counts. Rather than letting data slip through the cracks at five- or even one-minute intervals, Boundary provides real-time monitoring of servers, platforms and apps for IT and DevOps teams with one-second resolution. We’re excited to announce an integration with PagerDuty to help teams resolve infrastructure incidents even faster.


In Announcements, Partnerships


Report from ServiceNow Knowledge 15

Last week, we sponsored a booth and participated with all of ServiceNow’s awesome partners at Knowledge 15 in Las Vegas! ServiceNow is a a powerful platform-as-a-service for IT teams. We heard success stories from customers who use PagerDuty to enhance their ServiceNow experience.


In Events


Best Practices in Outage Communication: Internal Stakeholders

When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?


In Alerting, Best Practices & Insights


Introducing PagerDuty Integration for Threat Stack

We love PagerDuty and are big users ourselves. We love the ease of integration with our other platforms. We love the scheduling and overrides. We love the per-service escalation groups. We love the sound of our default alert setting, the sad trombone (Though, the more we think about it, “love” isn’t the right word on that last one. That infernal trombone wakes up our team to let us know there is trouble in the Cloud. We dread that trombone).


In Announcements, Partnerships


London Conference Wrap-Up

Last week our team went on an overseas adventure, sponsoring AWS Summit London and Puppet Camp UK. We heard over and over at AWS Summit that our international customers love our reliable multi-provider SMS, phone, push, and email alerting to over 175 countries (and growing!). Our international SMS alerts all come from local numbers in the countries we alert, so when engineers ack, they don’t incur international fees. International customers are also big fans of UTF-8 support throughout our incident pipeline, so messages in non-western character sets render correctly.


In Alerting