Introducing User Reporting, the latest addition to PagerDuty’s Advanced Analytics suite. User Reporting helps managers and teams understand how individual team members are responding to incidents. Now managers can see how many incidents each responder has received, acknowledged, reassigned, or moved up the chain of command due to non-acknowledgement. With this information, managers can work with their teams to make sure every team member is in the right position and that workload is spread properly across the team.
Something goes wrong in your staging environment, and you start seeing “CRITICAL” or “ERROR” all over the place. Oh… I forgot to mention that it’s 3am where you live. Is it really “critical” in that moment? Well, technically it is. The environment is still busted. But do you want to fix it now? Is it urgent?
One day, Ethan, whose dad works at Altiscale, heard a sweet song. It was an infectious tune; he couldn’t get it out of his head. Over and over, he heard this song, wafting again and again from his father’s phone. What was this magnificent melody? When would it play again? The song was, technically speaking, a PagerDuty alert: a jingle by the name of “You Made the Server Cry,” recorded Barbershop Quartet-style by some of PagerDuty’s more musical employees. Five-year-old Ethan thought the song was so amazing, he found himself singing it all the time. Pretty soon, he was making up his own PagerDuty alert sounds, and came up with a ditty called, “Something’s Broken,” sung to the tune of “Frère Jacques.” His dad decided to record it and submit it to us as a custom alert sound.
Using ticket systems can be fraught with issues: a clunky workflow, mired in process, means that users can’t always move and adapt quickly. While ticketing systems are a great way to manage a ticket queue of ongoing requests, we’ve noticed that many operationally mature companies stay away from ticketing systems for their real-time incident management. Instead, they are using a more lightweight solution, like PagerDuty. A lightweight solution, with a focus on automation, allows them to be more agile, and get things done faster.
We’re pleased to announce our fourth major mobile release, which brings some significant improvements to the performance and usability of key parts of the app. With all these changes, it’s faster and easier than ever to see, investigate, and take action on problems in your system — driving down resolution time and helping your team improve your operations performance.
Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.
Etsy occasionally runs an engineer exchange program, where they trade engineers with another tech company to give both organizations insight into what the other does differently. PagerDuty was their most recent participant, and in May, I had the pleasure of spending a week at Etsy’s office in Brooklyn. I learned from their practices, observed what they were doing well, and gained insight into their team dynamics. Etsy has an amazing culture, and I observed the customs they put into place to maintain their environment of empathy, autonomy, and learning. It was a great example of the traditions a company can foster to maintain a productive and happy work environment.
We’re pleased to announce improvements to our reporting capabilities that enable teams to gain even greater insight. Now, teams can optimize their monitoring by visualizing metrics such as common incidents, SLA performance, and noisy incidents.
We recently sat down with Shawn Motley, Senior DevOps Engineer at Virtuoso, to talk about his experiences with PagerDuty and the Event Enrichment Platform (EEP). Virtuoso is a travel portal for high-end clients, with over 200 employees and 8 web properties. When Virtuoso began focusing on their DevOps initiative 7 months ago, they were receiving thousands of events every 24 hours, the majority of which were noise. Learn how they reduced their alert volume by 94% in 3 weeks with PagerDuty and Event Enrichment by following 3 easy steps.
Transparency and collaboration are at the core of DevOps philosophy, and ChatOps is an important aspect of both. ChatOps puts an entire team or organization’s work in one place – everyone’s actions, notifications and diagnoses happen in full view. A native PagerDuty chat client would be designed for use during incidents, and wouldn’t replace the chat client you use every day. Having two different chat records, which a native chat client would encourage, runs counter to the DevOps philosophy.
Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.
Whether your server’s CPU is pegged at 100% or someone is chopping down your rainforest, PagerDuty has no opinions on how you use our platform to trigger a response from your on-call team. But here’s one area where we do have a strong opinion: alerting on business metrics. You should do it.
This is a guest blog post written by Anthony Gibbons, the Operations Manager at Airhead Education. Anthony gives his perspective as a startup setting up PagerDuty as their IT Operations Software: “With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path… It probably took me an hour to integrate all of my services with PagerDuty.”
The site is down. Alarms are going off. Before you can fix anything, you first have to understand what’s going on. And gaining context can be hard as you look across multiple systems and metrics. We’re pleased to announce Rich Incidents, a new feature for PagerDuty that helps incident responders gain additional context. Now, responders can go straight from an alert to a conference bridge, chat room, or runbook, giving them instantaneous access to each other and to any documentation they might need. Additionally, embedded graphs give more context into an incident, helping you respond faster and maintain a dependable product for your customers.
ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.
When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?
Last week our team went on an overseas adventure, sponsoring AWS Summit London and Puppet Camp UK. We heard over and over at AWS Summit that our international customers love our reliable multi-provider SMS, phone, push, and email alerting to over 175 countries (and growing!). Our international SMS alerts all come from local numbers in the countries we alert, so when engineers ack, they don’t incur international fees. International customers are also big fans of UTF-8 support throughout our incident pipeline, so messages in non-western character sets render correctly.
Application Performance Monitoring (APM) systems like AppDynamics can provide incredibly rich information about what’s happening with your IT infrastructure, and can identify performance issues before they create big problems. However, this information is only as good as your ability to respond to it. PagerDuty can extend the capabilities of AppDynamics Alert & Respond policies to ensure incidents are noticed, responded to, and fixed quickly.