85% of teams report missing a critical incident, even though 91% of teams use at least 2 monitoring tools. Uptime isn’t just about monitoring; it’s about optimizing your team’s incident response.
Here at PagerDuty, we face some interesting architectural challenges in order to guarantee alert delivery and provide our customers with the highest level of reliability…
Operations teams are receiving more telemetry data from monitoring systems than ever before. But they are struggling to sift through this data to find what really matters – resulting in alert fatigue and missed alerts. For this reason, we’re proud to announce that long-time partner Event Enrichment HQ is joining the PagerDuty family to deliver the industry’s first integrated event management and incident resolution platform. Adding Event Enrichment HQ and its keystone product, the Event Enrichment Platform (EEP), to PagerDuty helps you quiet your noisy monitoring systems, reduce alert fatigue, and slash your incident resolution times.
Today’s customer expects everything to be fast and always on, so uptime is crucial. This creates an entirely new set of business challenges for organisations with complex IT departments and a need for more agile IT Ops. We reviewed the top trends we are learning from customers on the road and what you need to consider when transitioning from a more traditional IT organisation.
No one should need to be convinced the value of good data. It gives you the confidence to make decisions quickly and with less risk, it allows you to measure your success, and it lets you know when you need to adjust your course. But there’s a difference between knowing the value of data, and creating a culture around it. A data-driven culture is a culture where everyone quantifies their actions as much as possible, and asks themselves how their teams are having a tangible impact on the business. It turns your entire organization into a squad of analysts. But fostering a data-driven culture isn’t always easy. Here are five steps that will help you get there.
Don’t let the hardboiled-sounding name of our latest integration scare you off, because this monitoring service is a great way to get notified when one of your mission-critical scheduled tasks suddenly sleeps with the fishes. Dead Man’s Snitch is an uptime-monitor for cron or periodic jobs like backups or batch processing, and it alerts you when your jobs don’t run so you can investigate before it becomes a problem.
After getting our exciting entries from you guys and narrowing it down to five stellar finalists, you guys all helped #pickyourpage and now, we have our alert sound contest winner.
Introducing User Reporting, the latest addition to PagerDuty’s Advanced Analytics suite. User Reporting helps managers and teams understand how individual team members are responding to incidents. Now managers can see how many incidents each responder has received, acknowledged, reassigned, or moved up the chain of command due to non-acknowledgement. With this information, managers can work with their teams to make sure every team member is in the right position and that workload is spread properly across the team.
Something goes wrong in your staging environment, and you start seeing “CRITICAL” or “ERROR” all over the place. Oh… I forgot to mention that it’s 3am where you live. Is it really “critical” in that moment? Well, technically it is. The environment is still busted. But do you want to fix it now? Is it urgent?
One day, Ethan, whose dad works at Altiscale, heard a sweet song. It was an infectious tune; he couldn’t get it out of his head. Over and over, he heard this song, wafting again and again from his father’s phone. What was this magnificent melody? When would it play again? The song was, technically speaking, a PagerDuty alert: a jingle by the name of “You Made the Server Cry,” recorded Barbershop Quartet-style by some of PagerDuty’s more musical employees. Five-year-old Ethan thought the song was so amazing, he found himself singing it all the time. Pretty soon, he was making up his own PagerDuty alert sounds, and came up with a ditty called, “Something’s Broken,” sung to the tune of “Frère Jacques.” His dad decided to record it and submit it to us as a custom alert sound.
Using ticket systems can be fraught with issues: a clunky workflow, mired in process, means that users can’t always move and adapt quickly. While ticketing systems are a great way to manage a ticket queue of ongoing requests, we’ve noticed that many operationally mature companies stay away from ticketing systems for their real-time incident management. Instead, they are using a more lightweight solution, like PagerDuty. A lightweight solution, with a focus on automation, allows them to be more agile, and get things done faster.
We’re pleased to announce our fourth major mobile release, which brings some significant improvements to the performance and usability of key parts of the app. With all these changes, it’s faster and easier than ever to see, investigate, and take action on problems in your system — driving down resolution time and helping your team improve your operations performance.
Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.
Etsy occasionally runs an engineer exchange program, where they trade engineers with another tech company to give both organizations insight into what the other does differently. PagerDuty was their most recent participant, and in May, I had the pleasure of spending a week at Etsy’s office in Brooklyn. I learned from their practices, observed what they were doing well, and gained insight into their team dynamics. Etsy has an amazing culture, and I observed the customs they put into place to maintain their environment of empathy, autonomy, and learning. It was a great example of the traditions a company can foster to maintain a productive and happy work environment.