PagerDuty Guarantees Uptime with Downtime Insurance

- June 30, 2015

seal-full

Here at PagerDuty, reliability is our business, and we aim to prove it with our actions, not just our words. We’ve spent over six years bulletproofing our software for our customers, and we’ve operated under a strict reliability SLA ever since. Now, we’re backing our code and infrastructure by becoming the first company to extend customers a multi-million dollar downtime insurance guarantee. We have put so many failsafes into our product that, if we have an outage, we’ll compensate you for lost revenue that that occurs a result of our downtime.

Share on FacebookTweet about this on TwitterGoogle+

What is Operational Maturity?

- June 26, 2015

DSC_5897

Long-time PagerDuty customers Dropbox, Flipboard, and Splunk spoke about their hard-won experience, shared war stories, and discussed what they’ve learned about operations at scale. They also had advice about how what they’ve learned can be applied to other teams. We were delighted to talk with customers, partners, and the extended community about what it means to be operationally mature. Here is what was said about Operational Maturity.

Share on FacebookTweet about this on TwitterGoogle+

Why We Didn’t Build a Native Chat Client

- June 18, 2015

PD chat

Transparency and collaboration are at the core of DevOps philosophy, and ChatOps is an important aspect of both. ChatOps puts an entire team or organization’s work in one place – everyone’s actions, notifications and diagnoses happen in full view. A native PagerDuty chat client would be designed for use during incidents, and wouldn’t replace the chat client you use every day. Having two different chat records, which a native chat client would encourage, runs counter to the DevOps philosophy.

Share on FacebookTweet about this on TwitterGoogle+

The Best Metrics for Driving Cultural Change in DevOps Teams

- June 11, 2015

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.

Share on FacebookTweet about this on TwitterGoogle+

Customer Perspective: Setting Up IT Operations Software for Startups

- June 2, 2015

This is a guest blog post written by Anthony Gibbons, the Operations Manager at Airhead Education. Anthony gives his perspective as a startup setting up PagerDuty as their IT Operations Software: “With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path… It probably took me an hour to integrate all of my services with PagerDuty.”

Share on FacebookTweet about this on TwitterGoogle+

CloudMonix and PagerDuty Join Hands for Next-Gen Cloud Monitoring

- May 19, 2015

With CloudMonix’s core objective of simplifying, streamlining and automating routine or complex tasks for Cloud System Administrators and IT Professionals – we are always on look to improve the way we deliver our services. That’s why we have partnered up with PagerDuty, to deliver instant alerts and notifications on PagerDuty’s leading Incident Management platform.

Share on FacebookTweet about this on TwitterGoogle+

Gain Greater Context with Rich Incidents

- May 13, 2015

The site is down. Alarms are going off. Before you can fix anything, you first have to understand what’s going on. And gaining context can be hard as you look across multiple systems and metrics. We’re pleased to announce Rich Incidents, a new feature for PagerDuty that helps incident responders gain additional context. Now, responders can go straight from an alert to a conference bridge, chat room, or runbook, giving them instantaneous access to each other and to any documentation they might need. Additionally, embedded graphs give more context into an incident, helping you respond faster and maintain a dependable product for your customers.

Share on FacebookTweet about this on TwitterGoogle+

The Discovery of Apache ZooKeeper’s Poison Packet

- May 7, 2015

zookeeper

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

Share on FacebookTweet about this on TwitterGoogle+