I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and…

Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to…

| In Alerting, On-Call Life, Operations Performance

  It generally pays to look beyond labels, such as “incident management” (which usually means much more than receiving and responding to alerts). Consider, for…

Accidents Happen It’s a fact: well-meaning team members, in the heat of the moment – and often in the middle of the night – sometimes…

Many DevOps companies embrace risk, but fear of failing is hard-wired into most of us. Here are 3 ways to handle an emotional reaction to failure.

Democracy: the great experiment. The voice of the people leading. The end of rigid and overbearing hierarchies. These principles have been with us for over two centuries in government, but many business models still look like the British Empire. As the pace of development continues to scale and customers come to expect real-time response to their concerns, businesses with complex IT departments are transitioning to a DevOps model that gives them the agility to stay up and responsive to the voice of the people. Here we explore how fostering a DevOps culture can build a more democratic workplace and customer experience.

No one should need to be convinced the value of good data. It gives you the confidence to make decisions quickly and with less risk, it allows you to measure your success, and it lets you know when you need to adjust your course. But there’s a difference between knowing the value of data, and creating a culture around it. A data-driven culture is a culture where everyone quantifies their actions as much as possible, and asks themselves how their teams are having a tangible impact on the business. It turns your entire organization into a squad of analysts. But fostering a data-driven culture isn’t always easy. Here are five steps that will help you get there.

Too many companies take the happiness of their engineers for granted. This is a huge mistake, especially since engineers are doing important work for your company: building your product, and then keeping it up-to-date and functioning. Their morale has a direct influence on their performance, and, by extension, your product. Part of the DevOps ethos is getting engineers working together better, smarter, and happier. But why should executives care about that?

As indicated in a survey conducted by Forrester Research, a well-constructed IT Operations management system provides fast alert notification, keeps business-critical incidences from occurring at a minimum, and focuses on automation as a way of addressing issues. What we are actually seeing in the field today, however, doesn’t seem to line up with this approach. According to a recent Forrester thought leadership paper, incident resolution practices today are tactical, reactive, and harm commercial success. Listed below are some observations we are seeing with IT Organizations in the Enterprise.

Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.

Transparency and collaboration are at the core of DevOps philosophy, and ChatOps is an important aspect of both. ChatOps puts an entire team or organization’s work in one place – everyone’s actions, notifications and diagnoses happen in full view. A native PagerDuty chat client would be designed for use during incidents, and wouldn’t replace the chat client you use every day. Having two different chat records, which a native chat client would encourage, runs counter to the DevOps philosophy.

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.

Whether your server’s CPU is pegged at 100% or someone is chopping down your rainforest, PagerDuty has no opinions on how you use our platform to trigger a response from your on-call team. But here’s one area where we do have a strong opinion: alerting on business metrics. You should do it.

This is a guest blog post written by Anthony Gibbons, the Operations Manager at Airhead Education. Anthony gives his perspective as a startup setting up PagerDuty as their IT Operations Software: “With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path… It probably took me an hour to integrate all of my services with PagerDuty.”

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

We, as IT professionals, have ever-expanding access to more accurate Ops telemetry. With this data, we have an incredible amount of visibility into what’s going on. However, more information isn’t always a good thing when it comes to alerting. You can definitely have too many alerts, and alert fatigue is a growing problem among Operations teams. More detailed telemetry isn’t bad; it’s just that much of this information is generally better suited for forensics rather than alerting. Event Enrichment and PagerDuty team up to help you battle alert fatigue.

| In Alerting, DevOps, Operations Performance

As the pace of development and business continues to scale, teams need an agile and collaborative work environment to succeed. Moving to a DevOps model is a critical part of setting your engineering teams up to succeed, but making the transition can be challenging for many companies. In this post, we share some strategies for making the transition.

| In Operations Performance

If you’ve used the term “DevOps” as a job title, you may have been making a big mistake. It sounds innocuous: After all, isn’t DevOps something that you do? If you’re a marketer, hiring manager or non-engineer at your company, it might seem like it. But nothing could be further from the truth. It’s actually a philosophy and set of practices that guides how your engineering and IT teams work. And using the term improperly doesn’t always sit well with tech teams, even if they have “DevOps Engineer” on their LinkedIn profile.