PagerDuty
/
Blog
/
Operations Performance

Operations Performance

Innovation, ITOps & Modern Ops, Operations Performance, Technology

The Transformers

I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and...

Rachel Obstler

3 min read

digital operations, digital transformation, microservices, NOC

Features, On-Call Life, Operations Performance, Reliability

7 Benefits of Incident Management in Supporting Applications

Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to...

Eric Jeanes

4 min read

Alerting, On-Call Life, Operations Performance

Reducing Technical Debt With Incident Management

It generally pays to look beyond labels, such as “incident management” (which usually means much more than receiving and responding to alerts). Consider, for...

Michael Churchman

6 min read

Alerting, Announcements, Features, On-Call Life, Operations Performance

Let ‘Team Responders’ Accident-Proof Your Incident Response

Accidents Happen It’s a fact: well-meaning team members, in the heat of the moment – and often in the middle of the night – sometimes...

Matt Fleck

3 min read

DevOps, On-Call Life, Operations Performance

3 Steps for Handling Failure with a DevOps Mindset

Many DevOps companies embrace risk, but fear of failing is hard-wired into most of us. Here are 3 ways to handle an emotional reaction to failure.

Sam Lewis

2 min read

DevOps, On-Call Life, Operations Performance

Democracy: the great experiment. The voice of the people leading. The end of rigid and overbearing hierarchies. These principles have been with us for over two centuries in government, but many business models still look like the British Empire. As the pace of development continues to scale and customers come to expect real-time response to their concerns, businesses with complex IT departments are transitioning to a DevOps model that gives them the agility to stay up and responsive to the voice of the people. Here we explore how fostering a DevOps culture can build a more democratic workplace and customer experience.

Sam Lewis

5 min read

Alerting, On-Call Life, Operations Performance

Five Ways to Create a Data-Driven Culture

No one should need to be convinced the value of good data. It gives you the confidence to make decisions quickly and with less risk, it allows you to measure your success, and it lets you know when you need to adjust your course. But there’s a difference between knowing the value of data, and creating a culture around it. A data-driven culture is a culture where everyone quantifies their actions as much as possible, and asks themselves how their teams are having a tangible impact on the business. It turns your entire organization into a squad of analysts. But fostering a data-driven culture isn’t always easy. Here are five steps that will help you get there.

Sam Lewis

3 min read

On-Call Life, Operations Performance

Why VPs Should Care About Engineer Burnout

Too many companies take the happiness of their engineers for granted. This is a huge mistake, especially since engineers are doing important work for your company: building your product, and then keeping it up-to-date and functioning. Their morale has a direct influence on their performance, and, by extension, your product. Part of the DevOps ethos is getting engineers working together better, smarter, and happier. But why should executives care about that?

Sam Lewis

3 min read

Community, Events, ITOps & Modern Ops, Operations Performance

Three Ways to Ramp Up Your Enterprise IT Operations Management

As indicated in a survey conducted by Forrester Research, a well-constructed IT Operations management system provides fast alert notification, keeps business-critical incidences from occurring at a minimum, and focuses on automation as a way of addressing issues. What we are actually seeing in the field today, however, doesn’t seem to line up with this approach. According to a recent Forrester thought leadership paper, incident resolution practices today are tactical, reactive, and harm commercial success. Listed below are some observations we are seeing with IT Organizations in the Enterprise.

Sam Lewis

2 min read

Alerting, Best Practices & Insights, Operations Performance

On-Call Best Practices: Page Your Manager

Having one person on-call isn't enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone's battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.

John Laban

4 min read

Best Practices, john

Alerting, Features, Operations Performance

Why We Didn’t Build a Native Chat Client

Transparency and collaboration are at the core of DevOps philosophy, and ChatOps is an important aspect of both. ChatOps puts an entire team or organization’s work in one place – everyone’s actions, notifications and diagnoses happen in full view. A native PagerDuty chat client would be designed for use during incidents, and wouldn’t replace the chat client you use every day. Having two different chat records, which a native chat client would encourage, runs counter to the DevOps philosophy.

Sam Lewis

2 min read

Alerting, Best Practices & Insights, DevOps, On-Call Life, Operations Performance

The Best Metrics for Driving Cultural Change in DevOps Teams

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.

Sam Lewis

4 min read

Alerting, On-Call Life, Operations Performance

Monitoring Business Metrics and Refining Outage Response

Whether your server’s CPU is pegged at 100% or someone is chopping down your rainforest, PagerDuty has no opinions on how you use our platform to trigger a response from your on-call team. But here’s one area where we do have a strong opinion: alerting on business metrics. You should do it.

Dave Cliffe

4 min read

Alerting, Community, ITOps & Modern Ops, On-Call Life, Operations Performance

Customer Perspective: Setting Up IT Operations Software for Startups

This is a guest blog post written by Anthony Gibbons, the Operations Manager at Airhead Education. Anthony gives his perspective as a startup setting up PagerDuty as their IT Operations Software: "With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path... It probably took me an hour to integrate all of my services with PagerDuty."

Sam Lewis

5 min read

Alerting, Community, Operations Performance, Reliability

The Discovery of Apache ZooKeeper’s Poison Packet

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work... until they don't. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

Evan Gilman

15 min read

Alerting, Announcements, Operations Performance, Partnerships

Eliminate Alert Fatigue with PagerDuty and Event Enrichment

We, as IT professionals, have ever-expanding access to more accurate Ops telemetry. With this data, we have an incredible amount of visibility into what’s going on. However, more information isn’t always a good thing when it comes to alerting. You can definitely have too many alerts, and alert fatigue is a growing problem among Operations teams. More detailed telemetry isn’t bad; it’s just that much of this information is generally better suited for forensics rather than alerting. Event Enrichment and PagerDuty team up to help you battle alert fatigue.

Julie Arsenault

4 min read

Alerting, DevOps, Operations Performance

Transitioning to a DevOps Model

As the pace of development and business continues to scale, teams need an agile and collaborative work environment to succeed. Moving to a DevOps model is a critical part of setting your engineering teams up to succeed, but making the transition can be challenging for many companies. In this post, we share some strategies for making the transition.

David Shackelford

6 min read

Operations Performance

Dev-Ops for Non-Engineers

If you’ve used the term “DevOps” as a job title, you may have been making a big mistake. It sounds innocuous: After all, isn’t DevOps something that you do? If you’re a marketer, hiring manager or non-engineer at your company, it might seem like it. But nothing could be further from the truth. It’s actually a philosophy and set of practices that guides how your engineering and IT teams work. And using the term improperly doesn’t always sit well with tech teams, even if they have “DevOps Engineer” on their LinkedIn profile.

Julie Arsenault

4 min read

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

Operations Performance

The Transformers

7 Benefits of Incident Management in Supporting Applications

Reducing Technical Debt With Incident Management

Let ‘Team Responders’ Accident-Proof Your Incident Response

3 Steps for Handling Failure with a DevOps Mindset

The DevOps Democracy

Five Ways to Create a Data-Driven Culture

Why VPs Should Care About Engineer Burnout

Three Ways to Ramp Up Your Enterprise IT Operations Management

On-Call Best Practices: Page Your Manager

Why We Didn’t Build a Native Chat Client

The Best Metrics for Driving Cultural Change in DevOps Teams

Monitoring Business Metrics and Refining Outage Response

Customer Perspective: Setting Up IT Operations Software for Startups

The Discovery of Apache ZooKeeper’s Poison Packet

Eliminate Alert Fatigue with PagerDuty and Event Enrichment

Transitioning to a DevOps Model

Dev-Ops for Non-Engineers