I recently joined the summer internship program at PagerDuty, and I have already had one of the most inspiring and thought-provoking experiences of my life….

Credit: NASA Organizations need many incident commanders to provide a high level of service to their customers while avoiding on-call load. Many shy away from…

Teams that serve the business, such as Business Operations and Business Intelligence, are faced with a barrage of urgent requests and never-ending list of business-critical…

| In Best Practices & Insights, Monitoring

Monitoring applications and systems is one thing — knowing what to do with all the data being gathered is quite another. Most IT organizations today…

Zayna Shahzad is a Software Engineer at PagerDuty on the Mobile Team. She works on the Android and iOS PagerDuty apps offered through the App…

Here at PagerDuty, our engineering teams are committed to Agile development principles that favor rapid iteration over lengthy periods of design, and favor direct communication…

Here at PagerDuty, we’re committed to helping our customers get the most out of the platform as possible. We’ve long shared best practices and knowledge…

| In Announcements, Best Practices & Insights, Features

While a major incident is ongoing, all of your focus is on restoring service: watch the smoke, figure out where the fire is, and put…

Monitoring is pivotal in the sustained proactivity in your ITOps architecture. In recent years, we have seen an explosion in both the number of and…

Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.

| In Alerting, Best Practices & Insights

When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?

| In Alerting, Best Practices & Insights

You’ve just realized that something has gone critically wrong, and you can’t fix it yourself. Particularly if you work within a collaborative DevOps environment, it’s better to get by with a little help from you friends. Effectively coordinating the incident response across subject matter experts and front-line responders is a secret to operational success that differentiates top teams. So it’s important that you have an effective and efficient way to to sound the alarm, and make sure that your conversations are recorded and actionable.

Outages are chaotic, and it can be difficult to figure out the best way to let your  customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.

Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn…

This post is the second in our series about how you can use data to improve your IT operations. Our first post was on alert fatigue….

Since we launched on-call handoff notifications, lots of our customers have used them to be notified about their on-call responsibilities to make sure they never…

| In Best Practices & Insights, DevOps, Tech Talk

In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great…

Search