| In Announcements, Best Practices & Insights, Features

While a major incident is ongoing, all of your focus is on restoring service: watch the smoke, figure out where the fire is, and put…

Monitoring is pivotal in the sustained proactivity in your ITOps architecture. In recent years, we have seen an explosion in both the number of and…

Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.

Everyone wants to optimize their team’s performance, but coming up with a good plan for doing so isn’t always easy. That’s why operationally mature DevOps teams use metrics to gain valuable insight into their work, enhance the their capacity, and drive cultural change. Here we outline the key metrics that you should be monitoring and talk about how they can influence your team’s culture and performance.

| In Alerting, Best Practices & Insights

When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?

| In Alerting, Best Practices & Insights

You’ve just realized that something has gone critically wrong, and you can’t fix it yourself. Particularly if you work within a collaborative DevOps environment, it’s better to get by with a little help from you friends. Effectively coordinating the incident response across subject matter experts and front-line responders is a secret to operational success that differentiates top teams. So it’s important that you have an effective and efficient way to to sound the alarm, and make sure that your conversations are recorded and actionable.

Outages are chaotic, and it can be difficult to figure out the best way to let your  customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.

Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn…

This post is the second in our series about how you can use data to improve your IT operations. Our first post was on alert fatigue….

Since we launched on-call handoff notifications, lots of our customers have used them to be notified about their on-call responsibilities to make sure they never…

| In Best Practices & Insights, DevOps, Tech Talk

In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great…

Anything can happen while you’re on-call. You can experience a quiet, incident-free shift or suffer a severe outage that makes your head explode. Since you…

Many solutions offer email alerts to notify customers of an issue. Email alerts are effective if you’re in front of your inbox all day, but…

This is Part 1 in a multi-part series dealing with tips for being on-call.

Search