Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. We present best practices for back up. One or more people, waiting in the wings, ready to spring into action if your primary on-call is unable to perform his or her duties to the best of their abilities at any given time.

| In Alerting, Operations Performance

Hiring software engineers is hard.  We all know this.  If you get past the problem of sourcing and landing good candidates (which is hard in

| In Reliability

This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known

| In Reliability

This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we

| In Reliability

Like pretty much everything else in Rails, optimistic locking is nice and easy to setup:  you simply add a “lock_version” column to your ActiveRecord model

| In Reliability

This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series,

| In Features

Tired of getting a flood of PagerDuty incidents whenever a problem occurs with one of your systems?  Do many of the incidents seem identical?  Do

| In Announcements, Features

Have you ever said to yourself: “PagerDuty is great, but I wish I could better integrate it into the custom tools I already use.” Or

| In Reliability

Today, at around 1am Pacific Time, Amazon began having major problems with some of their cloud infrastructure: specifically with their EC2, EBS, and RDS offerings. We’d like to share some statistics on the alerts we sent out – via phone or SMS – during the outage.

| In Reliability

This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I’ll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc.

This is Part 1 in a multi-part series dealing with tips for being on-call.