John Laban

Articles by John (16)

Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted […]

Approaching the Hiring of Engineers as a Machine Learning Problem

Hiring software engineers is hard.  We all know this.  If you get past the problem of sourcing and landing good candidates (which is hard in itself), the whole issue of “is this person I’m talking to ‘good enough’ to actually work here?” is a very difficult nut to crack.  Again, we all know this.  There […]

Outage Post Mortem – March 15

As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in […]

On-call best practices: Page your manager

Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It will happen. You need a backup! One or […]

Pressure Release Valves

This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with this week? That this problem is special, and is really, really bad? You know, that […]

A Standard Operating Procedure for when s*IT hits the fan

This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR. In our second post we went on to […]

More control over Optimistic Locking in Rails

Like pretty much everything else in Rails, optimistic locking is nice and easy to setup:  you simply add a “lock_version” column to your ActiveRecord model and you’re all set.  If a given Rails process is trying to update some record, and some other process sneakily manages to update that same record while the first process wasn’t […]

Availability lessons from shoe companies and ancient warlords

This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR.  Both increasing MTBF and reducing MTTR […]

Getting the most out of PagerDuty: Incident De-Duping

Tired of getting a flood of PagerDuty incidents whenever a problem occurs with one of your systems?  Do many of the incidents seem identical?  Do you spend valuable time trying to fend off the seemingly never-ending PagerDuty phone calls and SMS messages while you should be fixing the actual problem?  Then you, my friend, might […]

Velocity Contest Winners

Velocity 2011 was a blast! Thanks to everyone who came by our booth to find more about PagerDuty, snag a t-shirt, and enter our contest.