| In Reliability

At PagerDuty, all of our computing infrastructure is automated using Chef. We push out features and changes to our Chef codebase very frequently – often…

| In Reliability

High-frequency trading accounts for 50% of US’ security trading. With thousands of securities totaling millions of dollars traded every millisecond, robust and reliable computer systems…

| In Reliability

This is the first post of a multi-part series on some of the operations challenges that the team at PagerDuty is solving. At PagerDuty we…

| In Reliability

PagerDuty’s July Hack Day presented another batch of amazing projects from our staff. One project in particular has a lot of future potential to provide…

| In Reliability

We’re rolling out Webhooks on incidents and it opens up a lot of fun new things. For background, Webhooks let you recieve HTTP callbacks when interesting…

| In Reliability

As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability.  On May 30,…

| In Reliability

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it.  Most of this work is invisible, hidden…

| In Reliability

On January 24, 25 and 26, 2013, PagerDuty suffered several outages.  The events API, used by our customers to submit monitoring events into PagerDuty from…

| In Reliability

You’re a techie working for one of the multitude of startups that rushed to market, where the founders hastily glued a Rails app together with candy-bar wrappers and…

| In Reliability

A few weeks ago I had the privilege of speaking at Surge 2012 in Baltimore, MD. The audience were of those whose focus was on better…

| In Reliability

This is a guest post by Connie Quach, Sr. Product Manager, responsible for the web performance products at Neustar. In today’s competitive environment, website performance…

| In Reliability

Sometimes you just have to tinker. Experimentation, trial and error are all part and parcel of the learning experience, and the gateway to bigger and…

| In Reliability

At PagerDuty, we usually get a front seat to anything that’s wrong with the internet. Last weekend, a derecho storm took out 7% of AWS…

| In Reliability

On the evening of Friday, June 29th, Amazon Web Services (AWS) experienced a major outage at its North Virginia location due to a loss of…

| In Announcements, Reliability

We have some very exciting news for all of our customers who are running mission-critical systems on AWS in the US-East region: we have migrated…

| In Reliability

On Thursday, June 14, starting at 8:44pm Pacific time, PagerDuty suffered a serious outage. The application experienced 30 minutes of downtime, followed by a period…

| In Reliability

As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very…

| In Reliability

As a general rule, whatever percentage you think your test coverage is, it isn’t. Whatever amount of the known surface area you’re covering, there’s going…