Outage Post-Mortem – Jan 16, 2014
At PagerDuty we offer transparency of any outage that negatively impacts PagerDuty customers. We are proud of PagerDuty’s superior reliability, but occasionally we may have a snafu. We recommend that you follow our dedicated Twitter account, @PagerDutyOps, to be notified of any outages that occur.
On January 16th at 7:40 AM PST, we had a small incident that delayed six alerts (3 Email, 2 SMS and 1 Push Notification). This was caused by a rare race condition that caused a small set of locks to not release properly.
The condition was caused by our efforts to minimize locking and workflow contention in order to make our services scalable. This increased latency in our Cassandra and Zookeeper operations.
We quickly identified and fixed the issue, following up with regression testing. During the outage, no alerts were lost. Although the six alerts were significantly delayed.
We would like to apologize by those affected by the outage. We are making efforts to reduce the possibility of these types of errors in the future.
If you have any questions, please contact email@example.com.