| In DevOps, Reliability

| In DevOps, PagerDuty Life, Tech Talk

On June 28th, 2017, we marked four years of performing “Failure Fridays” at PagerDuty.  As a quick recap, Failure Fridays are a practice we conduct weekly at PagerDuty to inject faults into our production environment in a controlled way, and without customer impact. They’ve been foundational for us to verify our resiliency engineering efforts. Over […]

| In DevOps, PagerDuty Life, Reliability

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” — Principles of Chaos Engineering Netflix, Dropbox, and Twilio are all examples of companies that perform this kind of engineering. It’s essential to have confidence in large, robust, distributed […]

| In Operations Performance, Reliability

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled maintenance altogether. That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense. Scheduled maintenance tends to take place late […]

| In Ck Form, Product, Reliability

| In Events, Operations Performance, Reliability

How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to let them know when they have outages. So, who watches the watchmen? Arup Chakrabarti, PagerDuty’s engineering manager, spoke about how we monitor our own systems at DevOps Days Chicago earlier […]

| In Operations Performance, Reliability

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems. Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results […]

| In DevOps

| In Partnerships, Reliability

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack. The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an […]

| In Reliability

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted […]

| In Partnerships, Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile apps. Mobile apps are now critical for all types of businesses. Whether your company builds […]

| In Partnerships, Reliability

This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game at the age of six using AMOS Professional on the Amiga. There was a period where FPGAs programming and hardware was all the rage. For the last 15 years Erik […]

| In Reliability

PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking of ways to maintain and build systems that maximize resiliency — including our mobile apps. After the release of redesigned mobile applications last October, we’ve been shipping new features and […]

| In Reliability

Reliability is important to us. We even inject failure into our systems every Friday to prove it. But when it comes to sending alerts, reliability goes beyonds writing flawless code. We rely on several third-party carriers to deliver alerts to our customers. If an SMS isn’t delivered, you aren’t notified of an outage. We can’t stick […]

| In Reliability

On April 14th, PagerDuty suffered an outage that affected customers on both the mobile and web applications. During the period of the outage, customers may have had issues managing their accounts, and some alerts had been delayed. When these incidents occur, we make sure that we offer transparency for our customers who may have been […]

| In Reliability

In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great SaaS applications that will ping your system to let you know if you are up and running, just in case your team needs to sprint to find a fix. Knowing […]

| In Operations Performance, Reliability

Continuous integration (CI) is a software development practice where members frequently merge their work to decrease problems and conflicts. Each push is supported by an automated build (and test) to detect errors. By checking in with one another frequently, teams can develop software more quickly and reliably. In essence, CI its about verifying the quality […]