Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
This month is a big month for PagerDuty—we turned 10 on February 18! I never imagined we’d reach this milestone, honestly. A lot of Dutonians...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
According to a roundup by Gartner, the average cost of downtime for an enterprise is $5,600 per minute. While the data collected was from incredibly large companies, the cost of downtime for even small startups is no laughing matter.
Let’s assume, for the sake of simplicity, that your core product is a web app that relies solely on organic sales, totaling $1 million in revenue a year. This amounts to about $2 in lost revenue per minute. This doesn’t sound like too much in the grand scheme of things, but revenue is only a small part of your downtime costs. We also must consider wasted operating costs.
Employees’ time and productivity, too, are wasted during downtime. If, for example, you pay $500,000 a year in employee costs, that’s an additional $1 in lost revenue per minute. If you’re keeping track, we’re now at $3 in cost per minute.
That’s $180 an hour. $4,320 a day.
Adds up quickly, doesn’t it? Now we’ve accounted for employee costs and lost revenue, but what about other wasted expenses? Every unused piece of your architecture results in additional losses during downtime. Unused servers and third-party services can simply sit around while your team is working on a fix, and the fix itself could result in necessary additional (and costly) resources.
Depending on how critical your product is to your customers’ businesses, downtime could not only cost you money, but also your customers’ trust. It’s difficult to justify the cost of paying an unreliable vendor, so while one outage is easily survivable, the loss of faith in your product is compounded with every subsequent occurrence.
Ultimately, by understanding the causes of outages, you can maximize your chances of preventing them. The causes can be boiled down to a few categories — human error, third-party service outage, or a highly unpredictable “black swan” occurrence.
One of the most common causes of downtime that I’ve personally seen is human error. Regardless of if a developer committed broken code, or an administrator updated an untested package, when procedure isn’t followed or an obscure system bug isn’t accounted for, product uptime will suffer. Establishing a system of checks and balances within an organization is the best solution to this problem. Code reviews, unit tests, quality assurance, proper planning, and clear communication all go a long way in preventing downtime that is definitely avoidable.
Sometimes downtime isn’t caused internally, however. From time to time, even cloud providers like Amazon AWS go down. There is very little an organization can do when this happens (at least not without a proper plan in place). To combat this, I’m a fan of Netflix’s Chaos Monkey system. For the uninitiated, Chaos Monkey is a system whose sole job is to kill off random services within a product’s architecture. This forces the system to be self-repairing, and trains the team to handle outages effectively when they really matter. PagerDuty conducts its own Failure Fridays as well!
While occasional downtime is completely unavoidable (even Facebook goes down from time to time), how you handle and prepare for it will determine just how much of an impact it will have on your organization. Because every minute of downtime means additional costs, establishing workflows to prevent or reduce the length of an outage is crucial. Solutions like PagerDuty accelerate real-time incident resolution by notifying and getting everyone on the same page as soon as possible, and providing a platform for surfacing context to fix the issue. By aggregating all your event data and optimizing communication, it becomes far easier to identify root cause of an outage, and resolve issues efficiently and accurately.
It’s important to remember that improving communication externally is just as important as improving it internally. Communicating information about an outage to your customers early and clearly goes a long way to maintaining trust and credibility with them. Through the use of tools like StatusPage and StatusCast, as well as PagerDuty’s Stakeholder Engagement, organizations can better orchestrate the real-time business and external-facing response, and use status pages to provide valuable transparency into the health of a product. Personally, I find nothing more distrustful than an organization that remains quiet through a crisis. Their silence feels like an attempt at hiding something.
All of these solutions are great, but it’s important to understand that an indispensable part of managing unexpected downtime is to make sure there are always people on hand to fix the issue. This can be easily accomplished by establishing an on-call rotation amongst your engineers. An effective on-call rotation is a minimal investment that can help increase product reliability as well as maintain accountability, better service delivery, and improved work-life balance for your team. Without an on-call rotation, every outage turns into an “all hands” event, which is disruptive to the personal lives of every employee. On the flip side, a clearly defined on-call schedule and escalation policies means that workloads are balanced, and there is always a dedicated subject matter expert that is ready to fix an issue or drive collaboration for resolution as needed.
In the end, the best way to plan for (and mitigate) downtime is to invest in your resources and your team. Not every solution mentioned here is right for every organization, but the cost of doing nothing is significantly higher than the cost of doing something. When you have an established process for handling outages, it won’t matter if it was caused by a hacker or a power outage. You and your team will be prepared to handle it.
According to a roundup by Gartner, the average cost of downtime for an enterprise is $5,600 per minute. While the data collected was from incredibly...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019