Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Using Data to Dismantle a Criminal Industry Human trafficking is a $150 billion dollar criminal industry that denies freedom to over 40 million people globally—and...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
This is the second in a series of posts on increasing overall availability of your service or system.
In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR. Both increasing MTBF and reducing MTTR are important, but reducing MTTR is arguably easier. It doesn’t take months of engineering work and capital expenditure to see results, but can often be incrementally achieved with some additional tools, procedures, and processes.
In this post, we’ll talk about things you can do today to help reduce MTTR and effectively increase your availability.
During any outage scenario, minutes matter. Depending on the business, every additional minute of downtime could result in lost revenue, lost customer trust, or worse. To prevent wasting these precious minutes – and effectively directly increasing MTTR – a ‘bias for action’ attitude should be cultivated within your team.
What ‘bias for action’ means in an operational situation: if at any point you have a hypothesis on what is causing your outage, and you or someone comes up with an idea for a solution that might help fix the problem, just do it. Just give it a shot.
This attitude will help prevent indecision paralysis from gripping you when faced with a really bad operational problem and not enough research and data to make a completely informed and well-thought-out decision on what approach to try. Making the perfect fix 2 hours into an outage is almost always less of a win than making an imperfect but helpful fix 15 minutes in. And who knows: one of the first things you try might actually fix the problem completely. There’s only one way to know for sure: just try it. An outage is not the time for being risk-adverse.
An important factor in cultivating this operational attitude within your organization is to not penalize people for making mistakes or taking risks while trying to fix a large operation problem. Bouncing the entire fleet of frontends made the problem worse for a little while? Oh well; it was worth a try. We won’t try that solution as quickly next time.
Of course, while you should take risks and try things, there’s no point being stupid about it. Before truncating that database table, make sure you have a backup. Before running that update statement across 12 thousand rows, have someone double-check the SQL for you, and break it up into multiple transactions if you can. And weigh the magnitude of the problem against the fix that you’re attempting: outages are one thing, but if your systems are instead only working at reduced capacity or functionality, you might want to hold off on those radical fixes where the potential downside for messing up is very large or catastrophic.
So it is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. – Sun Tzu
You and your team should be very familiar with the common problems – enemies – that your service or system faces on a day-to-day basis. I don’t mean the most catastrophic and exotic potential problems that you might face, but the real and everyday failure modes that your system has encountered in the past, and almost certainly will encounter in the future.
You know what kind of problems I’m talking about: that scaling issue that hasn’t yet been licked, that fussy legacy service that occasionally flakes out, that database that has the nasty habit of just seizing up, or those specific sets of circumstances that kick off a message storm. Whatever the failure mode, if it’s something that crops up frequently in your system, everyone on your team should know (or be trained) on how to handle it.
Yes, you’re probably working on fixing the root cause of the problem (and if not, you should be), and you’re hoping it’ll soon be a distant unhappy memory. But many of your most chronic problems can’t be easily and completely vanquished – or you’d have done it long ago – and some legacy systems are difficult to change. So you should still have documented and detailed procedures for how to face these problems, and this documentation should be easily accessible to your team during future incidents. An internal wiki is a great place for this Emergency Operations Guide.
You and your team should also be very familiar with the tools you have at your disposal to be able to understand the state of your systems.
To begin, I’ll start with the most obvious: you need to know there is a problem in order to be able to fix it. So you need monitoring. Lots of it.
Monitor on the host-level: CPU, free memory, swap usage, disk space, disk IOPS, network I/O, or whatever host-level attributes are important for the system in question. There are tons of great monitoring solutions out there available for this.
Monitor on the application-level: Setup monitors to check and report on various system-level health metrics of your system, like request latency, throughput, processing delays, queue sizes, error rates, database performance, end-to-end performance, etc. Setup logscans that continuously monitor your service logs looking for bad signs. If you have an externally-facing system, setup an external monitor to check whether or not your system/website is up, accepting requests, and healthy.
Know how to use your monitoring tools. These tools are awesome resources during a failure situation to quickly and visually figure out what’s going wrong, but if your team doesn’t know how to use them or their UI (or even log in) then they’re not going to be of much use. Add links to interesting/useful monitoring graphs and charts in your Emergency Operations Guide mentioned above, for quick access.
But these monitoring systems aren’t of much use if nobody is listening to them. You should – shameless plug! – use a system like PagerDuty to bridge the gap between the monitoring systems and your on-call staff (the people who will actually fix the problem) as well as organize on-call schedules and escalation policies. Another – more expensive – option would be something like a NOC. I won’t harp too much on this point, but there’s no reason for there to be any delay whatsoever between your monitoring systems realizing there’s a problem, and you realizing there’s a problem.
I’ll follow up soon with another post detailing some more of my favorite tips.
Voices wield power. Staying silent is not an option. We must speak up and honor those who do. October is National Domestic Violence Awareness Month,...
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018