Silo’d responsibilities have wreaked havoc on team communications, making it difficult for different departments to have the full context of a situation during fire fights. This has not only reduced the quality of communication across entire development teams, it’s also created a serious issue that plagues many on the operations side — alert fatigue. Alert fatigue is not just an issue of unhappy team members — it impacts the software delivery chain’s ability to grow.
The great thing about DevOps is that it breaks down communication barriers and streamlines operations. DevOps teams come in two flavors: centralized teams for all applications, which are larger, but still smaller than traditional NOC environments; and decentralized teams, which entail one very small team for each application or core service.
These teams, in addition to being in charge of providing infrastructure, and sometimes the release process, have the burden of keeping production up and running, which is nerve-wracking, time consuming, and inhibiting to the entire environment if not done right. No one wants to be on-call, but we do it, because we know that faster mean-time-to-resolution (MTTR) and quick response to issues makes everyone’s upcoming days and weeks much easier – not to mention it keeps the business up and running. However, when being on-call starts to impact a team’s mood and dominates the majority of the operations team’s time, it comes with a huge risk.
Both centralized and decentralized configurations are prone to alert fatigue, each with a slight variation. For the centralized variation, it’s not just fatigue in the number of aggregated alerts across all applications; it’s also hard to know who the proper person is to address the issue, as there is a good chance that it’s not the person on-call. For decentralized setups, alert fatigue simply comes in a high volume of alerts for a small team.
The impact of alert fatigue on DevOps and IT Ops team is four-fold:
- Low morale: If the majority of your time is spent addressing issues, not only are you dealing with incidents all day and night, your time is spent doing less interesting things. You fall into the cycle of just putting out fires, which can wear down on the team’s communication and makes it difficult to remain effective.
- Single point of failure: In the centralized scenario, MTTR is dependent on the speed at which a very limited set of on-call operations folks can respond to an issue and identify the root cause. In a decentralized scenario, the time to identify a root cause is increased, but there is not enough coverage to triage issues and get them resolved faster. Moreover, since the call-down list is shorter, there is a greater risk of the issue not being addressed at all. All of this creates a bottleneck and a single point of failure for any issue that arises.
- Opportunity cost: This is the most unacknowledged impact of alert fatigue — the cost to the entire team and delivery chain. When your DevOps team is overwhelmed and drained from the alerting process, they are unable to innovate and improve the delivery chain. Because they’re only able to respond, they’re unable to explore better releases, infrastructure automation processes, or be proactive to prevent future issues. Not only does this prevent improvement, it can add to technical debt as issues that repeat frequently are never addressed with long-term fixes.
- Slower release cadence: The longer it takes to address issues, the greater the impact on release momentum. How many times has your team postponed a release?
The easy go-to response to managing alert fatigue is to grow the ops team; however, this isn’t necessarily the best option, as this “solution” eventually counters the benefits of having a smaller DevOps team.
There are several other options to consider when combatting alert fatigue:
- Create better escalation policies: Plan. Don’t just set up a call-down list for your team. Plan and consider what the impact on your team’s resources and morale might be. A little bit of strategy here goes a long way. For example, an easy trick is to break up rotations.
- Put QA and developers on-call: This one requires the entire team to be onboard, which can be very difficult, but if you add developers and QA teams to the rotation, you gain improved coverage and faster resolution times. Even if it’s in parallel with an ops team member, broader support can improve visibility into production issues to help developers resolve application-related issues, and can increase understanding to prevent issues in the future.
- Have detailed incident analytics: Visibility into the effectiveness of the alerting setup allows you to improve it over time and see where your current bottlenecks are. The data will also point out issues that keep repeating. Let the data guide you.
- Allocate time into stopping repeat issues: Spend time identifying issues that were resolved with a quick fix and address them so they don’t repeat in the future. The problem will have to inevitably be corrected, along with each subsequent issue. It’s a huge weight on the operations team.
- Standardize notification rules: Don’t let on-call team members arbitrarily set up their own rules. Standardize or templatize the rules so there is consistency and accountability.
- Allow parallel alerts: There is the vertical call-down but there can also be horizontal alerts where multiple team members can attack issues together for faster MTTR.
- Leverage the tools: Incident management tooling helps fight alert fatigue greatly. A great incident management solution, like PagerDuty helps automate alerts and helps you sift through alert noise — ensuring you are not being overwhelmed with non-critical alerts. This helps you pinpoint your alerts for more effective on-call operations. Then, if things go ding in the night, you know there is a real issue.
- Write better code: Spending time on quality reduces outages. It’s so simple, yet quality is neglected all too often. Spend more time showcasing the benefits to everyone with better quality code, better test coverage, better systems testing, and better test automation.
All of this is part of a broader strategy to optimize operations performance and it benefits everyone. Alert fatigue is real and it’s impacting not only your DevOps and ITOps teams happiness, but also the entire development team’s ability to innovate and get better at release code.