This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
It generally pays to look beyond labels, such as “incident management” (which usually means much more than receiving and responding to alerts). Consider, for example, the relationship between incidents and technical debt. It is a relationship that most software professionals probably haven’t even thought about, but it exists, and it is more than just a passing acquaintance.
Although new or recently revised code accounts for the majority of software errors, when you trace the problems caused by changes in code, they will very often lead to old patches of code containing technical debt.
This shouldn’t really be surprising. Technical debt is, by definition, code that contains built-in problems — in design, execution, integration with the rest of the program, and, very often, a combination of these factors. Later changes to code that interacts with technical debt, either directly or indirectly, can expose or amplify those problems.
Why? Consider the conditions under which programmers are likely to add technical debt. Typically, there’s a problem that needs to be taken care of quickly, and speed matters more than taking care of the issue the right way. It may be an emergency bug fix, a change to accommodate an operating system update, new features added under a tight deadline, code from another source being patched in, or simply a quick workaround to accommodate previous technical debt. When the code is added, it’s cleaned up and debugged to the point where it doesn’t cause any errors, but it isn’t up to contemporary standards for design or coding. That’s why it’s technical debt, and not just new code.
This means that it isn’t likely to be bulletproof, and its bug fixes and error handling are likely to be improvised and patched together. It’s like building a bridge with a badly designed truss or weak girders. The problem spots may be OK at first, but with added traffic or later structural changes, the probability of failure is likely to increase. In the same way, later revisions of your software may stress the parts of your code that contain technical debt beyond their limits.
Where does incident management come in? While not all incidents require analysis and revision of source code, many of them do. The point at which code is being revised is also the most obvious time to eliminate any technical debt that it contains. Even when the incident response itself doesn’t require any changes to the software, it can result in the discovery of previously unrecognized debt, which can then be scheduled for revision. Incident management can also serve as a warning and detection system for underlying problems in software design and coding. Repeated problems involving the same block of code are a good indication of problems with the code itself.
If technical debt is currently (or potentially) a significant issue with your software, you may want to adopt an overall policy and a formal framework for the elimination of technical debt. A technical debt policy could cover the following general areas:
The framework for carrying out such a policy might include components such as these:
There are several points at which such a framework would benefit from being tied in with an incident management system, particularly by means of a like system’s API. For example, incident reports could be exported to the application used to map debt, both for the purpose of correlating incidents with known problem areas and the mapping of newly identified technical debt. Incident management tool APIs can also be used to log incidents involving technical debt, and automatically generate work orders for remedying that debt. Those tools could also be used to alert developers who have the responsibility of handling technical debt in specified areas of the code.
Such a framework makes it possible to incrementally eliminate technical debt as part of a system for incident management and response, and provides an automated method of assuring that technical debt is dealt with. Incident management is a key aspect of the framework, providing tools for detecting debt-related problems, alerting responsible parties, and scheduling code revisions to fully eliminate technical debt. It ensures that it won’t simply wind up being kicked a little farther down the road.