PagerDuty Blog

Reducing Technical Debt With Incident Management

pagerduty-reducing-technical-debt-image-email

 

It generally pays to look beyond labels, such as “incident management” (which usually means much more than receiving and responding to alerts). Consider, for example, the relationship between incidents and technical debt. It is a relationship that most software professionals probably haven’t even thought about, but it exists, and it is more than just a passing acquaintance.  

Although new or recently revised code accounts for the majority of software errors, when you trace the problems caused by changes in code, they will very often lead to old patches of code containing technical debt.  

This shouldn’t really be surprising. Technical debt is, by definition, code that contains built-in problems — in design, execution, integration with the rest of the program, and, very often, a combination of these factors.  Later changes to code that interacts with technical debt, either directly or indirectly, can expose or amplify those problems.  

Why? Consider the conditions under which programmers are likely to add technical debt.  Typically, there’s a problem that needs to be taken care of quickly, and speed matters more than taking care of the issue the right way.  It may be an emergency bug fix, a change to accommodate an operating system update, new features added under a tight deadline, code from another source being patched in, or simply a quick workaround to accommodate previous technical debt.  When the code is added, it’s cleaned up and debugged to the point where it doesn’t cause any errors, but it isn’t up to contemporary standards for design or coding.  That’s why it’s technical debt, and not just new code.

This means that it isn’t likely to be bulletproof, and its bug fixes and error handling are likely to be improvised and patched together. It’s like building a bridge with a badly designed truss or weak girders. The problem spots may be OK at first, but with added traffic or later structural changes, the probability of failure is likely to increase.  In the same way, later revisions of your software may stress the parts of your code that contain technical debt beyond their limits.

Incident Management and Technical Debt

Where does incident management come in?  While not all incidents require analysis and revision of source code, many of them do. The point at which code is being revised is also the most obvious time to eliminate any technical debt that it contains.  Even when the incident response itself doesn’t require any changes to the software, it can result in the discovery of previously unrecognized debt, which can then be scheduled for revision. Incident management can also serve as a warning and detection system for underlying problems in software design and coding. Repeated problems involving the same block of code are a good indication of problems with the code itself.  

Technical Debt Policy

If technical debt is currently (or potentially) a significant issue with your software, you may want to adopt an overall policy and a formal framework for the elimination of technical debt.  A technical debt policy could cover the following general areas:

  • Identifying and mapping technical debt
  • Guidelines for identifying and remedying technical debt
  • Coding standards

A Framework for Handling Technical Debt

The framework for carrying out such a policy might include components such as these:

  • Mapping out known areas of technical debt within your source code.  Such a map would, of course, be subject to change, both as new debt is discovered, and as known debt is removed.  This would require an in-house definition of technical debt specifically designed to let all parties involved recognize it and distinguish it from acceptable variations in coding style.
  • Procedures for logging incidents involving new and known technical debt.  The log itself should cover such things as the time/date of discovery, a basic description of the debt, and the response (fix, schedule for later, leave in place), plus follow-ups.  Key parties (project managers, developers, etc.) will need to understand their responsibilities with regard to manual logging. Some logging (i.e., the initial incident alert) can probably be automated.
  • Guidelines covering when to remedy the debt as part of the incident response, and when to report the debt and schedule future remediation.  Ideally, of course, any existing code problems, including technical debt, would be handled on-the-spot as part of the incident response.  In practice, however, there are many situations where that simply isn’t possible.  The urgency of the incident may not leave any time to address anything other than the immediate problem.  It is important to have a formal system in place for not only logging technical debt, but also for scheduling revision of the affected code specifically to get rid of that debt.    
  • A set of formal code standards with particular reference to technical debt. This involves learning how to recognize technical debt, as well as learning the standards to apply when remedying it.  Standards may need to include guidelines for handling difficult problems of code design. Since technical debt is often the result of an attempt to work around design trouble spots, any real remedy will have to address those problems in a way that is systematic and in keeping with the application’s basic design standards.

There are several points at which such a framework would benefit from being tied in with an incident management system, particularly by means of a like system’s API.  For example, incident reports could be exported to the application used to map debt, both for the purpose of correlating incidents with known problem areas and the mapping of newly identified technical debt.  Incident management tool APIs can also be used to log incidents involving technical debt, and automatically generate work orders for remedying that debt. Those tools could also be used to alert developers who have the responsibility of handling technical debt in specified areas of the code.  

Such a framework makes it possible to incrementally eliminate technical debt as part of a system for incident management and response, and provides an automated method of assuring that technical debt is dealt with. Incident management is a key aspect of the framework, providing tools for detecting debt-related problems, alerting responsible parties, and scheduling code revisions to fully eliminate technical debt. It ensures that it won’t simply wind up being kicked a little farther down the road.