This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...by Dave Bresci
December 5, 2018
As a developer, I’m a huge fan of continuous integration. For the uninitiated, continuous integration is a software engineering practice in which code changes are tested as soon as they are committed. This enables early problem detection. It also provides immediate feedback on code quality, allowing for issues to be identified and fixed immediately.
Often mentioned in the same breath as continuous integration is continuous deployment. Continuous deployment is an extension of continuous integration in that, as soon as the unit tests pass, the code is immediately released to production. By continuously integrating and deploying code changes, developers can reduce risk and quickly adapt the product to the business’ needs. This process, while incredibly valuable to both developers and users, does have some room for improvement.
As in a production environment, integrating your team’s incident management process into your continuous integration workflow is an excellent way to improve communication and transparency around failed builds. Sending the details of failed builds to an incident management platform such as PagerDuty can provide a number of benefits that go way beyond the standard notifications many continuous integration tools provide.
One of the biggest benefits that incident management techniques can bring to continuous integration is the concept of the “premortem.” A postmortem, in project management terms, is the process of examining and identifying the elements of a project launch that were successful or unsuccessful. While this process is often done at the successful completion of a project, in terms of incident management, it involves understanding and communicating the root cause of a critical failure.
By integrating this process into the continuous integration workflow, you perform a premortem. With a premortem you are looking for problems in advance. What problem may potentially occur that may disrupt service and trigger an incident. Testing for failures and potential service disrupting incidents during the the continuous integration phase allows teams to evaluate potential threats and make changes before things even reach the end user.
Training junior developers is an important part of running a successful development team. Unfortunately, successful incident management in a production environment can be fast-paced and stressful. While the team can run a postmortem the morning after a late-night issue occurs and is mitigated, the information that can be conveyed during those postmortems might not be as easy to consume by the more novice members of the team. By pushing issues raised in the continuous integration workflow directly into your incident management process, you can provide junior developers with an opportunity to learn more about the process in a significantly less stressful environment.
I’m a big fan of code reviews. I have always found them to be an important tool for the professional growth of every member of the development team, no matter what level of experience they have. By pairing an incident management platform with the continuous integration workflow, you can use the details of failed builds as a supplement to the code review process to identify specific areas of the code that should be discussed in more detail. This is an excellent opportunity to not only improve the code quality, but also keep the importance of the test suite at the top of the team’s mind.
Incident management is often thought of as a reactive process, but by incorporating a continuous integration toolset and mindset, you can turn it into a proactive one. Responding to and mitigating incidents before they actually happen will allow you and your team to stay two steps ahead, which will serve to reduce code debt, and improve the overall stability and reliability of your product.