The PagerDuty Incident Response Process is a detailed document that provides a framework for how to structure your incident response process. But sometimes it helps...by George Miranda
June 20, 2019
Becoming more operationally mature isn’t something that happens overnight after implementing some new tools. At its core, it’s a concerted effort in shifting culture so that people can break down communication silos to ship better software. And while tools alone are not enough, they do provide a crucial advantage by applying automation that improves speed and accuracy, and can facilitate collaboration.
The DevOps tools that make sense for your environment may vary significantly depending on your team’s size and specific needs. There’s no right answer to how you should build your toolchain. In fact, we’ll be the first to admit that some of the best workflows and tools we learn from and try to emulate are built in-house by innovative teams from Netflix, Etsy, Dropbox, and others.
The goal of this list is simply to share a few of the most popular tools across each stage of the software delivery lifecycle, some of which we also use internally. We hope that you find it a useful reference as you evaluate the right tools for your own team.
The waterfall approach of planning out all the work and every single dependency for a release runs counter to DevOps. Instead, with an agile approach, you can reduce risk and improve visibility by building and testing software in smaller chunks and iterating on customer feedback. That’s why many teams, including our own, plan and release in sprints of around two to four weeks.
As your team is sharing ideas and planning at the start of each sprint, take into account feedback from the previous sprint’s retrospective so the team and your services are continuously improving. Tools can help you centralize learnings and ideas into a plan. To kick off the brainstorming process, you can identify your target personas, and map customer research data, feature requests, and bug reports to each of them. We like to use physical or digital sticky notes during this process to make it easier to group together common themes and construct user stories for the backlog. By mapping user stories back to specific personas, we can place more of an emphasis on the customer value delivered. After the ideation phase, we organize and prioritize the tasks within a tool that tracks project burndown and daily sprint progress.
Top tools and ones we use include: Active Collab, Pivotal Tracker, VersionOne, Jira, Trello, StoriesOnBoard
After you’ve written some code, there’s a few things that need to happen before getting it into staging. Get it code reviewed and approved, merge it into the master branch in a version control repository, and run any local tests as needed.
Top tools: Github, Bitbucket, Gerrit, GitLab
Now it’s time to automate the execution of tasks related to building, testing, and releasing code. Before the build can get deployed, it needs to undergo a number of tests to ensure that it’s safe to push to production: unit tests, integration tests, functional tests, acceptance tests, and more. Tests are a great way to ensure that existing pieces of your codebase continue to function as expected when new changes are introduced. It’s important to have tests that run automatically whenever there’s a new pull request. This minimizes errors that escape because of manual oversight, reduces the cost of performing reliable tests, and exposes bugs earlier.
There are also a number of great open source and paid tools that do useful things once the tests are complete, like automatically picking up changes to the master and pulling down dependencies from a repository to build new packages.
Top tools include: Jenkins, GoCD, Maven, CruiseControl, TravisCI, CircleCI
With the advent of Docker and containers, teams can now easily provision lightweight, consistent, and disposable staging and development environments without needing to spin up new virtualized operating systems.
Containers standardize how you package your application, improving storage capacity and flexibility, and making it easier to make changes faster. This also enables your application to run anywhere. In other words, things will magically behave in production exactly as they did when you made the changes on your laptop.
Top tools include: Docker, Kubernetes, Mesos, Nomad
With configuration management, you can track changes to your infrastructure and maintain a single source of system configuration. Look for a tool that makes it easy to version control and make replicas of images — i.e. anything you can take a snapshot of like a system, cloud instance, or container. The goal here is to ensure standardized environments and consistent product performance. Configuration management also helps you better identify issues that resulted from changes, and simplifies autoscaling by automatically reproducing existing servers when more capacity is needed.
Top tools include: Chef, Ansible, Puppet, SaltStack
Release automation tools enable you to automatically deploy to production. They should include capabilities such as automated rollbacks, copying artifacts to the host before starting the deployment, and especially if you’re a larger organization, agentless architecture to easily install agents and configure firewalls at scale to your server instances.
Note that if something passes the tests, it typically automatically gets deployed. One best practice is to perform a canary deployment first that deploys to a subset of your infrastructure, and if there are no errors, then do a fleet wide deploy.
A lot of teams also use chat-based deployment workflows, using bots to deploy with simple commands, so everyone can easily see deployment activity and learn together.
Top tools include: Bamboo, Puppet
It can be really helpful to have release dashboards and monitors set up that help you visualize high-level release progress and status of requirements. It’s also key to understand whether services are healthy and if there are any anomalies before, during, and after a deploy. Make sure you are notified in real time on key events that take place on your continuous integration servers so you know if there’s a failed build, or know to hold or roll back on a deploy.
Top tools include: Datadog, Elastic Stack, PagerDuty
Server monitoring gives you an infrastructure-level view. A lot of teams also use log aggregation to drill down into specific issues. This type of monitoring enables you to aggregate metrics (such as memory, CPU, system load averages, etc.) and understand the health of your servers so that you can take action on issues, ideally before applications — and the customers that use them — are affected.
Top tools include: Datadog, AWS Cloudwatch, Splunk, Nagios, Pingdom, Solarwinds, Sensu
Application performance monitoring provides code-level visibility of the performance and availability of applications and business services, such as your website. This makes it easy to quickly understand performance metrics and meet service SLA’s.
Top tools include: New Relic, Dynatrace, AppDynamics
Monitoring tools provide a lot of rich data, but that data isn’t useful if it isn’t routed to the right people who can take the right actions on an issue in real time. To minimize downtime, people must be notified with the right information when issues are detected, have well-defined processes around triage and prioritization, and be enabled to engage in efficient collaboration and resolution.
When application and performance issues now often cost thousands of dollars a minute, orchestrating the right response is often highly stressful, but it can’t afford to be chaotic. In the middle of a fire, you don’t want to waste half an hour pulling up a contact directory and trying to figure out how to get the right people on a conference bridge.
The good news is, PagerDuty automates the end-to-end incident response process to shave time off of resolving both major customer-impacting incidents or daily operational issues. Here at PagerDuty, everyone from our engineering teams, support teams, security teams, executives, and more, uses our product to orchestrate coordinated response to IT and business disruptions. We have the flexibility to manage on-call resources, suppress what’s not actionable, consolidate related context, mobilize the right people and business stakeholders, and collaborate with our preferred tools. If you can easily architect exactly what you want your wartime response to look like, you’ll have a lot more peace of mind.
Tools we use: PagerDuty, HipChat, Slack, Conferencing tools
When wartime is over, incidents provide a crucial learning opportunity to understand how to improve processes and systems to be more resilient. In accordance with the CAMS pillars of DevOps (Culture, Automation, Measurement, Sharing), it’s important to understand incident response and system performance metrics, and facilitate open dialogue to share successes and failures towards the goal of continuous improvement.
Look for a solution that enables you to streamline post mortems and post mortem analysis for the purpose of prioritizing action items regarding what needs to be fixed. You’ll want to measure the success of a service relative to business goals and customer experience metrics, with tools that help you understand product usage and customer feedback. All of these will feed into the next sprint so that you can accurately plan and prioritize both system and feature improvements — for even better product, and happier customers.
Tools we use: PagerDuty Postmortems, Looker, Pendo, SurveyMonkey
Again, simply investing in tools will not get you from a monolithic to a microservices architecture, or magically result in teams that can perform self-service deployments many times a day. But by bolstering a culture shift with the right tools and processes, you’ll be well on your way to optimizing and continuously improving software delivery and enabling seamless collaboration and trust between everyone responsible for it.
With that, we wish you success in exploring and finding the right tools for you! And check out these resources if you’re interested in learning more about which tools we use internally to maximize inclusivity, and how to accelerate DevOps best practices.