PagerDuty Blog

Achieve Better Accountability With Full-Service Ownership

Software teams seeking to provide better products and services must focus on faster release cycles. But running reliable systems at ever-increasing speeds presents a big challenge. Software teams can have both quality and speed by adjusting the policies around ongoing service ownership. While on-call plays a large part in this model, advancement in knowledge, more resilient code, increased collaboration, and practice also mean engineers don’t have to wake up to a nightmare.

In this blog, I’ll delve into the concepts of full-service ownership, psychological safety in transformation, the ethics of accountability, and the impact of ownership on the customer experience.

What Is Full-Service Ownership?

Full-service ownership is the philosophy that engineers are responsible for the code and services they create in production. This “code it, ship it, own it,” mentality means embracing the DevOps principle of no longer throwing code over the wall to operations or relying on the Site Reliability Engineering (SRE) team to ensure the reliability of services in the wild. Instead, accountability, reliability, and continuous improvement are the main objectives of full-service ownership.

Why Accountability Matters

When engineers are on call for their own creations, it puts accountability directly into the hands of that engineer/team. This is important because digital transformation has changed how people work, and how consumers consume. There is an implicit expectation in the mind of consumers that services will work.

For example, when I try to make an online purchase (generally always done through my mobile device), I expect a seamless, secure, and efficient experience. When I am interrupted because a page won’t load or throws an error, I simply move on to another company that can fulfill my request. According to the PagerDuty State of Digital Operations 2017 UK report, 81.2% of consumers will exhibit this same behavior.

Empowering engineers to work on customer experience by owning the full lifecycle of their code and services gives companies a competitive advantage. In addition to benefiting the company, full-service ownership benefits the engineer because this accountability ensures high-quality work and gives them a direct line of sight into how the code/service is actually performing and impacting the customers’ day-to-day.

Reliability—Beyond SME

Services will go down—it’s inevitable. However, organizations can reduce the amount of downtime and customer impact by bringing in subject matter expert (SME) or “owner” into the incident immediately. The SME is the engineer who created the code/service and has the intimate, technical knowledge to both respond to incidents and take corrective action to ensure their services experience fewer interruptions through continuous improvement. As the responsible party, engineers are incentivized to automate, test, and create code that is as bulletproof as possible.

Additionally, teams that adopt full-service ownership see an increase in overall knowledge. Through practices that include on-call handoffs, code reviews, daily standups, and Failure Friday exercises, individual engineers develop greater expertise around the entire codebase. New skills also include systems thinking, collaboration, and working in non-siloed environments. Teams and individuals build necessary redundancy in skills and knowledge through the practice of information sharing.

Continuous Improvement

As engineers strive to improve their product, code, and/or services continuously, a side-effect of full-service ownership is the refinement of both services and alerting. Alerts that interrupt time outside of work hours must be actionable. If team members are repeatedly being interrupted with non-actionable alerts, there is an opportunity to improve the system by analyzing the data.

Cleaning up the monitoring system is an investment of time; however, committing to actionable alerting will make on-call better for everyone on the team and reduce alert fatigue, which will free up mental energy to focus on future releases and automation. Developers who wrote the code and define the alerts for that code are more likely to create actionable alerts because they’re the ones woken up at night if they don’t. Beyond actionable alerts, engineers are incentivized to produce the highest-quality code since better code equals fewer interruptions.

Additionally, on-call is not meant to be “always-on,” and full-service ownership encourages teams to build time to go “off-call.”

Imagine you are on the operations team triaging an incident—time is of the essence and you need answers fast. Are you going to carefully run through a list of all members of the team responsible for that service? Or are you going to call the SME you know always answers their phone on a Sunday afternoon? Calling the same person places every single time an incident happens undue burden on one individual, potentially causing a single source of failure, which can then lead to burnout. With that said, an on-call rotation serves multiple functions to help organizations continuously improve:

  1. Engineers know when they are off-call; they know their code and services are being covered so they can fully relax, reducing risk of burnout and employee attrition
  2. The burden of being the “go-to” SME is parsed out to the rest of the team on rotation
  3. Services become more reliable
  4. Team knowledge and skills increase through deeper understanding of the codebase

By going beyond coding and including shipping and owning, full-service ownership reduces the chaos associated with incidents by defining roles and responsibilities, removing unnecessary layers, and, ultimately, fostering a culture of empowerment and accountability.

What has your experience been? Has being on-call helped you to become a better engineer? Do you loathe the thought of picking up a “pager”? Share your thoughts on our Community Forums! Check out our guide if you want to learn more full-service ownership best practices.

A version of this article was published September 20, 2019, on