PagerDuty Blog

What is Operational Maturity?

DSC_5897

PagerDuty and DevOps Thought Leaders Come Together to Answer Questions about Operational Maturity

On Wednesday night, PagerDuty hosted an event where long-time PagerDuty customers Dropbox, Flipboard, and Splunk spoke about their hard-won experience, shared war stories, and discussed what they’ve learned about operations at scale. They also had advice about how what they’ve learned can be applied to other teams. We were delighted to talk with customers, partners, and the extended community about what it means to be operationally mature. Here is what was said about Operational Maturity:

What is Operational Maturity?

Andrew Fong, Infrastructure Manager at Dropbox:

Operationally mature cultures are ones that are able to understand the tradeoffs that they are making in a production environment and the impact that has to the business.

Joey Parsons, Head of Platform & Operations at Flipboard:

Operationally mature, from our standpoint, is understanding the ramifications of incidents from both a business impact and employee well-being perspective. Being on-call can be both a rewarding or negative experience for the person responding. Having the operational tools and processes in place to be able to make smart, informed decisions for your business is key.

Sean Jacobs, Infrastructure and Datacenter Operations Lead at Splunk:

Operational Maturity at Splunk is often measured by the effectiveness of our response during a crisis. Being a big data company, we collect information on nearly every facet of our infrastructure, but having the data and having meaningful data are vastly different challenges.

Tim Armandpour, Vice President of Engineering at PagerDuty:

Operational Maturity means being part of a test-driven environment, where high-severity incidents resulting from bugs are very uncommon, and measured. It also means being part of an organization with where every team is part of an on-call rotation and uses the same incident management system and methodology for maximum transparency and collaboration. At an Operationally Mature company, reliability and accountability are seen as key factors for a successful business. The more mature you are, the easier it is for your business to be agile, and adapt quickly and change with the market.

What is something you do that makes you operationally mature?

Andrew Fong, Infrastructure Manager at Dropbox:

Our SEV process (Incident Response) at Dropbox used to be ad hoc with no clear owners other than Senior Engineers. Over the last year we’ve built out a process that identifies a clear owner for coordination and resolution. We built well-defined criteria and tooling so that we could support 350+ Engineers, as well as Product Management, Communications, and Legal. Also, at Dropbox, incidents can be both backend server issues or client issues. (We have desktop software!) So we needed to build a process that works for all.

Joey Parsons, Head of Platform & Operations at Flipboard:

Becoming mature had a lot to do with the evolution of our on-call and escalation policies. Monitoring is never done and needs to be continually revamped for both quality business and quality of life. Bad alerting very quickly leads to employee dissatisfaction.

Sean Jacobs, Infrastructure and Datacenter Operations Lead at Splunk:

A lot of effort gets put into making our alerting and monitoring useful, and not just having a blanket approach to monitoring. Additionally, we put a lot of priority into look-backs and retroactive reviews so we can iterate and improve, versus having to react to the same issues every week.

Tim Armandpour, Vice President of Engineering at PagerDuty:

Every Friday at PagerDuty is Failure Friday, where our engineers intentionally take services offline and try to break our system, to ensure that all of our failsafes are up and running. We take reliability very seriously here, and have three active data centers so we stay online even if one of them is down. We also have a robust incident management policy, and have eliminated non-actionable alerts to the point where our on-call engineers get a few alerts per month at most.

Outage communication best practices eBook