Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
One of the best things about working at PagerDuty is that our customers, our users, our champions, and our buyers are all the same people. With this year’s push into major incident response, we’ve spent a lot of time talking to Network Operation Centers (NOCs) about what the future holds for them.
Every job changes with new technology — some, like long-distance trucking will be completely disrupted by self-driving trucks — but after all the discussions we’ve had with the best NOCs around, it looks like their evolution will be significant but manageable.
I’ve always thought about PagerDuty as helping your Mean Time To Promotion, in keeping with that, here are some of the possible futures we see for NOCs.
One of the most straightforward paths is towards becoming a Site Reliability Engineer (SRE).
If you want a job doing this, you need all the troubleshooting skills of a systems admin, layered on with a deep understanding of monitoring. The goal of an SRE is to detect glitches before they develop into problems that users can notice. And if that doesn’t work, SREs moves heaven and earth to get everything back online. You’ll frequently see SRE positions at big cloud or online companies, like Amazon, Google, Heroku, and even Etsy. People get really cranky if they can’t buy things immediately, and SREs are there to make sure they can.
SREs keep the world online (ok, that’s kind of a big ask). As an SRE, you would work with a team to predict needs and build scale in a way that is fluid and invisible from the front end. Site Reliability Engineering is the art of never letting the user see you sweat, as a company. You’re working to make sure there is always enough capacity, enough uptime, enough pipe, and enough monitoring to make sure something isn’t falling apart invisibly.
Instead of firefighting, you want to be a building inspector, designing wider hallways, doors that always swing out, and multiple staircases (metaphorically). It may look heroic to jump in with a fire ax and a hose and tear down doors and fight flashovers, but it’s better to never need the heroics if you have smart policies around building materials and building sprinklers.
Historically, quality assurance (QA) at software companies has had an unfair reputation. In fact, there are lots of great companies like Microsoft where there’s a parallel track for Software Development Engineers in Test (SDET). Click testing has long since become automated unit tests which are now automated click & API tests against the staging server.
Operations and QA are the formalizations of, “Eek! Things are broken.” If you have a solid QA team checking things in test before you deploy, there are far fewer surprise outages. If you have an Operations team, they design and build things mindfully, considering risk and performance, rather than simply installing and hoping things work right.
At its core, DevOps and Operations are about getting servers or containers to meet the “three R requirements”:
To me, that also sounds a lot like QA.
DevOps means if something broke and woke you up, you are empowered to write the test that ensures it never makes it to production again — you’re already the best part of QA.
As you get better at preventing downtime or outages and streamlining requests, you can scale volume more easily because you’re not responding to one-off requests. Think about the difference between manually resetting user logins and offering an automated system to do it. You may spend the same amount of time fixing user login problems, but for ten to twenty times as many users.
One of my favorite NOCs I’ve visited is a telecommunications company in Los Angeles — it’s a classical NOC with an unconventional feel. Starting from the massive wall of dashboards, the room is arranged in rows, with each row representing a promotion in their operations org. Promotions average 6-12 months apart, with clear milestones and can stop with being in the back row (as a defacto SRE) or into other parts of the org. With so many companies lamenting how hard it is to find talent these days, I expect this will become more common.
At PagerDuty we treat our support team in much the same way: employees in our support org have gone on not only to be managers or more technical roles inside that org, but also to the engineering, marketing, and sales teams and I don’t see any sign of that stopping (unsurprisingly, this makes it easier for us to hire great people)
Predictions are hard, especially about the future; but it’s clear that the future of the NOC will not be humans watching screens waiting to press buttons. For many classes of always-on applications, it will still make sense to keep people ready to jump into action — the question is what to do with the other 99% of their time.
The NOC has undergone quite a bit of change in recent years and will continue to do so. Those that adapt to the changing digital landscape will position themselves for success, and we look forward to navigating that transition with you.