Real-Time Operations Management in the Cloud Just Got Easier Cloud infrastructure, by nature, is constantly changing as new instances are spun up and decommissioned, which...by Andrew Marshall
July 16, 2018
I’m proud and appreciative of the commitment PagerDuty makes to diversity and inclusion. I appreciate how we hire women and people of color into visible leadership roles, have safe spaces and channels for employees to discuss experiences and concerns, and have frank, open discussions about what we’re doing well and what we need to change. There’s lots to say here, and smarter and more qualified people have written about it better than I ever could.
But one great piece of our culture I think goes unnoticed is the way the team approaches internal technology: over many years, we’ve built a toolchain for getting work done that supports effective collaboration across many different technical backgrounds, locales, times of day, and working styles. I want to share some of my favorite examples and how I believe they help us create a more welcoming and inclusive culture.
When I started at PagerDuty in 2014, getting the engineering stack up and running locally was a bit of a nightmare. It was a long, 20+-step process documented in the main web repo, and having a piece of software pre-installed, or the wrong version, or Mercury being in retrograde would break the whole thing and prompt a visit to ops to get you up and running.
But as a product manager with a mild grammar and spelling obsession, every few weeks I’d run into some minor typo or visual issue on the site that I wanted to fix — but by that time my local environment fell out of sync.
What I soon discovered, though, was that our continuous integration and deployment tooling was advanced enough that I could just make a change in a text editor, push it up to GitHub, and use our friendly chatops bot to verify my change looked and worked as expected in a staging environment. Even scarier, once our automated tests finished running, I could actually use chat to deploy the change to production!
I was petrified the first time I did this, and had a member of the operations team come over and sit with me the first time I did it. The last thing I wanted as a new PM on a mission-critical operations platform was to break production. But I typed in “!deploy web/master to production”, our little chatbot went and did its thing, and nothing big exploded.
Since then I’ve gotten more comfortable developing locally, and our great SRE team has made it much easier to get your local environment set up and keep it current. But I’ll always remember that first production deploy, and the experience of an organization focused on making it so safe, fast, and boring to change production software that anyone can do it.
Continuing through today, I regularly see my UX teammates, most of them with limited engineering experience, change some CSS or text here or there, cutting out the work of writing a ticket and just fixing the problem directly. It’s empowering for the designers, efficient for the team, and really helps to promote a culture where everyone feels ownership and accountability for the product experience.
When I started at PagerDuty, “analytics” was an esoteric art, taught through copy-and-pasted SQL queries, CSV exports, Salesforce reports, and lots of spreadsheet work to pull it all together. Only a few people in the company had enough context about the inner product workings and the business structure to pull their data, and that meant that data-driven decision was inaccessible to a large part of the organization.
A few years ago, though, our business analytics team (affectionately known as “#dataduty”) built out a great data warehouse and analytics toolchain that makes it easy for anyone, from a brand-new salesperson to an expert SQL-er, to ask questions about our product and our business.
We use Mode Analytics for lightweight work — it’s just like a desktop SQL client, but better, because you can graph, permalink, and schedule queries. Rather than share the query and require anyone interested to re-run it to see the answers, we can just share around the link and know we’re all looking at the same thing.
For heavier-weight, more persistent analysis, like executive dashboards or tracking product adoption KPIs, we use Looker. This draws from the same data warehouse, but lets you model your data using YAML, creating graphical, interactive views that teammates can explore without ever writing a line of SQL. It’s particularly nice to tie together business metrics like sales segmentation or account size with usage metrics like notification counts, to ask questions about how the behavior we see maps to our large and diverse customer base.
Like easy deployments, inclusive and thoughtful analytics tools drive a culture of empowerment. When people can easily get answers to their data questions, they feel empowered and responsible for asking more questions, checking more assumptions, and driving more of their planning and decision-making from actual data.
As a distributed team, with multiple offices, numerous remote employees, and lots of people traveling, we rely on several different tools to help us create software together, and maintain a close team feel even when we’re separated by thousands of miles. However, the same tools that make remote work possible can make co-located work more inclusive as well.
We use Slack heavily, and most teams have a mix of channels for intra- and inter-team communication. While it can get a little distracting if you’re undisciplined with notifications, we really like how it lets people in different time zones or in and out of meetings stay aware of conversations asynchronously. And while they seem playful and a little silly at first, emoji actually do a great job of conveying the tone of a statement, or for getting reactions (up/downvotes, story point estimates) from a whole team quickly.
We love ScreenHero (now part of Slack) for pair programming, and for remote meetings, we lean on Hangouts, plus Google Slides, Google Docs, or Trello (depending on the need). These tools make sure we’re looking at the same thing, which is great, but being fluent in all of them also makes us more inclusive to different communication styles.
We try to keep meetings and discussions as open as possible to everyone in the business, and regularly see people from all over the org in engineering lunch-and-learns, failure fridays, and team sprint reviews. We record presentations in case people can’t make it, and also keep an open internal blog that anyone can contribute to, which is great for understanding the pulse of the company and what’s on coworkers’ minds.
Unsurprisingly, we also use our own platform to help drive our culture of inclusivity.
While we may have started out as a solution aimed at DevOps and ITOps response teams, we’ve discovered that there’s actually a ton of value PagerDuty provides in coordinating real-time response across the entire business. Even though their backgrounds and day-to-day duties may look very different, departments like marketing, support, finance, and the executive team all have a role to play in incident response. It was before my time, but I heard that in the very early days of PagerDuty, finance was the key responder on an incident — our credit card had expired with a provider, and we needed to update it to keep the notifications flowing to customers! We quickly automated that process, but still have an expectation that effective incident response includes well-documented, well-rehearsed plans for getting responders on top issues, from any area of the business.
As we’ve seen more customers interested in whole-business response to critical issues, we’ve added new capabilities to PagerDuty to support this — from stakeholder notifications and tighter chat integrations for letting interested parties know how response is progressing, to postmortem reports that make it easy for the whole business to see what happened with an outage and what we’re doing to improve.
I love our internal technology, but it’s important to keep in mind that nifty tools aren’t the cause of an inclusive culture, they’re effects: indicators of an organization that values and works toward welcoming and including a diverse set of people. Each one of the tools here started with somebody seeing an opportunity to help others in the business more easily do something, and going the extra mile to ensure that factors like department, technical background, time zone, or communication style don’t hold teammates back from accessing a piece of data or participating in a conversation.
And this work is never done, but every day at PagerDuty I see people putting in the effort to advance it. Whether it’s seeing a UX designer do their first production deploy; watching new team members be greeted on Slack; or seeing support, marketing, executives, and engineers collaborate together on an incident, it’s clear to me that we’re building an environment that values and supports everyone in working together to build an awesome product and delivering great service to customers.