This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...by Dave Bresci
December 5, 2018
It’s the morning of February 28, 2017, and vast swathes of the internet are unavailable. From individual sites to services that thousands of others rely on — such as Slack, Quora, GitHub and Trello — many are unavailable. You probably remember the day that ‘human error‘ took down so much of the internet, with broad components of AWS that no longer worked. This isn’t the first time an outage has brought the internet to its knees — but the sheer scale of AWS always means that the impact is felt.
Often during a crisis, we forget about the human side of dealing with a major incident — the complications involved in trying to reach people, in finding personal contact details, or in disturbing key staff in multiple time zones, pulling people out of meetings, and interrupting planned work with unplanned events.
PagerDuty is the leading digital operations management platform that exists for times like these. We help teams and organizations get ahead of and manage incidents by having their back when things go wrong, enabling teams to focus on the work they love. Our product serves as a platform for action when the unexpected happens. While we have primarily focused on Engineering and IT teams, we have been spending time thinking about what PagerDuty could mean to other areas of business that support teams and people.
In the past, enterprise systems or HR departments have been responsible for the systems that house the details of key personnel who are needed when incidents like the AWS outage happen. HR teams also need to keep informed during such times of major customer impact. Most of our world of information lives in the cloud, and when cloud-based services are unavailable, so is the critical information we need to do our jobs. Popular services we rely on daily, internal messaging and VoIP services, could potentially all be unavailable. During a large-scale outage or downtime of these cloud services, keeping customers and employees informed and orchestrating action and response, is hard.
Most of our world of information lives in the cloud, and when cloud-based services are unavailable, so is the critical information we need to do our jobs.
Modern companies typically have a handful of systems, all containing elements of people information, but people rarely have an incentive to keep their details up-to-date. With a system that is key to incident and crisis response, such as PagerDuty, there is a deep incentive and daily business need for all employees to stay in close contact. It becomes frictionless to leverage that platform to get accurate people information and coordinate response.
This is a great opportunity for my teams in HR to use a platform that is highly adopted across the business and connects us into the heartbeat of action, so we can help not hinder. The last thing we want to do is create unnecessary noise or bureaucracy within the company during a time of fast-paced response and action.
Consider a crisis more relevant to the responsibilities of an HR department, such as environmental, physical, or digital threats to employees. We need to be able to manage the situation, access critical information, empower the right people to respond without being hidebound by silos and workflows, or slowed down by tickets. HR departments are often the first line of response when it comes to a wide variety of situations, even though it may not always seem like it. We need to communicate with legal, facilities, vendors, and cybersecurity teams to deal with critical situations using best practice, and agile workflows.
Many of our customers have business models that keep no physical stock or supply chain — their digital infrastructure is their business and has a significant impact on their customer experience. Everyone from engineering to legal and marketing is invested in the success of that infrastructure and customer experience. We are all better served in using platforms that orchestrate across that infrastructure when there is real-time customer impact.
We could all benefit from the speed, orchestration and interoperability these teams have between people, decisions, and systems.
Those of us in tech-focused businesses should learn from the platforms and processes that our engineering and DevOps teams use. We could all benefit from the speed, orchestration, and interoperability these teams have between people, decisions, and systems. We can benefit from the culture these teams have created, in pushing accountability for the customer experience down to the front lines, trusting teams to coordinate a response, and learning from the highly collaborative, approaches to incident response that our DevOps teams, IT teams, and customers have mastered so well. The entire business could learn from this.
It’s our duty as HR professionals to ensure that the infrastructure of people data and orchestration at our organization is as effective as the technical infrastructure and orchestration of incident response.
It is exciting to me, to find these new ways that our different functions can work together on a shared platform, leverage centralized data and machine learning to see patterns before they emerge, or use a postmortem to learn and improve from the response. It’s our duty as HR professionals to ensure that the infrastructure of people data and orchestration at our organization is as effective as the technical infrastructure and orchestration of incident response. We owe our employees the experience that we build for our customers and as we make life better for on-call engineers, we can make life better for everyone across the digital business.