How We Ensure PagerDuty is Secure for our Customers

by Evan Gilman June 4, 2014 | 4 min read

We at PagerDuty take security very seriously. To us, security means protecting customer data and making sure that customers can securely communicate with PagerDuty. Given our focus on high availability, we pay a lot of attention to how we design and monitor our security infrastructure.

In a dynamic environment like PagerDuty’s, providing robust host and network level security is not easy. Doing so in a way that is distributed, fault-tolerant, and hosting provider-agnostic introduces even more challenges. In this series of blog posts, we wanted to highlight some of the strategies and techniques that PagerDuty’s Operations Engineering team is using to secure the PagerDuty platform and to keep customer data safe.

Here are some of the best practices we follow around security:

Establish internal standards

All of our internal decision making for security goes against a list of philosophies and conventions that we maintain. This list is not written in stone, as we update it when we find problems, but it forces us to understand where we are making trade-offs and helps us with our decision making. It also makes it easy for new engineers to quickly understand why things are set up the way they are.

Secure by default

We follow a convention of securing everything by default, which means that disabling any security service has to be done via an override or exception rule. This serves to enforce consistency across our dev, test, and production environments.

As tempting as it is to poke a hole in the local firewall or to disable SSL when connecting to MySQL, we don’t want to be making these types of security changes in our production or test environments. Setting our tools to automatically “do the right thing” keeps all of our engineers honest. Also, by having this kind of consistency, we can debug security-related issues earlier in the development cycle.

Assume a hostile and flaky network

All of our infrastructure is deployed to cloud hosting providers, whose networks we cannot control. Additionally, we are deployed across multiple regions, so a good chunk of our data traffic goes over the WAN. This introduces the challenges of packet loss and high latency – as well as the possibility that intruders may try to eavesdrop on our traffic.

With this in mind, we encrypt all data in flight and always assume that our data is flowing through networks where we have little visibility.

Be provider-agnostic

Security Groups, VPC, Rescue consoles, etc. These are all examples of provider specific tools that we are unable to use because we are spread across multiple hosting providers and need to avoid vendor lock-in. All of our security tooling has to be based on commonly available Linux tools or installable packages, which eliminates our dependency on provider specific security tools, and leads to better stability. We leverage Chef to do most of this work for us and have built out nearly all of our tooling on top of it.

Centralize policy management and distribute policy enforcement

Most companies approach AAA (Authentication, Authorization, Access)  by having single sources of truth for access control and then use that source of truth as an authorization mechanism as well. Examples of this include: using an LDAP server, using a RADIUS server, or using a perimeter firewall to store network policies. Instead of relying on these single sources of truth for both policy management and enforcement, we split out and distribute the enforcement pieces to the individual nodes in the network. Our typical pattern is when a change is introduced into the network, the single source of truth updates the policy, and is then pushed out to all of the nodes.

Constantly validate

While all of the above serves to provide a robust security architecture, it’s important that we validate our security measures to ensure that they’re doing what we actually need them to do.

Some companies will do quarterly penetration testing, but with our dynamic environment, that is too slow. We actively scan, monitor, and alert on changes (with PagerDuty) when there is something that is not expected. We catch problems quickly if a mistake is made (e.g. engineer accidentally opens the wrong network port on a server) or if there is actual malicious behavior (e.g. someone trying to brute force an endpoint), we get alerted to the problem immediately.

This is the first post in a series about how we manage security at PagerDuty. To continue reading this series check out, Defining and Distributing Security Protocols for Fault Tolerance