PagerDuty Blog

Defining and Distributing Security Protocols for Fault Tolerance

This is the second post in a series about how we manage security at PagerDuty. To start from the beginning check out, How We Ensure PagerDuty is Secure for our Customers.

High Availability and Reliability are extremely important here at PagerDuty. We have spent an enormous amount of time designing and architecting our service to withstand failures across servers, datacenters, regions, providers, external dependencies and many other factors. But given that failure is inevitable, how do you build fault tolerance into our security system?

Two things to keep in mind: First, dedicated network or security hardware introduces the problem of single points of failure, where a VPN appliance that goes down can take out an entire datacenter. Second, in a security model where only the edge of the network is protected, an attacker that is able to penetrate that single layer will gain access to everything.

To keep these issues from arising, our overall approach has been to centrally manage our security policies, and push out enforcement to all nodes. Each node is responsible for looking up the policy definitions as well as deriving and enforcing a tailored ruleset.

Dynamically Calculated Local Firewalls

The first strategy that we implemented in our centralized management/distributed enforcement model were our dynamic firewalls.

We have been migrating PagerDuty to a Service Oriented Architecture over the last two years, and with that, we have the opportunity to better isolate each service and contain lateral movements. When we define a new cluster of servers in Chef, we also setup the ruleset for the firewalls that define what group this server belongs to and what groups can talk to this group. From this, each server is able to create entire IPTables chains automatically, open service ports for the appropriates sources, and drop all other traffic. This means that each time a server is added/removed, we do not need to update any policies. Instead, the nodes will detect the change, and recalculate the ruleset.

We have seen a bunch of benefits from this approach:

  • We can easily create network partitions when needed. (This is how we make sure our dev, test, and production environments cannot talk to each other.)

  • We can isolate individual servers when we need to practice attacking them.

  • We can easily figure out which servers are talking to each other because all of the inbound rules have to be defined upfront.

  • We are using simple and straightforward Linux IPTables. If there is a firewall problem, every engineer understands how to manipulate the firewall and deploy a fix.

  • There is no single-point-of-failure network device. If a single server goes down or something more catastrophic happens, the rest of the system will continue to operate in a secure fashion.

Distributed Traffic Encryption

For encrypting network traffic, there are two dominant methods that most use: Virtual Private Networks (VPN) and per app/service encryption, but we found problems with both of them.

A typical VPN implementation with dedicated gateways at each of our AWS and Linode regions would have had a number of issues:

  • Almost single point of failure. Even if you deploy multiple gateways to each region, anytime a gateway server goes away, there is either a failover involved or a reduction in capacity. This will result in connectivity issues.

  • Cost and scalability. Because we are using standard virtual machines and not dedicated networking hardware, we would have to use very large instance sizes to encrypt and decrypt traffic for the servers behind them. We were concerned with conventional VPN gateways’ ability to scale with our traffic spikes.

  • Latency. Because we already have cross-region calls being made, we want as few hops as possible when connecting to non-local servers.

Per-app/service encryption methods – like forcing MySQL to only allow SSL connections or making sure that Cassandra uses internode encryption – do have a place in our security framework. But there are problems with only using this approach:

  • It’s easy to forget. While security is part of everyone’s job at a company, many times people will forget to enable the appropriate security settings.

  • Each app/service has a slightly different way of encrypting data. While most connection libraries support SSL, it can be implemented differently each time. Moreover, this means that anytime we add a new service, we have to rethink how to handle the encryption.

To solve the above issues, we implemented a point-to-point encryption model based on IPSec in transport mode. This enables all traffic between specified nodes on the network to be encrypted, regardless of where the node is located and what other nodes it is talking to. Again, we followed our centralized policy management convention by calculating the relationships on a Chef server and then pushing them out to each node.

There have been several benefits to using point-to-point encryption instead of the traditional VPN model:

  • Decentralized encryption. Instead of relying on critical VPN gateways, each node can handle its own encryption (removing single points of failure).

  • Scalability. Since relationships are only calculated for the nodes that a single node needs to talk to (as opposed to every node), the overhead of the encryption is quite low. In our initial benchmarks, we found that performance suffered when one node had to talk to thousands of nodes, but as long as our services remain isolated and contained, this model should scale for our needs.

  • Efficiency. We are able to take advantage of the dedicated AES hardware that ships with most modern chipsets. Additionally, since the traffic is encrypted as well as compressed, we have seen only a 2-3% impact on our network throughput.

  • Within-datacenter encryption. Sending traffic over dedicated links within or across datacenters is generally secure, but recent events have raised the specter of security breakdowns in these kinds of connections. Point-to-point encryption provides a better alternative.

  • One less dependency on NAT. As more networks support IPv6 and a global address space, the private address space provided by VPNs will have to be re-done. Our point-to-point model easily supports a global address space.

  • Full End to End encryption. Switches, routers, fiber taps, neighboring hosts, the hosting providers themselves. These are all examples of potential intrusion vectors. By encrypting traffic all the way through, even if an intruder were to succeed in capturing our traffic, they would be unable to read any of it.

Role-Based Access Control

PagerDuty follows a least-privilege permissions model. Basically, this means that engineers only have access to the servers they need to get their job done. We leverage Chef, in concert with straightforward Linux users/groups, to build out our access controls.

Whenever we add a new user to our infrastructure, we have to add in the groups to which this user belongs. Whenever we add a new group of servers, we have to specify which user groups have access to these servers. With this information, we are able to build out the passwd and group files on each host. Because this is all stored in JSON config files and is in version control, it is easy to wrap access requests/approvals around our code review process.