Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
This is the second post in a series about how we manage security at PagerDuty. To start from the beginning check out, How We Ensure PagerDuty is Secure for our Customers.
High Availability and Reliability are extremely important here at PagerDuty. We have spent an enormous amount of time designing and architecting our service to withstand failures across servers, datacenters, regions, providers, external dependencies and many other factors. But given that failure is inevitable, how do you build fault tolerance into our security system?
Two things to keep in mind: First, dedicated network or security hardware introduces the problem of single points of failure, where a VPN appliance that goes down can take out an entire datacenter. Second, in a security model where only the edge of the network is protected, an attacker that is able to penetrate that single layer will gain access to everything.
To keep these issues from arising, our overall approach has been to centrally manage our security policies, and push out enforcement to all nodes. Each node is responsible for looking up the policy definitions as well as deriving and enforcing a tailored ruleset.
The first strategy that we implemented in our centralized management/distributed enforcement model were our dynamic firewalls.
We have been migrating PagerDuty to a Service Oriented Architecture over the last two years, and with that, we have the opportunity to better isolate each service and contain lateral movements. When we define a new cluster of servers in Chef, we also setup the ruleset for the firewalls that define what group this server belongs to and what groups can talk to this group. From this, each server is able to create entire IPTables chains automatically, open service ports for the appropriates sources, and drop all other traffic. This means that each time a server is added/removed, we do not need to update any policies. Instead, the nodes will detect the change, and recalculate the ruleset.
We have seen a bunch of benefits from this approach:
We can easily create network partitions when needed. (This is how we make sure our dev, test, and production environments cannot talk to each other.)
We can isolate individual servers when we need to practice attacking them.
We can easily figure out which servers are talking to each other because all of the inbound rules have to be defined upfront.
We are using simple and straightforward Linux IPTables. If there is a firewall problem, every engineer understands how to manipulate the firewall and deploy a fix.
There is no single-point-of-failure network device. If a single server goes down or something more catastrophic happens, the rest of the system will continue to operate in a secure fashion.
For encrypting network traffic, there are two dominant methods that most use: Virtual Private Networks (VPN) and per app/service encryption, but we found problems with both of them.
A typical VPN implementation with dedicated gateways at each of our AWS and Linode regions would have had a number of issues:
Almost single point of failure. Even if you deploy multiple gateways to each region, anytime a gateway server goes away, there is either a failover involved or a reduction in capacity. This will result in connectivity issues.
Cost and scalability. Because we are using standard virtual machines and not dedicated networking hardware, we would have to use very large instance sizes to encrypt and decrypt traffic for the servers behind them. We were concerned with conventional VPN gateways’ ability to scale with our traffic spikes.
Latency. Because we already have cross-region calls being made, we want as few hops as possible when connecting to non-local servers.
Per-app/service encryption methods – like forcing MySQL to only allow SSL connections or making sure that Cassandra uses internode encryption – do have a place in our security framework. But there are problems with only using this approach:
It’s easy to forget. While security is part of everyone’s job at a company, many times people will forget to enable the appropriate security settings.
Each app/service has a slightly different way of encrypting data. While most connection libraries support SSL, it can be implemented differently each time. Moreover, this means that anytime we add a new service, we have to rethink how to handle the encryption.
To solve the above issues, we implemented a point-to-point encryption model based on IPSec in transport mode. This enables all traffic between specified nodes on the network to be encrypted, regardless of where the node is located and what other nodes it is talking to. Again, we followed our centralized policy management convention by calculating the relationships on a Chef server and then pushing them out to each node.
There have been several benefits to using point-to-point encryption instead of the traditional VPN model:
Decentralized encryption. Instead of relying on critical VPN gateways, each node can handle its own encryption (removing single points of failure).
Scalability. Since relationships are only calculated for the nodes that a single node needs to talk to (as opposed to every node), the overhead of the encryption is quite low. In our initial benchmarks, we found that performance suffered when one node had to talk to thousands of nodes, but as long as our services remain isolated and contained, this model should scale for our needs.
Efficiency. We are able to take advantage of the dedicated AES hardware that ships with most modern chipsets. Additionally, since the traffic is encrypted as well as compressed, we have seen only a 2-3% impact on our network throughput.
Within-datacenter encryption. Sending traffic over dedicated links within or across datacenters is generally secure, but recent events have raised the specter of security breakdowns in these kinds of connections. Point-to-point encryption provides a better alternative.
One less dependency on NAT. As more networks support IPv6 and a global address space, the private address space provided by VPNs will have to be re-done. Our point-to-point model easily supports a global address space.
Full End to End encryption. Switches, routers, fiber taps, neighboring hosts, the hosting providers themselves. These are all examples of potential intrusion vectors. By encrypting traffic all the way through, even if an intruder were to succeed in capturing our traffic, they would be unable to read any of it.
PagerDuty follows a least-privilege permissions model. Basically, this means that engineers only have access to the servers they need to get their job done. We leverage Chef, in concert with straightforward Linux users/groups, to build out our access controls.
Whenever we add a new user to our infrastructure, we have to add in the groups to which this user belongs. Whenever we add a new group of servers, we have to specify which user groups have access to these servers. With this information, we are able to build out the passwd and group files on each host. Because this is all stored in JSON config files and is in version control, it is easy to wrap access requests/approvals around our code review process.
AWS Security Hub and PagerDuty Power Real-Time Ops Companies migrating to the cloud need to ensure they have a strong security posture and can meet...
Disclaimer: This post is not meant as a religious statement, but merely an analogy to illustrate how DevSecOps has impacted engineering culture, both internally at...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019