Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
We at PagerDuty take security very seriously. To us, security means protecting customer data and making sure that customers can securely communicate with PagerDuty. Given our focus on high availability, we pay a lot of attention to how we design and monitor our security infrastructure.
In a dynamic environment like PagerDuty’s, providing robust host and network level security is not easy. Doing so in a way that is distributed, fault-tolerant, and hosting provider-agnostic introduces even more challenges. In this series of blog posts, we wanted to highlight some of the strategies and techniques that PagerDuty’s Operations Engineering team is using to secure the PagerDuty platform and to keep customer data safe.
Here are some of the best practices we follow around security:
All of our internal decision making for security goes against a list of philosophies and conventions that we maintain. This list is not written in stone, as we update it when we find problems, but it forces us to understand where we are making trade-offs and helps us with our decision making. It also makes it easy for new engineers to quickly understand why things are set up the way they are.
We follow a convention of securing everything by default, which means that disabling any security service has to be done via an override or exception rule. This serves to enforce consistency across our dev, test, and production environments.
As tempting as it is to poke a hole in the local firewall or to disable SSL when connecting to MySQL, we don’t want to be making these types of security changes in our production or test environments. Setting our tools to automatically “do the right thing” keeps all of our engineers honest. Also, by having this kind of consistency, we can debug security-related issues earlier in the development cycle.
All of our infrastructure is deployed to cloud hosting providers, whose networks we cannot control. Additionally, we are deployed across multiple regions, so a good chunk of our data traffic goes over the WAN. This introduces the challenges of packet loss and high latency – as well as the possibility that intruders may try to eavesdrop on our traffic.
With this in mind, we encrypt all data in flight and always assume that our data is flowing through networks where we have little visibility.
Security Groups, VPC, Rescue consoles, etc. These are all examples of provider specific tools that we are unable to use because we are spread across multiple hosting providers and need to avoid vendor lock-in. All of our security tooling has to be based on commonly available Linux tools or installable packages, which eliminates our dependency on provider specific security tools, and leads to better stability. We leverage Chef to do most of this work for us and have built out nearly all of our tooling on top of it.
Most companies approach AAA (Authentication, Authorization, Access) by having single sources of truth for access control and then use that source of truth as an authorization mechanism as well. Examples of this include: using an LDAP server, using a RADIUS server, or using a perimeter firewall to store network policies. Instead of relying on these single sources of truth for both policy management and enforcement, we split out and distribute the enforcement pieces to the individual nodes in the network. Our typical pattern is when a change is introduced into the network, the single source of truth updates the policy, and is then pushed out to all of the nodes.
While all of the above serves to provide a robust security architecture, it’s important that we validate our security measures to ensure that they’re doing what we actually need them to do.
Some companies will do quarterly penetration testing, but with our dynamic environment, that is too slow. We actively scan, monitor, and alert on changes (with PagerDuty) when there is something that is not expected. We catch problems quickly if a mistake is made (e.g. engineer accidentally opens the wrong network port on a server) or if there is actual malicious behavior (e.g. someone trying to brute force an endpoint), we get alerted to the problem immediately.
This is the first post in a series about how we manage security at PagerDuty. To continue reading this series check out, Defining and Distributing Security Protocols for Fault Tolerance
As a long-time security professional, I’m always interested to hear about how companies like Datadog are keeping up with the changing security landscape. I can...
We recently completed our third round of annual security training at PagerDuty. We run two sessions: One for all employees, where we discuss things such...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018