Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty is thrilled to be named a leader in G2Crowd’s Fall 2018 Grid Report for Incident Management. The ranking is based on high customer satisfaction...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
This is the first post of a multi-part series on some of the operations challenges that the team at PagerDuty is solving.
At PagerDuty we strive for high availability at every layer of our stack. We attain this by writing resilient software that then runs on resilient infrastructure. We take this into account when we design our infrastructure automation. We assume that pieces will fail and that we need to either replace or rebuild pieces quickly.
For this first post about our Operations Engineering team, we will be covering how we automate our infrastructure using Chef, a highly extensible, ruby based, search driven configuration management tool, and what practices we have learned. We will cover what our typical workflow is and how we ensure that we can safely roll out new resilient and predictable infrastructure.
Before going diving into the technical details, first, some context about the team behind the magic. Our Operations Engineering team at PagerDuty is currently made up of 4 engineers. The team is responsible for a few areas: infrastructure automation, host-level security, persistence/data stores, and productivity tools. The team is made up of generalists with each team member having 1-2 areas of depth. While the Operations Engineering team has it’s own PagerDuty on-call rotation, each engineering team at PagerDuty also participates in on-call.
We currently own 150+ servers spanning multiple cloud providers. The servers are split into multiple environments (Staging, Load Test, and Production) and multiple services (app servers, persistence servers, load balancers, and mail servers). Each of our three environments have a dedicated chef server to prevent hosts from polluting other environments.
The chef code base is 3 year old and has around 3.5k commits.
Following is the skeleton of our chef repository:
We use the standard feature branch workflow for our repo. A feature can be tactical work (spawning a new type of service), maintenance work (upgrading/patching), or strategic work (infrastructure improvements, large scale refactoring, etc). Feature branches are unit tested via Jenkins which is constantly watching Github for new changes. We then use the staging environment for integration testing. Feature branches that pass the tests are then deployed to the staging environment’s chef server. It depends on the feature, but most branches will go through a code review via a pull request. The code review is purposefully manual where we make sure that at least one other team member gives a +1 on the code. If there is a larger debate on the code, we block out time during our team meetings to discuss it. From there, the feature branch is merged and we invoke our restore script to delete all existing cookbooks from the chef server, upload all roles, environments, and cookbooks from master. Generally the restore process takes less than a minute. We do not follow any strict deployment schedules, we prefer to deploy whenever we can. Unless its a hot-fix, we prefer to do deployments during office hours when everyone is awake. We run chef-client throughout the week once a day via cron. If we need on demand chef execution, we use pssh or knife ssh with a controlled concurrency level.
All PagerDuty custom cookbooks have a spec directory which contains ChefSpec based tests and we recently migrated to ChefSpec 3. We use Chefspec and Rspec stubbing capabilities extensively as the vast majority of our custom recipes uses search, encrypted data bags etc. Apart from cookbook specific unit tests that reside inside the spec sub directory of individual cookbooks, we have a top level spec directory, which has functional and unit tests. Unit tests are mostly ChefSpec-based role or environments assertions, while functional tests are all lxc and Rspec based assertions. The functional test suite uses chef zero to create an in-memory server, then uses restore script and chef restore knife plugin to emulate a staging or production server. Then we spawn individual lxc per role using the same bootstrap process as our production servers. Once we successfully converge a node, we assert based on the role. For example a zookeeper functional spec will telnet locally and run ‘stats’ to see if requests can be served. This covers most of our code base, except the integration with individual cloud providers.
We heavily use community cookbooks. We try not to create cookbooks if there is a well maintained open source alternative. We prefer to write wrapper cookbooks with a “pd” prefix which addresses our customization over the community cookbooks. An example would be pd-memcached cookbook which wraps the memcached community cookbooks, and provides iptables and other PagerDuty specific customization.
Both community cookbooks as well as our PagerDuty custom cookbooks are managed by Berkshelf. All custom cookbooks (pd-* ) stay inside the site-cookbooks directory in chef repo. We use use several custom knife plugins. Two of them, chef restore and chef backup, take care of fully backing up and restoring our chef server (nodes, clients, data bags). With this, we can easily move chef servers from host to host. Other knife plugins are used to spawn servers, perform tear downs and check status of third party services.
Currently, we are confident about our ability to spawn and safely teardown our infrastructure when we have the appropriate tests in place. When we initially took a TDD approach for our infrastructure, there was a steep learning curve for the team. We still run into issues when we are spinning nodes across multiple providers and network dependencies for external configurations (e.g. hosted monitoring services, log managements services), so we have introduced additional failure modes and security requirements. We have responded to these challenges by adopting aggressive memoization techniques, introducing security testing automation tools (e.g. gauntlt) in the operations toolkit (more on this in a later post).
A key challenge remains with cross component versioning issues, and upfront and proactive effort to update dependencies. Some code quality related issues from community cookbooks also hampered us. But we understand these are complex, time bound problems. We are part of the bigger community responsible for fixing them.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018