PagerDuty Blog

A public release of our pd-feature Chef cookbook

PagerDuty uses Chef for some of its configuration management needs. While most Chef cookbooks wedevelop internally are not useful outside of PagerDuty’s infrastructureand workflow, sometimes we do come across a problem that seems general enoughto make open-sourcing the solution meaningful.

The pd-feature cookbooksolves the problem of gradually rolling out a new feature acrossa uniform fleet of machines while avoiding manual actions (we want all infrastructureto be controlled by and visible in source code). The cookbook allows fine-grained control ofthe process, supports a number of common scenarios, and makes featureseasy to discover. But, to justify the cookbook’s existence, let me explainwhy Chef does not solve this problem by itself.

Chef attributes are the usual method for controlling optional features.A common pattern defines a boolean attribute for the featureand takes different recipe paths based on that attribute’s value. In thisexample, the code would install Failure Friday-related tooling only in production:

in cookbooks/pd-base/attributes/default.rb:

default['pd-base']['failurefriday_enabled'] = false

in cookbooks/pd-base/recipes/failurefriday.rb:

if node['pd-base']['failurefriday_enabled']
cookbook_file '/opt/failurefriday/reboot.sh' do
source 'failurefriday/reboot.sh'
owner 'root'
group 'root'
mode 0744
end
end

in environments/production.rb:

default_attributes(
'pd-base' => {
'failurefriday_enabled' => true
}
)

Chef can set attributes on individual environments and roles, so ifa feature maps exactly onto an environment or a role, attributes are enough.However, if it is a shared feature set on a particular role in a givenenvironment things can get tricky (set to true in the environment andfalse in all the other roles is one none-too-pleasant way of accomplishing this).

The situation is even more complex for a uniform fleet (for example, twentyidentical machines with a web-app role in the same environment with a featurethat should be enabled on two of them). Attributes on role or environment level do not helpsince these machines share the same environment and role so their attribute values are the same.A different role can beassigned to a subset of machines, but that’s a fair bit of work. And assignment of customized roles,like other manual approaches such as editing node state of selected machinesdirectly or assigning Chef tags to a few nodes, are not visible anywhere in our sourceChef code (thus violating our infrastucture-as-code principle). Because that configurationis not in the code, it does not get replicated when manually modified nodes are replaced.And replaced they will be because constant, gradual churn of the fleetis a fact of life in large environments such as PagerDuty’s, usually due to hardware failureover time. The churn is guaranteed to eventually obliterate any node configurationchange made by hand.

This is where pd-feature comes in. Without repeating the extensive documentation, the solution isstill attribute-based but the attribute’s value specifies the rules for application ofthe feature instead of being a boolean on/off switch. For example, a count:2 value answersthe previous paragraph’s requirement, and if one of the selected machines gets replaced thecookbook will automatically select another one on the next run. The rules are expressed in codeand are tweaked with one-liner changes to adjust the feature’s reach.

A side benefit of using a unified approach to feature flags is consistency. In our Chef codebase,I can find boolean flags ending with “enable”, “enabled”, “disable”, and “disabled” with valuesbeing mostly booleans (true and false) but sometimes strings ('true' and 'false'), dependingon the author, age, and inspiration of the cookbook. Mistakes were made, including by yours truly, because of this variety. Using a helper for feature flags enforces a standard behavior and,by naming convention, clearly separates feature flags from other boolean attributes.

I hope you will find this cookbook useful. This is just one example of general infrastructureproblems PagerDuty engineers are solving in addition to developing the PagerDuty platform. Ifthese kinds of challenges interest you, we are hiring.