DraftKings is a digital sports entertainment and gaming company that fuels the competitive spirit of sports fans. The company operates daily fantasy sports, a sportsbook, and a casino, providing fans with opportunities to put their own skin in the game by wagering on their favorite team.
The growing gaming market in the US is driving increased competition. DraftKings is working to build the best, most trusted, and most customer-centric offerings while rapidly expanding into new markets—like a non-fungible token (NFT) marketplace. Josh Engels, Head of Problem Management at DraftKings, is responsible for providing a stable environment to ensure the best fan experience. The priority is engineering resiliency by providing an incident management framework for teams across DraftKings to handle their own issues. “A lot of change occurs on the backend as we grow fast and expand into new markets. We have to make sure we’re stable and offer a great customer experience,” said Engels.
Football weekends are a critical test for the platform. Gameday sees a steady stream of daily fantasy users picking their lineups ahead of kickoff. As soon as the first touchdown happens, DraftKings sees what they’ve termed a “Gronk Spike.” Fans open and refresh their apps, often doubling platform traffic and stressing the infrastructure. To prevent lost revenue, the company needs to ensure platform availability and rock-solid stability through constant gameday chaos. “Gaming is a highly competitive market,” explained Engels. “If a customer can’t access our service, they will immediately jump to a competitor.”
During its start-up years, DraftKings relied on a few key people who knew about its infrastructure to actively monitor systems and fix problems. They were tied to their laptops, carrying them around all day and often distracted from other responsibilities. As the business expanded and the number of platform users grew, so did the number of teams and services. Engels said, “nobody wants to manually monitor 24/7 in a dashboard. We want to be alerted when we need to be notified about an incident.”
The increasing complexity of managing technology caused alert fatigue and burnout for engineers. It was difficult to find time to work on new projects—projects that would keep DraftKings in front of the competition. Engels explained, “we need to help teams understand why they’re getting alerted and where these trends are, so they can have more time to innovate.”
DraftKings teams adopted a service ownership model, with each product line responsible for writing their own code and supporting it in production. These teams operated under the Problem Management team’s incident management framework, with PagerDuty as its digital operations platform. PagerDuty provided visibility across systems, and enabled DraftKings to handle incidents quickly and reduce recurring problems.
DraftKings integrated their key monitoring systems into PagerDuty, and set up schedules and escalations policies. Teams no longer had to carry a computer around. Now, the right person would be notified when there was an issue, providing teams flexibility and freedom. Engels shared, “with PagerDuty, when a service has an issue, we know exactly who’s expected to resolve it and where that communication is happening. It’s allowed us to really scale the business.”
As teams deploy services, everything is tied into PagerDuty. To reduce manual, repetitive work, an infrastructure as a code software tool is used for initial setup and onboarding. Whenever a new service is deployed, it automatically creates a service within PagerDuty and sets up the specific integrations required. This allows DraftKings to standardize service lists within PagerDuty. Engels commented, “you can look in PagerDuty and see the services we have and who owns them. This was hard to maintain at a growing company. Clarity on service ownership has been another huge benefit of PagerDuty.”
Problem Management uses PagerDuty to drive stability, ensuring the product is available for customers. PagerDuty reports provide metrics to identify trends, for example, if there are a lot of incidents related to a particular feature. The data is used to communicate with the business—all the way up to the CTO—providing information around incident status, mean time to resolve, and SLAs. Engels explained, “metrics allow us to make decisions and drive improvements throughout the organization.”
DraftKings implemented PagerDuty response plays for major incidents—situations where too many alerts are coming in for a single person to manage, or multiple people are receiving alerts on an issue. For example, if Sportsbook has a major incident on football Sunday, the response play will pull in a key engineer with business expertise across the infrastructure as incident commander. The response play can also create an incident-specific video conference meeting and responders can join the conference bridge via PagerDuty. This drives quick resolution during DraftKing’s most critical moments.
If there’s an outage, DraftKings will also use response plays to alert the customers as quickly as possible. The Customer Experience team is notified, and can immediately react by putting up a banner inside the app and pushing out communications on social media. This improves the fan experience by keeping them updated with what’s going on. Alternately, if a customer is the first to report an issue, the Customer Experience team uses PagerDuty’s email integration to create an incident and notify the right teams.
With PagerDuty, DraftKings has improved engineering resiliency and platform stability. Engineers no longer carry laptops around and Gronk Spikes are under control with PagerDuty orchestrating the right response, every time.
Since implementing PagerDuty, DraftKings has benefited from:
DraftKings is striving to provide the best fan experience while staying competitive and grabbing as much of the betting action as possible. Engels shared, “PagerDuty helps us know about issues before customers do. DraftKings has strict uptime and service requirements, and now constantly surpasses its goals. PagerDuty has really helped make us more efficient as a company.”
DraftKings will continue to prioritize team health. The Problem Management team plans on exploring PagerDuty’s Event Intelligence, including smart noise reduction, to minimize the number of alerts on-call engineers receive during an incident. By removing interruptions, responders can focus on resolving issues even faster, saving DraftKings time and money. Also, the company has been investigating stakeholder communication to provide the business status and impact information in real time, and reduce the influx of questions to engineering teams.