OwnBackup Leverages PagerDuty To Transform Customer Service Operations
Headquartered in Englewood Cliffs, New Jersey, OwnBackup is a leading SaaS data protection platform for some of the largest SaaS ecosystems in the world, including Salesforce, Microsoft Dynamics 365, and ServiceNow. Through capabilities like data security, backup and recovery, archiving, and sandbox seeding, OwnBackup empowers thousands of organizations worldwide to manage and protect the mission-critical data that drives their business. The company is ranked #38 on the Financial Times’ list of America’s fastest growing companies in 2021 and is New Jersey’s first Enterprise SaaS Unicorn. More than 5,000 customers rely on OwnBackup to safeguard their data and help them comply with industry and government data regulations. The company is renowned for delivering exceptional customer support, having received a Salesforce Appy Award for its commitment to customer success. With four times as many five-star ratings as its closest competitor, OwnBackup is the number one data backup application on the Salesforce AppExchange. Gadi Vered is the Vice President of Customer Support at OwnBackup and oversees three support centers in the US, UK and Israel. “I’m proud of the team for earning the trust of our customers,” Gadi explained. “Our customers rest assured that should they lose their data, they can quickly return to business as usual and avoid costly consequences such as a loss of productivity and revenue.” Meeting SLAs in a Fast-Growing Company Since its founding in 2015, OwnBackup has experienced an incredible growth rate of more than 100% yearly. However, this rapidly expanding customer base—including larger contracts with enterprise customers—put OwnBackup’s customer service teams under increasing pressure to provide top-level, 24/7 support. Before using PagerDuty, customer support staff worked evenings and weekends, and manually checked emails to ensure they met service level agreements (SLAs) as specified in their customer contracts. The team had to be ready to resolve issues at any moment, which affected their personal lives and resulted in sleepless nights, which we consciously chose to reverse. “We had to be on all the time and performing at 200% at any given moment,” said Gadi. “Our customers perform most of their activity overnight or on weekends and need a guarantee that we will be there to cover them on their demanding SLAs.” Leveraging PagerDuty to Transform Customer Service Operations PagerDuty provided Gadi’s department the comprehensive functionality to transform their Customer Support response process. “There are common tools between the customers and our team; with PagerDuty, we’re able to utilize them effectively to drive value,” said Vered. OwnBackup can trigger PagerDuty through Email, Salesforce, and Slack, giving the customer support team more control and flexibility over notifications. For instance, PagerDuty triggers midway through the expiration of an open case to ensure that the team meets the SLA with plenty of time to spare. In addition, the team can add PagerDuty triggers on Slack, so they can provide additional support to customers and demonstrate OwnBackup’s commitment to 24/7 rapid response. The team also has robust escalation policies up to the CEO, Sam Gutmann, so alerts are never missed. Teams can now respond in minutes, well below their one-hour response SLA. “With PagerDuty, our customers are assured that we’re committed to short responses in a support policy and that they will be upheld,” said Vered. “With the right people, technology, and processes in place, the sky is the limit.” With PagerDuty, OwnBackup can provide 24/7 customer support operations worldwide with flexible scheduling and easy updates during a last-minute change in the on-call rotation. The scheduling and reporting features also enable managers to track off-hours work by their on-call teams and provide overtime compensation, resulting in stronger staff motivation and dedication to customer success. “We leverage PagerDuty to provide exceptional customer support,” said Vered. “We use reporting for pay, as well as keeping ourselves honest on how we do, as trust through transparency is a core value for our company.” Effectively Troubleshooting Data Loss PagerDuty has been instrumental in helping OwnBackup quickly restore large-scale, critical data for its customers. When a large banking customer deployed a software update without fully testing it, a bug caused the system to delete a staggering loss of essential records. The banking customer submitted a high-priority case to OwnBackup over the weekend, which triggered a PagerDuty notification to the on-call engineer. Within eight minutes of the case opening, the Customer Support team connected with the customer and helped them quickly identify and restore the records, averting a considerable crisis. Another customer had inadvertently changed sales opportunity records within Salesforce right before the end of the quarter. The customer frantically called OwnBackup’s support line, configured to trigger PagerDuty during off-hours. Within minutes, the customer service team was on Zoom with the customer and helped guide them through restoring the records. “Perpetuating success needs constant nurturing, especially as you grow and land larger contracts. For us, this is where PagerDuty became the obvious choice to integrate with,” said Vered. Benefits of PagerDuty By implementing PagerDuty, OwnBackup has transformed its customer service operations, with benefits including: A scalable solution that enables teams to meet SLAs with faster response times. Increased visibility with detailed reporting and incident response data. Higher motivation among customer support staff with proof-of-overtime compensation. Lower employee burnout and increased engagement with defined rotations and escalation policies. “PagerDuty scales with us as we grow,” shared Vered. “PagerDuty allows us to maximize our team of experts who effectively solve issues for our customers.” Using PagerDuty within Salesforce Service Cloud Looking to the future, Vered’s team is eager to evaluate PagerDuty’s integration with Salesforce Service Cloud announced in 2021, determined to break down further walls between his team and the Engineering teams. PagerDuty’s native work-where-you-are experience inside of Salesforce Service Cloud enables teams to view the PagerDuty status dashboard directly within Salesforce and provide the visibility needed to proactively engage with customers when issues arise. Customer service teams also have access to the PagerDuty incident command console within Salesforce, allowing them to seamlessly escalate urgent issues to the appropriate development teams natively within the app. “This looks to be an absolute game changer,” said Vered. “I’m excited about PagerDuty’s future direction and the investment in the Salesforce integration.” To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
PagerDuty Process Automation Delivers Results for ResultsCX
ResultsCX has a simple, yet critical mission: helping its customers take care of their customers. As a global contact center, ResultsCX is responsible for trillions of interactions each year—whether over email, chat, SMS or telephone. If you’ve ever called a bank or a hotel, you’ve probably called a company like ResultsCX. Call volume is directly related to the company’s revenue. Every missed interaction is not only damaging to the bottom line, but to the reputations of both ResultsCX and its customers. Jamie Vernon, Senior Vice President, IT Infrastructure and Operations explained, “We can do the full suite of whatever any company might need to help take care of their customers. It’s my team’s job to deliver those interactions to some 20,000 people every day, ensuring we keep our applications, services and networks running around the clock.” Interruptions, Escalations, and Repetitive Tasks Unplanned work like service impacts were a challenge for the infrastructure engineers at ResultsCX. Most instances required specialists, and it often took too much time to respond. For example, during a network failover it took up to 30 minutes to identify the right engineering specialist and get them on a conference bridge, plus another 10 minutes for them to plan next steps and ensure no further impact. These types of disruptions caused other valuable work to be pushed aside or after hours, resulting in undue stress and strain on personal lives. “We’d have an outage we weren’t expecting, and whatever we were working on got dropped,” said Vernon. Daily maintenance tasks were also time consuming. For example, one engineer would spend two hours a day on telecom maintenance. This job was repetitive and boring, and was flagged as something that could be handed off to a junior resource. However, hiring more staff was a challenge given the tight labor market. This drove ResultsCX to evaluate alternative solutions. “Our leadership at the time was more likely to pay for software than it was another person,” Vernon said. ResultsCX Looks to Automation for Greater Scale, Speed (and Sanity) Looking to improve overall IT uptime and create a better experience for engineers, ResultsCX turned to PagerDuty Process Automation. ResultsCX already used PagerDuty Incident Response, and the team selected Process Automation for its tight integration. “PagerDuty understands how to create dependencies, define them, take in events and turn them into actual information. So, the logical next step was using Process Automation to turn that information into action,” said Vernon. Now, the team can attach a button in the mobile app to “run a diagnostic” for any technical service—for example, a network switch, firewall, or server. And, with the click of a button, the right people get detailed information about what went wrong and how to drive restoration. “PagerDuty Event Intelligence and Process Automation allow us to more quickly identify what has failed, why it failed, and which team needs to drive repair.” R.I.T.A. Joins the Team IT leadership made an unconventional move to drive adoption of this new technology—when introducing Process Automation, they gave their new “employee” a name: R.I.T.A., or “ResultsCX IT Automation.” “Instead of talking about scripts, we decided to look at automation as a junior employee,” Vernon said. “I couldn’t hire a $100,000/year engineer, but I could ‘hire’ a junior and train up.” R.I.T.A. got right to work with its automation directive, proving its value from basic diagnostics, informing teams about issues and driving service restoration, to automating maintenance and service tasks. Now, if there’s a network failover, anyone in a contact center can respond with the push of a button and resolve the issue in two minutes. Vernon explained, “Not only is it the restoration of a revenue stream, but it can be done by any resource on my team.” Daily maintenance tasks, including telecom maintenance is another part of R.I.T.A.’s role. Now, all it takes is one click, and a report is ready for review 20 minutes later. This has freed up time for engineers to focus on more valuable and creative work. “We’ve not only lowered execution times, but we’ve democratized that ability to anyone, not just those from the specialized skill set,” Vernon said. “Having these predefined automations—200 already—built and blessed by senior engineers means tasks not only take far less time and effort, but now anyone can do it.” Exceeding Customer Expectations With PagerDuty, ResultsCX has seen better overall IT uptime and exceeded SLAs—enabling agents to be productive and ensuring customer satisfaction. This is a result of: Faster diagnosis and resolution. Resolution time for service impacts like network failovers has been reduced from 40 minutes to 2 minutes. Increased capacity for high value work. Automating telecom maintenance now requires one click, freeing up two hours per day. With over 200 automations, teams have significantly increased their innovation capacity. Improved business agility. Anybody on the team can run automation jobs instead of escalating to specialists. Better engineering experience. Employees spend less time dealing with unplanned work and maintenance tasks, and can focus on more meaningful projects. “Our overall IT uptime calculation is really simple: how many minutes of agent productivity did we deliver today?” Vernon said. “The industry standard is 99.95%, and R.I.T.A. has helped us exceed that. And if I can point to our uptime as a reason why customers should trust us, well now we are talking about revenue growth and expansion.” Automating the Future Things have gone so well that R.I.T.A. got a “promotion” to handle more tasks, earning the trust of its colleagues and instilling more confidence in execution. “We’re more trusting in our ability to articulate a process and document it and make it executable. We’re more confident in R.I.T.A.,” said Vernon. The team plans to automate more over time, and is exploring how to further enrich diagnostic reports, drive self-service and self-repair to support the help desk team, and even automate client onboarding. Learn more from Jamie in his Summit ‘22 presentation Cultural Adoption of IT Automation and his thought leadership blog. To find out how PagerDuty Process Automation can help you automate and delegate business and IT processes, contact your account manager or request a demo.
Groww Opts for PagerDuty for a Better DevOps Experience
Founded in 2017, Groww is an investment platform that enables users to invest in stocks, mutual funds, ETFs, and gold in a simple, paperless, and hassle-free manner. The FinTech is one of India’s fastest-growing investment platforms and has reached unicorn status by making investing simple and transparent for new investors. Operating under a service ownership model, the DevOps team is responsible for several mission-critical services including authentication and payment services. The team must also ensure customers can view real-time market data and place orders. During the past year, the startup reached over 30 million users and increased its engineering staff by over 65% to support the rapid growth. Aman Khare, DevOps Engineer, helps to support the platform’s infrastructure and security. “We make sure that the infrastructure is up and running. We make sure our customers have the best possible experience on our platform,” he said. DevOps After Hours Toil Groww had an on-call management solution in place, but it wasn’t always reliable during an incident, especially outside of normal business hours. DevOps engineers sometimes missed email and Slack notifications in the middle of the night. “We couldn’t depend on Slack for notifications late at night, and time to acknowledge was quite high,” shared Khare. Involving other responders or subject matter experts during critical incidents required manual effort for the team. More frustrating, it was possible for an entire team to receive an alert that the on-call engineer hadn’t received an email about. These situations required tracking down the right individuals, which slowed down resolution time. Further, the team didn’t have a way to suppress alerts based on certain conditions like severity. Some alerts weren’t relevant after hours and could wait to be addressed the next day. Too much noise made it challenging for engineers to focus on what mattered. These challenges created a difficult on-call experience for the DevOps team. It became clear that the team needed a better incident response process that could scale with the company. “We needed something that could enhance the experience for our developers,” explained Khare. A Reliable Tool for Reliable Results After exploring alternate options, the team selected PagerDuty as a more reliable and comprehensive DevOps solution. By leveraging some of the 700+ integrations available through PagerDuty, Groww centralized alerts coming from monitoring systems such as Google Cloud Platform, Prometheus, New Relic, and Grafana. Groww customized PagerDuty to align with how services are deployed in the company’s infrastructure, driving clarity around who should be notified of an incident, and providing context around service dependencies. PagerDuty’s flexible, dynamic notifications were an immediate win for the team, who can now receive notifications via SMS, call, or mobile app push notifications. This eliminated the need to check email and Slack after hours, and greatly improved the team’s mean time to acknowledge (MTTA). “PagerDuty gives us a call and ensures we never miss a critical issue,” said Khare. PagerDuty also makes it easy to bring in additional responders when cross-functional triage is required—for example, if the security and database teams are impacted by the incident. Acknowledging, escalating, and resolving incidents can all be done within the mobile app, empowering teams to manage incident response from anywhere. PagerDuty Event Rules provide Groww with the flexibility to suppress alerts that don’t need to wake up team members overnight such as low severity or non-actionable alerts. Reducing unnecessary noise helps the team focus and respond to important issues. Benefits of an Improved Incident Response Process PagerDuty quickly proved its value at Groww, laying the foundation for a better incident response process that will fuel the company’s growth while ensuring a great user experience. PagerDuty helped: Improve MTTA. PagerDuty’s dynamic notifications and customizable escalation policies make sure incidents are never missed. Improve MTTR. Notifications reach the right people faster with PagerDuty’s service based architecture, which means the responder can resolve incidents faster. Make life easier. Mobile incident management and event rules have reduced manual efforts and provided flexibility to engineers on call. “If people don’t need to spend time debugging and we’re able to avoid downtime, they could focus on more important work. People will feel more satisfied developing new products instead of putting out fires,” said Khare. Growing Into the Future Having seen a quick time to value, the DevOps team is eager to find more ways to leverage PagerDuty to improve its operations. For instance, the team plans to evaluate alert analytics to better understand which issues are taking the longest to resolve. This information will help determine what system improvements will be most impactful. Also, the team is looking to use PagerDuty for stakeholder communications to provide the business with information about an incident’s scope of impact and progress toward resolution. To learn more about how PagerDuty is helping companies transform their digital operations, visit www.pagerduty.com/customers for more information and start a 14-day free trial today.
PagerDuty Enables a New Level of Incident Response for Hyland Software
In 1991, Packy Hyland Jr. convinced a Wisconsin bank it could save printing costs by storing reports on optical disks. That early innovation became OnBase, the now $5 billion global company’s flagship product—and set Hyland Software on its way to being a leading provider of data processing, storage, and management. A universal enterprise information platform, OnBase centralizes business content in one secure location. It then delivers relevant information when and wherever it’s needed – increasing productivity, delivering excellent customer service, and reducing risk. Serving over half of Fortune 100 companies, it’s critical for Hyland’s infrastructure team to ensure uptime of these cloud-based technologies, solutions and services. Ineffective Alert Distribution Impacts Resolution Times The infrastructure team struggled to get actionable information to the right responders. “Prior to PagerDuty, we had multiple monitoring solutions that would deliver alerts in various ways,” explained Brian Long, Observability Engineer. “We had difficulty getting the correct information to the correct team, or alerts were delivered in fixed formats that didn't necessarily give pertinent information front and center.” For example, when the team needed to be notified about various version retirements, alerts came in as a giant block of text with no formatting. The information wasn’t consumable and lacked details about which instance, the endpoint that was being retired, and what work needed to be done on it. Even experienced responders would need extra effort and time to dive in and understand the problem. In addition, triage and cross-team escalations were inconsistent and at times ineffective, resulting in slow or clunky collaboration. “Many of the processes that worked during the normal workday schedule, such as reaching out to those teams through Slack, weren’t reliable if those teams were off hours, or if the response was handled by a 24/7 team that then needed to escalate to a non-24/7 team,” said Long. Hyland needed to improve the user experience for engineers, as well as drive faster resolution for its nearly 20,000 customers. How Hyland Leveraged PagerDuty for Smarter Event Routing and Enrichment The company turned to PagerDuty Event Orchestration, a feature set within the Event Intelligence portfolio. Event Orchestration uses custom logic and rules nesting to enrich and control routing, or to trigger webhook actions based on event conditions. Event Orchestration cuts down on manual work by connecting real-time event processing with intelligent automation. “Event Orchestration allows us to set multiple service delivery rules to classify if a payload comes in with certain detailed information,” Long shared. Because Event Orchestration processes rules “top down,” the team puts specific and strict rules toward the top, and more generic rules toward the bottom as catch-all functionalities. Event Orchestration helped Hyland address the issue of poorly formatted alerts like various version retirements. Based on the metadata, the alert is intelligently delivered to the correct service. By adding Transformations and defining Custom Variables, difficult machine terms and code are translated into helpful context for responders to effectively respond to the problem. “Using custom variables, we are able to write pieces of text that make the alert information more human and easier to understand,” Long explained. “Now we know what version retirement it is, what account it’s on, and the instance or machine that requires action. The alert responder can then quickly mobilize, identify any additional pieces of information that don't get sent as part of the payload, and resolve the issue much faster.” Mobilizing Teams Faster with Response Plays Hyland also leveraged PagerDuty to assemble and mobilize cross-functional teams, escalating to additional subject matter experts when assistance is needed and further speeding up resolution times. Using Response Plays, incident actions can be run at the push of a button, which escalate directly to the appropriate team based on the pre-configured escalation policies inside of PagerDuty. The name of each Response Play is actionable, so the user knows exactly what will happen by clicking it. “All actions are tracked on the incident so the person reaching out knows what is going on,” Long said. Benefits of Intelligent Delivery PagerDuty has made a significant impact on Hyland’s infrastructure team, helping to ensure an always-on cloud environment for customers. The team has seen improvements that include: Reduced manual processes and toil. Event Orchestration uses a powerful decision engine to get the right information to the right responders. More meaningful notifications. Custom Variables ensure information is easy to understand for timely, accurate, and actionable triage. Faster resolution times. Response Plays help to assemble and mobilize cross-team action to tackle complex incidents. “When we looked at our problems, we saw that we had alerts that potentially needed to go to different teams, the alerts were poorly formatted, and we had hurdles and issues reaching out to other teams,” Long said. “PagerDuty solved all of that for us.” Watch Brian’s Summit ‘22 Session—Intelligent Delivery and SME Mobilization: Ensuring Effective Alert Distribution and Resolution. Click here to learn more about how PagerDuty is helping companies transform their digital operations and start a 14-day free trial today.
Brink’s Secures Scalability with PagerDuty Process Automation
Brink’s, a global leader in cash management, secure logistics and payment solutions, helps businesses securely manage their money. The company operates over 16,000 secured trucks and serves customers in more than 100 countries. Technology is a key component of Brink’s long-term corporate strategy to drive growth and ensure exceptional customer support. In recent years, the company saw an opportunity to modernize and mature its IT ecosystem. Teams were managing workflows manually, and spending too much time and money on repetitive tasks. Deployments, migrations, and changes in the IT environment were also time consuming and often inconsistent. The company needed more scalable solutions moving forward. Identifying Toil The engineering team audited their processes and identified challenges that impacted productivity and overall employee satisfaction. One of these processes was distributing changes in information from one SaaS system of record to other operational systems and tools that lacked native integrations. Their as-is process required the user to manually extract data, modify it to fit the formats required by these other systems, cleanup the data, encrypt it, and then manually upload it to other operational systems. “In cases where native integrations didn’t exist, it was widely agreed that doing this manually was really painful,” explained Robert Powers, IT Automation Manager. While it was a well-documented process, it involved multiple manual steps and several systems across multiple departments. There was a risk of stale data and potential for human error and it took about 10 hours for teams to complete these jobs, adding up to 520 full-time equivalent (FTE) hours per year. Automation to Eliminate Toil and Drive Improvements To address these challenges, the team developed an automation practice using PagerDuty Process Automation. Their goals were to reduce toil, and facilitate simpler and faster deployments, migrations, and changes in the tech stack. They saw automation as a way to drive agility and scalability for IT. “We chose PagerDuty Process Automation because of its simplicity,” Powers commented. Brink’s turned to Process Automation to solve their system of record data transfer challenges. ”By scheduling this workflow with Process Automation and adding in notifications, we were able to turn this into a fully-automated, hands-off process with notifications on completion,” said Powers. Ops systems are now updated in less than one day, and any risk associated with using automation is mitigated by immediately being notified of automation failures. The cost to automate was 20 FTE hours, and the team is now spending 99% less time on the previously manual integration process. “We're saving someone time that they can use on more interesting tasks.” This first project proved its value quickly and helped other teams understand the benefits of automating processes. PagerDuty Process Automation was an easy-to-use solution that eliminated pain in someone’s day-to-day life while reducing costs to the business. Enabling Self-Service Automation As Brink’s automation practice matured, the team focused on delegating the use of automation to other stakeholders—developing a self-service catalog of capabilities that teams could use without relying on a specialist for help. “This has significant potential to make our business much more agile,” explained Powers. Their next automation project aimed to reduce the time it took for expert engineers to provision virtual machines (VMs), and eliminate waiting time for developers who needed those VMs to deploy their software. The process involved users making requests, waiting for the request to be approved by a specialist, and once approved—for other experts to manually deploy and validate the hosts. It took about two weeks for a developer to receive their requested VMs, which was mainly spent waiting for reviews and approvals. By building a self-service automated workflow for building VMs, they reduced a developer’s waiting time from two weeks to 3 minutes. Powers said, "We chose PagerDuty Process Automation as the orchestrator for this workflow because it gave us the ability to manage these jobs programmatically via an API, which we could integrate with key resources on our platform. Additionally, we were able to present users with a simple and easy-to-use interface with unnecessary features turned off via the access control lists (ACLs) within Process Automation." Benefits of a Modernized IT Ecosystem Brink’s has successfully used automation to drive constant iterative improvements to the business and, in turn, to its customers. With PagerDuty Process Automation, Brink’s has: Realized fast time to value. By choosing an easy-to-use tool and automating well-documented processes, the team demonstrated the value of the solution and saw a fast return on investment. Eliminated toil. Teams spend 99% less time on manual tasks like loading information from their one SaaS system of record to other tools—while reducing risks of manual errors. With a self-service catalog of automation capabilities, a VM build now takes three minutes. Improved team efficiency. Individuals have time to work on more interesting and valuable tasks. Enabled agility and scalability. Automation facilitates more frequent, fresher data updates, and simpler and faster deployments, migrations, and changes in the tech stack. Reduced costs. Engineering saves over 500 FTE hours annually by automating one workflow. “Look for something that everybody agrees is really, really painful. If the business agrees that doing something is costing too much or creating too much risk, and the people doing it agree that it's tedious and painful, that's a pretty good indicator you’re looking at a prime candidate for automation,” shared Powers. What’s Next? Brink’s continues to iterate on and expand its automation practice. Their next automation initiative will deploy entire reference architectures on-demand in AWS. Learn more about how to develop an automation practice in Robert Powers’ PagerDuty Summit ‘22 presentation: Improving Operational SLAs by Orders of Magnitude with Automation at Brink's. To find out how PagerDuty Process Automation can help you automate and delegate business and IT processes, contact your account manager or request a demo.
IHS Markit: Centralizing Incident Management With PagerDuty & ServiceNow
In today’s digital world, organizations are constantly undergoing change. They’re moving to the cloud and rolling out DevOps at scale—all in the name of driving innovation. But moving from a monolith to microservices can lead to applications becoming increasingly distributed. When problems arise, customers don’t care how many teams and services you have, or how complex your architecture is. They only care that your services work when they need them to. To this end, bringing everything—teams, services, data—under centralized management is key. Urgent work cannot be held up by centralized ticketing tools. This is where combining IT service management tools with a digital operations platform can bridge the gap between central IT and decentralized teams. Enter PagerDuty and ServiceNow—by combining the two, responders gain access to automation to drive action without delay, enabling a real-time response in seconds while maintaining a complete history of all activities. This combination also streamlines the business response to incidents, keeping stakeholders updated. This better together approach is representative of incident response processes leveraged today in the modern enterprise stack. One such PagerDuty customer benefitting is IHS Markit. The Culture Clash IHS Markit provides analytics and intelligence to financial service providers, governments, and other major industries. Headquartered in London, UK, it employs 16,000 people globally. IHS Markit needed to bring together a rapidly growing number of hybrid operations to gain full visibility across the business and manage incidents from a centralized command center. The company had grown through acquisition and now offered around 700 customer-facing services and 300 internal services. Tracking for incidents at this scale was incredibly challenging, and was made harder by the conflicting requirements held by different areas of the business. The DevOps team wanted to remain “agile, autonomous, and awesome,” with full control over all its monitoring needs. A core requirement from DevOps was that the team did not want to raise tickets or have to log on to ServiceNow. The operations command center (OCC) team was rooted in a more traditional IT infrastructure library (ITIL) structure and based its system on ServiceNow. The team wanted better scheduling and escalation policies, but with “zero impact to the existing, mature incident management processes.” Compliance wanted to track controls and records in a common system of record in ServiceNow, particularly as IHS Markit has many products under various regulatory regimes. Management requested global oversight across all teams, whether the team was more aligned with DevOps or sat within the more traditional ITIL side. Management wanted ServiceNow to provide this visibility. IHS Markit already had PagerDuty in place, but wanted to expand its use. John Kennedy, Director of Observability at IHS Markit, explained, “We wanted to bring incident management together into one enterprise offering that was horizontal across the company and properly managed.” A Solution for Everyone To achieve this, IHS Markit integrated ServiceNow incidents with PagerDuty. IHS Markit worked with PagerDuty’s customer success team to customize the PagerDuty platform to accommodate all requirements and improve operations. This enabled the DevOps team to maintain ownership of their services within PagerDuty. For these teams, the ServiceNow integration was introduced “by stealth”—everything was tracked and recorded in ServiceNow, without them ever having to log into the platform. For the OCC team, PagerDuty’s integration with ServiceNow ensured the existing incident management process remained intact. Everyone could monitor major incidents via PagerDuty dashboards, even if they were not yet onboarded in PagerDuty. With one click, incident managers could quickly bring in specialized teams with diverse skills, including senior executives or product experts. This also fulfilled compliance and management’s visibility requirements, as PagerDuty gave them a single pane of glass through which they could view the entirety of the system. “All of our major incident management was now being done in PagerDuty, and if the incident occurred outside of it, then our major incident managers would sync it up with PagerDuty,” John explained. “On top of this, they’re using response plays to bring in executives to help us make quick decisions. As a result, we’re getting major benefits, especially on MTTR.” This better together approach means that central IT has visibility and access across distributed teams. This will be essential as IHS Markit continues its growth journey. What’s next for IHS Markit? Looking ahead, IHS Markit will continue to centralize visibility. “There is a huge expansion of agile and DevOps methodology across the company, which means we need to think about the next evolution of our converged model for incident management,” John said. Maintaining DevOps’ ability to be “agile and autonomous” will also be a major focus. “We need them to be able to create their own technical services, so that means thinking about the technical services in ServiceNow and whether they need to be hooked into our hierarchy there,” John explained. “Governance is important too—how we maintain the quality of the system and how that's governed centrally.” As digital transformation continues and teams are more distributed than ever, it’s key that business processes for managing urgent work can operate in real-time. To find out more about how PagerDuty can enhance ServiceNow and other ITSM tools for faster resolution times and enhanced coordination, check out these resources: How Your ITSM Tool & PagerDuty Make a Dynamic Duo for Real-Time Work Enhance your ITSM Solutions brief: Extend ITSM Workflows with PagerDuty And, if you’re ready to see PagerDuty in action, try us out for free for 14 days.
Better Data for Public Health: How Nexleaf and PagerDuty are Monitoring Healthcare
Having a reliable power source is something many of us take for granted. It is particularly important for healthcare facilities to have a consistent, reliable power source to ensure that vulnerable patients – specifically those who rely on electricity to sustain their lives – are not disrupted. In rural Sub-Saharan Africa, however, it’s estimated that only about 28% of hospitals have reliable electricity. With little to no data to understand how and when power outages occur, it has become increasingly challenging for the hospital staff to manage. Nexleaf Analytics is working to solve this challenge. Nexleaf creates data and technology solutions for better health outcomes in low and middle income countries. They work alongside health advocates, governments, and local communities to provide actionable data for decision-making at scale. Their mission is to ensure countries have the data they need to build lasting solutions that improve people’s health. The Case for Data and Analytics Having an unreliable power source causes a myriad of problems for rural hospitals. For example, many rely on diesel generators when they have unreliable power. Although this is the only way to ensure backup power, it’s also a costly and inefficient workaround for unstable power systems. Most of these facilities lack baseline data to track the trends in outages, which means the hospital staff play a guessing game of when there will be an outage and are constantly on high-alert. It also causes problems forecasting budgets for diesel fuel expenses. Without data showing exactly how long and costly these outages are, it's difficult for these hospitals to justify additional funds. “Our main aim was to document demand for power data and to also understand what problems and challenges exist that could be assisted by having visibility in data,” said Amos Momanyi, Medical Equipment Project Manager for Nexleaf Analytics. A pilot program between PagerDuty, Nexleaf and the Center for Public Health and Development was implemented to understand how and when outages occur, and to establish protocols for maintaining healthcare facilities when outages happen. The program was implemented in 15 rural hospitals in Kenya with a few goals in mind: Document the demand for power data Understand the problems and challenges that could be assisted by having visibility in data Understand how alerts and data could help resolve power outages The Power of PagerDuty Nexleaf deployed PagerDuty and connected IoT sensors to provide notifications to hospital staff via SMS and an online application. This allowed the staff to easily understand the root cause of power outages. One hospital, for example, found that they were sharing electricity with a neighboring facility, triggering power outages at 6 a.m. They shifted energy usage to different times of day to maintain a predictable energy supply– a simple solution, but one that would otherwise be invisible to the team without data from PagerDuty. The data from the PagerDuty platform also helped medical facilities explain their need for increased diesel and justify why they were over budget. Even better, the data helped improve the accuracy of their financial projections for the months ahead. Most importantly, real-time notifications from PagerDuty meant that biomedical engineers were not required to be at the facility to know when an outage was happening. After receiving an alert, hospital employees could act quickly to reconnect power to the facility. This eliminated the need to manually monitor their backup power, and prevented blackouts with potentially major consequences on patient outcomes. “With PagerDuty, teams could ensure that no fatalities happened because of equipment failure due to loss of power,” said Momanyi. The Future Looks Bright With the success of the pilot, hospitals and their staff found a number of solutions. Hospital employees worked to determine use cases for power data that could help them make effective and efficient decisions around management of power at their facilities. Facilities are keeping PagerDuty in place and Nexleaf is expanding to new facilities. For more on Nexleaf Analytic's journey, watch the full Summit '22 session here. Learn how real-time operations from PagerDuty powers nonprofits, or try a 14-day free trial today.
SIRUM Uses PagerDuty to Prioritize Critical Work
SIRUM is a nonprofit technology company that drives the future of healthcare by connecting people with surplus medications. Kiah Williams, Co-Founder of SIRUM, discusses how PagerDuty helps prioritize what's most urgent, so engineers have time to build for the future.
SailPoint Secures Digital Operations Maturity with PagerDuty
SailPoint is the leader in identity security for the modern enterprise, empowering complex companies worldwide to build a security foundation grounded in identity security. Harnessing the power of AI and machine learning, SailPoint automates access, delivering only the required access to the right identities and technology at the right time. SailPoint has experienced continued growth as companies experience more—and increasingly sophisticated—cybersecurity threats. Additionally, the COVID-19 pandemic drove more people to work from home, creating new security risks for their employers. As the security landscape continues to evolve, SailPoint’s DevOps team has to innovate and find new ways of working. Omar Lopez is a DevOps Manager for SailPoint’s cloud offering. His team is responsible for everything related to observability, from metrics and logging to tracing and alerting—anything that enables SailPoint to identify and address issues before they become problems for customers. “The uptime of our products is super important to our mission here at SailPoint,” Lopez said. Meeting the Challenges of Expansion In an effort to optimize the operations of its growing DevOps team, SailPoint recently made some structural changes to the team, which included organizing smaller teams and adopting a service-based ownership model. Keeping people at the center of this cultural shift was a priority for Lopez. “The happiness of my engineers is very important to me,” Lopez said. “When I joined SailPoint, it was tough for one engineer to handle everything DevOps-related. It was also clear that we needed to improve our on-call process and make that less burdensome. Our team has really grown, and our goal is to pivot to total service ownership.” SailPoint also sought improved analytics to support smoother handoffs and reduce the burden of being on-call to improve team health among its engineers. “Prior to implementing total service ownership, we were challenged with having the bandwidth to properly address every single problem as our company grew, and with it, we added more people and technology,” explained Caitlin Green, DevOps Engineer. Moving to Service Ownership SailPoint was already using PagerDuty but desired to better utilize its investment by improving its operational practices, including through improved coordinated responses. SailPoint integrated PagerDuty with monitoring tool Prometheus. Prometheus sends alerts to PagerDuty, which then routes them to the service owner defined by Rulesets. “PagerDuty’s Global Rulesets mean we can route alerts directly to the right on-call engineer for a particular service, rather than it going to a triage engineer who has to figure out who they should send it to,” Lopez said. “That’s a game-changer for us.” SailPoint also integrated PagerDuty with Slack to help manage lower priority incidents, resulting in fewer interruptions, both to work and personal lives out of hours. PagerDuty has become an important part of SailPoint’s service ownership model, empowering teams to take responsibility for issues affecting their services and reducing pressure on triage teams. As SailPoint has embraced service ownership, its DevOps team saw an 85% drop in the number of incidents being directed to its team. “With PagerDuty, we’re able to redirect critical work to the right people,” Greene added. Optimizing Processes Through Automation SailPoint is enhancing workflows using automation. For example, by enabling Intelligent Alert Grouping (IAG) on AWS CloudWatch, SailPoint has reduced noise and sped up response. Previously, a database failure would fire 60+ alerts, continuously disrupting the on-call engineer. By utilizing IAG, SailPoint condenses all alerts into a single incident for the engineer to acknowledge and resolve, freeing up time to fix the problem. SailPoint also automated how it builds monitoring into services, creating a self-service process for engineering teams. Lopez explained, “As we transition to service ownership, we are focusing on getting all our engineering teams, services, and microservices into PagerDuty. We put a lot of effort into automating that process. We built a self-service tool using Terraform that all engineering teams can leverage to create their own services, and their own rules for those services through code—without the need for DevOps.” Collaborating Seamlessly Across Engineering and Customer Service SailPoint is in the early stages of introducing its customer support team to the incident response process. By onboarding customer support to PagerDuty, SailPoint engineers can provide relevant context to service representatives. Matt Smith, a Director of DevOps, explained, “If there’s an issue, the goal is to get more proactive about reaching out to customers and letting them know that we're on it before they see it.” Benefits of PagerDuty By implementing PagerDuty, SailPoint has matured its digital operations and moved closer to its goal of service ownership, with benefits including: Better visibility into systems by offering a single pane of glass Reduced MTTR and MTTA Less on-call fatigue and fewer off-hour interruptions, leading to improved team health Faster triage time as Intelligent Alert Grouping (IAG) condenses multiple alerts into one “PagerDuty has given us the tools we need to continue our journey toward service ownership,” Lopez said. “Importantly, PagerDuty has also enabled us to reduce on-call fatigue and boost the happiness of our engineers—one of our top priorities.” Matt added, “Having PagerDuty in the mix is tremendously beneficial to how we manage our on-call response. PagerDuty helps us disseminate responsibility to specific engineers, giving clear ownership and transparency, and enables us to track what teams are working on and which incidents are still outstanding.” Plans to Further Leverage PagerDuty’s Capabilities SailPoint continues on its path to total service ownership and is onboarding more engineering teams onto PagerDuty. The company is also looking at how it can leverage more of PagerDuty’s capabilities to mature its incident response framework. During broader incidents, it plans to use PagerDuty for better communication and coordination with cross-departmental teams including customer service, product management, and executive leadership. Find out more about SailPoint’s DevOps journey in The SailPoint Tech Blog. To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
Australian Retail Giant Delivers Always-On Digital Experience using PagerDuty
With the decision to bring the development and management of its website in-house, this retail giant saw an opportunity to reinvent its technology ecosystem. The website would be the first to use a brand new API platform, and the company needed real-time visibility into the system to manage and diagnose issues. A Principal Developer shared, “The first application that was going to use new APIs on the new platform was the website. Building a successful online retail experience was critical for our strategy moving forward.” Supporting an In-House Website Launch While the engineering team was mature, more scalable processes for incident management were needed to support a changing environment. "We needed all sorts of different technologies to support this initiative on the new platform,” explained the Principal Developer. While there was logging in place, there was no easy way to alert the team about an issue. Somebody had to review and understand the logs, determine if the alert warranted a call, and figure out who to actually call. Over time, engineering’s rapid delivery made it challenging to track down the right person at the right time. The retailer needed to address these challenges to ensure a highly available website. An outage could result in lost productivity, missed revenue, and negative brand impact. After a review, the team defined several technical requirements needed to improve incident response: Gain insight into the root cause of incidents. Leverage AI capabilities that provide intelligent recommendations over time. Encourage product ownership to reduce the time it takes for an incident to reach the correct engineer, and eliminate incorrect call outs. Manage and measure MTTA and MTTR. To achieve this, the company needed a platform that could enrich the information available—link dependencies between systems, and sync information with ITSM and APM tools. This would inform who’s affected by an incident and what capabilities might be disrupted, and ensure critical work be sent to the right teams quickly. Integrating PagerDuty into the Ecosystem PagerDuty was selected as a scalable, easy-to-use digital operations platform. PagerDuty integrated with the retailer’s existing services, providing end-to-end visibility across the ecosystem. This allowed the team to build an orchestrated process for critical work and supported a culture of product ownership. A tight integration with ServiceNow immediately proved valuable for incident response—mapping priorities, syncing notes between the two, and closing incidents down on either platform. “It was really great to have a lot of this integration provided out of the box with very minimal work required,” shared the Principal Developer. A Jira integration was used for alerts that didn’t need to go through the formal ITSM process, for example byproducts of other issues. The team leveraged the workflows inside Jira to manage these alerts, syncing notes between the two platforms. This integration encouraged a more resilient application design, steered quality logging, and ensured quality tickets were created. Leveraging Automation and Event Intelligence PagerDuty’s ML-powered event management, Event Intelligence, helped automate incident response. Change events provided situational awareness, surfacing critical information about recent deployments and releases in the code repository. This was especially useful for Terraform projects, providing insight around an event like when, where, and who did the merge. With key integrations in place, the team built out technical services to route incidents. Engineers were empowered to take ownership over the database of technical services, tracking who owns what. Dependencies were created across these technical services, enabling the correlation of issues across APIs. Over time, PagerDuty could determine potential contributing factors of an incident and narrow down the correct engineer. “We're seeing the benefit of an AI lens over our services and dependencies.” said the Principal Developer. “PagerDuty has helped us become a lot more confident in our services, and provided us with a source of truth from an engineering lens on a technical service and its status.” It was critical for the retailer to understand an incident’s impact on the business. Using PagerDuty’s Business Services, it was able to effectively communicate information to the right business stakeholders. Even better, owners of a business service could subscribe to alerts or view the status dashboard to be informed about what was happening and when a resolution was in place. “Using PagerDuty has enabled our service desk to know immediately what capability might be disrupted with a particular incident,” explained the Principal Developer. Benefits of PagerDuty With PagerDuty, the retail store successfully launched its new website in-house using the new API platform. With better incident response operations in place, the company is set up to deliver an amazing online retail experience to customers while keeping their own employees happy. The team has achieved: Full-stack visibility. Integration with existing tech stack including ServiceNow, Microsoft Teams, Jira, and Dynatrace has centralized operations. AI is delivering meaningful associations, driving faster diagnosis of the root cause of API problems. Reduced resolution time. AIOps features remove manual processes and guesswork in incident routing. Actionable alerts are routed immediately to the right engineer. Accurate data is providing insight into incident response, helping teams learn and improve operations. Improved team health. Driven by a high level of integration and clear product ownership, engineers are receiving less alerts, and are confident that alerts they do receive are for them. Effective stakeholder communications. Status dashboards are letting teams see when there is an incident affecting a business service they care about. Omnichannel customer experience. A reinvented retail experience for customers, who can now seamlessly shop in-store or online. The Principal Developer shared, “PagerDuty is helping us understand our applications, visualize our product health, and enable a culture of ownership.” A Strong Platform for the Future There are plans for continuous improvement across the organization. The company is exploring PagerDuty Analytics, including intelligent dashboards to measure the impact of incidents on teams, and will introduce postmortems to avoid repeating mistakes. It’s also actively investigating ways to best implement more Event Intelligence features to help the team reduce noise and drive down resolution time. To further streamline operations, PagerDuty will be rolled out to other teams including corporate infrastructure and security. “Looking back on it all, we met the objectives and have a very strong platform which we can build on,” said the Principal Developer. To learn more about how PagerDuty is helping retailers deliver amazing customer experiences, visit www.pagerduty.com/customers or www.pagerduty.com/industries/retail/, and start a 14-day free trial today.
DraftKings Scores Touchdown With PagerDuty
DraftKings is a digital sports entertainment and gaming company that fuels the competitive spirit of sports fans. The company operates daily fantasy sports, a sportsbook, and a casino, providing fans with opportunities to put their own skin in the game by wagering on their favorite team. The growing gaming market in the US is driving increased competition. DraftKings is working to build the best, most trusted, and most customer-centric offerings while rapidly expanding into new markets—like a non-fungible token (NFT) marketplace. Josh Engels, Head of Problem Management at DraftKings, is responsible for providing a stable environment to ensure the best fan experience. The priority is engineering resiliency by providing an incident management framework for teams across DraftKings to handle their own issues. “A lot of change occurs on the backend as we grow fast and expand into new markets. We have to make sure we’re stable and offer a great customer experience,” said Engels. Watch Out for a Gronk Spike Football weekends are a critical test for the platform. Gameday sees a steady stream of daily fantasy users picking their lineups ahead of kickoff. As soon as the first touchdown happens, DraftKings sees what they’ve termed a “Gronk Spike.” Fans open and refresh their apps, often doubling platform traffic and stressing the infrastructure. To prevent lost revenue, the company needs to ensure platform availability and rock-solid stability through constant gameday chaos. “Gaming is a highly competitive market,” explained Engels. “If a customer can’t access our service, they will immediately jump to a competitor.” During its start-up years, DraftKings relied on a few key people who knew about its infrastructure to actively monitor systems and fix problems. They were tied to their laptops, carrying them around all day and often distracted from other responsibilities. As the business expanded and the number of platform users grew, so did the number of teams and services. Engels said, “nobody wants to manually monitor 24/7 in a dashboard. We want to be alerted when we need to be notified about an incident.” The increasing complexity of managing technology caused alert fatigue and burnout for engineers. It was difficult to find time to work on new projects—projects that would keep DraftKings in front of the competition. Engels explained, “we need to help teams understand why they’re getting alerted and where these trends are, so they can have more time to innovate.” Using PagerDuty to Streamline Operations and Drive Improvements DraftKings teams adopted a service ownership model, with each product line responsible for writing their own code and supporting it in production. These teams operated under the Problem Management team’s incident management framework, with PagerDuty as its digital operations platform. PagerDuty provided visibility across systems, and enabled DraftKings to handle incidents quickly and reduce recurring problems. DraftKings integrated their key monitoring systems into PagerDuty, and set up schedules and escalations policies. Teams no longer had to carry a computer around. Now, the right person would be notified when there was an issue, providing teams flexibility and freedom. Engels shared, “with PagerDuty, when a service has an issue, we know exactly who's expected to resolve it and where that communication is happening. It’s allowed us to really scale the business.” As teams deploy services, everything is tied into PagerDuty. To reduce manual, repetitive work, an infrastructure as a code software tool is used for initial setup and onboarding. Whenever a new service is deployed, it automatically creates a service within PagerDuty and sets up the specific integrations required. This allows DraftKings to standardize service lists within PagerDuty. Engels commented, “you can look in PagerDuty and see the services we have and who owns them. This was hard to maintain at a growing company. Clarity on service ownership has been another huge benefit of PagerDuty.” Problem Management uses PagerDuty to drive stability, ensuring the product is available for customers. PagerDuty reports provide metrics to identify trends, for example, if there are a lot of incidents related to a particular feature. The data is used to communicate with the business—all the way up to the CTO—providing information around incident status, mean time to resolve, and SLAs. Engels explained, “metrics allow us to make decisions and drive improvements throughout the organization.” The PagerDuty Playbook DraftKings implemented PagerDuty response plays for major incidents—situations where too many alerts are coming in for a single person to manage, or multiple people are receiving alerts on an issue. For example, if Sportsbook has a major incident on football Sunday, the response play will pull in a key engineer with business expertise across the infrastructure as incident commander. The response play can also create an incident-specific video conference meeting and responders can join the conference bridge via PagerDuty. This drives quick resolution during DraftKing’s most critical moments. If there’s an outage, DraftKings will also use response plays to alert the customers as quickly as possible. The Customer Experience team is notified, and can immediately react by putting up a banner inside the app and pushing out communications on social media. This improves the fan experience by keeping them updated with what’s going on. Alternately, if a customer is the first to report an issue, the Customer Experience team uses PagerDuty’s email integration to create an incident and notify the right teams. Benefits with PagerDuty With PagerDuty, DraftKings has improved engineering resiliency and platform stability. Engineers no longer carry laptops around and Gronk Spikes are under control with PagerDuty orchestrating the right response, every time. Since implementing PagerDuty, DraftKings has benefited from: Customer Experience and Loyalty. DraftKings wins fans in a competitive market by maintaining a highly available platform, and proactively communicating when issues do occur. Incident Management. Problem Management’s framework, including PagerDuty’s platform for digital operations and a service ownership model, ensures the right person is notified and can quickly resolve incidents. Innovation. Engineers can focus on growing the product lines without the distraction and burnout of actively monitoring the infrastructure. Flexible Setup. In addition to integrating with monitoring systems, teams can also integrate tools used for their specific product line, and set up schedules and policies that make sense for them. All of this is streamlined using infrastructure as code software. Data-driven Decision Making: DraftKings uses PagerDuty reports to learn and improve operations. MTTR is now under 30 minutes, and issues proactively identified by internal stakeholders is above 90%—a significant improvement. Improved Communication. PagerDuty enables communication across the company for rapid response, including engineering, customer experience, and key business stakeholders. DraftKings is striving to provide the best fan experience while staying competitive and grabbing as much of the betting action as possible. Engels shared, “PagerDuty helps us know about issues before customers do. DraftKings has strict uptime and service requirements, and now constantly surpasses its goals. PagerDuty has really helped make us more efficient as a company.” Where To Place the Next Bet? DraftKings will continue to prioritize team health. The Problem Management team plans on exploring PagerDuty’s Event Intelligence, including smart noise reduction, to minimize the number of alerts on-call engineers receive during an incident. By removing interruptions, responders can focus on resolving issues even faster, saving DraftKings time and money. Also, the company has been investigating stakeholder communication to provide the business status and impact information in real time, and reduce the influx of questions to engineering teams. To learn more about how PagerDuty is helping companies transform their digital operations, visit www.pagerduty.com/customers for more information and start a 14-day free trial today.
Self-Driving Car Company Moves Into the Fast Lane With PagerDuty
Disrupting the automotive industry, this start-up is building one of the world’s most advanced self-driving electric vehicles. It's manufacturing at scale, building all-electric, zero-emission, self-driving vehicles that are redefining time in transit. Since the organization is developing groundbreaking technology, its digital infrastructure is complex and constantly evolving. “In our business, we utilize a mix of digital systems hosted in different environments,” explained the Engineering Manager. “We were beginning to rigorously test our cars in the real world, and it was critical that we could monitor everything in our environment to ensure we could react quickly to any incidents.” Alert Noise and Lack of Integration Limited Incident Response Keeping track of digital services was a complicated and time-consuming process for the engineering team. Not all systems were automatically compatible and sometimes required custom-written modules to allow them to interact properly. The team also experienced high levels of alert noise around the clock. Systems sent multiple notifications for the same incident, without distinguishing between high- and low-urgency, making it difficult for them to prioritize incidents. "When you spend so much time on incident response, it can begin to feel like a full-time job. We needed a digital operations platform that not only brought all our services together, but enabled them to talk to each other regardless of where they were hosted,” said the Engineering Manager. “We were also planning to migrate services from AWS to different cloud platforms, so it was important that any solution we selected was compatible with the platforms we wanted to adopt.” PagerDuty Enables Effective Digital Operations The self-driving startup adopted PagerDuty’s digital operations management platform for its comprehensive response functionality and ecosystem of over 600 integrations. Using PagerDuty, the organization successfully completed its cloud migration without any disruption to services. PagerDuty provides an intuitive status dashboard that dynamically updates teams with a shared view of system health to improve awareness of operational issues in real time. “Whether it's AWS CloudWatch or a different integration, getting effective and streamlined information into PagerDuty so that we know when there's a potential issue with the service has been invaluable,” shared the Engineering Manager. “PagerDuty’s compatibility—specifically with integrations like Slack and Jira—means people outside of the engineering department, such as vehicle testers, can respond to issues raised by PagerDuty using their interface of choice.” Implementing PagerDuty also enables the team to streamline incident notifications and reduce alert noise. The platform’s machine learning algorithms helped to reduce the number of false positive alerts. For example, if a Docker container disappears, PagerDuty will wait to check if a replacement spins up before sending out an alert. “PagerDuty gives us a smarter way to manage incidents and reduce alert fatigue,” explained the Engineering Manager. “If something breaks at 3 a.m., it might not need an immediate fix. With PagerDuty, we set low-urgency notifications that come via email outside of business hours, so people can sleep through the night for minor issues that can wait.” Using PagerDuty Outside of Engineering Beyond the engineering team, this self-driving startup also uses PagerDuty Live Call Routing to handle vehicle accident reports. If one of the company’s vehicles is involved in a traffic accident, law enforcement officers can call a toll-free number and reach the organization’s on-call staff immediately. Inbound calls get routed via the same escalation policies used for critical applications and services, so anyone can immediately reach the right responder. In addition, the legal team uses PagerDuty for their own workflows. The platform's advanced permissions settings enables the team to keep their services, incidents, escalation policies, and on-call schedules private and separate from other teams that use PagerDuty internally. “We use PagerDuty for pretty much everything,” explained the Engineering Manager. PagerDuty’s extensive and robust functionality continually helps the automotive company improve how it resolves incidents. Benefits of PagerDuty Since implementing PagerDuty, the self-driving automotive company has gained greater control over incident response processes and seen several benefits, including: • Increased efficiency by channeling all services and dashboards into one intuitive platform • 50% reduction in alert volume • 30% reduction in MTTA/MTTR • Increased metrics tracking and the ability to gain insights into service performance “PagerDuty gives us the functionality and flexibility needed to reduce toil, decrease alert fatigue, and improve collaboration between teams,” said the Engineering Manager. “PagerDuty is the most comprehensive digital operations management platform on the market— nothing else comes close.” Using PagerDuty to Increase Business-Wide Collaboration As the company continues to grow and evolve, PagerDuty will play an even greater role in managing the organization’s increasingly complex digital environment. PagerDuty’s flexibility and functionality have the potential to increase collaboration and efficiency across the business, and its intuitive interface has appealed to teams outside of engineering. As other internal teams expand their use of services such as Slack and Jira, they are evaluating integrating them more tightly within PagerDuty. “PagerDuty’s flexibility allows everyone to work in a way that suits them best,” the engineering manager explained. “In fact, the pandemic has prompted us to explore even more of PagerDuty’s features, and we’re always finding new ways to work smarter and more efficiently.” To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager and try a 14-day free trial today.
SAP Revolutionizes Rapid Major Incident Response Processes With PagerDuty
SAP is the market leader in enterprise application software, with customers in over 180 countries. More than three-quarters of the world’s transaction revenue touches an SAP system. In recent years, SAP has been on a journey to digitally transform its business and move customer-facing services to the cloud. As part of his role within the Global Cloud Services (GCS) team, Mitchell Rose, Senior Program Manager, is responsible for the global uptime of these services. “SAP’s vision is to help the world run better and improve people’s lives,” he explained, “but to do this, we need to ensure that there are fewer—and less impactful—cloud outages and incidents that affect our customers.” The Global Cloud Services team’s vision is to help technology teams within SAP ensure that their cloud services and infrastructure remains always on through intelligent outage management. “This meant creating a major incident service that could scale at an SAP level, helping us to ensure the uptime of services such as Ariba, Concur, and Fieldglass,” said Rose. Challenges Hindered Major Incident Response Considering the above challenges, the team knew that developing and rolling out such a service in an organization the size of SAP would be challenging. Many teams were using in-house tools customized for their respective technology teams; however, they weren’t scalable across the entire organization. Over the years, the acquisitions by SAP led to the use of different tools and processes across the organization, making collaboration and cohesion difficult. “Across teams, there were very different operating models,” explained Rose. “There were differences in operational definitions, and the word ‘priority’ had different meanings to different teams. They also had different ticketing systems, ChatOps tools, processes, and practices. To be successful, we needed a best-of-breed platform that mapped to our vision for major incident response. This is why we adopted PagerDuty.” PagerDuty Helps Speed Up Major Incident Response SAP’s Global Cloud Services team uses PagerDuty to orchestrate their major incident response. Since adopting PagerDuty, SAP has improved its major incident handling, reducing initial response and communications times to critical incidents by 30% and resolution times by 26% in two months. “We have successfully reduced the impact and duration of major incidents,” shared Rose. “With PagerDuty, we’re able to engage the right people, on the right issues, at the right time. As a result, we’ve reduced the number of people needed to resolve major incidents by 25% in just two months.” PagerDuty has also helped improve communication between teams and stakeholders. When SAP needs to triage critical, customer-impacting incidents, such as cloud service disruptions, SAP activates “SWAT” mode, its internal critical response procedure. The SWAT team then drives internal business communications, including those responsible for customer communications. Through PagerDuty, the SWAT team has access to real-time information about the status of an incident, allowing them to keep other stakeholders—including senior management—updated. Decisions to engage SWAT mode are made more quickly as a result, helping to reduce major incident response time from hours to minutes in many cases. Driving Greater Collaboration and Ownership GCS has made PagerDuty a key part of its major incident framework so they can better collaborate with Major Incident Management (MIM) teams across SAP. Now, when a major incident occurs, the relevant team—such as the SuccessFactors or Ariba MIM team—is notified to help coordinate the best response. “PagerDuty helped us align core business and technology teams around a common operating model for major incident response,” said Rose. “By using a common framework, we have aligned on processes and criteria for severity and priority. We’re also driving clear responsibility for services during a major incident, which has been scaled at an SAP level.” Benefits With PagerDuty Since implementing PagerDuty, SAP’s Global Cloud Services team has improved operational excellence, with benefits including: The ability to engage the right people with the right information in real time, optimizing response to major incidents A 25% reduction in the number of responders needed for major incidents, within two months Greater cross-team collaboration and ownership of services Improved real-time communication with the wider business about major incidents, helping to achieve internal performance SLAs Reduction in impact and duration of major incidents, with response times reduced by 30% and resolution times reduced by 26% Seamless integration with various commercial and in-house tools “PagerDuty has become mission critical for SAP, enabling our teams to collaborate and rapidly respond to major incidents, and helping us to continue to provide SAP customers with world-class digital services,” concluded Rose. Future Looking SAP’s Global Cloud Services team works hard to improve incident troubleshooting and will use PagerDuty postmortem reports, as well as past incidents, to help troubleshoot current issues. In addition, SAP wants to further automate its major incident response process by creating automated runbooks, aligned to key business impact metrics using PagerDuty. To learn how PagerDuty can help your team simplify and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
Zoom Video Communications Uses PagerDuty to Keep Video Conferencing Frictionless for Every Customer
Zoom Video Communications is a video conferencing company on a mission to make video communications frictionless for all. Eric Yuan, CEO and founder of Zoom, and Alex Guerrero, Senior Manager of SaaS Operations, dive into why their teams have adopted PagerDuty as their end-to-end incident management platform. Companies trust Zoom for their video conferencing services and, according to Yuan, “Our business counts on PagerDuty.”
Space Made Simple: How PagerDuty Enabled Loft Orbital to Achieve Incident Response Lift Off
The next great space race is on. Today, there are multiple companies competing to earn their slice of a global space industry set to be worth more than $1 trillion by 2040. However, launching a satellite into space still isn't an option for most organizations due to the prohibitive costs and complex engineering required. Now, thanks to innovative satellite-as-a-service company Loft Orbital, any organization can buy a ticket on a shared satellite and launch data capture technology into space at a fraction of the cost of doing it themselves. Launched in 2017, Loft Orbital’s satellite-as-a-service business model is shaking up the space industry. Loft Orbital flies customer payloads onboard regularly scheduled satellite launches and handles the entire mission as a service. For example, suppose a customer wanted to send a camera to space to analyze weather. Loft Orbital would add the camera onto one of its upcoming satellite launches and take care of the data collection process once the camera is in space based on customer requests. Understanding The Gravity of the Situation John Murray is a senior member of Loft Orbital’s satellite operations and software engineering team. He helps build and manage the company’s proprietary ground control software, Cockpit—an all-in-one solution for satellite operations. Engineers control everything through Cockpit, from mission planning to communications between satellites and ground stations. Cockpit is highly automated, reducing the risk of human error while making operations simpler for both Loft Orbital’s engineers and customer requests. One of Murray’s first tasks, when hired, was to implement an incident response system that could provide 24/7 support, in preparation for their first satellite launch. The system needed to scale in line with the fast-growing business and integrate with Loft Orbital’s existing tech stack, such as Grafana. Loft Orbital needed a solution with a stable API that was easy to customize. Being able to quickly respond to incidents was a key priority. Loft Orbital operates Low Earth Orbit satellites in a sun-synchronous orbit, so satellites regularly pass over ground stations, or very large satellite dishes used to communicate with the satellites, located at the North and South Pole. There are no ground stations in much of the Pacific Ocean, so there are times when Loft Orbital’s satellites are not in contact with the ground. This means that if an engineer misses an opportunity to correct a problem, they won’t be able to respond for at least another 45 minutes. “Urgency is key because things tend to snowball very quickly in space,” explained Murray. “Prompt responses save us time and money on a huge scale.” Another requirement was rethinking the mission command center. “Traditionally, you’d have a team of 20 or more people in a room 24/7 keeping your satellite healthy and operational. We’re hoping to only get called in when there is a problem that needs to be fixed, though we generally have at least one person on duty to perform certain manual tasks and maintenance,” explained Murray. “Automation is foundational - we need the ability to scale our operations to 5, 10, or 100 satellites rapidly without needing to scale personnel.” Murray had used PagerDuty in a previous role at another company and was familiar with its functionality. “There was no question in our minds that PagerDuty was the best solution to solve our problems.” A Trustworthy Solution Loft Orbital was able to implement PagerDuty quickly, and seamlessly integrate it with Cockpit and Grafana. If Grafana determines there are telemetry issues with a satellite such as abnormal temperatures or low battery voltage, it triggers an alert in PagerDuty and engineers are able to quickly respond. Additionally, PagerDuty’s agile API was able to easily integrate with Cockpit to alert when performance issues arise with the software itself. “Since adopting PagerDuty the team has acknowledged and resolved all incidents swiftly, giving us complete trust in our incident response processes,” explained Murray. Loft Orbital also benefits from PagerDuty’s simplicity; PagerDuty’s interface is so easy to use that new hires can support satellite operations within just a few hours. “PagerDuty is something that I can throw at users and don't have to worry about extensive training on how to ensure they are aware of issues so they can focus on issue resolution training,” explained Murray. “The software is simple enough that you give it to new hires and they're off to the races an hour later.” Additionally, employees can personalize alerts to suit their preferences to make sure it is notifying them as effectively as possible, which is ideal when managing a global team with differing approaches to work-life balance. PagerDuty has increased collaboration between Loft Orbital’s teams. The company doesn’t follow a traditional hierarchy, instead empowering engineers to manage services and incidents when they arise while ensuring full-team awareness by tracking issues and resolutions. When there is an issue, PagerDuty alerts the relevant expert according to set escalation policies aligned with time zones, minimizing out-of-hours disruption and downtime while documenting the problem for future reference. “PagerDuty has simplified our team's lives immensely,” explained Murray. “Previously, engineers were stuck in a position where they didn't know who the subject matter expert was, but PagerDuty helped eliminate this and now allows for seamless collaboration.” “PagerDuty is the glue that joins human monitoring to automated response and has given us the ability to scale operations rapidly,” said Murray. “With PagerDuty, I can walk away from my desk and live my life knowing my team has access to me in an emergency, and I have a way to look back on any issues others have addressed.” Mission Accomplished: A Culture of Rapid Incident Response PagerDuty has enabled Loft Orbital to scale its operations rapidly and provide 24/7 support for its satellites without a traditional command center or needing to hire additional staff at the same pace that its constellation grows. The company can confidently meet all customer SLAs and let them focus on what matters to them—their data or service. Looking ahead, Loft Orbital plans to roll out PagerDuty to different engineering teams. By adding more users to the platform, teams will be able to create a more solid structure of response orchestration and cross-team issue tracking and resolution. When an incident arises, everyone, regardless of their role, knows whom to escalate the alert to. Loft Orbital is building a culture in which everyone feels empowered to triage and troubleshoot incidents without worrying about missteps or causing an inconvenience. The company is also considering exploring PagerDuty’s Event Intelligence features to further increase the efficiency of its incident response process. To learn how PagerDuty can help your organization, contact your account manager or try a 14-day free trial today.
Cambridge Cognition Uses PagerDuty to Help Save Lives
Cambridge Cognition is a UK-based, neuroscience technology company that helps healthcare organizations better quantify the cognitive health of patients by using scientifically validated digital health solutions. From complex digital health assessment tools to intuitive mobile applications for drug development, Cambridge Cognition assesses hundreds of thousands of patients in over 100 countries. Much of the company’s technology assists with measuring specific cognitive functions during clinical trials. For example, if a pharmaceutical company wants to develop a new treatment for Alzheimer’s, they need a way to measure the impact of the drug in trials, and Cambridge Cognition provides the technology to help make that possible. Chief Technology Officer Ricky Dolphin is responsible for Cambridge Cognition’s software development and product management teams, focusing primarily on providing product direction and technical strategy, as well as making sure products meet quality and regulatory standards. Working with customers in the academic, life sciences, insurance, and healthcare fields, Dolphin’s vision for Cambridge Cognition is to one day build a tool that makes it just as easy to test a patient’s cognitive function as it is to test their blood pressure. “Cambridge Cognition is at the forefront of a fairly young area of science. There is still so much to learn about the brain, and we continue to push forward to find new ways to study cognitive health,” explained Dolphin. The Need for a Reliable Solution In this highly regulated industry, Cambridge Cognition must comply with a host of regulatory requirements. For example, the Health Insurance Portability and Accountability Act (HIPAA), the EU’s General Data Protection Regulation (GDPR), and the 21 CFR Part 11 requirements all must be considered when building their state-of-the-art technology. “Building intuitive technologies for measuring cognitive health requires a lot of data,” shared Dolphin. “We have to make sure that the data we are using and collecting is secure, otherwise we could face major penalties and risk putting patient information in jeopardy.” Using solutions that report on how participants respond to treatments in clinical trials in real time requires high reliability and minimal downtime. During clinical trials, there might be a two-minute window to capture critical information as a drug is metabolized. Uptime is therefore critical, which is why Cambridge needed not only SMS alerting but also acknowledgement of receipt from its DevOps team and escalation of issues. Without PagerDuty’s services, Cambridge Cognition risked: Missing SMS messages or having slow response times to incidents, leaving some incidents backlogged and unresolved Uncertainty surrounding the collection of trial data without reliable testing software “Our customers rely on us to capture time-critical information, and we want to guarantee that they will be able to do this,” explained Dolphin. Thinking Differently With PagerDuty Initially, Cambridge Cognition used PagerDuty for traditional alerting to notify technical responders when issues would arise within its cloud environment and server infrastructure. With PagerDuty, Cambridge Cognition was able to guarantee SMS alerting reached its responders, which helped Dolphin’s team deliver against client expectations. For example, using PagerDuty’s integration with Pingdom, the engineering team is immediately notified if one of their systems goes down. Additionally, the team also saw an opportunity to think differently about incident management and align their use of PagerDuty more closely with patient services. Digital Operations for Patients in Need Cambridge Cognition also helps companies conducting clinical trials monitor suicidal ideation by using a virtual survey (filled out by the patient) to assess if a patient is having suicidal thoughts. Before PagerDuty, a clinician had to manually review these surveys and then intervene or escalate if they felt that a patient needed immediate attention. This process lacked timeliness and was prone to human error. This information is now fed through PagerDuty’s API to continuously monitor responses for high suicidal severity ratings. If a patient is determined to be high risk, an alert is sent from PagerDuty to the appropriate clinician to notify them that this patient needs immediate attention. If the clinician doesn’t acknowledge the alert in a timely manner, then the alert is automatically escalated until an appropriate party can respond to the incident. In addition to automating the escalation of alerts, Cambridge Cognition has seen several other benefits with PagerDuty, including: Helping Save Lives. Automating the manual survey response process led to faster response times, which could help save lives at risk. Easier Audits. Notifications and responses are audited, enabling post-incident review. Reliable Products. With PagerDuty’s help in keeping Cambridge Cognition up and running, pharmaceutical and healthcare customers are guaranteed secure data from their trials. Competitive Differentiation. Being able to escalate, acknowledge, and resolve incidents on one platform speeds up the incident response process, giving Cambridge Cognition a competitive edge. “PagerDuty delivers real value and helps us save lives.” - Ricky Dolphin, CTO, Cambridge Cognition Preparing for the Remote World COVID-19 has forced many companies to focus on digital acceleration, but this is nothing new to Cambridge Cognition. PagerDuty's API has made it easy for Cambridge Cognition to integrate its software with PagerDuty's platform, which has helped accelerate the company's digital transformation. “We knew how well PagerDuty worked with our cloud infrastructure. We thought it would be great to use with our software during clinical trials. Now, we’ve been able to capture metrics that we never could have, like incident response and resolution time,” shared Dolphin. Moving forward, Dolphin wants to expand the use of the company’s suicide monitoring and alerting software across more of its customers. He is also looking at how his teams might be able to use PagerDuty in a cost-effective manner for other clinical trials besides suicide prevention. To learn more about how PagerDuty is helping companies transform their digital operations, visit www.pagerduty.com/customers for more information and start a 14-day free trial today.
FreedomPay Uses PagerDuty Process Automation As Extension of Next Level Platform Engineering
FreedomPay is a data-driven Next Level Commerce™ platform that transforms existing payment systems and processes from legacy to leading edge. As the premier choice for many of the largest companies across the globe, FreedomPay’s technology is built to deliver rock solid performance in the highly complex environment of global commerce. Business Success Puts Pressure on Operations As the FreedomPay business scaled, Chris Randall, Director, Platform Operations, and DJ DeBrakeleer, Site Reliability Engineer, found that the volume of operations and maintenance tasks was increasing to the point where it was straining their team. In addition to project work, Chris and his team found themselves frequently executing repetitive tasks like configuration and setup for new customers, validating and maintaining data feeds with customers, investigating and recycling app pools, database maintenance, and more. Adding to the burden was the push to have the Platform Operations team provide 24x7 coverage. Burnout was looming and it was clear that a new approach would have to be taken. Knowing that throwing more headcount at the problem was not the right solution, Chris and DJ decided to look for an automation solution. Automating Processes Across Systems Chris connected with a performance engineer colleague who had been using PagerDuty Process Automation for a variety of test infrastructure automation use cases. Chris and DJ’s first experiment using Process Automation was to automate running SSIS jobs outside of SQL agent using PowerShell. During this proof-of-concept, the ease of use and general utility of Process Automation became clear. The team started an internal wiki page to keep track of all the ideas they had for where PagerDuty’s capabilities could be applied. PagerDuty would also help FreedomPay improve their security posture. Considering the importance and sensitivity of FreedomPay’s customers’ data, maintaining strict privileges around production infrastructure PCI compliance is paramount. Using PagerDuty to put appropriate permissions around processes made this easier. Boosted Productivity After their proof-of-concept, DJ and team jumped at the opportunity to create further operational efficiencies using PagerDuty with the onboarding of a large, complex client. Previously, onboarding this type of account would require a great deal of file processing and manual tasks that typically would have fallen on DJ’s team to fulfill. Instead, they automated the workflows by having PagerDuty invoke the various tools and scripts needed to onboard the new client. With this success under their belt, the Platform Operations team continued to automate more procedures with PagerDuty. “By automating tasks with PagerDuty Process Automation, we have saved the equivalent of three or four dedicated people’s worth of time. And by removing the risk of human error, we have solidified the reliability of our critical processes,” shared DJ. What’s Next? DJ and his colleagues are looking to use PagerDuty Process Automation to automate incident response and spread Process Automation usage to other teams who participate in production operations work. Learn how PagerDuty Process Automation can help you automate and delegate business and IT processes, contact your account manager or request a demo.
Guidewire Drives Cloud Migration and Customer Support with PagerDuty
Guidewire provides cloud services to property and casualty insurers around the globe. Its platform offers innovative services for insurers to manage underwriting and policy administration, claims management, and billing. Guidewire has several hundred customers including Allianz, MAPFRE, MetLife, and Nationwide. With its transition to cloud-based solutions, Guidewire changed its operating support model to ensure 24/7/365 service availability for its customers, while reducing costs associated with the traditional follow-the-sun support model. However, the company’s cloud migration journey has also enabled it to streamline operations across other functional areas, with PagerDuty as the driving force. Michael Nguyen is a First Responders Team Manager, which is part of the Customer and Cloud Operations (CCO) organization that oversees customer service operations. The First Responders team manages customer support tickets and provides proactive system monitoring. As he explained, “PagerDuty came in at the very beginning of our cloud journey and it’s been quickly adopted. All of the teams within Guidewire have a different story on how they’re using PagerDuty.” Using PagerDuty for Customer and Cloud Support For years, Guidewire offered on-premises software to its customers. However, the shift to cloud demanded a more real-time response to customer issues, which led to the creation of the Customer and Cloud Operations (CCO) organization. “Moving to the cloud means our mission-critical services have to be available 24/7 for customers. We needed to be proactive in identifying customer issues and responding fast. We needed a comprehensive and reliable system to manage our increasingly complex digital operations,” explained Nguyen. At the start of Guidewire’s cloud journey, the CCO team utilized tools such as Sumo Logic, Sentry, and DataDog for monitoring application and infrastructure health. In order to scale, the team brought on PagerDuty at the beginning of this transformation as a way to integrate all of their monitoring tools, design the response based on impact level, and engage the correct stakeholders. “We needed to get this information to people, to give visibility to a variety of different areas. And the first thing that came to mind was PagerDuty,” said Mike Riella, Senior Manager of Cloud Operations. Expanding PagerDuty Adoption Across the Organization Shortly thereafter, the Product Development team also adopted PagerDuty, as the shift to cloud enabled more collaboration with the CCO team and a direct line into the customer experience. “We work directly with Product Development as a lot of things can break, go wrong, or you need someone with that high level of expertise. So that's where PagerDuty has come in,” said Riella. Using PagerDuty’s escalation policies, the CCO team can directly engage the Product Development team if needed. As PagerDuty adoption grew rapidly, the Guidewire BizTech team was brought in to manage PagerDuty at the corporate level, which led to their own adoption of PagerDuty. Dejan Nedic, Director of Production Services, is part of the BizTech team supporting and monitoring internal systems and applications, with a follow-the-sun support model of technical support staff distributed globally. Similar to the CCO team, the BizTech team also integrated and centralized all of its monitoring tools into PagerDuty. When an incident arises, the appropriate support engineer is notified during their business hours, avoiding any off-hours disruption for other members of the team in different time zones. “It's impossible to organize a follow-the-sun model without the proper tooling. PagerDuty plays the pivotal role there; it provides an easy way to communicate and escalate, with one view that’s a single pane of glass,” explained Nedic. Most recently, the Guidewire Security Operations (SecOps) team adopted PagerDuty to ensure high-priority issues flagged by the team’s security vendor are always addressed. Rather than rely on a single phone number for the entire SecOps team, the vendor sends an email to PagerDuty, which automatically triggers SMS, phone and email notifications to the appropriate resource, with escalation policies to ensure that someone responds. Steve Kavanagh, SecOps Senior Engineer, explained, “the Security Operations team is a 24x7 operation. We have to react within a couple of minutes and that’s where we’re using PagerDuty—for mission-critical priorities.” Increased Cross-Team Collaboration To date, 17 teams within Guidewire have adopted PagerDuty, which has improved collaboration among teams so that incidents can be escalated and resolved quickly. “The average response time is 3.5 minutes, and most teams respond within two minutes,” shared Muhammad Khan, Critical Escalation Manager on the Guidewire Escalation Team that manages high-severity incidents. “Guidewire has aggressive SLAs with penalties if we don’t meet our commitments, so this extreme availability needs a lot of quick action facilitated by PagerDuty.” For high-severity incidents, Guidewire even has its senior executives on PagerDuty. The Executive Duty Officer (EDO) team is part of the escalation process where executives follow a 14-week on-call rotation to help make time-critical decisions when every second counts. In describing the EDO program, Paul Allen, Senior Manager of the Guidewire Escalation Team, explained, “The team can jump on a call to provide that executive oversight and join joint customer calls whenever those are necessary. That's another way PagerDuty can help us demonstrate how important these issues are to our customers and make sure that we have good decision making on the call.” Benefits of PagerDuty Digital operations management has improved as more teams within Guidewire have adopted PagerDuty, with benefits including: Streamlined incident response from seamless integration with tools such as Datadog, Sumo Logic, Sentry, Jira, and Slack Rapid identification of service issues, resulting in faster response and resolution times and lower risk of costly SLA breaches for the CCO organization and Customer Support team Better cross-team collaboration with real-time access to Product Development specialists Faster decision making on critical issues through the EDO program Less employee burnout driven by successful implementation of a follow-the-sun support model within the BizTech team Quick and automated response for mission-critical issues within the SecOps team “PagerDuty provides us with a flexible and comprehensive real-time digital operations platform,” said Nguyen. “We don’t know how we’d support our customers without it.” Future Integration and Feature Rollouts Looking forward, Guidewire plans to use more of PagerDuty’s Event Intelligence capabilities to further reduce alert noise and create more efficient escalation paths. Guidewire’s Customer Service team also wants to integrate PagerDuty into Salesforce Service Cloud so that the Customer Service team can gain greater visibility into incident data directly in Service Cloud. This will allow them to update customers about issues more easily and quickly. “We think this is going to save us 10 minutes per incident—a quarter of our error budget that we’ll get back just by adding that integration,” explained Allen. “When it comes to which platform to choose for mobilizing people when things happen, PagerDuty is the first thing anybody thinks of,” said Riella. “At Guidewire, we have been super happy with our partnership with PagerDuty. People are fierce users of the platform.” To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
GLOBO Relies on PagerDuty to Help Bridge Communication Gaps
GLOBO is a B2B language service provider that combines innovative technology, certified linguists, and actionable data insights to deliver high-quality translation and interpreting services to businesses worldwide. With over 250 accessible languages and 24x7 support—including telephone, video, on-site interpreting, and on-demand text translation—GLOBO helps bridge the language barrier gap for businesses through its easy-to-use platform. And with the mission of assisting people in communicating in the moments that matter most, their services must be online and available to their customers around the clock. For example, many of their customers are healthcare facilities, and GLOBO operates as a conduit for doctor-to-patient communication by providing certified medical interpreters to convey sensitive information between the two parties in a timely and consistent manner. Jonathan De Jong, VP of Engineering, is mainly responsible for ensuring that GLOBO’s platform is up and running in his role. He also manages the Custom Development, Data, Analytics, and DevOps teams, all of which play a crucial role in ensuring the reliability and delivery of critical systems and services to their customers. Continuous Digital Transformation Before the outbreak, GLOBO already offered a video software service to connect doctors and patients who were in the same room, with interpreters. However, the service had limitations where only two dial-ins were supported at the same time. As the COVID-19 outbreak became more widespread, doctors and patients often needed to communicate from different locations. It was clear they needed an interpreting solution that could accommodate the doctor, patient, and interpreter, each dialing in separately. When COVID-19 hit, scaling up the video offering became the Engineering team’s primary focus. De Jong and his team needed to refine the software’s abilities to host a doctor, patient, and interpreter securely and in real time. Expanding the product to support three separate video calls from various locations around the globe was no easy task. Still, the team also had to scale the new and improved video software to meet the demand of customers who needed to shift to a fully remote way of working. With everyone going remote, the customer support organization started seeing a rise in customer-facing issues that required the engineering team’s assistance in a timely manner—and the engineers quickly realized they needed a solution to help manage the increased incident traffic. Along with the struggles that came with pivoting to an all-remote video offering, GLOBO experienced many other challenges, including: Slow response times due to manual escalation of alerts through email and Slack Lack of visibility between the Customer Support and Engineering organizations Alert fatigue stemming from non-actionable incident escalations during non-business hours “There was a week where, every single morning around 2 a.m., I was woken up from alerts from Amazon deciding to recycle all of our Redshift connections for our data warehouse. We needed a solution to help better determine incident severity for our teams,” said De Jong. Increasing Cross-Functional collaboration With PagerDuty Jonathan had used PagerDuty at a previous position and brought it to his new team at GLOBO. By using PagerDuty to automate escalation policies, GLOBO drastically improved workflows between departments, specifically with the Customer Support and Engineering teams. Now, when a customer-facing issue arises that customer support agents can’t resolve on their own, the problem is escalated to on-call engineers. For example, when a patient is trying to join the video software, but there is an issue connecting, customer support will first try to troubleshoot the problem. If they cannot fix it, then the customer support agent will use PagerDuty to escalate the problem to the engineering organization to resolve the issue quickly. As the move to remote work continued, more GLOBO customers turned to the improved video software. Escalation policies have allowed De Jong and the team to quickly resolve incidents coming in from the customer support team and all other endpoints; however, too much time was still being spent fixing issues that may not have been a priority, which led De Jong to PagerDuty Event Intelligence. GLOBO uses Event Intelligence as an endpoint, receiving alerts from the organization’s entire monitoring stack, including Datadog and New Relic. Using artificial intelligence and machine learning, Event Intelligence helps De Jong and his team avoid unnecessary interruptions by automatically reducing alert noise and intelligently grouping alerts into single incidents. As a result, De Jong’s Engineering team has been able to shift their focus from being reactive to customer-facing issues to being proactive and resolving issues before the customer is affected. This has been especially important in communicating sensitive information between doctors and patients through a certified virtual interpreter. Additionally, PagerDuty has helped GLOBO with: Speed to market. Automated escalation allows teams to focus on building out and scaling new products and services. Self-management. Customizable escalation policies have allowed on-call engineers to personalize how they receive alerts (emails, SMS, phone calls, or push notifications). Reducing alert fatigue. PagerDuty Event Intelligence has helped GLOBO reduce alerts by two-thirds. “Before PagerDuty, my team and I were very reactive to incidents coming in, which led to more time resolving incidents than working on supporting our products. PagerDuty has allowed us to focus on what really matters,” explains De Jong. Scaling PagerDuty for Growth Looking ahead, GLOBO plans to continue expanding its use of PagerDuty and its automation capabilities to more teams across the organization. For example, to help interpreters connect faster with clients, GLOBO wants to automate the process of contacting on-call interpreters during a service request. Additionally, GLOBO hopes to eventually see a reduction of up to 90% of alert noise from its continued use of PagerDuty Event Intelligence. With PagerDuty, customers can rely on GLOBO to provide translation services to their users and ensure a consistent customer experience when it matters most. “I love PagerDuty because it just works. Our clients rely on us to provide them a seamless customer experience, and we know PagerDuty is there to help support us in times that matter most,” shared De Jong. To learn more about how PagerDuty helps companies uplevel support operations and management in the U.S. and abroad, try PagerDuty today.
Global Bank Partners With PagerDuty to Migrate to the Cloud
This global banking company focuses on serving low-income customers and families whose needs are often turned away by mainstream banks. The company’s mission is to help people achieve their financial goals by providing credit card and loan services to over 2 million customers. The bank’s Senior Technology Operations Manager oversees different teams inside the organization including application support, infrastructure, and platform. These teams ensure the organization’s systems and services are always online and available around the clock. Traditionally, this company prioritized its products around in-person banking services, but with an increased demand for digital-first service offerings over the last few years, it shifted its focus towards revamping its website and mobile app to fully support a completely digital customer experience. “We want to put our technology in the hands of our customers and allow them to choose the channel they want to use while interacting with our services,” explained the Senior Technology Operations Manager. Challenges of Digital Transformation The organization’s digital transformation began in 2016 when the mobile team pivoted to focus more on digital services. Teams felt that the mobile app needed a new user interface that put the company’s services in the palms of customers' hands, rather than traveling to the bank every time they needed to use a service. Prior to 2016, a third-party vendor developed and hosted the bank’s mobile services, but since few customers were actually using the mobile app, the team instead decided to create a new one in-house. With a goal of getting this new app in the hands of customers as fast as possible, the team decided to build and host the application completely in the cloud, allowing service provisioning to go from weeks to hours and saving several months of development time. Due to the success of the mobile app, the company now hosts all new services in the cloud. Today, the mobile team is now responsible for monitoring and supporting these new cloud-hosted services around the clock. Previously, the company had a team dedicated to troubleshooting incidents, which mainly involved a service desk following runbooks and using other, more manual escalation processes. But due to the nature of the organization's mobile app now needing 24/7 support, the mobile team needed a new process to provide support after normal business hours. With the new mobile app, the bank realized they needed to address additional challenges, including: Manual incident response. Using a service desk, incidents needed to be monitored, acknowledged, and escalated manually. Lack of visibility. Without an incident management platform, a record of incidents was not easily accessible. Difficulty providing 24/7 support. Creating in-house digital services meant that customers now required 24/7 support from internal teams. “Our services require 24/7 support and this simply wasn’t manageable with a manual incident response process,” explained the Senior Technology Operations Manager. “Basically, we’re offering a real-time service (with regard to the mobile app), and we need to be able to support it in real time.” Automating Incident Response With PagerDuty Instead of having a centralized team dedicated to monitoring the global banking company’s services and alerting responders when incidents happened, the mobile team turned to PagerDuty to help distribute these responsibilities across the organization. Adopting PagerDuty allowed the organization to expand ownership and visibility to all employees responsible for services they built, rather than making the service desk responsible. The mobile team was able to quickly integrate the PagerDuty platform with tools such as SolarWinds for on-premises monitoring, as well as Microsoft Azure and Datadog for cloud monitoring. “The ambiguity that PagerDuty’s integrations offer helped us move to an automated incident management process for both our on-prem and cloud services,” explained the Senior Technology Operations Manager. After implementing PagerDuty, the mobile team immediately saw the impact of having an end-to-end digital operations platform. “We had an instance where one of our file shares was getting consumed rapidly and was going to reach capacity,” the Senior Technology Operations Manager shared. “With several services impacted, the service desk was notified about the incident by different teams, but my team had already acknowledged, escalated, and resolved the incident before it ever became customer-facing. That was the moment we saw the true value of PagerDuty.” As the mobile team shifted from relying on one centralized team to monitor, assign, and escalate issues to a more automated incident management process, other teams inside of the organization began adopting the PagerDuty platform. Now, the core infrastructure and the loan services teams have integrated PagerDuty into their incident management workflows. A Core Infrastructure Engineer shared that, “It’s been difficult shifting away from a manual process, but when we saw the success that the mobile team had, we decided to implement PagerDuty into the core infrastructure team as well.” Teams using PagerDuty have seen many benefits, including: Faster response times. PagerDuty’s mobile app allows teams to triage incidents anywhere and anytime. More automation. Incidents can be automatically closed and logged into the organization’s configuration management system. Increased visibility. Teams have been able to keep a record of its incidents to analyze new data. “With PagerDuty, we can proactively resolve incidents before they become customer-facing,” explained a Senior Technology Operations Manager. What’s Next? Looking forward, the global bank wants to continue shifting its internal culture by embracing automated workflows across more teams. For example, using the PagerDuty mobile app, the executive team now has real-time visibility into incidents occurring throughout different teams and services, and will begin to use this information to continue fine-tuning the incident management process at the organization. Additionally, teams are evaluating other PagerDuty products to see how real-time data can provide deep contextual insights and further boost their team's productivity. “PagerDuty allows us to have that single-pane-of-glass, holistic view of the health of all our services,” shared the Senior Technology Operations Manager. To learn more about how PagerDuty is helping companies transform their digital operations, visit www.pagerduty.com/customers for more information and start a 14-day free trial today.
Loblaw Partners With PagerDuty to Implement Full-Service Ownership
Loblaw Companies Limited is Canada’s largest retailer and leader in the food and pharmacy industries, with a mission of empowering customers to “Live Life Well®.” The company provides online and brick-and-mortar marketplaces with access to groceries, apparel, health and beauty products, and financial services. Jaspal Sawhney, Senior Director of SRE at Loblaw Technology, manages a team of over 100 software developers and engineers, all of whom are expected to build tooling and capabilities for the development team, which allows them to fully own their services. With the COVID-19 pandemic driving more online and mobile commerce than ever before, reliability and customer convenience are at the forefront of Loblaw’s priorities. Loblaw’s philosophy has always been focused on customer convenience. “Wouldn’t it be great if you could go to one place and pick up birthday gifts, party favors, and get the catering as well? That’s the vision of ultimate convenience,” explained Jaspal. “As a retailer, our competition is fierce. If we don’t compete at a high level, then we may not be around in 5 to 10 years.” Looking Internally A few years ago, Loblaw was a very traditional IT shop, with siloed teams and its own data centers. This created problems around visibility and accountability when issues arose. “We basically had a lack of ownership because everything would just be getting thrown over the fence,” said Jaspal. As Loblaw grew and systems became more complex, centralized incident management was simply not scalable. Minor issues would turn into major incidents, which caused an increase in mean time to resolve (MTTR). “We would sometimes have outages that lasted longer than we would’ve liked,” explained Jaspal. “Every minute you're down is serious dollars you're losing.” The Path to Full-Service Ownership Part of Loblaw’s digital transformation efforts involved transitioning to a full-service ownership model leveraging cloud technologies—and that would take time to succeed. The first step was changing the roles within the technical teams at Loblaw Digital so that everyone was a developer responsible for the code they developed. “If you are in Loblaw Digital, then you are fundamentally writing software. And code runs the lifecycle of everything we have in our organization,” he explained. The next step was cloud adoption so that the entire team would have visibility into their code and could follow it through to production. “Moving to the cloud allowed for a lot more control through pipelines and gave teams more visibility, accountability, and auditing capabilities than they had before,” shared Jaspal. The teams could use a host of cloud tools to ideate, build, and test their own code, allowing for new solutions to be built in a fraction of the time it took before. Having full ownership of their code also enabled the team to better understand the business impact of their work, which created even more accountability. While the move to the cloud-enabled teams to move faster, the risk of failure was also higher, and it was critical that teams could learn from failures without assigning blame. It required Loblaw Digital to change its culture and seek a system that would provide psychological safety. The final step to this transformation was bringing in a platform that could support a full-service ownership model for every person on the Loblaw Digital technical teams. Partnering with PagerDuty Loblaw adopted PagerDuty as the final puzzle piece in their digital transformation journey and shift to a full-service ownership model. PagerDuty enables Loblaw Digital to identify issues and understand if the right team member is the first responder to an incident. Rather than a centralized team scrambling to identify the root cause, all customer-facing services are now broken into individual components with ownership by the teams that wrote the code. With this model, the teams can push code more frequently, use the monitoring tools they want, integrate them into PagerDuty, and set their own schedules and escalation policies for the services and applications they own. Additionally, Loblaw uses PagerDuty to facilitate postmortems, now a staple of its incident management process. PagerDuty provides a record of incidents used by Loblaw to build a blameless retrospective culture that focuses on continuous improvement without pointing fingers. “Now, teams are able to dive into these retros and share details to learn from them,” explained Jaspal. PagerDuty has helped Loblaw with: MTTR. Incidents have gone from hours to under 15 minutes. Talent Retention. According to Jaspal, a culture of service ownership—driven by PagerDuty—ties directly to developers achieving mastery, autonomy, and purpose, leading to higher job satisfaction and retention. Executive Trust and Confidence. Because incidents are resolved quickly with little customer impact, executives no longer lose sleep at night and trust that systems and services will stay always on and available. Productivity. Developers are spending more time on innovation and improving the customer experience and less time putting out fires. “With PagerDuty, we have been able to embrace a full-service ownership model for developers, which has been adopted by all teams taking the SRE journey as they modernize their applications for the cloud,” shared Jaspal. Resilience During a Pandemic Loblaw was well on its digital transformation journey when the pandemic began, so it was able to manage the increase in online traffic with little disruption. “This agile, full-service ownership model was really validated when the pandemic hit,” explained Jaspal. “As an essential service, we needed to be able to pivot and build these new products within a matter of days.” In the first four weeks of lockdown, Loblaw Digital built new solutions for seniors, healthcare workers, and frontline workers, with a focus on making their shopping experience as convenient as possible. Full-service ownership created the autonomy and accountability for Loblaw to make decisions quickly and build as fast as possible. While other retailers took some hits during the pandemic, Loblaw Digital has been resilient and was able to pivot quickly due to its embrace of a full-service ownership model. Looking Ahead Because of Loblaw Digital’s success, the broader Loblaw enterprise has taken notice. “The key is asking, ‘How do we take what we've done within Loblaw Digital after proving some success with it and replicate that across the entire enterprise?” shared Jaspal. With this in mind, Loblaw Digital’s site reliability engineering function has transitioned into an enterprise team, focusing on smoothly creating and scaling highly reliable software solutions for the entire organization. “When it comes to digital transformation, there really is no end state. It’s always going to be evolving, and it’s our job to continuously improve, learn, and reach digital maturity and then push the bar higher towards engineering excellence,” explained Jaspal. To learn more about how PagerDuty partners with companies to optimize cloud migration and service ownership, check out what our other customers have to say and try PagerDuty today.
HUG Relies on PagerDuty When Healthcare Incidents Arise
The Geneva University Hospital (HUG) is one of the five university hospitals in Switzerland and one of the largest hospitals in Europe. Pierryves Fournier, SRE Team Lead at HUG, explains how PagerDuty and Rundeck help automate his team's incident response process, empowering the right action when seconds matter.
Trek Medics: Deploying Emergency Responders in Underserved Communities Around the Globe
Emergency responders were under enormous pressure over the past year, especially when COVID-19 overwhelmed healthcare systems in various parts of the world. In the face of this, coordinating a rapid and efficient emergency response has never been more crucial. In many low- and middle-income countries and underserved communities, emergency responders typically coordinate the response via messaging services (e.g., WhatsApp). But scrolling through group messaging chats to coordinate and mobilize on a broader level is not scalable nor effective, and leaves emergency responders at the mercy of outdated mobile networks and legacy infrastructure. Enter non-profit organization Trek Medics, a PagerDuty.org grantee and Impact Pricing customer. Trek Medics works to improve emergency response for at-risk and vulnerable populations through innovative mobile phone technologies. Through Trek Medics’ Beacon communications platform, responders can alert, coordinate, and track emergency response networks on any mobile phone—with or without internet. The platform is active with more than 2,000 daily users in 25 countries including Puerto Rico, England and Tanzania, and handled over 100,000 calls in 2020. Orchestrating Emergency Response at Scale The emergency services organizations that rely on Trek Medics need to be on call 24x7. Teams in these regions are typically alerted to nearby emergencies via text message or even an air siren. The Beacon platform serves the same purpose, but instead of an air siren, Beacon sends a digital signal to the responder's device. Beacon goes beyond just alerting individuals. “Coordinating a response to an alert is the second part of the equation that is just as critical as the first part. We can coordinate the appropriate response to make sure that the right people are going to the right place at the right time,” said Jason Friesen, the Founder and Executive Director at Trek Medics. Beacon must be available to support the mission-critical nature of 24x7 emergency response. Whether it’s a motor vehicle collision, opioid response or domestic violence emergency, prolonged downtime puts lives at risk. It’s vital that Trek Medics can spot and resolve any digital incidents in Beacon before end-users are impacted. This is where PagerDuty comes in. Getting Help to Where It's Needed Fast Before PagerDuty, Trek Medics’ approach to digital operations management was manual and time-intensive. Teams logged onto different sites and sifted through various sources to identify and resolve issues. With PagerDuty, Trek Medics integrates and centralizes alerts coming from sources such as Twilio, Slack, Cloudwatch, New Relic and internal monitoring, providing teams with visibility into issues within its whole environment. “To us, PagerDuty is an internal response and early warning system,” explained Friesen. “So, there is a real parallel between what we do and what PagerDuty does. PagerDuty not only streamlines all of the back-end alerts that we could be getting on our stack, but it also helps to coordinate our own response to make sure that the right people can investigate and resolve the alert.” Prioritizing Alerts to Ensure Users Aren’t Impacted PagerDuty also provides Trek Medics with priority and severity tagging to help teams quickly sort through alerts and identify the mission critical problems in real-time. For instance, when one of Trek Medics’ two servers went down, PagerDuty alerted Trek Medics immediately. “Thanks to PagerDuty, as soon as the server went offline, our back-end developers were alerted through Slack, the mobile app, SMS and phone. This allowed them to instantly take action to resolve the problem and we avoided any tangible impact on our users,” said Friesen. In another case, PagerDuty flagged instances of Beacon’s mobile app crashing when a user opened a photo that was more than a month old—even before the mobile app monitoring tool sent a notification. PagerDuty enabled the team to quickly resolve the issue without impacting users. “While this was a relatively innocuous problem with the app, it’s a great example of how effective PagerDuty is at keeping us aware of what’s going on across our services at all times and making sure users are not impacted,” commented Friesen. Creating a Future-Proof Ecosystem Looking ahead, Trek Medics plans to further integrate PagerDuty with its partners. “Many of our Beacon users leverage additional services on top of the platform. We want to alert Beacon users immediately through PagerDuty when these third-party services run into issues. If a mobile network goes down, for instance, we could alert our partners and even the mobile network before they realize it’s happened. This helps the whole ecosystem become more proactive.” PagerDuty.org’s partnership model, designed to help nonprofits and mission-driven organizations accelerate their vital work, has helped Trek Medics further improve its services. “As a non-profit organization, we’re counting our pennies every day and we are dependent on the generosity of others. We have to watch our budget very closely. PagerDuty's holistic support makes it totally possible for us to work at full capacity without having to make any trade-offs or compromise any of our services.” Find out how nonprofits and B Corps can accelerate their work and reach with our Impact Pricing and COVID-19 Response Pricing.
Solarisbank Banks on PagerDuty to Keep Financial Services Online
Solarisbank is Europe’s leading Banking-as-a-Service platform that enables any business to offer their own financial services. Satyajit Ranjeev, Daria Kameneva, and Jens Hermann discuss how PagerDuty helps teams implement a “you build it, you own it” model and reduce incident response times.
Wiley Relies on PagerDuty as the World Moves Towards Digital Learning
John Wiley & Sons, Inc., commonly referred to as Wiley, is a global publishing company founded in 1807 that focuses on academic publishing and instructional materials. Sean Mack, CIO and CISO of Wiley, discusses how PagerDuty is empowering teams to own and support services 24/7/365 as digital learning becomes more prevalent.
John Lewis & Partners Taps PagerDuty to Power Always-On Retail Experiences
Founded in 1864, the John Lewis Partnership is the largest employee-owned business in the UK, with two main retail brands: John Lewis & Partners, a chain of high-end department stores, and Waitrose & Partners, a grocery retailer. For John Lewis & Partners, its online presence began in 2001 and has since grown considerably. The company moved from a monolithic ecommerce platform hosted in its data centers to a microservices-based platform on Google Cloud Platform. The move to the cloud made it easier to scale and deploy new services, helping accelerate digital innovation. As Rob Hornby, Lead Engineer and Product Owner for the John Lewis Digital Platform explained, “We now have 60 services running with more than 300 microservices on the platform and 5,000 deployments a year, a huge increase on the 10 a year we could previously deploy.” Simon Skelton, Platform & Operations Manager for johnlewis.com, has overall IT accountability for the smooth running of ecommerce operations. “We’ve done a lot over the years to tune the website to meet demand during key peaks, but our new digital platform has really helped us scale up and quickly meet the demands of customers,” he shared. Challenges of a Centralized Approach Having laid the foundations for digital innovation, John Lewis & Partners realized that its existing operations model wouldn’t scale to ensure a seamless, always-on experience for customers, which is paramount in an industry where just a few minutes of downtime can result in thousands of pounds in revenue loss. John Lewis & Partners followed a traditional operations model with a Network Operations Center (NOC) that had 24/7 eyes on glass for every alert. As its digital platform grew, so did its teams and the services they supported, to the point where it was extremely challenging for the NOC to efficiently route through the myriad of technical complexity. “Teams typically handed code over to Operations, but as we scaled the number of deployments and teams grew, and this became impractical,” explained Rob Hornby. Using PagerDuty for Faster Response Times and Delivery John Lewis & Partners turned to PagerDuty because the company needed a more effective incident response process that would enable the engineering teams to have full-service ownership and directly fix issues as they arose. “As we moved to the cloud and out of data centers towards a ‘build it, run it’ model, we needed to change our approach. We looked at a lot of different tools, but PagerDuty stood out and became our preferred option,” explained Skelton. PagerDuty helps John Lewis & Partners orchestrate the right response for every incident—ensuring the right people are brought together in the right place, at the right time. Moving from a centralized, resource-intensive approach to a full-service ownership model across all of the development teams has allowed John Lewis & Partners to identify and address issues more quickly. With PagerDuty, the company has managed to reduce time to acknowledge incidents from 10-15 minutes to 1-2 minutes. Delivering an Always-On Experience, Even During Peak Seasons Typically, John Lewis & Partners has three main peaks a year: Christmas, Summer Sale and Black Friday. These events require the retailer to scale swiftly while handling 10 times the usual traffic—at peak, that could be more than eight orders per second and tens of thousands of page views. With a full-service ownership model enabled by PagerDuty and a shift to the cloud, the staff and infrastructure were well prepared to deal with the increase in traffic during those peaks. So when people began shopping online in droves as a result of pandemic-related lockdowns in the UK, John Lewis & Partners was ready. “Like many retailers in the UK, we had to close all our shops at the end of March 2020, and the majority of our customer traffic immediately moved online. Our challenge was rapidly turning off all the website integrations with in-store services. We’ve never had to do this before, but it went remarkably well,” explained Simon Skelton. Scaling up and meeting the demands of customers whose only interface is digital, and mitigating any disruption quickly, have never been more important. “With stores closed, it is our only means of continuing as a business, so we’ve had to remain stable while also being able to iterate quickly and make sure the website is offering the same great level of service and information as our Partners do in stores,” shared Rob Hornby. “We’ve gone through our Summer Sale, which has broken all our records and we haven’t really broken a sweat Platform wise. We’ve been stable throughout.” Benefits of PagerDuty Since implementing PagerDuty, John Lewis & Partners has been able to respond quickly to continue meeting customers’ expectations, with benefits such as: Improving cross-team collaboration so they can focus more time on driving continued innovation in an ever-changing retail environment Significantly decrease the time to resolution when issues occur—teams are able to restore service three times faster than they could previously Greater visibility into the operational health of its systems and services “PagerDuty has been critical in ensuring we can rapidly respond to digital incidents so we don’t lose revenue to our competitors,” explained Simon Skelton. Additionally, even in light of the pandemic, Rob Hornby shared that the company’s platform and support model allowed the retailer to pretty seamlessly adapt, so PagerDuty was already BAU (business as usual) for many of its teams. “It fits really well with a remote working model,” he said. Future Looking In the future, John Lewis & Partners plans to use insights from PagerDuty to help support the postmortem process after incidents. Eager to improve how it manages incidents beyond the initial response, the team wants to better understand what went wrong in the first place, and what can be improved in the way people responded and the processes used. The retailer has also started mapping its technical services to business services to help improve cross-organizational communication during an incident, as it provides a clear view of the response to the business owners. “Over time, this may help replace our existing business email communications on incidents,” explained Rob Hornby. “In today’s climate, spikes in traffic could happen at any time, so we need to make sure we’re constantly ready to respond, which is where PagerDuty plays a key role. It has allowed us to automate our incident response processes so we can be more proactive, instead of leaving us reliant on manual and reactive processes,” concluded Simon Skelton. Hear more about John Lewis & Partners’ digital transformation journey by registering and watching this on-demand webinar: Delivering Always-On Digital Customer Experiences in Retail. To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
The Trevor Project Counts on PagerDuty to Keep Their Suicide Hotline Running 24/7
The Trevor Project is a non-profit organization focused on suicide prevention efforts among lesbian, gay, bisexual, transgender, queer, and questioning youth. John Callery, Director of Technology, discusses how PagerDuty assists The Trevor Project with saving lives by keeping their suicide hotline and text services up and running 24/7.
Carrefour Bank Uses PagerDuty and Rundeck to Automatically Self-Heal Incidents
With the mission of transforming the customer experience for financial services, Carrefour Bank offers a wide portfolio of financial products created to meet and satisfy different customer needs. Learn how Carrefour Bank leverages PagerDuty and Rundeck to automatically self-heal.
Use Case FinServ
Fortune 500 FinServ Company Automates With PagerDuty
The Infrastructure Engineering team at this Fortune 500 financial services company manages administering the dev, test, and production environments to 20,000 Linux servers distributed across many global data centers. The team has to comply with strict industry regulations and to stay in compliance they had to follow a tedious change management process. They sent all change management actions to the due diligence team for approval before their IT services partner could execute. Changes were made manually so the Infrastructure Engineering team had to validate the results of the service partner’s work. Further, alert noise from their monitoring tool, Sensu, was a constant distraction for engineers. It was clear, to continue scaling operations, the Infrastructure Engineering team needed to become more efficient. PagerDuty Process Automation was deployed to automate workflows across systems and infrastructure. Now, the service provider can call to gather system info and execute system changes—simplifying the change management process and eliminating most of the associated labor intensive work. Using Process Automation, the Infrastructure Engineering team can run automated jobs to validate that maintenance operations are completed correctly, giving back an hour each day to the engineers. “We found an opportunity to use PagerDuty Process Automation to improve the efficiency of working with our outsourced IT service partner,” said the Infrastructure Engineer Manager. Additionally, standardizing and automating workflows has improved the company’s security posture. Access to the workflows in Process Automation is controlled by user rolls, so login credentials and SSH keys don’t need to be shared. Process Automation helped reduce alert noise by automating remediation playbooks in Sensu for common alerts. For known cases, a workflow is executed automatically to remediate the issue without escalating to a human operator. The Infrastructure Engineer Manager explained, “That’s fewer interruptions so they can focus on longer term projects and be available in the event a more serious problem arises.” The Infrastructure Engineering team plans to expand their use of Process Automation. The next project is to enable better auditing user activity. “It’s complicated to track down executions by specific users, especially when we need to do this in a rapid timeframe. PagerDuty Process Automation will be able to simplify this for us as well,” the Infrastructure Engineer Manager shared. To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
Vodafone Utilizes PagerDuty to Better Understand Their Real-Time Operations
Vodafone is a telecommunications company providing 4G network coverage for 18 million customers and 99% of the United Kingdom’s population. Ben Connolly, Head of Digital Engineering at Vodafone, details the challenges that his engineering teams were facing and why PagerDuty was the perfect fix. PagerDuty helps Vodafone deliver a better customer experience by allowing their teams to see the impact that they're having in real time.
SPS Commerce Chooses PagerDuty to Keep Critical Retail Supply Chain Services Online
SPS Commerce, the largest retail network, connects over 90,000 retail businesses of all sizes across the globe. Companies turn to SPS to streamline operations and support new order management models, such as the ability to ship directly to consumers. As Andy Domeier, Senior Director of Technology at SPS, explained, “Companies have very different backend systems and technical abilities, which can make collaboration complicated. Retailers and suppliers need to work together regardless of size, and we offer a variety of full-service offerings to connect these companies across our network.” He leads a group of technology teams that include the Site Reliability Engineering (SRE), Cloud Operations, System Operations, and Continuous Improvement teams, responsible for ensuring the network is always on and working seamlessly for their customers. To support the company’s growth, in 2013 Domeier sought to streamline existing digital operations to better scale to meet the future needs of the business. Challenges Without a Digital Operations Management Platform At this time, Domeier’s teams faced new challenges as its retail network grew. For example, Domeier’s teams saw an increase in noise and clutter as they adopted new monitoring and data observation tools. When incidents arose, teams had to scramble, with little visibility because of the alert noise coming from various monitoring tools. They also had difficulty notifying the subject matter expert (SME) for each issue or affected service. SPS needed a solution to help streamline this process and a platform to help manage the entire incident lifecycle. Domeier and his teams faced challenges with: Collaboration. Disparate tooling across the organization complicated cross-team collaboration. Visibility. Teams lacked a holistic view of their digital operations health from incidents due to increased alert noise from ample monitoring tools. Accountability. Complex infrastructure created confusion around code ownership, which led to increased time to resolution. “We needed something that could integrate with our monitoring tools, send alerts, and act as a hub to make sure those alerts were sent to the right person,” explained Domeier. Benefits of Implementing PagerDuty Domeier centralized all of the monitoring tools and teams onto PagerDuty so they could have improved consistency in terms of visibility into performance. This removed friction from the incident response process and enabled SPS to maintain “organizational velocity”. Leveraging PagerDuty’s broad ecosystem of over 500 integrations, SPS connected all of its cloud monitoring tools, including Amazon CloudWatch, Grafana, LogicMonitor, Prometheus, Sentry, and Sumo Logic, to PagerDuty. Additionally, Domeier’s teams leveraged PagerDuty’s integration with Slack, so that teams could trigger, respond to, and resolve incidents—all within the chat application. As a result, SPS technology teams smoothly transitioned mission-critical services to improve how the teams monitored an ecosystem of tooling and performance solutions, and could take immediate action on incidents. In recent years, the company adopted a full-service ownership model, where developers own their code in production. Full-service ownership enabled SPS teams to minimize downtime and maintain a consistent customer experience. “We’ve seen a positive internal cultural shift,” explained Domeier. “Before, our development teams would deliver their code to production with little transparency to their service’s health and availability. But as we architect and deploy new services, managing these services using PagerDuty has allowed development teams to see their code all the way through deployment and take ownership when incidents arise. Our Technology team is a talented and truly special group of individuals around the world!” Furthermore, the company’s customer success teams have also started using PagerDuty. Because the company’s platform must be always on, the customer success team can now proactively escalate customer-facing issues to engineering teams before customers are impacted. They also leverage PagerDuty to route important notifications about specific customers to Technical Account Managers improving the quality of service SPS is able to provide. With PagerDuty, SPS has seen several benefits, including: Sustained organizational velocity and consistency among teams with the ability to troubleshoot incidents using a unified platform. Improved operational health with visibility into incidents coming from the organization’s entire tech stack. Improved response and resolution times via a full-service ownership model. “PagerDuty’s incident data is a gold mine of improvement insights,” said Domeier. Moving to Remote Work As the world went remote and consumers went digital in 2020, so did SPS. Using PagerDuty, the company smoothly transitioned to remote work, despite high volumes within their network. “Since the pandemic began, we’ve found that retailers needed to find ways to be more efficient, more effective, and save money,” explained Domeier. “This led to an increase in the use of our retail network, and PagerDuty has been able to help us keep organizational velocity even as we’ve moved to a remote working environment.” What’s next for SPS Commerce? Looking ahead, SPS plans to embed PagerDuty into its service creation process to streamline development and support teams as new products, features, and services are built. SPS is also planning to build more automation around the PagerDuty platform so developers can gain more context about new code and service deployment using PagerDuty’s change events. Domeier is also looking into other PagerDuty products like Event Intelligence and Analytics as they continue to see new operational data surface from the platform. “PagerDuty allows my teams to focus on what’s important to us and continue to move our business forward,” explained Domeier. To learn more about how PagerDuty is helping companies transform their digital operations, visit www.pagerduty.com/customers. To see how PagerDuty could help your team approach real-time work more efficiently, start a 14-day free trial today.
PagerDuty Helps CTC Transform Operations in a Remote World
Founded in 1995, Chicago Trading Company (CTC) is a derivatives trading firm that specializes in market trading across a variety of products, services, and strategies. CTC actively trades in a broad spectrum of asset classes, including equities, interest rates, and commodities. Its trading desks are open 20 hours a day, six days a week, and the company is recognized as a leading provider of liquidity and pricing on numerous equities and derivatives exchanges around the world. Because the market fluctuates by the microsecond, CTC’s critical applications and services need to always be online and available for users in a moment's notice to deliver a consistent customer experience, every time. “With our services directly tied into the open market, downtime is just not an option,” explained Luke Rotta, Manager, SRE and Observability at CTC. “If we’re not in the market, we’re not participating in the opportunity—and it’s a missed opportunity.” Rotta is responsible for managing observability at CTC, as well as overseeing the SRE team that supports, automates, and improves uptime for the pre-production and production environments. Before PagerDuty Before implementing PagerDuty, Rotta’s team experienced several challenges, including: Delays in response stemming from a manual on-call directory with outdated schedules and rotations Difficulty communicating with on-call responders during non-business hours Lack of automation embedded into the response process, which led to more manual work for on-call responders A legacy dashboard cluttered with unactionable events and alerts, creating delays in incident acknowledgement and resolution Alert storms that reduced the ability for teams to understand the makeup of, and respond effectively to, incidents With the recent push towards remote work, CTC was forced to quickly pivot operations to a digital-first model. Additionally, heightened market volatility meant that its customers also increased the frequency of their trading, making it more important than ever that the CTC trading platform stayed up and running at all times. To help achieve this, CTC needed to rethink its incident management process while continuing to maintain and deliver a consistent customer experience. This meant Rotta’s teams needed to refocus their efforts on day-to-day operations rather than long-term projects—and all in a new, remote-first environment. “Our teams are laser-focused on making sure systems can handle the increased capacity and deliver liquidity to the marketplace to keep our customers happy,” shared Rotta. Prioritizing Communication and Collaboration Before going remote, most information was communicated verbally in the office. Now, with the entire company working remotely, the ability to effectively communicate and collaborate across teams is more important than ever. PagerDuty helped CTC transform its incident communication channels to be completely digital. “PagerDuty really taught us to spin up an incident remotely and allowed us to centralize our incident management process to quickly assemble teams into a single channel and make decisions directly from there.” CTC also leverages Slack, part of PagerDuty’s ecosystem of over 600+ integrations, to improve incident communication and collaboration between teams, as well as for conducting postmortems. With the Slack integration, teams can create, respond, and resolve PagerDuty incidents directly inside the Slack interface, which alleviates the stress of multiple communication channels and allows all necessary teams to work through the incident together. “Since all teams are remote now, we just create the incident directly in Slack. The playbook tells everybody what Zoom room to jump into, and off we go,” shared Rotta. Improving Operational Visibility In a digital-first environment, it’s critical for stakeholders to have total visibility into the health of their critical systems and services in real time so they can quickly orchestrate a proper response when an incident occurs. Before PagerDuty, CTC used a traditional dashboard that would alert the team about service disruptions and incidents. “We would get what we call the ‘wall of red,’ which was quite literally a screen filled with hundreds of alerts, with no sense of what’s being impacted or what’s going on in our environment,” explained Rotta. To combat this issue, CTC implemented PagerDuty Event Intelligence to automatically group alerts together and cut down the noise for all mission-critical services and applications. “Before PagerDuty, we sometimes had 50-200 alerts coming in at once. With Event Intelligence, that number is now down to 5-10,” explained Rotta. With Event Intelligence, CTC’s response teams also have the context they need to quickly resolve an issue before it becomes widely customer-impacting. “The ability to reduce the noise and clear out alerts within the platform really frees up a lot of time for people on our SRE team to focus on higher-impact tasks,” said Rotta. Like many companies today, CTC needs to continue scaling to keep up with customer demand and new innovations. Even though speed is table stakes at a trading firm such as CTC, running non-latency-sensitive workloads within AWS has given CTC the ability to scale quicker and reduce time to market for ideas. Many of the new services deployed to AWS follow a you-build-it, you-own-it approach and PagerDuty provides a single way to escalate, track, and measure incidents across the company regardless of who owns or supports the service. Benefits With PagerDuty Since implementing PagerDuty, CTC has seen several benefits, including: Reduced alert fatigue and improved incident response with PagerDuty Event Intelligence Faster mean-time-to-acknowledge/mean-time-to-respond (MTTA/MTTR) across all critical systems and services Improved day-to-day incident management and the ability to automate the hand-off of incidents from shift to shift An open line of communication with senior traders on the floor that escalates incidents to on-call managers across time zones when needed Seamless incident management experience for 24x7 applications running on AWS PagerDuty also helped support CTC’s business continuity strategy. “In this new, remote environment, employees can feel disconnected from what's going on, and we're trying to solve that with PagerDuty. Almost everyone at the company is on the PagerDuty platform, whether they’re a stakeholder or a full user,” shared Rotta. Future Looking CTC plans to continue expanding its use of PagerDuty across the organization. For example, the company has decided to focus more on metrics to inform future actions, so Rotta’s team is looking into Operational Reviews, as well as PagerDuty Analytics and Intelligent Dashboards, to help better understand team health and the business impact of incidents, measure SLAs, and gain the ability to seamlessly share metrics with executive leadership. “This could help drive decisions around what applications we need to invest in,” explained Rotta. Additionally, while CTC already has all of its major business services set up in Status Dashboards, the company is looking to extend its use across the company by providing executive leadership improved visibility into the status of an incident or a service. As the PagerDuty platform grows with CTC, Rotta and his team look forward to extending the platform's functionality across other parts of their infrastructure. “I like that it’s simple. I don’t have to manage anything because it just does its job,” he shared. To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
Global Technology Company Uses PagerDuty to Navigate Real-Time Operations
This global technology company specializes in mapping and navigational technologies for both corporations and consumers worldwide, and has expanded its product suite into a hybrid software/hardware model, offering cloud services. A Senior Project Manager who manages the Site Reliability Engineering team explained, “focusing less on personal navigation devices and more on building upon cloud services has opened the door to product opportunities we never thought possible.” Challenges Faced But full-scale digital transformation is complex, and the company needed to ensure that its engineering teams were immediately notified of system outages and incidents. Before PagerDuty, the company’s engineering organization ran into several challenges, including: High mean-times-to-respond due to bottlenecks in incident response and management processes that used legacy, home-grown tools Delays in incident acknowledgement and emergency communication due to on-call staff needing to share physical pager Lack of accountability and ownership in services and applications because of siloed workflows and tool sprawl across engineering teams Difficulty in scaling a custom-built paging tool that worked sporadically to notify individuals on call Incident management processes not embedded into the engineering infrastructure, which led to manual dependencies and a lack of communication to key stakeholders Language barriers that made it difficult for Facilities staff around the world to communicate issues in their buildings through the main, English-speaking Help Desk number Benefits With PagerDuty With PagerDuty’s ecosystem of over 500 integrations, the company integrated its entire tool stack into one single point of ingestion to improve visibility into the health of its infrastructure. PagerDuty’s integrations with Slack, JIRA, AppDynamics, Prometheus, Nagios, and Terraform enabled the company’s teams to gain visibility and actionable insights and understanding they need to proactively address incidents from a centralized platform. Since implementing PagerDuty, the company has seen many benefits, including: A culture of accountability that encourages full-service ownership throughout the developer organization, leading to improved code quality Improved cross-team coordination in terms of tool standardization and orchestrating a holistic response Reduced resolution times by more than 75% across major incidents and events “PagerDuty helps us gain better insight into what’s going on within our services and adds complete visibility. Before, we had too many silos. PagerDuty helped us break them down and ultimately centralize our operations,” said the Senior Project Manager. Localizing Support for Facilities Teams Worldwide Using Live Call Routing Another benefit to using PagerDuty was helping the company’s distributed Facilities teams with urgent issues. Prior to PagerDuty, Facilities teams across the globe would call into an English-speaking Help Desk number, where language barriers delayed response to urgent issues such as plumbing leaks or electrical outages. By leveraging PagerDuty Live Call Routing, the company enables calls to be routed to a local number where the Facilities staff can immediately discuss urgent issues in their own language to the appropriate people on call. Mapping for the Future As the engineering organization continues to invest in cloud services and infrastructure automation, they plan to further embed PagerDuty into their digital environment and use Event Intelligence, which combines human behavior and machine learning data, to help broaden their understanding of incidents. The company also plans to use PagerDuty’s integration with Terraform to automate response plays within the infrastructure as code and expand it across the greater developer organization. Additionally, the company plans to integrate PagerDuty with ServiceNow to further centralize ITSM and digital operations workflows. As the Senior Project Manager explained, “I see PagerDuty as a digital partner. The platform’s ability to create opportunities for team collaboration is invaluable to our digital operations environment.” Other teams are continuing to expand their use of PagerDuty, including the Security, NOC, Facilities, and Engineering teams. The engineering teams are also looking into PagerDuty Stakeholder Notifications as part of their emergency response processes to improve visibility and ensure all teams are adequately prepared for future events that could disrupt business as usual, like the COVID-19 pandemic. “The more teams we add into the PagerDuty environment, the more prepared we are to tackle future emergencies that could affect day-to-day operations.” To learn more about how PagerDuty can help transform your team’s digital operations, sign up for a 14-day free trial.
Tokopedia Automates Incident Response and Sees Greater Engineer Accountability With PagerDuty
Indonesia technology company Tokopedia is one of Southeast Asia’s largest marketplace businesses, with 100+ million monthly active users and 9+ million merchants on the site. Tokopedia prides itself on being more than just a marketplace, offering technology that empowers millions of merchants to participate in eCommerce. Rajesh Gopala Krishnan is Tokopedia’s AVP of Engineering Productivity and executes the platform’s shared technology and services vision. “Tokopedia’s mission is to democratize commerce through technology,” he explained, “We help small retailers to become big brands, allowing them to reach a more diverse customer base and make it easier for them to do business across Indonesia and beyond.” ‘Born digital’ in 2009, Tokopedia dedicated itself to digital transformation two years ago when its customer base expanded rapidly. Tokopedia modernized its technology stack, shifting from monolithic infrastructure to a microservices-based, multi-cloud architecture, running 350+ services. Increasing Complexity Leads to Slower Incident Response However, this shift to a more dynamic, scalable architecture made it difficult for Tokopedia’s in-house incident management tools to keep up with alerts and for its teams to respond effectively. This meant incident response was taking longer and kept engineering resources away from improving the customer experience and building new services for merchants and customers. Tokopedia also experienced a high volume of alert noise, making it difficult to prioritize incidents. “Our tools were identifying incidents, but addressing them was taking too long,” explained Krishnan, “Most usually took 30 minutes to resolve because we were manually looking up who was responsible for a particular service before notifying engineers and setting up war rooms to address the issue. We soon realized we needed a modern, automated incident response process to gain visibility into this complex environment, which is why we turned to PagerDuty.” Automating Incident Response With PagerDuty Since adopting PagerDuty, Tokopedia is now able to automate its incident response processes and reduce the time it takes to resolve incidents. After initially integrating PagerDuty with five services, Tokopedia saw dramatic improvements in metrics such as mean time to repair (MTTR) and decided to scale up the deployment to all 350+ services. Additionally, PagerDuty has helped to reduce alert noise. “Instead of being bombarded with alerts, PagerDuty groups related alerts into one single incident, with all the details in one place rather than scattered across multiple tools. This not only reduces alert noise, but also helps us prioritize the most urgent incidents,” Krishnan shared. Tokopedia’s investment in digital transformation and modern incident response also meant it was well-prepared to deal with peaks in demand following the COVID-19 pandemic in Southeast Asia. “By moving to the cloud and adopting PagerDuty, we’ve been able to gain greater control over the number of incidents we face. This was particularly crucial during the surge in online shopping we experienced during the COVID-19 outbreak and meant we could respond to incidents faster to ensure minimum disruption for sellers and shoppers.” Closing the Accountability Gap PagerDuty has also helped Tokopedia embrace full-service ownership and foster a culture of responsibility, something it had previously struggled to do with its in-house incident management tools. As Krishnan explains, it was often unclear who should respond to an incident when it came in. “What was missing was accountability—who is responsible for this service or application? Have they seen there is a problem and are they working to solve the problem? We didn’t have a very clear picture of this.” On-call engineers were also carrying additional phones for teams to reach them on when an alert came in. But even then, getting a hold of the right people was tricky because there was no centralized way to manage escalations. “With PagerDuty, we’ve been able to eliminate manual incident response processes. Instead, when an alert comes in, we are automatically routing incidents, based on our escalation policies, to whoever is responsible for a particular service,” Krishnan explained. Benefits With PagerDuty After implementing PagerDuty, Tokopedia has gained greater insight and control over incidents in its environment, with benefits including: Greater accountability among engineering teams Reduced alert noise Faster incident response times Increasing software updates from 10 to over 300 per day as team productivity increases through the use of automation “Since adopting PagerDuty, our engineers have been spending less time on incident response. Instead, they’re able to focus on improving the customer experience, understanding what our merchants and customers want, and how they’re using our services,” Krishnan explained. “With PagerDuty’s support for automation, engineers are also far more productive. We’ve increased daily software deployments by 3,000%.” Future Looking Looking ahead, Tokopedia will continue to expand its use of PagerDuty. Part of this involves monitoring the performance of new features before deployment to identify problems before they go live in the production environment. Additionally, as Tokopedia continues to adopt automation across the software delivery cycle and build applications that can self-heal, PagerDuty will have a vital role to play in creating workflows and runbooks to prevent, diagnose, and resolve incidents without needing to escalate them to an expert. To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.
Pantheon Partners With PagerDuty to Meet Customer Support SLAs
Pantheon is a WebOps (Website Operations) platform that provides the most complete and creativity-enabling platform for professional website creation. Over 300,000 websites and thousands of marketing and development teams trust Pantheon to provide a first-class experience and deliver positive business results. For example, Pantheon’s customer Patch.com constantly pushes out new content and has 50 million pageviews a month with 99.969% uptime—all through Pantheon’s platform. Sarah German, a Support Manager at Pantheon, leads a team of support agents to help ensure that Pantheon’s clients are guaranteed service around the clock. To help meet SLAs, Pantheon uses PagerDuty to ensure customer support tickets are resolved in a timely manner. “We pride ourselves in continually progressing our support function to provide the best support possible to our customers,” explained German. In Need of an On-Call Platform Before Pantheon adopted PagerDuty, emergency support tickets were received and triaged in the same manner as non-urgent tickets. The Support team did not have a mechanism to automate escalation and guarantee a response time within the SLA. Without a clear incident management platform or process, Pantheon Support experienced challenges, including: Difficulty providing 24/7 customer support Problems escalating an alert during weekends or after hours Trouble distinguishing and prioritizing between major and minor incidents within the Customer Support organization Meeting resolution time SLAs for customers “We were a small team supporting our customers 24/7. We needed a way to automatically escalate incidents to ensure we never missed an alert,” said German. Incident Management for Customer Support The customer support organization turned to PagerDuty to fix their on-call and support escalation issues. Now, each member of the team—including managers and directors— is categorized into specific on-call severity levels: Level 0: Basic, non-technical customer support dispatch Level 1: Customer Support Agent Level 2: Customer Support Manager Level 3: Director of Customer Support Incidents are escalated through this severity model until the issue is resolved. This process allows the customer support team to meet their SLAs surrounding support for customers. “We have SLAs around response time that are really important to uphold. PagerDuty gives our team that extra level of accountability,” explained German. PagerDuty also helped Pantheon build an automated customer support ticketing process using PagerDuty’s ecosystem of over 500 integrations. Customers can call a support line—set up by Pantheon—that uses PagerDuty to open up a Zendesk ticket with information pertaining to the issue. Then, Pagerduty sends that information directly into a dedicated Slack channel so the right team member can acknowledge the issue within minutes. Pantheon’s top levels of customers using Diamond or Platinum Support—their highest tiers of customer support—have access to a button in their Pantheon platform that was made available with the implementation of PagerDuty. Premium Support customers are guaranteed a 15-minute-or-less response time from a Pantheon customer support agent and major incidents are typically acknowledged within two minutes. Essentially, if a customer’s site goes down, they can jump into the Pantheon platform and click this button as an emergency alert to Pantheon support. PagerDuty then automatically notifies a technical engineer, who can acknowledge and begin working on the incident within minutes. “With the confidence that PagerDuty provides, our customers and leadership team can trust in our promise to take action on incidents immediately,” explained German. Pantheon has seen several improvements with PagerDuty, including: Meeting and exceeding SLAs for MTTA The ability to distinguish emergencies from regular support issues and automatically escalate incidents to quickly resolve incidents Increased reliability around providing 24/7 support to all customers with an end-to-end incident management platform Visibility into platform-level issues and digital operations health “With PagerDuty, everything happens faster. When an incident comes in, it is immediately acknowledged and assigned to a support agent and can be escalated properly, if necessary,” explained German. A Data-Driven Future for Customer Support Looking ahead, Pantheon’s customer support organization is looking to expand its use of PagerDuty with data-driven solutions like PagerDuty Event Intelligence and Analytics. With these add-ons, German’s team can bring more value to Pantheon customers, including better understanding of the types of incidents that are occurring and actively reducing noise with the use of artificial intelligence and machine learning. Pantheon’s Engineering teams are also using PagerDuty and plan to look into how they can leverage these solutions to improve their digital operations. To learn more about how PagerDuty can help your business with providing a streamlined customer support process, try PagerDuty today.
Glovo Delivers a Consistent Customer Experience With PagerDuty
Glovo is a Barcelona-based startup and the fastest-growing delivery player in Europe, Hispanic America and Africa. With food at the core of the business, Glovo delivers any product within its coverage areas at any time of day. The company currently delivers over 100M orders annually and operates in over 400 cities in 22 countries. To reach that goal, Glovo needs to ensure that, as the delivery service continues to grow, the backend infrastructure and core applications continue to scale alongside it. Joan Martinez, Engineering Manager of Infrastructure and Security at Glovo, is largely responsible for overseeing the reliability and scalability of Glovo’s infrastructure. “Our core responsibilities as part of the Engineering team are to 1) ensure critical systems are reliable and scalable and 2) support the growth of the Engineering team by providing tooling and improving the feedback loop to create more autonomy across the organization,” shared Martinez. To ensure a reliable experience to their users and customers, Glovo needed to rethink their incident management process for the entire organization. Challenges Faced When Martinez joined Glovo, PagerDuty was already implemented as its incident management platform; however, with a team of over roughly 60 engineers that was growing rapidly, there was only one on-call responder responsible for the entire platform. The team faced several challenges as a result, including: Difficulty onboarding new responders to the platform Lack of ownership across critical systems and services Increased time-to-detect due to a lack of monitoring and observability across systems and services Higher mean-time-to-recover because of a one-engineer on-call rotation Poor visibility into the health of the infrastructure for key stakeholders, such as service owners, users, partners, and executive leadership Getting Everyone On the Same Page With PagerDuty In order to include more teams in the on-call rotation, Martinez’ team adopted principles of DevOps by integrating PagerDuty into their incident management process. This included breaking down on-call rotations by service, ensuring all teams had someone on call that was knowledgeable about a given service or application, and empowering service ownership across the Engineering organization. “PagerDuty really allowed us to adopt DevOps practices, and really build upon and improve our existing processes, rather than ripping and replacing everything,” explained Martinez. To further improve stakeholder communication and consistently deliver a perfect customer experience, Glovo uses PagerDuty Modern Incident Response across the organization. “Typically, when you work an incident, you just focus on solving it and communication is not a high priority,” explained Martinez. “But with PagerDuty, we can just automate response plays and automatically notify key stakeholders about the status throughout the course of an incident. It’s a really important benefit for our organization.” Focusing On Integrations and Team Health PagerDuty has also played an important role in helping Glovo centralize its technology stack to improve communication and collaboration across different solutions. With PagerDuty’s Slack integration, teams are able to trigger, respond, and resolve incidents all within the Slack application. Glovo also uses PagerDuty’s Datadog integration to help centralize the majority of its monitoring data onto the PagerDuty platform. Additionally, Martinez’ team leverages PagerDuty’s integration with Jira to automatically create a ticket when incidents are triggered. “This integration allows our team to focus solely on the incident rather than the backend work and ticket creation process,” Martinez said. Glovo also uses PagerDuty Analytics, which gives teams an in-depth look at metrics behind the incident management process and allows managers to better understand technical issues and quantify team health from a process perspective. With PagerDuty Analytics, managers now have better insight into responder health, which helps them ensure on-call engineers aren’t being overwhelmed with on-call tasks and getting burnt out. Benefits With PagerDuty By using PagerDuty for their digital operations, Glovo has seen several benefits, including: Improved stakeholder communication and increased automation into the incident response process with PagerDuty Modern Incident Response The ability to acknowledge, troubleshoot, and resolve incidents from anywhere with the PagerDuty mobile app Improved visibility across tools and solutions with PagerDuty’s ecosystem of 350+ integrations Better system reliability across mission-critical services, which improved the user and customer experience and reduced mean-time-to-resolve Increased visibility into technical areas of improvement and team health with PagerDuty Analytics “Our teams love PagerDuty because we just know it’s reliable and we can depend on it,” shared Martinez. Future Looking Glovo is laser-focused on building upon DevOps principles and continuing to expand the use of PagerDuty across its infrastructure as the company scales its services. Martinez also plans to use PagerDuty’s integration with Terraform to remove a lot of the manual work and help build response automation into the team’s existing processes. Additionally, the team is looking into PagerDuty Event Intelligence to see how AIOps and automation can help uplevel digital operations and incident management capabilities. Curious to learn how partnering with PagerDuty can help your company scale to meet the needs of your customers? Contact your account manager and sign up for a 14-day free trial today.
SundaySky Uses PagerDuty to Proactively Resolve Customer Support Incidents
Founded in 2006, SundaySky is a technology company that aims to transform the customer experience for Fortune 500 companies by delivering video-powered experiences at critical moments along the customer journey that engages, educates, and inspires customers. Since its inception, SundaySky has delivered billions of video-powered experiences to its customers around the world. Ran Geller, Director of Customer Support, is responsible for overseeing SundaySky’s global support operations and providing ongoing support for the video experience platform. Since the company emphasizes delivering video experiences to customers in real time, Geller’s team adopted PagerDuty to ensure that potential issues are addressed before customers are impacted. “The worst thing that can happen is if a customer tries to click on one of our client’s video experiences and the video won’t play. We do everything we can to make sure this doesn’t happen,” explained Geller. Overwhelmed by Alert Noise Before PagerDuty, SundaySky used a third-party network operations center (NOC) agency to monitor services and send email alerts to Geller’s team when issues arose. In the early days, this process was manageable, but as the company grew and the number of events increased, the email alerts began to overwhelm the customer support team, which could negatively impact the customer experience if left unacknowledged. For instance, alerts would come in at all hours and sometimes be duplicated, which made it difficult for Geller’s team to differentiate between new alerts and duplicate alerts stemming from the same incident. To combat this, Geller and his team created an exception list, which notified the NOC agency about the issues to ignore. This exception list was helpful, but it also created some additional communication issues between Geller’s team and the NOC—for example, sometimes the on-call customer support agent was falsely called by the NOC team in the middle of the night due to an oversight caused by running manual procedures. It was apparent that Geller’s team needed a more comprehensive incident management solution as they continued to scale their operations along with the company’s growth. SundaySky also experienced other challenges, including: Slow incident resolution times due to the inability to separate signals from the noisy alerts and the absence of automated escalation policies Difficulty communicating with the NOC agency, which negatively impacted SundaySky’s incident management process Lack of visibility into the status of open issues “The duplication of alerts coming in at all times of the day became overwhelming. We needed to find a solution that would allow us to spend our time building tools rather than trying to respond to duplicate alerts,” shared Geller. Shifting From Reactive to Proactive To help reduce the noise, SundaySky turned to PagerDuty to streamline the incident management process. Within a month, SundaySky was able to implement PagerDuty into its customer support stack, eliminating the need for the outside NOC agency, which helped reduce operational costs for the company. Today, Geller’s team uses PagerDuty’s Zendesk integration to track customer support-related incidents. When an incident comes in through PagerDuty, a support ticket is opened in Zendesk and the incident information is filled into the ticket, helping the team track incidents specific to customer support. SundaySky’s customer support organization also takes advantage of PagerDuty’s ecosystem of 350+ integrations by utilizing the Splunk, AWS CloudWatch, and Jenkins integrations, in addition to working with PagerDuty’s API. “With all of our monitoring tools connected to PagerDuty, we have been able to proactively resolve 70% of incidents coming in by automatically escalating incidents to the right person to resolve in a timely manner.” - Ran Geller, Director of Customer Support. SundaySky has seen several benefits with PagerDuty, including: Faster response and turnaround times with PagerDuty’s automated alerts and escalation policies Increased transparency into the digital health of the customer support organization with the ability to track incidents directly in the PagerDuty platform The ability to quickly scale operations now that they can focus on developing better monitoring tools and only get alerted when critical issues arise Prioritization of important alerts and a reduction of noise with PagerDuty Event Rules "Because of PagerDuty, we are able to develop our monitoring tools further and scale our monitoring capabilities without fearing that it will overload the team or impact the monitoring process," explained Geller. Continuing to Scale Looking ahead, Geller and team hope to continue to increase the percentage of incidents that are proactively dealt with. Additionally, he plans to utilize PagerDuty’s bi-directional integration with Zendesk so alert information is automatically populated in both platforms. Automating this workflow and injecting PagerDuty’s data and insights will allow Geller’s team to continually move towards proactive incident management. “PagerDuty allows our teams to proactively resolve incidents without being swamped by emails, saving us time and creating a much cleaner process,” shared Geller. In addition to the customer support organization, SundaySky’s engineering team has also adopted PagerDuty to help centralize communications between the two teams. “PagerDuty has allowed our teams to gain visibility into the health of our incident management processes. The documentation of it all inspired me to display the PagerDuty platform on one of our monitors in the office so everyone can see exactly what needs to be done,” shared Geller. PagerDuty helps organizations scale, even as the complexity of modern customer service increases—so customer service teams can process more tickets across multiple regions, time zones, and support channels. To learn more about how PagerDuty can help your business with digital operations management, including customer service, try PagerDuty today.
Ecobee Improves Team Health and Productivity With PagerDuty
Founded in 2007, ecobee is a Canadian home automation company that builds Wi-Fi enabled thermostats for residential and commercial applications to help users maximize comfort, reduce their carbon footprint, and save money. Behind the curtains of this easy-to-use product are continuous deployments of mission-critical applications and services, a regionally distributed infrastructure, and self-healing server clusters that operate to maintain and keep the services online for their global customer base. Jordan Christensen, VP of Technology at ecobee, is responsible for the company’s platform infrastructure, including automation, self-healing, and end-to-end service delivery and availability. “My team’s overall mission is to build reliable, fault-tolerant infrastructure, and PagerDuty really is the critical platform we use to measure and monitor this reliability,” he explained. Challenges Faced Because ecobee’s premier product is responsible for temperature control in millions of residential and commercial buildings, its services need to always be online and available for users. A minor blip or application failure can lead to lost revenue—so minutes matter when it comes to getting ahead of and responding to potential incidents before they impact customers. In order to provide the best customer experience for its users, ecobee needed to approach incident management from a proactive and preventative angle. To do this, its engineering teams needed a platform that would enable real-time visibility across ecobee's entire infrastructure and services. Infrastructure as Code With Terraform Jordan’s platform team relies heavily on PagerDuty’s Terraform integration to build their PagerDuty instance into the greater infrastructure as code. By building PagerDuty into Terraform, teams can better understand the real-time health of their infrastructure and enable full visibility into on-call rotations and schedules, as it is all defined as code within the Terraform environment. This technique enabled teams to cut out the manual work of on-call management and create opportunities for automation in terms of maintaining on-call rotations and schedules between different applications and services. “Having PagerDuty embedded into the infrastructure as code rather than a disparate interface makes it a central piece of the infrastructure rather than hanging off as an ancillary service,” explained Jordan. This improved visibility and the ability to manipulate code within ecobee's codebase empowers his teams to truly understand the health of the infrastructure when incidents inevitably occur. With the help of this integration, the ecobee team is gradually working towards four 9s in terms of uptime and availability. The Benefits of PagerDuty With PagerDuty, ecobee is able to proactively work incidents collaboratively and be fully enabled on the context of the incident at hand. “The insights are pointed and specific, not generic,” recalls Jordan. Centralizing all of the signals from every container, server, application, and microservice in PagerDuty makes it easy for his teams to be able to diagnose issues and automatically engage the right people to remediate the issue before it impacts the customer. Jordan’s team has seen several benefits from PagerDuty, including: Enhanced visibility and communication between Engineering teams and other key stakeholders throughout the incident management lifecycle An automated response process, which decreases manual work and improves productivity and work-life balance Teams that are empowered to make changes for the better due to the practice of accountability and full-service ownership The creation of a safe space for junior engineers to escalate incidents and work directly with senior engineers when they feel they need guidance, contributing to improved team health A Heavy Focus on Team Health With PagerDuty implemented across the entire Engineering organization—along with other key business units and stakeholders—Jordan noted that leadership has been able to put a strong focus on team health, work-life balance, and creating opportunities for growth among junior engineers. “With PagerDuty, employees feel safe being on call because they know they can escalate issues to senior developers to provide guidance and walk through the issue to solve it,” explained Jordan. Minor incidents often turn into learning opportunities, which boosts morale and team health among the organization. “If we didn’t have PagerDuty, it would be extremely difficult to execute proper incident management and response as a company.” - Jordan Christensen, VP of Technology The Future of PagerDuty With ecobee Ecobee plans to continue its use and expansion of PagerDuty across the greater organization. Specifically, the engineering teams want to learn to better leverage PagerDuty Modern Incident Response so they can implement response plays for particular services and automate certain tasks within a response action. The teams also plan to leverage PagerDuty’s Slack integration to centralize communications and improve collaboration across teams during major incidents. Additionally, ecobee would like to formalize a postmortem build-out within its PagerDuty instance in order to centralize the entire incident lifecycle onto one platform. Jordan’s team is also looking to harness the full ability of the PagerDuty REST API to encourage automation and build business efficiencies across the rest of the organization. “We haven’t even begun to scratch the surface of what we can truly accomplish with PagerDuty,” explained Jordan. Interested in learning more about how PagerDuty can improve your team’s health and incident management process? Sign up for a two-week free trial today!
Mary’s Meals Serves Up PagerDuty to Maintain Its Mission
Mary’s Meals, a nonprofit organization based in Scotland, provides critical meals to more than 1 million children each day through school feeding programs in some of the world’s most under-resourced communities. Through the nonprofit, a donation of only $21 can feed one child for an entire year. Mary’s Meals manages employees and services in 19 countries across the world to feed nearly 1.7 million children daily. As IT Infrastructure Lead, Stephen Neil is responsible for leading a distributed tech team of 10 to keep all of Mary’s Meals’ digital operations up and running—this includes managing the website, applications, and all technical monitoring at a global scale. Connectivity Challenges in Remote Locations Traditionally, the organization’s program provided one vital meal daily directly to schools in poverty-stricken communities. However, with COVID-19 closing most schools across the world, Mary’s Meals had to adapt. “Within a matter of weeks, our program’s delivery model was impacted globally. We had to start looking at alternatives because the need [for child nourishment] didn’t go away,” explained Neil. Now, the organization is delivering meals directly to the communities served by the schools. With this change, Mary’s Meals faced new challenges. Many of the countries they serve lack key internet infrastructure to communicate effectively, making oversight difficult. ”Being able to detect where we have communication failures in a timely manner is really critical to our operations,” Neil shared. Mary’s Meals needed help with: Maintaining communication in areas with minimal internet access Alerting the correct team member when connectivity in remote areas fails Aggregating alerts from multiple monitoring systems in one place Using PagerDuty to Stay Connected—and More Through PagerDuty.org, Mary’s Meals was able to leverage PagerDuty's technology and benefit from Impact Pricing, which provides 10 free PagerDuty licenses to support nonprofits and other social enterprises in delivering on their critical missions. After implementing PagerDuty, Neil quickly saw that there were numerous use cases where PagerDuty could benefit Mary’s Meals. For example, after visiting an on-site location in Liberia, Neil quickly realized that the organization’s vehicle tracking software could be integrated with PagerDuty so that when a vehicle gets stranded somewhere, employees can press a panic button to alert rescue responders immediately via SMS notifications. “We can couple PagerDuty with an existing solution that we had already implemented on the ground to provide a higher degree of alerting on system events locally,” explained Neil. This enabled Mary’s Meals to protect company assets and provide more reliable security in remote locations, saving valuable resources that could be used to feed as many children as possible. Using PagerDuty, Mary’s Meals can now: Send alerts via SMS instead of relying on unstable internet connections Send alerts to the right person quickly when a connection drops or web services are faulty Integrate with monitoring tools to have all alerts in one place Looking Ahead Going forward, Mary’s Meals plans to scale PagerDuty across other departments and implement new use cases, such as using GPS tracker information from satellite phones to alert on potential personnel safety issues when traveling and to alert security officers to important field safety reports. Mary’s Meals is beginning to manage services in AWS and Azure and will be adding integrations with AWS CloudWatch and Azure Alerting to get application insight, log events, and autoscale notifications. To learn more about PagerDuty’s Impact Pricing and whether your nonprofit organization is eligible, check out https://www.pagerduty.com/foundation-eligibility/.
Australian Bank Supercharges Deployments and Automates Compliance Measures with PagerDuty
As this large Australian financial institution expanded offerings and added new customers, the DevOps team felt pressure to keep up. The team is responsible for supporting 7 non-production environments and 65 production applications, and receives daily changes from vendors and developers equating to over 500 deployments per month. They realized that in order to quickly push more products and maintain the excellent customer experience they are known for, they needed to expand their use of automation. The team was already using Rundeck Community in a limited capacity to execute single commands. The Platform Engineering team recognized the value they experienced from runbook automation and decided to expand their use. They upgraded to PagerDuty Process Automation to take advantage of the Enterprise Support and smart workflows. PagerDuty enabled the bank to overhaul its release processes, and strengthen its security and compliance posture. The Platform Applications Manager explained, “As a bank, it's critical that we have Enterprise support on the applications that we depend on. PagerDuty Process Automation has become more and more the lifeblood of what we do.” One-Click Deployments Through Process Automation Before implementing Process Automation, the deployment process was lengthy and largely manual. Now they have turned daily deployments and major releases into one-click operations. For daily deployments, the agile delivery teams hand application updates to DevOps to package for deployment into production. Under the bank’s change control practices and separation of duties, DevOps does not touch production. They build deployment automation in staging, test it, and pass it to the Operations team to execute— all via PagerDuty. The Platform Applications Manager shared, “The beauty of using Process Automation is that to the Operations team, the deployment is identical no matter which platform it comes from. We do 500-600 deployments per month. Those kinds of numbers are not possible without good quality automation.” Major releases are scheduled every 3 months, taking the bank offline for maintenance. Before PagerDuty, it took up to 40 minutes to shut down the systems by logging into servers and stopping applications one-by-one. Now, they can do this with the click of a button, the exercise takes 85% less time, and they avoid potential mistakes from manual processes. The bank has experienced significant time savings by automating deployments, saving them about 30 minutes per deployment and approximately 250 hours of engineering time per month. “PagerDuty is the control plane of the ability to meet our business deliveries,” said the Platform Applications Manager. Self-Service in a Secure Environment Process Automation has strengthened the security of the bank’s systems, as secrets such as login credentials do not need to be widely shared. Access to workflows are controlled by user roles, and all actions are logged by PagerDuty in addition to server logs. With improved access control and authentication management, DevOps implemented self-service operations. The QA team was granted self-service access to common testing tasks. Before, the QA team would require data from DevOps during the testing process. When a request surfaced, someone in DevOps would stop what they were doing, log into service, find the data or run the scripts needed to gather that information, and send it back to the QA team. The Platform Applications Manager explained, “The safe, self-service capability in Process Automation prevented our testing team from waiting 30 minutes to 4 hours for a team member to have capacity to execute a task. I estimate that this saves us around 20-30 hours a month of unnecessary delays to our testing schedules.” Automating Compliance in a Highly Regulated Industry The proven repeatability of running workflows in PagerDuty has streamlined the change request process. Depending on the risk, changes to production need to go through the bank’s change request board. However, if there is an existing automation in PagerDuty that has been approved and run previously, this lowers the risk rating because there's far less uncertainty. A manager is able to make these approvals faster and with more confidence, further streamlining deployments. Audits are routine in the highly regulated FinServ industry and the bank uses PagerDuty to automate parts of the process. Every runbook automation job captures the necessary information for audits into the system used to track production applications. During the yearly audit, they can pull up every production deployment, show what changes were made and by whom, what time the changes were made, and what changes it was related to. “PagerDuty has helped immensely with compliance reporting because we can demonstrate consistent processes. Audits used to take two to three weeks of manually gathering information, finding change records, and deployment records—by using Process Automation, we can provide this data in under a day,” said the Platform Applications Manager. What’s Next? The bank will continue modernizing its technology stack and is planning a major transformation from traditional data centers to the cloud. The Platform Engineering team sees Process Automation as the main mechanism for deploying applications to the cloud, accelerating their speed-to-market. Learn how PagerDuty Process Automation can help you automate and delegate business and IT processes, contact your account manager or request a demo.
Claranet Partners With PagerDuty to Achieve Real-Time Operations
Founded in 1996, Claranet is an IT Service Management company that provides network, hosting, and managed application services to organizations around the world. With customer experience as the centerpiece of its company mission, Claranet helps bridge the technology gap for its customers by delivering tooling, automation, and IT services so they can focus on innovation while continuing to work on in-house development and maintenance. Andrew Rundle, a Principal Engineer at Claranet, is part of the Group Engineering team that oversees Claranet’s infrastructure and operations services, specifically around hosting within its own data centers and the public cloud. Responsibilities for his team range from deploying servers and containers to managing the application experience and DevOps processes for their customers. “Our team’s goal is to reduce our customers’ costs and help them build a more efficient operation while also introducing new technologies, products, and services,” explained Rundle. A Growing Network Brings Growing Pains Claranet went through a phase of rapid growth stemming from several business acquisitions and almost tripled its employee count over a few years. This growth led to the addition of several new IT teams to Claranet, as well as an influx of new customers, applications, and tools to support. This internal and external growth, coupled with incorporating new operating models with existing IT processes, created some new challenges, including: Responder burnout stemming from unbalanced on-call schedules and rotations Maintaining SLAs with customers due to communications issues driven by the influx of new teams and technologies Technology sprawl from adding new teams, tools, and services to the organization Delays in acknowledging support calls, which negatively impacted MTTR and reporting capabilities Inefficiencies due to monolithic monitoring systems, manual processes, and siloed workflows Because of the growth in new customers, products, and services, Claranet’s Group Engineering team needed an end-to-end incident management platform to properly acknowledge, respond, and resolve incidents before they negatively impacted both internal and external customers. “Our teams were getting calls four or five times a night during off-hours for one product. This was causing response delays, fatigue, and frustration for our team. Some of our engineers were leaving because the existing model just wasn’t sustainable,” shared Rundle. Automating the Manual Work Before PagerDuty, Rundle’s teams were using local Network Operations Center (NOC) resources to field incoming alerts, which was a manual process that relied on multiple human interactions before an incident reached the designated responder. Some of these teams and regions had centralized NOCs, while other regions took a DevOps and SRE approach to engineering operations, leading to a HybridOps model within the company. As a result, teams found it difficult to break down silos and ensure a degree of standardization and technology adoption across their monitoring stack. Resources were getting exhausted by the influx of calls and the local NOCs weren’t properly escalating alerts to the Group Engineering team as they came in, because they weren’t fully aware of the severity of the incidents that the alerts were associated with. “NOC teams would receive off-hour alerts and not notify our team until the following morning, which became problematic when more severe incidents within our services occurred,” shared Rundle. The reliance on manual processes and human interaction created a bottleneck in the response process and negatively impacted MTTR. With PagerDuty Live Call Routing, Rundle’s team now has the ability to create a self-service model to ensure incidents coming in are automatically sent to the right resources at the right time to respond quickly and efficiently. PagerDuty Live Call Routing at Claranet is used in two distinct ways: Internal: When incidents or events occur that monitoring systems don’t initially capture or in specific situations where teams are needed for a platform-specific incident, the right teams can be notified immediately to orchestrate a proper response. External: Some customers have a direct line of communication connected to the Claranet on-call team so they can escalate major incidents straight to the right responders when necessary. “We've essentially gotten to the point now where we don't have to rely on that human interaction anymore because of Live Call Routing. And over time, other teams across the organization have continued to adopt it because of its self-service domain,” explained Rundle. Benefits With PagerDuty Claranet has deployed PagerDuty across several globally distributed teams within the organization, including the Network, Security, and Engineering teams. Rundle’s team uses PagerDuty’s integration with Slack to communicate quickly and seamlessly about the response and management of incidents as they happen, while also ensuring full visibility of an incident’s current status to stakeholders, like the executive team. “Before PagerDuty, we had to individually reach out to people to ask what was going on, but with the Slack integration, we see everyone's alerts and we can actually analyze correlations across the platform,” he shared. Additionally, PagerDuty has helped improve the data management and reporting of the incident management process to key stakeholders and leadership teams. “PagerDuty helps us from the data perspective because you can actually see the data, take it to management and say, ‘Look, this is worth investing time and money,’” explained Rundle. With PagerDuty, Claranet’s regional teams have the autonomy to use the platform in a manner that best fits the existing processes of a particular team, and every regional team can leverage PagerDuty in their own original way. “PagerDuty is a simple, slick application that ultimately allows our teams to reduce their workload and really see the impact through the data we get from it,” shared Rundle. Claranet has seen several other benefits with PagerDuty, including: Improvements in MTTR from removing manual work and adding automation to the incident response process Faster response and resolution to incoming alerts with PagerDuty Live Call Routing Reduced operational costs and increased service availability due to new process efficiencies Greater visibility for key stakeholders into on-call performance and incident resolution with analytics and data reporting A central point of ingestion that aggregates all of their monitoring data through PagerDuty’s ecosystem of 350+ integrations “Having PagerDuty as that central aggregation layer saves us time by not having to go and build monitoring system integration and cookie cut everything by service.” - Andrew Rundle, Principal Engineer, Claranet Looking Into the Future Claranet plans to continue expanding the use of PagerDuty across different teams throughout the global organization, including the Infrastructure, Public Cloud, and Security teams across the group. “We want to be far more proactive and leverage even more automation to predict what's really going on and reduce as much noise as we can,” shared Rundle. His team is also looking at implementing PagerDuty Event Intelligence to further their understanding of an incident's makeup and how they can improve their response process across the organization. To learn more about how PagerDuty is helping global companies with digital operations management, try PagerDuty today.
Summit EMEA: How Vodafone Is Enabling Immutable Telemetry
In June, we were delighted to host our first ever virtual PagerDuty Summit EMEA! Llywelyn Griffith-Swain, SRE Manager, and David Jambor, Head of Systems Engineering at Vodafone, were among our speakers. They outlined Vodafone’s approach to achieving immutable telemetry. David opened the session by defining Vodafone’s strategic goals. “Our vision is to create an engineering-driven culture,” he explained. “We want to empower development teams to be self-sufficient. Therefore, we’re putting them at the center of everything we do, but we want to challenge them—their code needs to reach production within four hours.” To do this, Vodafone is building self-service capabilities, with development teams given the power to say what tools and capabilities they need and how they want to use them. The end goal is to have observability and alerting capabilities that tells development teams what happens to code and how it behaves as it moves into production. “We’re building a lot of tooling around this,” David shared. “We’re building true continuous CI/CD, with a focus on continuous deployment that enables us to move code from a sandbox into the production environment. But this cannot be achieved without immutable infrastructure, which will enable us to provide immutable observability and alerting for development teams.” Why Is Immutable Telemetry Important? To explain how immutable observability can be defined, David gave us a great analogy using Formula 1. Imagine you’re leading the race and your tire gets a puncture, forcing you to come in for a pit stop. What do you expect your engineers to do—repair or replace the tire? You of course want them to replace it because you want to get back to the race as soon as possible. Immutability is about throwing away what is broken and replacing it quickly, instead of spending time trying to repair it. “Immutable infrastructure in IT really means that you shouldn’t change things if something is broken; it is much quicker to replace it with something new,” David explains. “Immutable observability leverages this approach to provide an on-demand, out-of-the-box capability to monitor and alert everything, end to end, in an immutable fashion.” How Vodafone Is Enabling Immutable Telemetry The immutable approach to telemetry would see Site Reliability Engineering (SRE) teams develop new monitoring approaches on demand. Llewelyn gave us an example where three development teams are all using a threshold error rate monitor. But what happens if one team decides it wants an anomaly detection error rate monitor? Instead of replacing the existing monitor and upsetting the other teams, the SRE team would develop the new monitor. Once ready, the development team that requested it would use the new monitor, while the others carry on using the existing monitor. Llewelyn also talked about the challenges Vodafone faced in implementing immutable telemetry. “We have 150+ developers and are following the DevOps approach, where developers need to own the code whether it’s in production or lower environments, including subsequent monitoring and alerting,” he shared. “We also need to give an immediate view of our production status to all stakeholders to enable visibility across digital.” He also explained that the solution they build needs to be in line with SRE principles of reducing toil. But because the solution will also be for the developers, it means they need to make all modules and monitors available as code and implemented via a CI pipeline, which allows developers to quickly add them as needed and also allows Vodafone to recover should an incident arise. The SRE team dreamed of a developer never having to leave the release pipeline to set up monitoring and alerting; instead, they can simply call up modules that have been built by the team itself. In practice at Vodafone, this sees the SRE team developing configurations for Datadog monitors and PagerDuty callouts, which can be called up in Terraform to set up monitoring and alerting. In the future, should developers want new monitors, these would be requested from the SRE team, who would develop it and make it available, and developers could then call it up through Terraform. David wrapped up the session by explaining how PagerDuty fits into SRE’s strategy. “SRE’s goal is to eliminate toil to allow time to be spent on more valuable tasks, like engineering solutions that make tomorrow a better place. Automation of tasks is vital here, and PagerDuty is the best tool for the job because it brings development teams closer to their code and empowers ownership.” Interested in watching the full session? Register today to check it out on demand (for free!), along with other customer sessions, including incident management at Form3 and how to drive operational efficiency with Auto Trader UK and Gousto.
PagerDuty Paying Dividends for Form3’s Digital Payment Platform
Your payment systems have slowed to a crawl, customers are getting impatient and abandoning their shopping carts both online and in stores, and you’re losing money every minute this problem goes on. Behind the scenes, technical responders are scrambling to resolve the issue before it impacts more customers—and before even more money is lost. This experience is exactly what Form3—a UK-based, cloud-native, payment-as-a-service platform—aims to eliminate. The company provides payment technology, processing, and infrastructure to the financial services industry, enabling organizations to clear transactions faster. Due to its cloud-native platform running on AWS, Form3 is able to deliver faster payment processing to their customers than they are accustomed to, as typical payment platforms that are not cloud-first are often held back by monolithic processes and legacy infrastructure. At PagerDuty’s Summit EMEA 2020, Eimear O’Connor, Chief Operations Officer of Form3, shared how her company uses the PagerDuty platform in combination with AWS cloud infrastructure to improve their operations to ensure the best possible customer experience—so that their clients’ end customers can pay seamlessly when checking out with their shopping carts and merchants can capture as much revenue as possible. Several banks, e-commerce platforms, credit card providers, fintech companies, and financial institutions all rely on Form3 to ensure payment transactions are secure, reliable, and processed quickly. The Need for Speed O’Connor’s team is responsible for maintaining the successful operation of Form3’s platform and ensuring a positive and consistent customer experience. During her PagerDuty Summit EMEA session, she described how their customers expect instantaneous access to view the status of their payments, which they need to ensure their services are up and running around the clock. And when incidents inevitably occur, Form3 needs the ability to orchestrate a swift response and ensure the right responders are contacted at the right time to resolve issues to reduce the impact on customers. To achieve all this, O’Connor and her team knew they needed a solution that could quickly identify and notify teams what was happening if any services started to experience a disruption or outage. Form3 adopted a DevOps approach from its inception, building a culture of accountability and ownership with its teams and leveraging PagerDuty to empower engineers to build their own schedules, customize alerts, and gain real-time visibility into incidents. “We utilize PagerDuty to coordinate all of our intelligence, 24/7, 365 alert and incident response. We use predefined on-call scheduling and escalations based on alert types, which are then instantly directed to the right DevOps teams that can then respond in real-time,” O’Connor said. Because Form3 is growing so quickly, O’Connor also shared that implementing automation capabilities into their infrastructure and provisioning processes will help their teams scale more easily on AWS. For example, to help speed up the onboarding process for engineers, the company is utilizing PagerDuty’s Terraform integration to automatically provision the PagerDuty instance. Additionally, PagerDuty isn’t just used by engineering teams and on-call responders. By using the PagerDuty API, Form3 was able to develop an integration that helps track and report on-call hours for responders. With this integration, the company’s finance team can now easily view a report detailing any on-call hours worked and pay employees accordingly. Interested in hearing the full story? To view Form3’s session and others at Summit EMEA, register here—it’s free!
mTOMADY: Expanding the Reach of Healthcare to Underserved Countries
Every year, roughly 100 million people around the world are pushed into extreme poverty because of health-related expenditures. Analysis from the World Health Organization (WHO) found that more than 800 million people spend at least 10% of their household budgets on health expenses for themselves, a sick child, or other family members. For almost 100 million of these people, these expenses are high enough to push them into extreme poverty, forcing them to survive on just $1.90 or less a day. While these numbers are grim, innovation in healthcare and financial technology is opening the door to opportunities for people to gain access to healthcare and financial assistance on a global scale. Enter mTOMADY, which got its start as a project of the NGO Doctors for Madagascar, and is a digital platform that facilitates resilience to medical impoverishment and universal health coverage, particularly in countries with underserved healthcare systems such as Madagascar. And as a PagerDuty Impact Pricing customer, mTOMADY is one of the organizations committed to helping bring essential medical care to people who need it, in the hardest-to-reach places, in the moments they need it most. Increasing Healthcare Access Through Technology Samuel Knauss, a medical doctor and digital health clinician, oversees mTOMADY’s Product team and is responsible for the organization’s development and operations of end-user software solutions, as well as strategy around the distribution of care and support for their patients and users through those solutions. “Our teams are distributed across Germany, Madagascar, and parts of Sub-Saharan Africa,” shared Knauss. With a small, distributed team operating in different workstreams, it’s important that communications are properly directed to the right people at the right time. “Between the team of on-the-ground agents and software developers, it is important for our teams to be able to communicate timely and effectively when incidents with our systems occur.” mTOMADY’s services have to be online and available 24x7 in extremely rural areas to serve patients, so uptime and connectivity is critical to ensuring patients have access to proper care and financial support solutions, especially during global crises. mTOMADY is built upon the foundation of five main principles: Transparency. Collecting real-time data on healthcare system usage, disease prevalence, and treatment costs enables impact to be traced down to individual health outcomes. Universal Access. Technology that works in resource poor settings, on any mobile phone, allows patients and healthcare workers to access the platform even in isolated, rural areas. Cost Reduction. Simple-to-use technology that can be easily integrated into existing procedures to reduce costs and streamline administration obstacles, improving operational efficiency. Advanced Technology. Cutting-edge pattern recognition and data analytics provide actionable insights to prevent insurance fraud and improve quality of care. Scientific Evidence. Research conducted to collect measurable, empirical evidence on how the solution works in the real world enables continuous optimization. Manual vs. Real-Time Incident Management Before PagerDuty, mTOMADY’s incident management process was completely manual. They used a spreadsheet for on-call directories, which led to delays in response and confusion around how to prioritize alerts for certain teams with different expertise. The manual directory also created a bottleneck in terms of scheduling because on-call rotations were globally distributed and cascaded across time zones. “Organizing our duty times and who’s responsible for what at what time is crucial in our field. If we don’t know who’s on call, we can’t provide the level of care we promise to our patients,” Knauss explained. Since the implementation of PagerDuty, mTOMADY has seen significant improvement in their on-call and incident management processes. They replaced manual on-call spreadsheets with automated rotation schedules, which let responders know precisely when they are on call and what services they are responsible for. They also leverage PagerDuty’s ecosystem of over 350 integrations to centralize tools like Kibana, Sentry, and Slack into a single point of ingestion. With these implementations, the mTOMADY team has seen significant improvements in metrics like MTTA/MTTR, as well as better alignment around who the point person is for incidents involving certain applications and services. “Timeliness is the most crucial factor in our field of work; knowing that the PagerDuty platform holds my teams accountable in response scenarios gives us the peace of mind we need to focus on our work, both on-the-ground and internal operations.” said Knauss. If mTOMADY is unable to respond to inquiries in a timely manner, it could mean that someone out there isn’t getting the care or support they need. Healthcare in the Sahara During the COVID-19 Outbreak Since the outbreak of COVID-19, mTOMADY has drastically changed the way it operates: They have canceled or delayed all non-essential meetings until further notice and have moved from much of their on-the-ground work to a predominantly digital approach. However, mTOMADY is also constantly looking for ways to offer assistance. This includes working directly with rural hospitals to provide support and shift their focus to COVID-19-related response by distributing protective equipment to local and rural communities to help slow the spread of the virus. Looking ahead, mTOMADY wants to create a way to integrate PagerDuty into their global health emergency systems so they can detect spikes in illnesses or outbreaks in both local and rural communities. Looking Into the Future mTOMADY plans to continue expanding its use of PagerDuty into different facets of the business and go beyond just on-call management. They plan to integrate PagerDuty into their emergency response processes, including integrating with their ambulance service that works directly with rural communities. Additionally, they are looking to expand PagerDuty into their supply and equipment networks so that when there is some sort of communal shortage, they are alerted in real time and can act accordingly. To learn more about how PagerDuty is working to provide resources to healthcare companies, specifically around the COVID-19 outbreak, visit our COVID-19 resources page. Additionally, visit PagerDuty.org to learn how PagerDuty partners with organizations to deliver on their mission when moments matter, including providing Impact Pricing to organizations like mTOMADY.
ezCater Dishes Out Real-Time Operations With PagerDuty
ezCater is the leading corporate catering marketplace in the world. Since its inception in 2007 out of Boston, MA, ezCater has served over 100 million people across 22,000 cities and continues to expand its network with over 80,000 restaurants and caterers on its platform. For ezCater, providing a consistent and unparalleled customer experience is critical to the continued success and scale of the business. Jennifer Page, an Engineering Manager at ezCater, is responsible for ensuring that the company’s infrastructure remains resilient and available for their customer base. Rethinking Incident Management In order to move to a service-oriented architecture, engineering teams needed to rethink the way they approached incident management to align with their new, agile approach to digital operations. Some of the issues they were experiencing included: Lack of visibility into service ownership across teams Misconfigurations due to confusing interfaces Unstructured work stemming from permission issues across distributed teams Difficulty routing escalation policies across teams and services Gaining Transparency and Flexibility With PagerDuty After implementing PagerDuty, the ezCater team saw immediate improvement across several areas of its incident management process. For distributed teams, the PagerDuty platform proved to be intuitive and granular in understanding and executing rotations, services, and schedules. “Our teams are now empowered to run their own on-call processes and are held accountable to their service integrations and alerting processes in real time,” explained Page. By using PagerDuty, ezCater has gained: Transparency. Teams can easily understand who is on call for each service in real time. Empowerment. From engineering leadership to individual contributors, teams can directly manage their permissions, rotations, and schedules. Flexibility. With PagerDuty’s ecosystem of over 350+ integrations, teams are able to easily integrate their existing tool sets and services to provide more system visibility. A Taste of What the Future Holds ezCater will continue to leverage PagerDuty as a key partner to develop processes, make improvements to infrastructure stability, and create business efficiencies throughout the rest of the organization. “PagerDuty gives us the confidence that we don’t have to manually interact with every incident, and we know that things will work when they need to,” Page explained. If you’d like to explore the possibilities of what PagerDuty can do for your team, sign up for a 14-day free trial today.
Parsley Health: Bringing Telemedicine to the Front Lines
As a child with a cancer survivor and heart disease patient for parents, I developed a psychological discomfort for visiting hospitals and the doctor’s office in my adult years. For me, the feeling of restlessness while sitting in a waiting room coupled with fear of the medically unknown can create the perfect breeding grounds for stress and anxiety—and apparently, I’m not alone. But in the last few years, advancements in technology have opened the door to new opportunities that can change the way we think about doctor visits and in-person healthcare. Parsley Health, a holistic medical practice that offers online medical care and consultation in almost all 50 states, is helping lead the next evolution of healthcare. As part of PagerDuty’s commitment to helping organizations scale to meet acute and changing needs in healthcare in response to COVID-19, we are offering healthcare organizations like Parsley Health 20 free PagerDuty licenses for 6 months. Healthcare From Anywhere Parsley Health recognizes that going into the doctor’s office may not be the ideal choice for many people, either due to scheduling conflicts or fears tied to seeing a doctor, and they are one of the pioneers of using telemedicine to build strong relationships with their members and patients. Using the Parsley Health platform, members can meet with medical providers and health coaches through uninterrupted, online one-on-one sessions, either via phone calls or video calls. These sessions may include assessing health risk factors, writing prescriptions, stress management techniques, and even creating comprehensive plans around dietary changes and mental health improvement. The benefits are twofold: Patients can engage in online sessions within the comfort of their own homes, and providers experience reduced levels of stress and more genuine conversation. These interactions help the providers build rapport and strengthen their relationships with patients. In addition to personalized one-on-one sessions, Parsley Health also offers members an online patient portal that allows patients to securely contact their doctor via messaging 365 days a year. They also offer a proprietary digital questionnaire to monitor and track patient progress over time. Presented to patients before meeting with a doctor, the questionnaire is scored according to the Parsley Symptom Index (PSI) and gives providers a better understanding of a patient’s symptoms and overall health over the past two weeks. The information is then used to make recommendations regarding a patient’s individual healthcare needs. The Impact of Telemedicine During Global Health Crises While the use of telemedicine has become more accessible and widely adopted over the last several years, the benefits and advantages have been especially illuminated during the COVID-19 outbreak. “Through telemedicine, doctors can evaluate patients’ symptoms and assess whether they need to be seen in person, preventing the unnecessary exposure of both parties to COVID-19. It also allows doctors to provide continuity of care to their patients—other medical needs do not stop in the face of a global health crisis,” shared Martín Beauchamp, Manager of Infrastructure, Security, and Data at Parsley Health, whose team is responsible for the alerting infrastructure and incident management operations that keep the platform up and running. Because of the increased accessibility to technology and the fear for some at-risk individuals to go out in public, telemedicine can play a pivotal role in maintaining the continuity of personal care during health crises. “With more people turning to telemedicine during this global pandemic, a whole new audience is learning how easy and helpful online care can be,” said Beauchamp. “Our hope is that people who have lived in places where there is a more limited range of medical services finally have access to the care they need.” “Telemedicine has a unique, dual-role to play during COVID-19: Protecting frontline medical workers and serving the needs of patients in the community,” - Martín Beauchamp, Manager, Infrastructure, Security, and Data, Parsley Health For Parsley Health, the organizational effects of COVID-19 have been seen mainly through an increase in inquiries and questions from members about the virus and what preventative steps they can take to stay healthy. “With the growing uncertainty and constantly changing information around COVID-19, our providers have seen a huge uptick in online messaging with patients who have questions about symptoms, testing, and practical living," Beauchamp explained. The Advantages of Being Cloud-Native With all of these managed services comes the need for a highly available and reliable platform that members can depend on for care, no matter the time or location. High uptime and resilient services are critical to keeping members healthy, happy, and informed. “When the right engineers can respond quickly to issues with necessary context, we are able to meet the high expectations that our members have in every interaction with Parsley Health providers,” explains Beauchamp. In order for organizations to keep pace with technology in society, they need to find a way to adapt to the new ways people consume and use products and services. Historically speaking, healthcare typically lags in innovation compared to other more cloud-forward industries. With challenges like private data, compliance, and other regulatory requirements, digital transformation is thought of as a long-term game with paced, incremental advances. Parsley Health is an innovation outlier in the healthcare industry. As a cloud-native company, technology has been embedded into their offerings since day one. They built a microservices-based platform within Kubernetes and orchestrate their containerized environment through Google Kubernetes Engine (GKE). On the front end, the engineering team uses tools like React and GraphQL, which enable developers to be extremely productive. With this highly available and forward-looking technology stack, Parsley Health is able to offer premium products and services to their members in real time, around the clock. But in order to keep these services stable, they needed a platform that could keep a pulse on their digital environment. Using PagerDuty, Beauchamp’s team is able to rally the right people at the right time during incidents and take a prescriptive and calculated approach to incident remediation. “PagerDuty is a crucial part of this process, because, as we receive alerts from monitoring tools, it allows us to respond with the most appropriate personnel based on the situation at hand,” shared Beauchamp. With PagerDuty’s ecosystem of over 350 integrations, his team has the ability to centralize tools like Slack and Datadog into one single point of ingestion to ensure monitoring data and incident communications are visible and easily accessible to key stakeholders during the lifecycle of an incident. "PagerDuty continues to help us keep our mean time to resolution as low as possible, rally the right resources during incidents, and provide high level visibility for a distributed team,” said Beauchamp. Since implementing PagerDuty, Parsley Health has seen several benefits, including: Improved reliability and uptime, leading to a positive user experience for members Better visibility into an incident’s makeup, which helps distributed teams escalate the issue to the appropriate personnel quickly and efficiently Improved MTTA/MTTR through automated escalation within PagerDuty and real-time communication through Slack PagerDuty partners with Parsley Health to ensure their services are always online and available for their users, even when seconds matter. “Parsley Health will continue to rely on PagerDuty to give us timely alerts so that engineers can respond to operational issues. Together, we’ll keep the platform healthy so our providers can keep our members healthy.” - Martín Beauchamp, Manager, Infrastructure, Security, and Data, Parsley Health Visit PagerDuty.org to learn more about PagerDuty’s investment in Time-Critical Global Health and how we’re helping bring essential care to people who need it, in the hardest-to-reach places, in the moments they need it most. Additionally, check out our COVID-19 resource center to see how we’re working with companies and offering resources to help combat the COVID-19 outbreak.
Casumo Goes All-In on PagerDuty for Digital Operations Management
Casumo, an award-winning online casino headquartered in the archipelago of Malta, has been a disruptor in the online gaming industry since its inception in 2012. The company’s motto of “Your Favorite Online Casino” stems from their modernized “instant play” interface, mobile compatibility, and their unique promotional reward system. Because of its “user first” mentality, maintaining a perfect customer experience and keeping players loyal and online is critical to Casumo’s continued success. With an online library of over 2,000 games and a system that grants over 31,000 rewards to users on a daily basis, Casumo’s mobile device and desktop platforms require around-the-clock monitoring in order to keep its robust infrastructure online and available for users. With PagerDuty, Casumo is able to deliver a consistent and secure customer experience. As an online betting and gaming company, users give the company private information which it must protect, and PagerDuty is the safety net that helps teams at Casumo know that they will be notified when a problem does occur. Finding the Right Solution Because Casumo has been rapidly growing since its launch in 2012, the engineering team wanted an end-to-end, reliable digital operations management platform that could scale and support the company’s infrastructure and keep user information protected. Casumo conducted a thorough evaluation of PagerDuty and another provider to assess performance across several criteria that were important to the engineering team in identifying and resolving incidents as quickly as possible. Select criteria included: Mobile app and web functionality Ease of integration On-call scheduling and management Visibility of similar incidents Alert merging, escalation, and resolution The Results Are In! The team integrated both PagerDuty and a competitor solution to the company’s monitoring tool stack and each engineer recorded their preferences regarding the specific criteria. At the end of the trial their preferences were averaged to determine the solution that had the best fit for their team, based on functionality and user experience. Across all their criteria, PagerDuty was the preferred choice. Key highlights: Nearly 60% preferred the functionality and user experience of PagerDuty’s mobile app Nearly the same percentage chose PagerDuty for ease of integration More than two-thirds preferred PagerDuty’s on-call schedule visibility More than 80% had a better user experience viewing past incidents available with PagerDuty Event Intelligence More than two-thirds preferred PagerDuty’s alert merging, escalation, and resolution functionality Casumo found PagerDuty’s solution to be very robust, working as intended so that the teams could find immediate value in using PagerDuty. What’s Next? In terms of overall user experience, functionality, and business value, it was clear that PagerDuty was the go-to option for an end-to-end, real-time incident management platform. Looking ahead, Casumo plans on expanding the use of PagerDuty across the organization, including IT, DevOps, and the customer support team. For Casumo, PagerDuty is the right choice because it enables the company to solve immediate issues while allowing it to grow and scale for the future. To learn more about how PagerDuty is helping companies with digital operations management in the U.K., U.S., and abroad, try PagerDuty today.
eToro Trades Manual Processes for Real-Time Operations With PagerDuty
eToro is a multi-asset investment company with over 12 million registered users from more than 100 countries. The company’s mission is to change the way people think about trading and investing, ultimately reducing dependency on traditional financial institutions and to make trading and investing more transparent and fun. The eToro platform enables people to invest in the assets they want, from stocks and commodities to crypto assets. They can also choose how they invest—either investing directly, copying another investor, or investing in a portfolio. In order to stay ahead in a highly competitive industry, it is critical that eToro’s services are always on. “We want to have the best customer experience for our users,” explained Elad Gotfrid, Director of IT Production at eToro. Manual Processes Slow Incident Response Prior to using PagerDuty, eToro relied on manual escalation processes, such as SMS notifications for its on-call teams. But with the large volume of alerts coming from eToro’s various monitoring tools, there was little context available to triage and resolve incidents quickly. The NOC wanted to automate its processes so that they could improve incident response. “We wanted to move responsibility to the developers and reduce the number of personnel involved so we could assess problems faster,” explained Gotfrid. Prior to PagerDuty, eToro faced several challenges, including: Manual escalation processes that delayed response times, which could negatively impact the customer experience Teams having to work harder during the night due to the number of incidents Lack of visibility and context for alerts, with the risk of missing an important alert Taking Stock of Real-Time Operations With PagerDuty eToro turned to PagerDuty to automate the incident management process, with the ultimate goal of minimizing the volume of alerts and ensuring a seamless customer experience. With PagerDuty, eToro can now: Respond faster, with automated escalation paths and scheduling using PagerDuty On-Call Management Mobilize the appropriate teams with Modern Incident Response Empower teams to act immediately, wherever they are, using the PagerDuty mobile app Measure MTTA, MTTR, and other KPIs to improve real-time operations “PagerDuty enabled us to achieve high SLAs for our internal clients and provide a superior customer experience for our users,” shared Gotfrid. Growing With PagerDuty eToro plans to continue its rollout of PagerDuty across various teams in the company, including the security, IT, and data teams. “The main objective now is to aggregate all incoming alerts from all teams into one centralized console through PagerDuty Visibility,” explained Gotfrid. eToro plans to adopt additional PagerDuty products, including Event Intelligence and Analytics to optimize the company’s real-time operations. “We use PagerDuty because it’s effective yet also super user-friendly,” said Gotfrid. To learn more about how PagerDuty is helping companies transform their digital operations management, try PagerDuty today.
Adopt-a-Pet.com Keeps Their Systems Pup and Running With PagerDuty
“PagerDuty allows us to better support our mission when critical issues arise and help the animals we are determined to save.” - Shannon Cronin, CTO, Adopt-a-Pet.com Adopt-a-Pet.com is North America's largest nonprofit homeless pet adoption website that helps over 18,000 shelters, humane societies, SPCA organizations, pet rescue groups, and pet adoption agencies advertise their homeless pets to millions of adopters each month, for free. They also operate and maintain the world’s largest nonprofit, adoptable pet search engine. Users rely on Adopt-a-Pet.com at the early stages of the adoption process prior to the moment of adoption. Users rely on Adopt-a-Pet.com at the early stages of adoption by partnering and supporting users as they consider adoption options, seek the right pet for their families, and take the necessary steps to communicate with more than 18,000 animal shelter and rescue partners in the Adopt-a-Pet animal welfare network. Challenges Faced Because of its global user base, the organization’s website needs to be able to support a high volume of traffic around the clock. Shannon Cronin, Adopt-a-Pet.com’s Chief Technology Officer, oversees the technology and engineering departments that help support the organization’s mission and availability objectives. When Cronin moved into his role, his first order of business was to reconfigure the on-call incident response process and remediate the ongoing shortage of after-hours support. Additionally, at the time, many communicative processes between teams were often manually executed, which caused response delays and accountability issues in terms of ownership of services and applications. “If our site goes down or is experiencing other critical issues, this means pets are not getting seen or inquired about by potential adopters and that ultimately jeopardizes their chances of being adopted,” Cronin explained. Before PagerDuty, Cronin’s team experienced several challenges, including: Lack of processes to efficiently alert technical staff of critical issues Unclear escalation policies, which created gaps in team communication Low visibility into system and service health for key stakeholders Difficulty routing incidents based on the integration or affected service Due to the lack of communication and visibility across teams during non-business days, some service disruptions would go unacknowledged for hours or even days before someone picked up the issue. Cronin realized that meant missing Adopt-a-Pet’s SLAs and that he needed to rethink the way the organization executes incident management within its greater digital operations architecture. The Benefits of PagerDuty Since implementing PagerDuty, Cronin and his team have seen several benefits, including: Improved response times to critical issues across services and applications Full-service ownership as it relates to the technical staff and on-call teams Faster alert routing, leading to better communication and quicker acknowledgement Full visibility into on-call rotations and support staff availability “Once my team is notified by PagerDuty that there is an issue, we’re able to begin troubleshooting and resolving these issues as soon as possible, literally saving the lives of homeless pets,” explained Cronin. With PagerDuty, Cronin’s team was able to implement uniform escalation policies that routed particular teams to services they are responsible for, and even created a secondary response process. Additionally, by leveraging PagerDuty’s ecosystem of 350+ integrations, the Adopt-a-Pet.com team is able to integrate all of the necessary tools into their platform to create a single point of ingestion and gain full visibility into the environment of their application architecture and service health. “We can confidently say that, with the support of PagerDuty, we’re able to provide the 300,000+ adoptable pets who are on Adopt-a-Pet.com, like Cooper (pictured below with his new mom, Rylee), with the ability to be seen by more than 5 million potential pet adopters who browse our site each month,” explained Cronin.
(Fish) Farm-to-Table Produce With PagerDuty
Most of us are familiar with the traditional farms that have existed since humans learned to sow and harvest crops—these farms have provided us with food for centuries. And for a long time, due to the lack of refrigeration and other technology, humans lived near their food sources. But industrialization has also led to centralization of farming systems, with farms getting larger and further from consumers and with distributors depending on preservatives or refrigeration to extend shelf life. For example, getting a salad, where consumers expect freshness, depends on a staggering level of geographic consolidation—95% of all leafy greens in the US are grown in California and Arizona (and almost all of that in just two counties!). Not only does this mean your salad spends most of its shelf life on a truck, but it also accounts for why issues like E. coli outbreaks cause frequent and widespread recalls. Upward Farms (formerly Edenworks), based in Brooklyn, New York, is changing all that by providing fresh produce grown locally in a sustainable way via aquaponics—a mix of aquaculture and hydroponics. This means that in addition to growing flavor- and nutrient-powerhouses baby greens and microgreens, the company also grows fish. And it’s the manure from these fish that provide the nutrients to the growing plants. Founded in 2013 by CEO Jason Green, Construction Manager & Systems Engineer Matt La Rosa, and CTO Ben Silverman, Upward Farms grows a variety of baby greens and microgreens in climate-controlled vertical farms (think tall industrial racking systems with layers of plants growing under LED light), and sells washed and ready-to-eat packaged salads to local stores, including Whole Foods Market. Their products use 95% less water than traditional farming, are pesticide free, and are on shelves the day after harvest—compared to the week that field-grown products spend on a truck—doubling the shelf life. From Waste to Water to Food Indoor farms aren’t really anything new—greenhouse growing has been a major industry in Holland for 100 years. What is new are vertical farms, where layer upon layer of plants are stacked into towers. These high-density indoor farms are the result of paradigm shifts in LED lighting, sensors and IoT, and automation. For instance, LED fixtures have historically been very expensive, and cheaper halogen or metal halide fixtures used in other growing systems generate too much heat to be used in high-density vertical farms—they would cook the plants. However, as technology improved and the price of LED bulbs dropped—according to Green, the hard costs of an LED has decreased about 90% over the past 5 years, while operating costs have similarly decreased through higher efficiencies—vertical farming is now a viable solution for areas like the U.S. Northeast, which has massive populations but a climate that cannot support year-round growing like the U.S. West Coast or Mexico. Upward Farms sustainably farms fish, with no hormones or antibiotics. Water from the fish tanks is pumped through a bioreactor, where naturally occurring bacteria transform the fish waste—manure—into fertilizer. The Upward Farms team has custom-built industrial racks containing long pond-like shelves, layered into tall towers, and the fertilizer-rich water is pumped into those shelves. Floating on top of each shelf are trays filled with seeds for a variety of leafy greens, which are fully grown after 7–18 days under LED bulbs, and are then harvested, washed, and packaged to be put on store shelves the next day. A Delicate Balance Recreating an entire ecosystem indoors with equipment in close quarters is exactly as tough as it sounds. The Upward Farms team needs to constantly monitor a variety of metrics, from “vital signs” like dissolved oxygen in the fish tanks, to ambient climate conditions that determine plant quality and yield. With all that stuff to monitor, the team uses the PagerDuty Events API to connect with a variety of equipment that capture environmental conditionals and water chemistry in real time so they can optimize their operations. For example, if a pump fails or an aeration pipe is clogged, an oxygen deficit in the fish tanks could lead to a mortality event in which hundreds of fish die within hours. Plants are less sensitive, but if the temperature in the grow stacks rises above the ideal range, it could impact the quality and yield. In these cases and others, it’s imperative that someone is alerted immediately to resolve the issue. Upward Farms uses PagerDuty to route incidents to the appropriate teams for response. Environmental threshold alarms are directed to the farming and aquaculture teams while equipment-related alarms are directed to the I.T. team. In the end, it’s grocers and their consumers who benefit, through a new level of freshness, quality, and safety in the most delicate perishables of leafy greens and fish. -------------- Interested in learning more? Michael Karlesky, Director of Software & Electronics at Upward Farms, provides more details on Upward Farms uses PagerDuty’s Events API in the Upward Farms Implementation Guide.
Scribd Uses PagerDuty to Support Growth of Their Massive Digital Library
Scribd is considered “the Netflix for books” with over one million titles on their audiobook platform. R. Tyler Croy, Director of Platform Engineering, explains how Event Intelligence and Modern Incident Response have helped support Scribd’s infrastructure so the company can continue to bring value to its millions of users.
Global Staffing Company Recruits PagerDuty to Manage Its Digital Operations
“Once we realized the power of PagerDuty, it expanded our use of the product to empower teams to just be better at what they do.” - System Engineering Team Manager This organization provides workforce solutions to several industries, with clients spanning from small businesses to the Fortune 500. The company fills individual positions on demand, staffs entire facilities, and manages outsourced recruiting processes and programs for a variety of clients. As part of its digital transformation, the organization is adopting a cloud-first approach to drive more efficiency and innovation throughout the organization. In order to ensure a smooth transition, it completely redesigned its monitoring and metrics infrastructure, and uses PagerDuty to funnel thousands of various business signals and metrics into a single-pane-of-glass-view of its digital operations. Legacy Processes Not Working as Well for Cloud Migration As the company was moving more of its infrastructure to AWS, legacy processes made it challenging for its engineering teams to maintain uninterrupted service across a global environment. The legacy incident management process led to: Longer times to mobilize key stakeholders during major incidents due to a manual, internal “phone book” process Slower incident response driven by the lack of visibility into the overall health of the infrastructure More unplanned work and disruption for teams on call Teams were bombarded by alert storms and signals, which made it difficult to determine if an incident required immediate attention. “We had teams working on the same incidents in silos—no one had any idea who was working on what,” explained a systems engineering manager. Leveraging Integrations to Increase Visibility and Collaboration Ensuring complete visibility and communication across teams was a top priority for the company. By leveraging many of PagerDuty’s 350+ integrations, including Slack, ServiceNow, Checkmk, and AWS CloudWatch, the company’s engineering teams were able to aggregate their entire monitoring suite into one centralized view within the PagerDuty platform and improve communications across teams. “Having many integrations within PagerDuty allows problems and events to quickly escalate to the right people,” explained an engineer on the team. Employing Automation to Improve Operational Efficiency With data and signals from monitoring tools flowing into PagerDuty, the engineering teams now have real-time visibility into systems health and are empowered to respond immediately. When major incidents occur, the company can now mobilize teams in under five minutes, where it had taken up to an hour previously. With PagerDuty Modern Incident Response, business stakeholders can automatically open a conference call and email key personnel immediately by running a Response Play. PagerDuty enabled the company to: Automate more of its response process and quickly engage key stakeholders during major incidents with PagerDuty Modern Incident Response Improve work-life balance for responders with PagerDuty On-Call Management Mobilize teams immediately using PagerDuty Live Call Routing with one-touch-to-join conference bridge Gain visibility into infrastructure health by leveraging PagerDuty’s ecosystem of 350+ integrations Focus on constant operational improvement through retrospectives, reporting, and analytics “Once we focused on automating the monitoring, escalation, and alerting, that’s when we started to improve as an operations team,” explained an engineering manager. Recruiting the Security Team to PagerDuty In addition to supporting the engineering teams, PagerDuty also helps the security team mitigate risk. Leveraging the deep AWS integrations from PagerDuty enables the operations team to notify the security team to immediately take action if one of their security servers goes down. Using the Red Canary integration, a number of events can be automatically resolved. “Without that ability, I’m not sure where the alert would really go. The ease of configuration as well as the results that we get are absolutely amazing,” explained an engineer. Working on Next Steps PagerDuty has already been adopted by other teams within the organization, including development, security, and the IT service desk teams, with more growth anticipated. Additionally, the engineering teams are evaluating PagerDuty Event Intelligence and Analytics to gain additional context into incidents across their environment. For this workforce solutions company, continuous improvement and innovation is critical to its success, and PagerDuty is a key partner as it continues to scale the business. According to an engineering manager, “PagerDuty plays a huge role in ensuring we are successful in providing the highest level of service to our customers.” Learn more about what PagerDuty can do for your organization by signing up for a free 14-day trial today.
OVO Energy Powers Digital Transformation With PagerDuty
OVO Energy is the UK’s largest independent energy technology company and supplier. Across the group, OVO serves 1.5 million customers with intelligent energy services. Founded in 2009, OVO redesigned the energy experience to be fairer, greener, and simpler for all. Today, OVO is no longer simply an energy retail business: it is a group of innovative, dynamic companies, all striving to harness technological advances with great consumer propositions to create more affordable clean energy for everyone. PagerDuty has helped OVO’s engineering teams with maintaining the applications and microservices they build, creating a more streamlined process. As a result, PagerDuty has helped OVO with: Quickly building on a consistent on-call policy as the company grows and new teams are formed. Providing a single solution with enough integrations so their autonomous engineering teams don’t have to compromise on their chosen monitoring tools. Drawing incident response data from services running on multiple platforms under different brands into one place for greater visibility. Powering Up Incident Management With PagerDuty OVO wanted to automate escalation processes so teams could take action faster, leading to greater efficiency. With PagerDuty, OVO was able to significantly reduce their MTTA and MTTR. PagerDuty also helped OVO identify the incidents that took longer to resolve so teams could take steps to continually improve to reduce the chances of recurrence. Additionally, PagerDuty’s ecosystem of 350+ integrations allowed OVO to consolidate and centralize their alerts from various monitoring tools, such as CloudWatch, Datadog, and Slack, into one centralized place. Today, OVO Energy is able to: Improve the work-life balance for responders through PagerDuty On-Call Management Orchestrate automated response across the organization with Modern Incident Response Execute incident response actions from anywhere, anytime with the PagerDuty mobile app Compile key outcome metrics to improve the postmortem process with PagerDuty Analytics “PagerDuty Analytics helped reduce our MTTA by 97% and our MTTR by 70%,” shared Tom Shaw, an Engineering Operations Analyst at OVO. Charging Forward With PagerDuty OVO Energy will continue to implement PagerDuty into teams across the organization, including Engineering Operations, Product Support, and feature teams. “We now have 15 engineering and support teams across retail brands using PagerDuty,” Shaw said. “The extensive range of integrations offered by PagerDuty and the ease with which they can be configured and customized are valuable to our autonomous engineering teams, who are free to explore and adopt the tools that best suit their needs,” Shaw explained. To learn more about how PagerDuty is helping companies transform their digital operations management in the UK, U.S., and abroad, try PagerDuty today.
Dropbox Uses PagerDuty to Help Scale Digital Operations
Watch how Dropbox is relying on Pagerduty to provide their engineers a digital operations platform that helps keep track of incidents and reduce downtime and drop offs.
Carnival Corporation and Princess Cruises Sail Into Digital Transformation With PagerDuty
Sqills Books PagerDuty to Modernize Digital Operations Management
Sqills is a travel software company headquartered in the Netherlands. Its flagship product, S3 Passenger, is a cloud-based booking and distribution platform for the bus and rail industry, built on a microservices architecture. The company has been quickly growing worldwide due to its rapid innovation, and the platform’s high availability and reliability for its customers. “We can’t afford to wait to jump into action when our customer notices a problem. Not being able to check in passengers or sell tickets is simply not an option for our clients,” said Robin Breuker, Team Lead Software Development at Sqills. Manual Processes Delayed Incident Response Each development team is responsible for monitoring their own microservices, but with an increasingly complex infrastructure and multiple teams working to support S3 Passenger, it became more challenging to mobilize the appropriate teams when incidents arose. This was an issue because the faster Sqills could identify and resolve incidents, the better it could meet its SLAs. Yet teams used Confluence to track on-call schedules and manually alerted teams when incidents occurred, which became more difficult to manage and scale as Sqills grew. As a result, Sqills faced several challenges, including: Delayed incident response times due to manually having to find and notify the right team to fix an issue Lack of visibility into the overall health of applications driven by distributed teams using their own monitoring tools Teams being notified too often on various alerts, impacting work-life balance “If you thought something was wrong, you would have to find the right tool and check if there was a real issue. That takes time,” explained Breuker. A Smoother Ride With PagerDuty Sqills wanted a better, automated solution, and selected PagerDuty after a thorough evaluation process. “We looked at two other alternatives, but their solutions were not mature and seemed quite basic. In contrast, we were able to quickly configure PagerDuty the way we wanted to, with only a little bit of code,” explained Breuker. Because PagerDuty supports over 350 integrations, the development teams could easily configure and centralize all their monitoring tools onto the platform, using the PagerDuty-Terraform integration. With PagerDuty, Sqills has seen: Streamlined incident management processes using Modern Incident Response and postmortems Empowered teams to manage their own schedules and improve their work-life balance through On-Call Management Enabled responders to take action from anywhere by leveraging the PagerDuty mobile app, resulting in faster incident response Gained more visibility into the application’s health “You spend less time getting all of the right information and that helps a lot. Using PagerDuty means you spend less of your free time working,” said Breuker. The Journey Ahead With PagerDuty Looking forward, the company hopes to explore PagerDuty Analytics to gather more data to better inform decisions and help teams understand their performance. In addition, the company will be looking at how PagerDuty can help engage teams such as product support via stakeholder notifications and other features. “PagerDuty is really well designed and optimized for what we want to do. It was really easy for us to get started with PagerDuty,” said Breuker. To learn more about how PagerDuty is helping companies transform their digital operations management in the U.K., U.S., and abroad, visit www.pagerduty.com for more information.
The Sporting Group Doubles Down on Real-Time Digital Operations With PagerDuty
The Sporting Group is one of the world’s leading names in sports betting technology and trading. The company has two principal divisions—Sporting Index, the pre-eminent sports spread betting company, and Sporting Solutions, a rapidly growing B2B operation that supplies real-time pricing and sports trading capabilities to sportsbooks around the world. Sporting Group prides itself on delivering an unrivaled customer experience. Ash Patel, Head of IT Operations, leads the UK-based technology team responsible for ensuring that the Sporting Group’s entire infrastructure remains highly available for all of its customers. A Drive for More Operational Efficiency With peak traffic occurring in the evenings or on weekends due to sporting events around the world, various members of his team would be on call. “Minimizing the disruption of people called out of hours is something that we were looking for,” said Patel. With a mix of manual processes and a slew of alerts coming from monitoring tools like SumoLogic, SolarWinds, and Turbonomic, the Sporting Group wanted to improve operational efficiency even further. For instance, the company would manually create incident reports for Sporting Solutions clients when customer-impacting incidents occurred. The existing process had continuous alert traffic that did not provide full context into incidents, which led to: Lack of visibility into prioritizing events After-hours business disruption for on-call teams Time and resources spent on manually writing incident reports “We want to streamline the process and ensure the team is quickly focused on the critical issues when they occur,” said Peter Wallis, Chief Technology Officer, the Sporting Group. Improving Real-Time Operations By leveraging some of the 350+ integrations available through PagerDuty, Sporting Group centralized all of the alerts coming from its multiple monitoring tools, which enabled the Technology Operations team to streamline their incident response process even further. Using multiple products available on the PagerDuty platform, Sporting Group could apply machine learning and automation to proactively identify issues and notify the best teams to take action in real time via Slack, push notifications, text, or telephone. "PagerDuty allows us to capture things we were not able to measure before, providing a truer picture of what’s transpired." - Peter Wallis, The Sporting Group PagerDuty’s platform helps the Sporting Group to: Reduce alert noise and prioritize P1/P2 incidents using PagerDuty Event Intelligence Gain immediate insights into the availability and performance of business services using dashboards from PagerDuty Visibility and Status Page Improve work-life balance for on-call teams during “sporting hours” by leveraging PagerDuty On-Call Management and Live Call Routing Automate incident reporting for clients through postmortems available through PagerDuty Modern Incident Response “Everything is much more efficient, automated, sleeker, visible, and effective. PagerDuty allows us to capture things we were not able to measure before, providing a truer picture of what’s transpired,” said Wallis. Looking Ahead Other teams within Sporting Group are also using PagerDuty, including teams within customer service and risk management, as well as the senior leadership team. In the future, Sporting Group seeks to further customize its Slack integration with PagerDuty to improve collaboration, transparency, and response across the global organization. With the increasing number of sport events 24/7, Sporting Group hopes to expand the use of PagerDuty to enable business and technology teams to respond to critical business events whenever they happen, from wherever they are. “PagerDuty is such a good fit for us because it builds upon much of the technology we already have. Because it’s so flexible, we can drive a lot of different use cases through it to maximize the business value gained from using the solution,” said Patel. Learn more about what PagerDuty can do for your organization by signing up for a free 14-day trial.
ITV Depends on PagerDuty When Broadcasting to Millions Across the UK
ITV is the biggest and most popular commercial television channel in the United Kingdom. Tom Clark, Head of Common Platform at ITV, shares how PagerDuty can easily integrate with other tools and empowers his team to identify, acknowledge, and take action to resolve issues within minutes. With 10,000 hours of content a year reaching roughly two-thirds of the United Kingdom, ITV relies on PagerDuty to drive its real-time operations to provide their viewers with the best experience.
Sportradar Leverages PagerDuty to Further Drive Efficiency for the Company
With more than 1,000 companies in over 80 countries utilizing its services, Sportradar is a global leader in understanding and leveraging sports data and digital content for clients around the world. Referred to as “the source code of sport,” Sportradar prides itself on data availability and sheer speed of delivery to its customers. “Providing a stable hosting platform and ensuring consistent application delivery are critical to driving revenue streams for the business,” said Christoffer Franzen, Sportradar’s global head of system administration. A Need for More Operational Transparency Franzen leads a global unit consisting of several 24/7 operational teams across multiple regions around the world managing and operating a hybrid cloud infrastructure. That said, monitoring and alerting were mostly decentralized across the departments leading to some challenges, including: Lack of transparency into overall systems and application health Reactive (vs. proactive) incident management More steps involved to resolve issues "We needed a one-stop source to bring transparency into our operational duties, and PagerDuty was a solution that would centralize and standardize our monitoring across different departments." — Christopher Franzen, Global Head of System Administration, Sportradar Gaining Insights to Maintain the Customer Experience For Sportradar, identifying and resolving issues as quickly as possible are core to ensuring that its customers stay on top of sporting events data as it happens. With PagerDuty, Franzen’s teams gain important context on issues in real time so they can determine faster paths to resolution and preserve the customer experience. PagerDuty’s products enable Sportradar to: Surface relevant information from past incidents more easily using Similar Incidents from PagerDuty Event Intelligence Visualize the real-time, overall health of the environment using PagerDuty Visibility Orchestrate teams automatically through PagerDuty On-Call Management Take action from anywhere by leveraging PagerDuty Mobile Incident Management As Franzen remarked, “We’re able to have one powerful source that provides the necessary visibility and insight to respond faster and more effectively.” Leveraging Integrations to Improve Response PagerDuty’s ecosystem of 300+ integrations enables Sportradar to easily centralize alerts from its most heavily used monitoring tools—including Prometheus, which provides more sophisticated views so teams can be more proactive. With the Slack integration, detailed alerts go directly to the right teams so they can immediately take action. “We rely heavily on the integrations with Slack so teams can work directly within the interface and resolve incidents with fewer steps,” said Franzen. "Sportradar has already seen the clear benefits from partnering with PagerDuty." —Christopher Franzen, Global Head of System Administration, Sportradar Looking Ahead In the future, Sportradar plans on expanding Event Intelligence even further as the company continuously implements new hosting technologies. It will also add more customization into PagerDuty Visibility consoles for each of its engineering teams, as well as the products within its hybrid cloud infrastructure. As the company looks toward the future, PagerDuty will be an important partner to help Sportradar scale and support its growth through: Providing more operational transparency Enabling faster response by providing rich context through the PagerDuty mobile app Provisioning new services on PagerDuty for any team “When we were looking into renewing our monitoring and improving our operations, PagerDuty was the name that consistently came up. We are very happy with the results we’ve seen so far and our decision to partner with PagerDuty,” concluded Franzen. Learn more about what PagerDuty can do for your organization by signing up for a free 14-day trial.
Metapack Counts on PagerDuty to Deliver the Best Customer Experiences
MetaPack is the leading provider of e-commerce delivery management technology to enterprise retailers and brands. In this video, several members of the tech team at MetaPack share how PagerDuty empowers teams throughout the company with data, clarity, and context so that they are able to work together to respond to issues in real time and have confidence when making complex technical decisions in the future.
Elastic Leverages PagerDuty for Visibility into System Health to Exceed Customer Expectations
Avoka Proactively Manages Its Digital Operations with PagerDuty
Avoka, a Temenos company, provides a software platform built for creating customer acquisition and onboarding solutions for financial services clients, managing the flow of transactions, and analyzing actual customer behavior. Jeremy Emmett, Global VP for Cloud Services, discusses how PagerDuty has helped Avoka’s teams move from a reactive to proactive culture by providing a digital operations management platform that allows them to act in real time.
ClassPass Reduces Manual Overhead and Improves Service Quality with PagerDuty
ClassPass offers monthly fitness memberships that provide its customers access to thousands of different classes at studios and gyms across multiple U.S. cities. Don Neufeld, VP of Engineering, shares how PagerDuty reduces the amount of manual overhead (leading to a reduction in incident count by 50%) and how PagerDuty’s postmortem capabilities help the team learn from incidents to improve their response, positively impacting the business.
Cox Automotive Improves Developer Productivity by 20% Using PagerDuty
Cox Automotive aims to transform the way the world buys, sells, and owns cars by providing insights and products to bridge the gap between consumers, manufacturers, dealers, and lenders at every stage of the automotive experience. Jason Riggins, Senior Director of Delivery Enablement, shares how PagerDuty enabled teams to mitigate business disruption, resulting in a 75% reduction in customer downtime, a 56% reduction in MTTR, a 23% decrease in overall incident count, and a 20% increase in developer productivity.
PagerDuty Fuels BlaBlaCar’s Shift Towards Real-Time Operations
BlaBlaCar is the world’s leading long-distance carpooling service, connecting car drivers with empty seats to passengers looking for a ride. With 60 million members across 22 countries and over 18 million travelers every quarter, BlaBlaCar is creating an entirely new, people-powered travel network. Julien Dehee, Head of Foundations at BlaBlaCar, leads a team of site reliability engineers (SREs) and database reliability engineers (DBREs) responsible for BlaBlaCar’s production infrastructure and the tools used by the larger engineering organization. “We provide a highly available platform. This is critical to the business because if the platform goes down, then our customers can’t book carpool trips, which leads to a direct loss of revenue,” explained Dehee. A Bumpy Road to Service-Oriented Development BlaBlaCar was in the midst of transitioning from monolithic to service-oriented architecture, and moving from traditional IT Operations to a DevOps culture that embraced “you build it, you own it” mentality. However, BlaBlaCar had only one on-call team, with all members sharing one mobile phone, which would be handed off to the next on-call person on the weekly rotation. As a result, BlaBlaCar faced many challenges, including: Delayed response times driven by manual processes Alert fatigue for on-call engineers due to the high number of false positives More time focused on unplanned work versus time devoted to innovation Time spent on logging on-call hours for compensation “Before [PagerDuty], it was pretty difficult to acknowledge and resolve issues immediately. We had to log into our previous monitoring system and find the alert,” said Dehee. Shifting Gears to Accelerate Incident Response With PagerDuty, development teams at BlaBlaCar own and manage their on-call schedules and notifications. The teams can easily acknowledge and respond to issues through PagerDuty’s mobile app, which enables them to immediately take action from wherever they are. At the same time, Dehee’s team gains rich insights into the performance and health of development teams and infrastructure through PagerDuty Analytics. Using this data, his team can provide recommendations on fine-tuning alerts and on-call rotations. With PagerDuty, BlaBlaCar has achieved: Improved operational efficiency and platform stability Ability to respond in real time with the appropriate resources through automation Distributed responsibility across the entire engineering organization Automated accounting of on-call hours for Human Resources to compensate responders "We’ve reduced alert fatigue, which has allowed us to focus more on the development roadmap. If you work less on production issues, you can focus more on innovation." — Julien Dehee, Head of Foundations, BlaBlaCar Driving Operational Efficiency Through Integrations BlaBlaCar provides its development teams with the autonomy to build and customize monitoring and alerting on their services by leveraging PagerDuty’s 300+ integrations. Dehee’s team uses the Prometheus integration for monitoring and the Slack integration for enabling real-time collaboration across the different teams. Other integrations used by BlaBlaCar teams include New Relic, PanOpta, and Jira. Through PagerDuty’s comprehensive integration ecosystem, BlaBlaCar gains: Real-time operational visibility and actionable insights across the entire IT infrastructure Shared ownership of production, deployment, and management throughout the entire engineering organization Improved productivity and engagement driven by empowered development teams “Operational efficiency and alerting platform stability are highly critical to the organization. We could have created something in-house, but overall, PagerDuty brings more value than an in-house solution,” Dehee concluded. Learn more about what PagerDuty can do for your organization by signing up for a free 14 day trial.
When Every Minute Matters
Using Data to Dismantle a Criminal Industry Human trafficking is a $150 billion dollar criminal industry that denies freedom to over 40 million people globally—and it happens in every country in the world. Polaris is an organization dedicated to ending human trafficking and restoring freedom to survivors. For over a decade, Polaris has operated the U.S. National Human Trafficking Hotline. This 24/7 resource connects victims and survivors to the services they need to get help and stay safe, as well as equips the anti-trafficking movement with tools and data to combat trafficking. The hotline also provides a way for community members to report tips about potential incidents of human trafficking. Through the hotline alone, Polaris has learned about more than 40,000 cases of human trafficking over the past decade. Given the clandestine nature of human trafficking, data on the crime has been fragmented, siloed, and incomplete. In the process of serving thousands of individuals over the years, we have learned key trends and attributes of trafficking itself. In 2017, Polaris launched a breakthrough report, “The Typology of Modern Slavery,” revealing for the first time that there are 25 distinct types of human trafficking in the U.S. alone. The report broke down the broad categories of sex trafficking and labor trafficking into the distinct business models traffickers use—from labor trafficking in agriculture, restaurants, and nail salons, to sex trafficking in fake massage parlors and escort services. Today, the hotline, with its 3,000+ service provider and law enforcement partners across the country, is the hub and data engine for the U.S. anti-trafficking movement. The insights we uncover help generate data-driven strategies and tools to support the anti-trafficking movement and allied fields (like domestic violence and child protection) so that we can get upstream to prevent and disrupt trafficking at scale. This hotline is a massive technology platform comprised of many integrated systems. We receive hundreds of incoming “signals,” meaning calls, texts, chats, emails, and webforms, each day, and we always have people in the queue waiting to talk to us. Because those reaching out to us may be in life-threatening situations, it’s vital that a glitch in any of our systems is reported swiftly and routed to the person with the appropriate expertise so they can immediately address and resolve the issue. Coordination becomes mission-critical when an advocate is on the phone with someone in crisis. Responders Taking Action in Real Time We began our journey with PagerDuty this spring. Currently, our tech team uses PagerDuty’s Modern Incident Response to escalate urgent issues that come from the 24/7 hotline. When an incident occurs, it is escalated either through Live Call Routing or through the integration with our help desk when someone submits an urgent ticket. For example, let’s say the hotline receives a phone call from a woman working as a nanny for a family who has locked up her passport, underpaid her, forced her to work 18-hour days, and monitored her access to the outside world. One day, her traffickers leave her alone in the house for 20 minutes to run errands and she finally sees an opportunity to reach out for help, and runs to a neighbor to ask to use the phone. If, on the day the nanny was finally able to reach out, a technical glitch occurred on one of our systems that prevented a hotline advocate from viewing the list of appropriate social services in the city where she is located, then the chance for her to leave that trafficking situation could be lost. However, if we take that same scenario, and our advocate was able to reach out to the tech team immediately through PagerDuty rather than scrolling through a long protocol document to find the number of the appropriate tech team member, the story would end very differently. With PagerDuty, our advocate would quickly reach the right person, a backup system would be deployed, and the advocate would be able to make a safety plan for the nanny and connect her with the services she needed—all before her traffickers returned home. More to Be Done As a result of using PagerDuty, our IT team’s incident escalation and response process is reduced by roughly one hour for every urgent issue. These 60 minutes that we get back translates to an additional 10 people who we are able to serve each month—and that’s a conservative estimate. We hope to continue to improve Polaris’ efficiency and effectiveness by building the PagerDuty workflow across the organization and have our non-IT teams integrate PagerDuty into their own incident response plans. Our vision is for PagerDuty to be our core mode of escalation across all functions, including (but not limited to) managing technology issues and engaging subject matter experts in real time. When someone’s safety and trust is on the line, every second matters. It takes courage and strength for a victim to reach out for help. Timeliness and effectiveness are everything in these circumstances, and those are the elements essential to building a platform here at Polaris that people can rely on. We’re in the trust business, and this trust is tethered to the speed of our responses and the stability of our systems. If you would like to contribute to our efforts, please visit our website to learn how you can help. Nancy McGuire Choi is the chief operating officer of Polaris with over 15 years of experience as a social enterprise executive in international development, information management, and technology information. Nancy leads Polaris’ strategy and day-to-day operations, in addition to managing the organization’s data, technology, and operations teams. Before Polaris, she served as the chief operating officer at Development Gateway, an international nonprofit that creates digital tools and services to support data-driven decisions in international development.
PagerDuty Drives Auto Trader UK’s Incident Response
Auto Trader UK is the largest digital automotive marketplace in the UK and Ireland, attracting an average of 55 million platform visits every month from consumers searching and viewing car, van, and bike advertisements from almost 14,000 UK retailers. “We’re a business that’s based on the web, so we need to make sure our shop is open 24/7,” said Ryan, Senior Operations Engineer at Auto Trader UK. Through the continuous evolution of its digital platforms and innovation of its data products, Auto Trader UK makes the car buying process easier for its customers. But maintaining a reliable, faultless platform while simultaneously undergoing a public cloud migration can be difficult, making it more crucial than ever for Auto Trader UK’s operations team to be responsive and proactive when issues arise. As Auto Trader UK continues its public cloud migration, PagerDuty helps provide the company with flexibility in how they manage incident response, ensuring they can immediately take action and resolve incidents the second they arise. No Alert Lost or Left Behind The operations team manage and monitor the infrastructure for the entire enterprise. They are the first responders for alerts regarding Auto Trader UK’s systems, engaging with product development teams to resolve issues as needed. “As soon as the developers deploy, we do a lot of the maintenance to ensure application health,” Ryan explained. “If something goes wrong, we communicate with the dev teams and provide diagnostics to help them resolve it.” One of the challenges the team faced were instances of email alert notifications being either delayed or never received. “Occasionally email alerts would arrive 10 or 20 minutes after the incident actually started,” Ryan shared. “Worse, sometimes we wouldn’t get the email at all which would result in a delayed response to an incident.” By leveraging the SolarWinds integration—one of more than 300 integrations available with PagerDuty—the team could directly receive alerts within PagerDuty, entirely eliminating email alerts from SolarWinds. As a result, the team mitigated risk of alert delay/loss. Additionally, with PagerDuty as the main alerting and notification platform, the team can respond faster with more confidence than ever before. “All of our monitoring ties into PagerDuty,” he explained. “We have a good idea of what the incident is, based off the context embedded in the message of that alert. When we fix it, it’ll auto-resolve.” "We love PagerDuty because it works for us now, and it’ll work for us in the future as we deploy it to different squads." - Ryan O’Gorman, Senior Operations Engineer, Auto Trader Driving More Accountability on the Public Cloud Migration Journey For Auto Trader UK, using PagerDuty is one step in a larger movement to embrace a DevOps culture. “The plan is to move away from a centralized management model and instead distribute alerts to the appropriate development teams. Meaning in the future we have the option to bring them on call so they have more ownership of their product, especially when it goes live in production,“ Ryan said. The shift to a more decentralized incident response model is especially relevant to Auto Trader UK’s public cloud migration. The company is moving from a traditional on-premises environment to Google Cloud for more flexibility and scalability. “As we migrate to multi-public cloud environment, primarily on Google’s Cloud, a whole new set of tools and monitoring systems will spring up, and we can integrate those with PagerDuty,” stated Ryan. Since the PagerDuty platform is so versatile, the team has the flexibility to add more development teams when the organization is ready. “We love PagerDuty because it works for us now, and it’ll work for us in the future as we deploy it to different squads. If we decide to change the company structure in the future, PagerDuty will help facilitate that,” he said. Improving Work-Life Balance Because PagerDuty captures every alert, the team can now easily stay on top of incidents and respond directly from the PagerDuty mobile app. Work-life balance has improved since the team can manage their own schedules and take action without having to disturb anyone else. “If we need someone to cover an overnight shift for maintenance, we can reroute alerts automatically with PagerDuty and make that transition silently,” explained Ryan. “That’s much better than waking someone at night, just to have them turn off alerts.” Ensuring Customers Have a Seamless Experience Aside from mitigating downtime risk and improving team health, the team also uses PagerDuty to proactively communicate with Auto Trader UK customers during incidents. The team uses PagerDuty’s StatusPage.io integration to automatically share updates when issues arise, providing more transparency to end users. “We wanted a more intuitive way of informing our customers whenever an outage occurs,” Ryan shared. “Alerts that go to our StatusPage service in PagerDuty will automatically generate an incident on our status page with relevant information, so our customers know we are already working on the issue. And once the incident is resolved, PagerDuty will resolve the incident within our StatusPage”. So, instead of the team having to manually to create StatusPage incidents as part of their investigation, PagerDuty automatically makes end user notifications a slick and simple process for Auto Trader UK. To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
Xero Leverages PagerDuty and ChatOps to Improve Incident Response and Digital Operations
Xero is a global small business platform for accountants, bookkeepers, and small businesses. Founded in 2006, the platform offers small business owners and their advisors automatic bank and credit card account feeds, invoicing, accounts payable, and standard business and management reporting. Xero has an easy-to-use intuitive interface so that even small business owners with little bookkeeping experience can accurately account for their transactions. A comprehensive education portal as well as, award-winning customer service further support small business owners if they have questions. For its active community of accounting partners, Xero offers additional functionality, such as a practice manager, advisory tools, and an app marketplace. With offices in the U.S., U.K., Asia, Australia, and New Zealand, Xero has more than 1.2 million subscribers in over 180 countries who rely on its software to help run their businesses. It’s therefore very important for Xero’s platform to be dependable—a responsibility that falls on the company’s developers and site reliability engineers. Challenges Anthony Angell, one of the Site Reliability Engineer Team Leads, explained that when he joined the company a few years ago, Xero was already using PagerDuty to manage two schedules. The production environment was supported by Operations teams located in Auckland, New Zealand, and Denver, Colorado. However, as Xero continued to rapidly grow, it became increasingly challenging for the Operations team to scale and coordinate schedules and escalation policies across the two sites. In 2016, Xero implemented a DevOps approach incorporating Site Reliability Engineering (SRE) to manage the production environment and overhauled its incident management processes. Rather than having the operations teams oversee the entire production environment, this new incident management framework relied on the teams that built the software to be available and on-call in the event of an incident—regardless of whether they were a developer or a QA engineer. This meant many more people and teams were added to on-call schedules, and Xero needed a way to manage and scale the on-call groups, which is where PagerDuty came in. “[PagerDuty] helped us to be able to scale the on-call groups within the business quite easily,” Angell shared. “It has also given us and the business a better support structure.” Business Impact With PagerDuty, the site reliability engineering team also was able to educate many other teams about incident management and how alerting works on the platform. The result? Customers are seeing quicker resolution times because the people who developed, built, and continue to maintain the code are also the first responders should something go wrong. “The ability to get a hold of our responders in a timely fashion via different methods adds a lot of business value,” said Angell. To further automate and scale the incident management process, Xero’s Site Reliability Engineering team leverages ChatOps to support hundreds of employees around the world. Xero’s homegrown chatbot, “Multivac,” is integrated into the company’s Slack account and leverages PagerDuty’s API to automate several critical activities within Xero’s incident management framework. Using Multivac, Xero can onboard a new team and on-call schedule into PagerDuty by sending a request to Xero’s Github repository to automatically enable the configuration. Incident managers can use Multivac to notify the right team members to initiate the incident response process within PagerDuty and create a unique Slack channel for the incident. Users can also request status updates on recent production releases or active alerts from Multivac, which provide needed context to troubleshoot incidents more quickly. By offloading many activities to Multivac and PagerDuty, Xero has been able to respond and resolve incidents much faster. “In a one year span, from January 2017 to January 2018, PagerDuty analytics showed us that we saw a 40 percent reduction in high-urgency alerts. Not only that, but MTTR for high-urgency alerts, the highest urgency level, is down 74 percent.” #PeopleFirst: Improved Work-Life Balance With PagerDuty One of Xero’s core values is “human” which put a big emphasis on people, and the company expanded its use of the PagerDuty platform by leveraging analytics capabilities to gain insight into team health. “The analytics insight is helpful for our managers—particularly those on other teams—because they can see from the data how many alerts their team received over a specific time period,” explained Angell. “This is useful for when we need to take a closer look at the reasons for engineer fatigue—for example, we want to know if on-call responders received unusually high number of alerts in a short time period, as that could put them at risk of burnout.” Additionally, Angell’s favorite part about PagerDuty is how it gives teams flexibility and ownership when it comes to on-call scheduling. Instead of having one team overlooking everything like before, Xero now has a number of distributed teams empowered to manage their own on-call schedules. “We’ve educated a lot of teams around incident management and how alerting and PagerDuty works, and it’s actually given the business a better MTTR,” said Angell. What’s Next Xero is expanding its use of the PagerDuty Digital Operations Management platform across a broader range of users and use cases. The company has already taken some steps to evaluate team health on their own, and they hope to have more in-depth insight into how their teams are performing by adopting PagerDuty’s Operational Health Management Service (OHMS).
IG Trades Rigid Processes for Reliability, Flexibility, and Agility With PagerDuty
Based in the UK, IG Group Holdings (IG) is a global leader in derivatives trading. In the fast-paced and heavily regulated world in which it operates, speed, accuracy, and regulatory compliance all have equal weight. For IG, trades must happen in real time as even a second of downtime can have serious consequences for its clients and their portfolios. At the same time, regulators around the world hold IG to strict standards of performance, reliability, and transparency. IG must constantly innovate and deliver new products and features to its clients or risk irrelevance in a highly competitive industry. With IG’s mix of legacy and new technologies, fostering agile development while maintaining uptime was challenging. “Reliability in the financial world is very important. If you can’t trade when you have to, you’re losing money,” explained Hamed Silatani, Head of Application Services. “People don’t like to invest their money where they can’t access it when they need to—it’s absolutely critical that we minimize downtime.” Creating Options for Development Teams For IG products built on older technologies, changes must be made within scheduled maintenance windows in order to minimize the risk of regulatory penalties, which means having teams own and fix their respective code on-the-fly isn’t feasible. While DevOps practices work well for IG’s newer systems and technologies, the mix of legacy and new presented numerous challenges to the development teams—namely, they could not simultaneously develop products, respond to incidents, and improve the architecture. “The technologies built 10 or 15 years ago don’t lend themselves to a DevOps ecosystem. One size doesn’t fit all,” said Silatani. As a result, IG modified its digital operations so that products built on legacy technologies could benefit from DevOps in the same way its products built on newer technologies did, with PagerDuty playing an integral role. This new approach embedded Silatani’s Reliability Engineering, Application Support, and Tools and Monitoring teams with the development teams so that they could address incidents and discover root cause issues much faster. According to Silatani, “Reliability is now seen as more of a community practice rather than a specific team that people call for troubleshooting.” Swapping Manual Escalations for Automated Incident Response Prior to PagerDuty, IG followed a highly manual escalation process, where scheduling and handing off support tasks across different time zones required days of advance planning and numerous, manual steps. Incidents required on-call staff to be onsite, yet there was no clear ownership of incidents established across teams. With its automated scheduling and escalation features, the PagerDuty platform empowered IG’s teams to take ownership of their applications and quickly mobilize the right teams in real time when incidents occurred. Distributed teams across the UK, Poland, and India could now be easily enlisted to help. “IG has always had very good uptime, but it was achieved by spending a lot of human time and investment,” shared Silatani. “PagerDuty has helped us get incidents to the right set of people faster than ever before, with the touch a button so that clients can continue to use the trading platform without interruption.” “Improved quality of life for our employees is a key benefit we’ve achieved by using PagerDuty.” Hamed Silatani, Head of Application Services, IG Leveraging Mobile to Improve Work-Life Balance Increased efficiency and speed aren’t the only reasons IG relies on PagerDuty. Life for on-call teams has become much better, thanks to PagerDuty’s mobile app. Teams no longer need to be at their computers when incidents arise, no matter the time zone. “PagerDuty’s mobile features are very handy and useful, enabling us to see incidents straight away and assign tasks using a mobile device,” Silatani said. This better work-life balance has improved team health and reduced the risk of burnout and churn. “Improved quality of life for our employees is a key benefit we’ve achieved by using PagerDuty,” Silatani said. “PagerDuty has made things a lot easier for the operational people who support our applications.” Investing in the Future IG has over 300 users on PagerDuty, with more to come. With the successful deployment of PagerDuty, Silatani is already planning to make use of the platform’s other features. To improve the signal-to-noise ratio, Silatani plans on using PagerDuty Event Intelligence to analyze and improve alerting so that his teams can focus on actionable incidents. He also wants to leverage more of PagerDuty’s Modern Incident Response capabilities; specifically, automating post-mortem reporting to make it easier for his teams to implement best practices and key learnings for future incidents. “Day in, day out, we focus and think about how we can help Dev teams do things faster, and there’s a lot more we can do with PagerDuty,” he added. To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
Monzo Banks on PagerDuty to Improve Customer Experience
Monzo Bank Ltd. is a digital, mobile-only bank based in the United Kingdom and has over one million customers. Founded in 2015, Monzo is built from the ground up with a modern technology stack leveraging primarily open source software and microservices to run its bank operations. The iOS and Android mobile apps are the heart of the bank, with innovative features that provide more convenience for its customers while adhering to the strict standards and regulations that govern traditional bank security and privacy practices. Because of customer expectations surrounding access and usage of the mobile banking app, managing digital operations at Monzo means ensuring that the app runs faultlessly regardless of platform. “Engineers have more pressure to take action in real time,” Christopher Evans, Platform Team Lead at Monzo, shared. “You can’t go wrong and have one second of downtime. You have even more pressure than a traditional bank.” Customer Experience Beyond Banker’s Hours Since it’s a digital bank, Monzo’s hours of operation are 24/7. “Customers are the heart of everything we do,” Evans said. “Our services must always be up and accessible to our customers.” To that end, Monzo does not have scheduled downtime, unlike many traditional banks. To protect the customer experience, Monzo has been using PagerDuty to accelerate response to customer-impacting issues. “We put a huge amount of engineering effort into making sure we don’t have scheduled maintenance periods, but we’ve had downtime because of incidents, which are inevitable,” Evans explained. “We prioritize customer-impacting systems ahead of everything else. The team’s focus is providing a bank that has a world-class level of availability, and PagerDuty is the hub for mobilizing the right people in real time to fix issues in the fastest possible way. That’s where the value lies.” “We know that PagerDuty will always be up and that we can rely on it.” Christopher Evans, Platform Team Lead, Monzo Visibility a Huge Asset for Improving Application Performance As Monzo has grown, its use of PagerDuty has evolved alongside it. “In the beginning, our usage of PagerDuty was minimal,” Evans shared. “The majority of our alerting was done by our on-call engineers watching Slack channels.” Since then, Monzo has re-architected its monitoring and alerting solutions for more visibility into infrastructure health and system performance. The company has moved to the open-source monitoring system Prometheus for everything, from infrastructure monitoring to application performance management (APM). With this shift, Monzo can now monitor the metrics coming out of its applications. Monzo also configured its use of PagerDuty to gain more analytical insights by leveraging the metrics collected by the platform. Using PagerDuty’s integration with Prometheus, Monzo now sends all of its alerts to the PagerDuty platform so it can better track mean-time-to-action and mean-time-to-resolution. Additionally, because more alerts are flowing through PagerDuty, the on-call team has more context to immediately take action when issues arise. Investing in Team Health When Evans first joined the platform team, the on-call rotation consisted of only four engineers across the entire business, putting them at risk for burnout. “It was a really stressful experience,” Evans explained. “People wanted to drop out of the rotation.” To expand the pool of available resources and improve work-life balance, Evans used PagerDuty to implement an on-call “shadow program,” which consisted of a primary team who were paired with new people joining the rotation. Though the primary on-call engineers are in charge of driving action and response, those shadowing them are subject to the same SLAs, such as being no more than 15 minutes away from a laptop and responding to all of the alerts. PagerDuty’s automated on-call management features enable Evans to create multiple schedules for the primary responders and the shadow team. “That was the gateway for training the new on-callers to hone their skills in a safe, non-stressful way,” said Evans. Monzo now has eight primary on-call resources and an additional eight resources shadowing them, bringing the total team size to 16. “The general health of on-call engineers is better,” shared Evans. “Just before I joined, three or four engineers left the rotation because they were burnt out. But since we implemented the on-call shadow program, no one has left the rotation.” Charging Ahead Into the Future In its current phase of rapid growth, Monzo will continue scaling its on-call teams with PagerDuty so that individual teams are empowered to manage their own schedules. Evans also plans to leverage PagerDuty Analytics so he has insight into noisy services that require attention, as well as how individuals are doing on call. “We want to be able to monitor how heavily affected people are by being on call for any given week,” explained Evans. In addition, once Monzo has a significant amount of data flowing through PagerDuty, Evans plans to explore PagerDuty Event Intelligence and its machine learning capabilities to further improve Monzo’s real-time operations. “We know that PagerDuty will always be up and that we can rely on it,” said Evans. “I’m instantly less stressed because I don’t have that feeling of isolation. From day one, PagerDuty has really helped.” To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
Trustpilot Gives PagerDuty a Rave Review
Trustpilot is a leading independent review platform - free and open to all. With more than 45 million reviews of over 230,000 companies, Trustpilot gives people a place to share and discover reviews of businesses, and gives every company the tools to turn consumer feedback into business results. Our mission is to bring people and businesses closer together to create ever-improving experiences for everyone. Trustpilot reviews are seen more than 3 billion times each month by consumers worldwide. While consumers read and submit reviews at no cost, Trustpilot offers multi-tiered, subscription-based pricing for businesses to leverage additional services, like promoting their Trustpilot score in a “TrustBox” on their respective websites. Businesses leverage the TrustBox to promote their quality of products or services to customers and prospects. With this kind of scale, it is essential that the Trustpilot platform never lags or falters so they can continue to deliver real-time information to both customers and businesses. Empowering Teams to Respond Faster Morten Reinholdt Boelskifte, Site Reliability Engineering (SRE) manager at Trustpilot, leads a team of SREs focused on website availability, online performance, incident response, and infrastructure. Any platform downtime translates into lost customer traffic and loss of revenue. As a result, Boelskifte’s team must take action immediately when critical issues arise and address them with urgency. “3 billion reviews seen a month increases traffic on our infrastructure, and so it’s even more important for us to stay on top of everything,” said Boelskifte. “Ultimately, we ensure customer happiness.” The PagerDuty platform, with its on call management, scheduling and automatic notification capabilities, make it easy for Trustpilot to engage the right people at the right time when business impacting issues arise. “On-call rotations are very easy to set up and maintain, even down to the minute. We can even create multiple scheduling layers to help balance the workload for our on-call resources,” said Boelskifte. In addition to Boelskifte’s team, all engineering teams at Trustpilot have adopted PagerDuty. The flexibility to personalize individual notifications empowers team members to maximize their own efficiency. “Having teams be able to actually customize on an individual basis is quite powerful,” shared Boelskifte. Driving Continuous Improvement Trustpilot is a firm believer in continuous improvement, and tracks every incident and alert to identify ways for teams to become even more efficient and responsive when future issues arise. Trustpilot also tracks the number of alerts that teams receive, in an effort to combat alert fatigue. “PagerDuty streamlines everything, addressing not just a specific alert, but the grouping of alerts. This enables us to see where we need to make changes,” said Boelskifte. By using PagerDuty Event Intelligence capabilities, Trustpilot can determine if alerts are properly categorized and acted upon. “We ask ourselves if it’s actually a critical alert (P1) or a lower priority P2 / P3 that can be addressed and resolved later, during the teams normal working hours. We want to figure out how we can prevent this in the future to reduce the risk of burnout for teams,” he added. Ultimately, all of this helps effectively balance team workload and improve team health. Boelskifte’s team also leverages blameless postmortems, which are part of PagerDuty Modern Incident Response. Events that occur during incidents are automatically documented so that best practices and key learnings can be incorporated into their incident response workflow. “The postmortem process is already set up in PagerDuty, and it’s more than self-explanatory on how to use it. It’s a pretty great feature,” Boelskifte said. “PagerDuty will fit your business as it is right now, but it will also grow alongside you. Every time we produce something new, we automatically include PagerDuty in it.” Morten Reinholdt Boelskifte, SRE Manager, Trustpilot Smarter Development, Faster Innovation At Trustpilot, all Engineering teams have full stack ownership, responsible for their applications and how services are set up and monitored. PagerDuty plays a key role in keeping the development engine running -- all alerts are funneled into PagerDuty, thanks to the 300+ integrations that make it simple for Trustpilot to set up monitoring. “The PagerDuty API really helps us. It’s well-documented, easy to use, but more than that, it’s so fast and easy to incorporate it into our entire pipeline. Every time we have a new service that we want to set up, we can incorporate PagerDuty into that,” said Boelskifte. As Trustpilot continues to grow and evolve, Boelskifte anticipates that PagerDuty will be there every step of the way. With PagerDuty, Trustpilot can scale and deliver a faultless digital experience for their customers around the globe. PagerDuty empowers software engineers at Trustpilot to focus on innovation. According to Boelskifte, “We are evolving all the time. We see a lot of adoption of PagerDuty at Trustpilot, and it’s clear that PagerDuty can grow with us.” To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
Good Eggs Keeps Produce Fresh by Integrating PagerDuty
Good Eggs provides same-day deliveries of fresh groceries and meal kits to San Francisco Bay Area homes. Assistant Director of Operations, Tannia Hernandez, and Facility Manager, Juan Mayora, share how PagerDuty enables warehouse operations and development teams to analyze signals from refrigeration units to ensure food stays fresh for deliveries.
Centro Drives Better Business Outcomes through PagerDuty Automation Capabilities
Centro provides digital advertising and buying software that helps advertisers streamline and scale digital campaigns. Jeff Smith, Director of Production Operations, shares how PagerDuty makes Centro’s on-call process more humane, empowers teams to take action in real time, and accelerates response in order to resolve more incidents.
PagerDuty Headlines The Telegraph’s Digital Operations Management
Founded in 1855, The Telegraph is a multimedia news brand that operates a print newspaper, website, and different mobile apps for more than 25 million readers each month. To succeed in a sector wracked by massive change and consolidation, the company has continuously reinvented itself, revamping its digital operations and learning some important lessons along the way. As Head of Technology for Platforms and Engineering, Lucian Craciun oversees the teams that build, manage, and run the APIs that power The Telegraph’s digital properties. Though The Telegraph leverages microservices for its newer applications, the teams must also support legacy systems and applications. With such a complex environment to support, visibility into performance issues was very challenging, especially since separating between signal and noise in the existing monitoring and alerting tools was time consuming. Too Many Alerts Delivered Bad News for Teams The Telegraph’s website receives approximately 100 million hits a day, in addition to readers browsing its mobile apps for content. To ensure maximum uptime, The Telegraph monitors its services for performance issues and alerts if certain thresholds are breached. However, this process was not seamless. “Our monitoring system would fire the same alert every 5 minutes until it was resolved,” Craciun explained. “We had to do extra work to filter it out or aggregate the alert into one incident, which resulted in more time required to resolve the incident.” Editing Out the Noise With over 300 integrations and an easy-to-use API, setting up PagerDuty in The Telegraph’s environment was fairly simple and straightforward. The Telegraph then quickly integrated PagerDuty with monitoring and collaboration tools such as Datadog, Jenkins, Slack, and JIRA, which enabled Craciun’s teams to address some of their more challenging issues. “PagerDuty is simple to set up. We just decided which services to add and then mapped them to PagerDuty’s platform. Things are working out well without having to do much customization,” said Craciun. Better Time Management Because the scheduling functionality was so easy to use, The Telegraph quickly began adding more teams to PagerDuty, subsequently rolling it out to IT, software engineering, and system engineering. The previous tool used by The Telegraph did not provide enough flexibility for Craciun’s teams to make on-call scheduling modifications on-the-fly, which was a key pain point that PagerDuty easily solved. With PagerDuty, on-call responders are now empowered to manage their own schedules, with full visibility into the specific teams and people on call at any given moment. “PagerDuty has made our lives easier and improved our quality of work by giving us a simple way to create and change scheduling rotations,” Craciun shared. “PagerDuty gives us everything we want.” - Lucian Craciun, Head of Technology -- Platforms and Head of Engineering, The Telegraph Getting the Full Story with Operations Command Console In the past, Craciun’s team had to spend more time in identifying real incidents from false positives. By providing a window into the operational health of The Telegraph’s services in real time, PagerDuty’s Operations Command Console (OCC) enables Craciun’s teams to be more proactive in resolving issues before they impact customers. Because The Telegraph utilizes many microservices that are inter-dependent, one status failure could impact several other services in short order, which can make troubleshooting very challenging. But with OCC, Craciun’s teams can see the services that are degrading with one quick glance and immediately take action. “The visibility provided by PagerDuty significantly decreases the time required to acknowledge and recover from incidents,” said Craciun. Notably, one of the benefits of increased visibility is a decrease in calls to The Telegraph’s service desk, demonstrating the success of the implementation. The Inside Scoop on What’s Next Since the PagerDuty deployment has been so successful, The Telegraph plans to add more users and functionality as time goes on. In particular, the company intends to delve into PagerDuty’s Stakeholder Notification feature, as well as Analytics to better understand alert, response, and resolution times. With its digital operations on track, The Telegraph is able to respond and resolve incidents faster and more efficiently. “During the wedding of Prince Harry and Meghan Markle, we had 2.5 times more traffic than we usually get. And everything worked great,” Craciun shared. To learn more about how PagerDuty is helping media companies with digital operations management in the U.K., U.S., and abroad, visit www.pagerduty.com for more information.
Flixbus Drives Operational Efficiency With PagerDuty
Flixbus, Europe’s largest long-distance bus service founded in 2013, is a unique combination of tech startup, e-commerce platform, and sustainable transportation company. It quickly became the leading long-distance travel provider within Germany before expanding to other European countries in 2015, and to the United States in 2018. On a daily basis, more than 300,000 passengers travel to over 1,700 destinations across 28 countries. Flixbus is revolutionizing traditional bus travel by providing user-friendly features, including the Flixbus App, e-ticketing, GPS live tracking, and the automated Delay-Management System, all of which run in real time. And with so many transportation options in Europe, the developers at Flixbus are constantly delivering more features to remain competitive. “The DevOps team is probably the most important team when it comes to business continuity because they are the first line of defense when it comes to maintaining the customer experience,” said Jasper Spruytte, Engineering People Lead at Flixbus. “If something isn’t working, then we don’t have a platform to sell tickets or check in passengers.” For Spruytte, this means that his Ops and DevOps teams must have 24/7 coverage across all digital channels, with real-time visibility into system performance. But with regulations governing overtime compensation, Flixbus must manage on-call schedules and escalations efficiently and effectively to keep costs low. Accelerating Response Helps Flixbus Go the Distance Before PagerDuty, the on-call process was manual and time-consuming. Once the appropriate people were notified, the response was further delayed due to lack of visibility into application performance. “The modus operandi before PagerDuty went like this: Somebody knows something doesn’t work, people get pinged on Slack, and someone will eventually respond and fix it. This was a horrible process,” Spruytte shared. Flixbus envisioned a more modern approach that could empower teams to respond in real time and fix issues much faster so that its passengers could continue to travel relaxed and stress-free. “One of our board members proposed PagerDuty,” explained Spruytte. “PagerDuty has a good resume of customers, so we opted for that. We did a small pilot in the beginning and then quickly adopted it into our system.” The teams Spruytte oversees make good use of PagerDuty’s 300+ integrations and custom APIs to monitor their applications, most notably New Relic and AWS CloudWatch. Flixbus also monitors Kubernetes clusters and Adyen payment processing. After PagerDuty was deployed, mean time to respond significantly improved—from 2016 to 2017, the teams saw a 60 percent decrease in high-urgency incidents. Taking On-Call Compensation in a New Direction Beyond incident response, Flixbus will be using the PagerDuty platform to automate its revamped on-call compensation program. Typically, organizations pay overtime as a set percentage on top of salaries for their on-call resources. But the teams at Flixbus wanted more options in terms of compensation for on-call rotations, and Spruytte wanted to provide incentives for teams to respond faster. So he created a program that converted on-call compensation into points, where responders earned points for different scenarios. For instance, a person on call would automatically receive 200 points and would earn varying point levels depending if an incident occurred off hours during the week or on the weekend. If other people were added to the incident response, they would also earn points, thereby encouraging collaboration and resolving the incident faster. These points could then be used in a marketplace where people could choose additional paid time off, cash payouts, gift cards, or other options. Currently, the points-based system is manually tracked and updated in a spreadsheet for each on-call person. Since PagerDuty’s scheduling capabilities automatically track when resources are on call and actively working incidents, Spruytte plans to integrate the points-based system with PagerDuty for easier tracking and management. “With PagerDuty behind the wheel of this cultural shift, Flixbus plans to maintain its leading position in long-distance bus travel.” - Jasper Spruytte, Engineering People Lead, Flixbus The Road Ahead In addition to the teams that Spruytte manages, the internal IT and payment teams at Flixbus are also using PagerDuty, with plans to add seven more teams. With the time saved from faster incident response and improved collaboration across teams, Flixbus developers can devote more time to innovation. “PagerDuty has proven its worth already. Clarity has been improved, so people know what they need to do and how fast they have to respond before customers are impacted. The philosophy we want to have is that every team should feel responsible for ensuring the uptime or continuity of their component or product within the framework that we create.” With PagerDuty behind the wheel of this cultural shift, Flixbus plans to maintain its leading position in long-distance bus travel. To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
AEO Transforms Its Incident Management With PagerDuty Customer Success
What runs through your head when shopping for jeans online? If you’re anything like me, you’re likely contemplating how they’ll fit, whether your phone will fit in the pockets, and whether that particular style (skinny, boyfriend, jegging) is something you want to add to your wardrobe. American Eagle Outfitters (AEO), a clothing and accessories retailer giant with over 1,000 stores, offers jeans in a variety of fits and washes, which can help when trying to find that perfect pair of jeans. The brand is particularly popular among college students—the company’s midflight eagle is easily identifiable and unmistakable. In addition to its strong brick-and-mortar presence, the retailer also has a powerful online presence and a mobile app, which are both instrumental in helping it continue to grow its business. On the AEO website, shoppers can find apparel in all sizes and styles, as well as clear support and partnership for projects like It Gets Better. But while shopping, how often do you think about what runs in the background so you can easily browse, add your items to your shopping cart, and check out? Digital Transformation at AEO Unless your background is in IT, you likely don’t think twice about the up-to-date online presence and payment processing systems that is Matt Kundrat and his team. Matt joined AEO in November 2012 and, in his six years with the company, has played an instrumental role in its digital transformation. I have the pleasure of being AEO’s Customer Success Manager and working with Matt in his current position as Senior Manager of Production Support. After AEO first purchased PagerDuty, Matt and I discussed some key challenges his team and AEO faced. Every alert went out to his team’s distribution list—which meant all 25 people on his team were contacted. Each person was always on call and had no easy visibility into whether someone else had taken ownership or could assist. Additionally, if people could not solve the issue, his team had no way to escalate it or loop in other teams. SLAs were essential to ensure issues were solved very quickly, especially if they were customer-impacting. Matt recognized that his team was not meeting their SLAs and needed a way to measure their time to acknowledge and resolve. The team was dispersed around the world and Matt needed a way to have a 24/7 schedule where each person was on call only during their working hours. He also needed an easy way they could quickly contact each other. The last large pain was the challenge of alerting business stakeholders when customer-impacting issues did occur. In addition to these pain points, AEO’s business sectors were highly segmented and had challenges with cross communication. AEO realized that, in order to maintain its strong online presence, the right people needed to be informed when issues arose. At the same time, responders needed to be able to focus on resolving said issues. The company concluded that it needed to update its incident management process. This was when Kundrat started searching for a solution that could improve his team’s and companywide communication and continue to scale with AEO while also reducing alert noise and burnout. PagerDuty was an easy choice as it allowed his teams to communicate through one centralized platform while still integrating with all the tools they were familiar with and used. Improved Work-Life Balance I helped Matt set up his account according to our best practices, train and onboard his initial customer service teams, and share PagerDuty’s growing suite of capabilities. Together, Matt and I have been able to solve each of his pain points. For example, instead of people being bombarded with alerts for each incident and then waiting for someone to take ownership, Matt put his team on a scheduled rotation, solving the issue of being on call 24/7 and bringing work-life balance back to his team, both locally and globally. They can now also escalate the issue to the right people in real time with the click of a button. His team quickly saw their lives improve as their manual processes and number of clicks decreased. Other teams, such as the Corporate IT and Security teams noticed too, and I worked with Matt to onboard them as well. As Matt mentioned on a recent call, teams across the organization went from being afraid to use PagerDuty to not being able to imagine their lives without it. Additionally, Matt is fully utilizing all the capabilities PagerDuty has to offer as a digital operations platform, encompassing the full lifecycle of an incident. By using response plays, similar incidents, and our event intelligence capabilities, he is now seeing improved response times, more effective business communication, and better work-life balance within his team and the organization. Through PagerDuty, AEO has created a unified front during customer-impacting events to make sure everything is up and running. We look forward to continue partnering with AEO and seeing the success and growth of their company with PagerDuty. Check out the entire case study: American Eagle Outfitters Taps PagerDuty to Transform Incident Management.
American Eagle Outfitters taps PagerDuty to transform incident management
American Eagle Outfitters (AEO) opened its first store in 1977 and today boasts more than 1,000 stores and 40,000 associates worldwide. Part of what keeps this successful retailer in the game is its full-throttle embrace of digital technology, including retail websites, mobile apps, and order management systems. But it’s no cakewalk to keep digital services across multiple sales channels running smoothly around the clock in both the U.S. and abroad. What’s a retailer to do? Many Tools and Fragmented Processes A couple of years ago, AEO’s leadership noted that the company’s manual incident management process was highly fragmented. AEO’s teams use a plethora of different alerting, ticketing, and monitoring applications. “Worse yet, no one was willing to give up their tools,” shared Matt Kundrat, Sr. Manager of Production Support. “It was really holding us back from true, unified incident management.” “All alerts and reported incidents were being sent to a single email box,” he continued. Staff—who had to focus on responding to alerts and incidents—often didn’t understand their role and wasted time trying to figure out who was on call and how to reach them. At the same time, there was the “jump-on-a-grenade” approach where someone would send an alert email and multiple people would respond. In short, AEO needed to modernize and standardize its approach to incident response across its entire organization. PagerDuty Makes the Grade Kundrat and other managers evaluated different vendors, but PagerDuty as an industry leader got their attention. PagerDuty supports over 200 integrations, including the various tools used by AEO’s teams. “We were able to meet the goals that we outlined for incident management, while allowing everyone to continue to utilize their own tools,” said Kundrat. Teams could continue to follow their own SLAs and escalation policies, but AEO standardized alerting and notification on PagerDuty. “We found that PagerDuty has the most advanced and streamlined tool on the market. And people really liked its UI the best.” When we began deploying PagerDuty late last year, AEO onboarded the digital teams first, as they had multiple monitoring tools spanning their entire digital technology landscape. “With PagerDuty, one of the biggest gains we saw was the visibility into the problems that happened,” noted Kundrat. Now, every team across the organization—approximately 200 responders—are on PagerDuty, just a few short months after rolling it out to the digital teams. “From our perspective and for our needs, PagerDuty has the most advanced and streamlined tool on the market.” - Matt Kundrat, Sr. Manager of Production Support, American Eagle Outfitters Intelligent Alert Notification Tops the List of Big-Time Benefits PagerDuty has transformed AEO’s incident management processes in several key ways, enabling the company to successfully handle everything from the smallest customer complaint to the biggest, system-wide catastrophe. Specific benefits include: Intelligent Alert Notification. PagerDuty enables AEO to get alerts into the right hands so staff can respond quickly. “The incident management piece is very key. We're able to get those alerts out of email and onto people's phones,” Kundrat explained. “It’s allowed us to implement a very controlled, manageable on-call schedule.” Now, instead of multiple people checking a single inbox, PagerDuty notifies the right teams, who can immediately start working on the incident from PagerDuty’s mobile app. Seamless Integration. The first thing AEO did with PagerDuty was integrate all of its different tools. In doing so, the company can meet its service level agreements while offering employees flexibility in how they work. “It didn't matter what the ticketing tool was, it didn't matter what the alerting tool was, it didn’t matter what the monitoring tool was: PagerDuty allowed us to aggregate them all together at a service level,” said Kundrat. “PagerDuty enabled the teams to still use the tools that they wanted to use.” Engagement. Using PagerDuty, AEO now addresses incidents more quickly than before. “We saw reactiveness improve,” Kundrat shared. Moreover, by providing increased visibility and analytics, PagerDuty has enabled AEO to gain valuable insights and quickly identify issues that warrant further attention. “With PagerDuty used as a notification aggregator, it allows us to better react to large issues that need immediate attention,” said Kundrat. Better Work-Life Balance. Instead of being awakened in the middle of the night for seemingly random alerts, employees are only notified on incidents that require immediate action, thanks to PagerDuty’s ability to differentiate signal from the noise generated from thousands of alerts. For example, Thanksgiving is AEO’s third-biggest day of the year. “PagerDuty allowed us to manage the monitoring and responsiveness to issues from our phones, while letting us spend more quality time with our families during the holiday,” said Kundrat. “Once employees realized that PagerDuty supports work-life balance, everybody who now uses it loves it.” Improved Business Communications. AEO’s executives are very hands-on and want to stay informed during customer-impacting incidents. Kundrat’s organization can keep the business apprised of incidents while enabling the technical teams to stay focused on incident resolution. “PagerDuty allows us to get the message out quicker and keep everybody on the same page,” said Kundrat. Unified Incident Management the Result—But Only the Beginning Whereas AEO’s approach was previously fragmented, PagerDuty has helped unify incident management across the entire organization like never before. By using PagerDuty, AEO has been able to successfully implement a “follow-the-sun” methodology, providing 24/7 support across the globe and across teams. Building on this success, AEO plans to continue evolving its incident management strategy using PagerDuty. The customer service team uses PagerDuty, with the security team not far behind. “Different teams utilize different on-call methodologies, and we’re able to showcase that to other teams to demonstrate how PagerDuty might work for them,” said Kundrat. They’re also looking forward to leveraging more advanced Event Intelligence capabilities for reducing operational noise and better understanding patterns in their monitoring data. According to Kundrat, the importance of unified, on-call management across the organization is huge. “We're getting closer and closer to being a single team,” Kundrat noted. “PagerDuty is really helping us unify and really understand things at more of a holistic level across the technology organization.”
TechSoup’s Recipe to Help Nonprofits Succeed
Have you ever heard of TechSoup? Despite its name, TechSoup is not a company that uses the latest tech to create a recipe for tasty soup. Rather, it’s a nonprofit that gathers all the ingredients (in this case, a myriad of technology solutions) to support other nonprofits in achieving and accomplishing their missions. The organization manages donation programs through 60+ partnerships with hardware and software service providers such as Symantec, Adobe, Cisco, Microsoft, and PagerDuty. With over 200 employees worldwide, TechSoup supports more than 60 web platforms globally in a variety of different languages. It provides this service through its partner network to nearly every country in the world, and works with those partners to manage their donor programs. It also manages a lot of the back-office work and coordination tied to the donor programs. “In addition to the product or donation side of the house, we also bring nonprofits together so they share their thoughts and experience, and leverage what each has done in terms of using their technology,” shared Michael Enos, Senior Director of Community and Platform at TechSoup. Monitoring a Swirling Stew of Applications Working with companies and partners around the globe meant TechSoup needed to stay on top of a number of their own internal applications. “We monitor our main website and our back-end systems—which includes a fulfillment system and an enterprise service bus—which are all used to feed other various systems and services that are also monitored,” Enos explained. In addition to these applications, TechSoup also needed a system to monitor 300-400 isolated nodes that provide service to its entire network. With all the monitoring required, the organization needed a soup-to-nuts solution—and one that could be customized to its unique network. That’s where PagerDuty came in. “The fact that we could customize PagerDuty’s features was the deciding factor,” said Enos when asked why he chose the platform over other solutions. With over 1,000 servers and systems in different environments and using their own event monitoring systems, TechSoup decided to use PagerDuty to centralize the escalation of issues so that upper management could easily stay informed. “Considering the number of systems we’re supporting, there’s usually something going on at any given time, and they’re usually handled close to where the action is,” Enos explained. “We needed something that would send an issue up the escalation chain when someone’s not responding within a certain amount of time." A Soup-erb Solution for Support Staff TechSoup also needed to optimize its off-hours support staff, which included sending alerts to first responders when an issue arose—and escalating that issue if necessary. In the past, if someone couldn’t log into their computer or they got locked out of their accounts, they would call TechSoup’s help desk, and the help desk would then contact whoever was scheduled to be on call. If the phone lines went down, no one would be alerted about issues. That all changed with PagerDuty. “We were able to configure PagerDuty so that if an alert comes in and matches specific criteria, it would then create an incident and alert someone,” said Enos. “We wanted something cloud-based so that if there are any issues with our infrastructure or if our network goes down, we would still be able to get in contact with our first responders.” Taking Stock of the Current State of Affairs Today at TechSoup, PagerDuty is used by the help desk, infrastructure, community, and platform teams. “It just works; we’ve never had an incident where PagerDuty wasn’t available,” Enos shared. “We can get as granular as needed or use it for the most basic alerting. Additionally, the extensibility of multiple platforms—my Mac, my PC, the PagerDuty app on my phone—means I can acknowledge or resolve an issue through a number of different ways.” “We all sleep better at night now because the calls that are being picked up are the ones that are truly important,” concluded Enos.
Leading Fintech Company Funding Circle Uses PagerDuty to Help Scale Growth
Overview As the leading small business loans platform in the U.K., U.S., Germany, and the Netherlands, no one knows the challenges of managing digital operations to deliver innovation and a great customer experience in a highly regulated industry better than Funding Circle. Despite these challenges, Funding Circle has grown at an impressive rate. With over £4.5 billion in loans facilitated to more than 45,000 businesses to date, the company is highly focused on providing the best customer experience. This translates into hiring top engineering talent and working hard to keep them engaged and focused so that they can deliver innovative features while ensuring software quality and performance. With Fast Growth Comes More Complexity Currently, Funding Circle has hundreds of engineers dispersed throughout the U.K., U.S., and Germany, and expects to expand its global engineering function by another 30 percent in 2018 alone. But with that growth comes more complexity—the larger the teams, the more challenging it becomes to coordinate resources and schedules. For Funding Circle, ensuring consistent quality for rapid product releases and sharing knowledge, accountability, and ownership of code across teams require the ability to scale and automate as much as possible. In addition to providing application infrastructure, Funding Circle’s Agile-centric engineering division is responsible for ensuring platform availability, with an eye on achieving 99.8 percent uptime (which translates to only about 17 seconds of planned downtime daily). To meet this goal, Funding Circle implemented PagerDuty. Today, the company has roughly 33 engineering teams as well as non-technical personnel using the PagerDuty platform. Measuring Performance and Leveraging Real-Time Visibility As part of Funding Circle’s focus on delivering a great customer experience, understanding how its software is performing is critical. The company uses PagerDuty to look at key performance metrics (KPIs) from both a development point of view (which includes the number of failed deployments and mean time to resolution) and business point of view (which includes overall uptime and number of incidents per quarter)—and then automates the reporting of those numbers to its board of directors and other key stakeholders. Automation was a simple task: Funding Circle easily integrated the monitoring tools for its applications and infrastructure with PagerDuty, enabling the company to leverage PagerDuty’s reporting and analytics to track overall operational performance with accuracy and clarity. The engineering teams can now focus on improving code quality so that they can respond faster when potential issues arise. “Leveraging metrics within PagerDuty to accurately capture and report KPIs is a key way to establish whether we’re actually improving as a business,” shared Whyte. "We move incredibly quickly here and PagerDuty helps us clarify and balance many different priorities.” “PagerDuty is the industry standard for every company I’ve ever worked in.” - Paul Whyte, Engineering Manager, Funding Circle UK Reducing Burnout and Increasing Retention Funding Circle traditionally struggled to create accurate schedules for its global workforce. By applying PagerDuty’s easy-to-use escalation and scheduling features, the company can now quickly identify and contact the right people when an incident occurs. “People were being woken up at two or three o’clock in the morning,” Whyte shared. “By implementing PagerDuty and follow-the-sun on-call scheduling, those middle-of-the-night phone calls stopped practically overnight.” “PagerDuty has significantly improved our performance in dealing with incidents, and faster fixes mean we have happier customers,” he added. As successful as the PagerDuty deployment has been, Funding Circle has only scratched the surface of the platform’s functionality. The company plans to explore new visibility and incident management features, as well as PagerDuty’s integration with Slack. To learn more about how PagerDuty is helping FinTech companies with digital operations management both in the U.S. and abroad, visit www.pagerduty.com for more information.
Time Efficiency Essential for Cornea Donation, Restoring Sight to the Blind
Founded in 1969 and based in Seattle, Washington, SightLife is a non-profit global health organization working to prevent and eliminate corneal blindness worldwide by 2040. Approximately 12.7 million people around the world suffer from corneal blindness, 98 percent of whom live in developing countries. SightLife’s efforts are focused in the U.S. on eye banking operations and internationally on training and capacity building, community outreach, prevention, and advocacy. It’s 2 a.m. on a Tuesday, and Dave is at the SightLife offices on night shift with several other Transplant Donor Coordinators (TDCs), when he receives a notification email: A death just occurred and the deceased could possibly be a donor. As time is of the essence in transplant donation, Dave needs to act fast. His focus is restoring sight to the blind through corneal transplants. The cornea is the transparent front part of the eye that covers the iris and pupil. It is also the only part of the eye that doesn’t contain blood vessels, which allows us to see clearly. Instead, tears from blinking bring oxygen to the corneas to help keep them healthy and prevent infections. Dave’s first step is to review medical records and other relevant information, which takes place within 5 minutes of notification. He learns that the deceased is a 68-year-old male with a rare, terminal illness. Although this medical history may not have impacted the cornea, which is resilient to many illnesses, Dave needs to be certain. He triggers a PagerDuty alert so that he can review these clinical details with the Medical Review team, who are SightLife’s eligibility experts. These individuals are designated by SightLife’s Medical Director to ensure the safety of transplant tissue for recipients. PagerDuty allows Dave to rapidly access their expertise which saves critical minutes in the recovery process. The Medical Review Manager quickly identifies the illness as harmless to the cornea and advises Dave to proceed. The next step is to obtain clearance from the coroner. Dave, however, has trouble reaching the coroner who is out in the field, so he uses PagerDuty to reach a partner advocate on SightLife’s Partner Relations team. Allison, from Partner Relations, works through the complexities of the death investigation to ultimately obtain release for tissue donation from the coroner. Now it’s time for Dave to contact the deceased’s next of kin. For many, this is the most difficult part of the process. In some situations, it’s necessary to obtain next-of-kin approval for donation to take place. In others, the deceased may have already registered as a donor, in which case their wish is upheld. In both scenarios, the TDC must balance sympathy and understanding for the grieving family while conveying information clearly and with a sense of urgency. In Dave’s case, the deceased was a registered donor. After a discussion with the family, Dave dispatches a procurement technician, who sets out immediately to the hospital to recover the tissue. The procurement technician arrives at the hospital and heads straight to the morgue, only to find it locked. The clock is ticking. Although corneas can be recovered up until 24 hours after time of death, the ideal recovery is within 12 hours. Needing the morgue unlocked ASAP, the tech gets in touch with Dave, the TDC, for help. Dave once again sends a notification via PagerDuty to SightLife’s Partner Relations team member, Allison. In addition to partnering with coroners, the Partner Relations team works to build relationships with hospitals, funeral homes, first responders, and other community advocates to help facilitate the donation process. Dave quickly receives an acknowledgment from Allison, assuring him of her work on this case. Fifteen minutes later, a security team member, keys in hand, is hurrying toward the procurement tech, who’s still waiting in front of the morgue. The tech starts the full-body examination, taking note of any signs of disease, identification marks, or obvious damage to the cornea. Once he determines the deceased meets the donation criteria, the tech starts the in situ corneal excision procedure, which is a multiple-step process. The cornea, though resilient, is delicate, and if not recovered correctly, it can’t be used in a transplant. After about an hour, the tech places the cornea and scleral rim in a preservation media, which is maintained at a temperature of 2 – 8 degrees Celsius, and transports it back to the nearest SightLife lab for further medical testing. The media preserves the tissue for up to two weeks, allowing the corneas to be best placed with recipients who have been waiting to have a cornea transplant. There are many steps and many people involved in procuring, preserving, and transplanting corneas. SightLife coordinates more than 700 donations per month across five regional U.S. geographies. These efforts support more than 48,000 corneal transplants performed each year in the United States. Utilizing PagerDuty is essential to the process when SightLife needs quick access to experts. That timely exchange helps to overcome obstacles and make it possible for individuals whose lives are vision-compromised to experience restored sight. For more information or if you would like to sign up as a donor, visit www.sightlife.org.
PagerDuty Brings Better Visibility to Consumer Comparison Site Verivox
Verivox, one of Germany’s leading comparison sites for utilities, mobile, insurance, and more, serves over eight million consumers looking to compare prices and switch service providers. With so many customers relying on Verivox to provide them with accurate information, Verivox’s website must remain stable and reliable. And with competitors snapping at its heels, 13 development teams pushing out new features weekly, and its engineering teams dispersed across the country, the company needed a better way to scale and automate its digital operations in order to mitigate downtime. In the past, Verivox relied on its site reliability engineering (SRE) team to manually review alerts and notify teams of incidents. However, the company’s alerting protocols routinely triggered invalid alerts, eating up resources and thwarting visibility into network health. Additionally, with one person on call for an entire week after business hours (including weekends), Verivox risked both staff burnout as well as potentially missing meaningful alerts in the middle of the night. “By eliminating manual interactions, PagerDuty has enabled our alerting process to take a huge step forward. And we’re no longer losing track of incidents that affect production.” - Waldemar Spitschak, Head of SRE, Verivox From Manual to Automated According to Waldemar Spitschak, Head of Site Reliability Engineering, “First and foremost, we needed PagerDuty to automate alerting.” As PagerDuty has over 200 integrations, it made it easy for Verivox to connect the PagerDuty digital operations management platform to all of its monitoring tools—like New Relic, Zabbix, and AWS Cloudwatch—across its entire hybrid production environment of databases, cloud applications, Windows and Linux servers, and more. PagerDuty automation enabled Verivox to better define and assign on-call roles. As a result, the company can immediately route issues to people who know how to fix them rather than force an intermediary to pick up the phone and track someone down. If the on-call team needs to add more resources to assist, they can run a response play to automatically tap the right people. “By eliminating manual interactions, PagerDuty has enabled our alerting process to take a huge step forward,” commented Spitschak. “And we’re no longer losing track of incidents that affect production.” “We’re reacting to and resolving incidents faster than ever before, which is really important since our development cycle is so short,” he added. Automation also evens out the peaks and valleys of Verivox’s seasonal workflow by standardizing the on-call process and enabling the company to better predict costs. With PagerDuty, on-call teams now deliver the same comprehensive coverage all year round, maintaining a consistent level of expertise beyond the peak Q4 time. Improved Visibility Shines a Light on Digital Operations Using PagerDuty, Verivox now has a better understanding of incidents—Spitschak’s team can see the exact number of incidents per service and how quickly they’re resolved. The data helps them determine whether the platform is performing adequately or if a particular service is impacted. With PagerDuty’s rich API functionality, Verivox can generate different reports and alert mechanisms and set automated maintenance. “We’re getting a more holistic view with PagerDuty. Before, we had to make decisions based on a gut feeling. With PagerDuty, we have a clearer picture of what’s going on in our production environment,” said Spitschak. The increased transparency also helps Verivox improve the quality of monitoring and alerts. Because Verivox removed invalid, legacy alerts from PagerDuty, its monitoring is now in a much better place than before. And fewer alerts mean Verivox handles fewer incidents. “In the past, our alerting system was sending 10 to 20 times more emails than the on-call person needed to act on,” Spitschak shared. “Now the ratio is more like 1:1.” Looking Ahead The company soon plans to deploy PagerDuty throughout its organization and its parent company subsidiaries. “With PagerDuty, we get a much clearer view of the health of our production environment, and we’re looking into PagerDuty’s Operations Command Console and Operational Health Management Service,” said Spitschak. While Verivox initially selected PagerDuty for its alerting features, the company is now using it to enhance other key dimensions of its digital operations management. And since getting more bang for the buck is what helps fast-growing companies like Verivox stay ahead in a competitive market, it also plans to use PagerDuty to define and measure key performance indicators. Visit www.pagerduty.com for more details on PagerDuty’s digital operations management solution or gain insights, strategies and hands-on experience at one of our many upcoming events.
Datadog Puts DevSecOps in Action With PagerDuty
Datadog provides a monitoring platform that enables teams to ensure that their cloud applications provide the best possible user experience. To help achieve this, Datadog also embraces DevOps, working in an agile manner to constantly innovate and rapidly deliver new features and enhancements to their customers. This DevOps approach means that Datadog engineers make frequent updates to the infrastructure, which can cause alarm for its security team if the engineers fail to work closely with them. To avoid miscommunication and ensure that releases have security built in, Datadog involves its security team during development. In fact, Datadog is at the forefront of a growing trend called DevSecOps. In August 2016, the company adopted PagerDuty as an integral component of its digital operations management. Today, in addition to using PagerDuty to support engineering teams with business continuity and disaster recovery (BC/DR), Datadog uses PagerDuty to notify its information security team of events that require an immediate response. Embracing a New Approach to Scale Security Datadog is an agile, operations-focused organization with hundreds of engineers distributed around the globe. In many organizations, information security teams are typically siloed from the rest of the development teams, which can often delay production releases due to validation processes during security code reviews. But Datadog knew this approach had to go. “Security wasn’t going to work if it was outside of the development organization, just trying to swoop in when things go sideways,” explained Andrew Becherer, Datadog’s Chief Security Officer. Because it views security as another aspect of quality, Datadog embeds its security operations and development functions into the organization as a whole. “It behooves security to use the same tools, use the same methods, and bring the same types of technologies to bear in solving the problems faced by the rest of the development organization,” Becherer shared. By extension, when it comes to vulnerability management, Datadog’s security team tracks issues in much the same way that its developer teams track issues, and security alerting and response follow a similar workflow as other teams within Datadog. “It’s uncommon that security teams at other companies are using PagerDuty in the capacity in which we're using it,” said Becherer. Validating Code Changes on Amazon Web Services (AWS) Datadog is completely cloud-based, leveraging a range of AWS services to run code. With over 15 AWS accounts to manage everything from staging to production, keeping track of authorized changes to code can become quite complex. To validate changes, Datadog leverages ChatOps, integrating Slack, Duo Security, and PagerDuty. When a developer makes a potentially dangerous change to AWS, the security team sends a Slack message to the developer to validate the action. The developer confirms the code push via Slack and through two-factor authentication from Duo. If the developer does not reply in a timely manner or does not confirm the code change, PagerDuty sends an alert to the security team to escalate the response. If a change exceeds a certain threshold of risk, then either the security team is notified immediately via PagerDuty or automated AWS configuration management logic reverts the change to a trusted state. In short, developers are constantly making changes to each AWS instance, but it’s up to the security team to determine whether or not changes are authorized across tens of billions of API calls per year on AWS. “It behooves security to use the same tools, use the same methods, and bring the same types of technologies to bear in solving the problems faced by the rest of the development organization.” - Andrew Becherer, Chief Security Officer, Datadog Enabling Agile Security and Development With PagerDuty Like other companies, Datadog’s security organization is deeply concerned about the time it takes to remediate security vulnerabilities. Rather than having to parse through audit logs to understand what happened, Datadog connects its developer teams to security as quickly as possible. By using PagerDuty, Datadog has reduced the overall time required to resolve such issues. PagerDuty also provides visibility into security incidents to developers and engineers so that they can get immediate feedback on their actions if the security team deems them risky. “Developers are trying to solve a problem and they make a change [to address that problem],” explained Becherer. “You want to provide feedback as quickly as possible in that moment because they're going to move on to something completely different [right after making that change].” For example, in one recent security event, PagerDuty quickly escalated a security issue that occurred when a Datadog sales rep was preparing to give a demo. “Because of PagerDuty, we were able to connect a security engineer with our engineers within minutes,” recounted Becherer. “That's solid gold. That's where we have to be.”
PagerDuty Helps ServiceNow Integrate Incident Management for IBM Cloud
As one of the most successful technology companies ever, IBM didn’t get to where it is today by resting on its laurels. To remain relevant, the company has had to continuously reinvent itself and search for new ways to outshine the competition. IBM Cloud and the company’s Watson cognitive computing system are both great examples of this. But how do these offerings address infrastructure incidents that threaten the customer experience? Enter PagerDuty’s new, game-changing integration with ServiceNow. Time to Shake Things Up In late 2016, IBM’s Cloud division decided that it needed a standardized tooling system to handle incident management, escalation, root cause analysis (RCA), and related procedures. The group, which owns the operational toolset for IBM’s entire cloud organization, knew its existing IM systems weren’t up to snuff. Decentralized frameworks fragmented contacts and escalation procedures, and manual escalations wasted hours of time. Likewise, siloed knowledge was often outdated and inconsistent. The IBM Cloud team wanted to automate its IM processes while further driving adoption of ServiceNow, its system of record, for incident ticketing, tracking, and documentation. At the same time, they needed a solution that would also help automate self-healing workflows and improve real-time incident response. To make this happen, all moving parts of the incident response ecosystem had to work in concert. PagerDuty and ServiceNow Deliver Flexibility and Autonomy When choosing a vendor to enhance its existing ServiceNow implementation, IBM Cloud realized that only PagerDuty fit the bill. Why? According to Travis Warner, Program Director, Cloud Service Management, IBM Cloud, “First and foremost, it’s the user configurability. You don’t have to have a team of administrators saying ‘go edit this group for user A or create this schedule.’ Everyone’s able to do their own thing.” He also noted that, from a tooling perspective, he can stay completely out of it and let managers handle their own teams and services. With the integration, PagerDuty and ServiceNow now offer soup-to-nuts incident management functionality to existing ServiceNow customers like IBM Cloud. Moreover, customers using both products can integrate them quickly and easily while enjoying rapid onboarding and implementation. “About 18 months ago, self-healing automation was in single digits. At the end of 2017, we were approaching 65% and we’re aiming to be at 80% this year.” - Travis Warner, Program Director, Cloud Service Management, IBM Cloud Real-Time Support Automates Workflows, Rallies the Troops For IBM Cloud, the PagerDuty + ServiceNow integration enables the company to address incidents in real time through PagerDuty and document them through ServiceNow. Once PagerDuty receives an incident alert from IBM Cloud’s monitoring tools, it supports automated workflows and incident analysis. Based on the content within the alert, PagerDuty then triggers automated self-healing or escalates the incident to the appropriate team and processes. So no more scrambling around, trying to figure out who should do what. According to Warner, “PagerDuty is really good at what it originally set out to do: get people out of bed.” Moreover, he added, “PagerDuty really helps us automate our IM processes. We love this because it reduces the number of times folks are notified and heads off issues before they impact our customers. And if the automation fails, PagerDuty then alerts the right people with an incident history and other details so they can address the issue right away.” Better yet, all IBM Cloud incident data is auto-synced between PagerDuty and ServiceNow. Among other benefits, this bidirectional flow of information gives employees a lot of flexibility in how they work. And as noted by Warner, PagerDuty’s “any device, anytime, and anywhere concept is really where it’s at.” IBM Cloud Heals Thyself Since launching in November 2016, there are now close to 4,000 IBM Cloud employees using PagerDuty. With deployment well underway, the IBM Cloud team is seeing dramatic improvement in incident management. PagerDuty’s streamlined framework reduces resolution times, centralizes visibility, and automates recruitment of the right resources, creating a better work-life balance for employees. The integration of PagerDuty and ServiceNow also makes it easier for the IBM Cloud division to automate self-healing. Whereas only 5–10 percent of incidents were self-healed before deployment, this number increased to a whopping 65 percent by the end of 2017. And it doesn’t stop there: IBM Cloud is aiming for 80 percent self-healing in 2018. Full Speed Ahead The IBM Cloud group’s use of PagerDuty represents only part of its broader DevOps deployment across IBM’s cloud organization. IBM Cloud already has close to 3,000 services configured and using PagerDuty and ServiceNow. To learn more about the PagerDuty and ServiceNow integration, visit www.pagerduty.com/solutions/servicenow.
Evernote Improves the Customer Journey With PagerDuty
Imagine the frustration you feel when you’re writing something in Google Docs and you suddenly lose Internet connection. Or the panic you experience when you’re searching through your Notes app on your phone for one very particular note you typed on your computer about elephants in Djibouti so you can win the trivia game—and can’t find it. From meeting notes to random trivia tidbits, Evernote’s job is to help people create, assemble, nurture, and share information. Our unique search capabilities allow people to find information when they need it, no matter the format it was stored in—whether in a note, image, PDF, or voice recording. Our product is a cross-platform software-as-a-service application designed to enable people to organize, personalize, consume, and share thoughts from any device at any time. We currently have over 220 million people using our product globally, and that number increases daily. As the SRE Manager, my team of site reliability engineers are responsible for customer happiness by ensuring that our product works as intended. This means minimal downtime, but if downtime does happen, we need to act fast and resolve the issue as soon as possible. This is where PagerDuty comes in: When I joined Evernote in 2012, we were using PagerDuty primarily for alerts and notifications, as well as on-call rotation scheduling. In 2016, we began a major evolution of our hosting infrastructure, which centered around migrating many workloads to Google Cloud Platform. By moving to the cloud, engineers were able to iterate and build services quicker than ever before. But with this increased agility came new challenges—namely, tracking key performance indicators that tie into our service-level objectives (SLOs), which we use internally to identify which incidents have the most negative impact on the customer journey. For example, our customers care about how long it takes to open, write, and sync a note across their devices, so when any one of those actions experiences an issue, my team needs to be aware immediately and resolve that incident as quickly as possible. On the other hand, if one server goes down and we have eight of them still running, we’ll still receive an alert. But if it doesn't affect our customers’ experience (and our SLO), then it probably isn’t a big deal and we can plan to address it later on. PagerDuty helps with this by funneling all of our alerts and grouping them together so we can figure out what to prioritize, allowing us to look at things from the top of the funnel down versus from the bottom up. Additionally, the platform’s advanced analytics capabilities gives us a single source of truth for visibility into production issues. As we continue to grow, we plan to expand our use of PagerDuty within the company, specifically in regards to using the available postmortem templates and incident response plays to further automate our incident response process. Garrett Plasky is SRE Manager at Evernote. His team is responsible for running Evernote’s production service infrastructure. See the full case study to learn more about Evernote’s story.
Evernote Takes Note of the Customer Journey
Evernote is a cross-platform software-as-a-service application designed to help people be more productive by making it easier to take notes and manage information across web and mobile devices at all hours of the day. Today, Evernote has over 220 million users across the globe, with 80 percent of them outside of the United States. With so many people dependent on the platform, Evernote must ensure high service availability—or risk having unhappy customers and subscription cancellations. PagerDuty enables Evernote’s engineers to respond quickly to minimize customer impact of performance issues. Understanding the Customer Journey Through Service-Level Objectives Garrett Plasky, Evernote’s SRE Manager, leads a team comprised of site reliability engineers, devops engineers, and system administrators who are responsible for the health of Evernote’s production service infrastructure—and ultimately, customer happiness. “In 2016, Evernote began a major evolution of its hosting infrastructure,” Plasky shared. “The update—which centered around a migration of many workloads to Google Cloud Platform—was part of an effort to democratize operations and enable engineers to move quickly, iterate, and build services.” However, with increased agility came more responsibility. Evernote engineers were now responsible not just for building services, but also for maintaining them in production. To do this effectively, they needed to track key performance indicators (KPIs), which could help them make informed decisions about how to maintain service-level objectives (SLOs) when a problem occurred with the infrastructure. “These are the types of things that we’re monitoring and alerting for more—the full user journey, aka the things our users care about,” Plasky explained. “For instance, how long does it take you to open, create, and sync a note? We’re reframing the way we think about what's important and looking at things more from the top of the funnel down instead of from the bottom up.” Developing Insights to Empower Engineers and Improve Future Response Looking at SLOs through the lens of the customer has also provided Plasky’s team with insights to make informed, real-time decisions about complex application environments. Evernote engineers are responsible for maintaining services that they create and have the authority to determine whether a given alert is serious enough to merit action. PagerDuty provides the data necessary to help Plasky’s team make decisions about the relevance of each incident, empowering engineers to work more effectively while still maintaining high service availability for end users. Additionally, using PagerDuty’s postmortem capabilities also enables Plasky and his colleagues to perform insightful, streamlined postmortems. “One challenge that we have as an Operations organization is to continue our mature and well-rounded incident response process, but also balance that with the fact that we don’t want to spend two man-days putting together a postmortem report or have a three-hour meeting discussing an issue.” By automating postmortem reporting, PagerDuty helps the team meet this challenge. “We have different sources of data and alerting. But having them all funnel through PagerDuty has value because it makes it easy for us to see what happened, what went wrong and when.” - Garrett Plasky, SRE Manager, Evernote Evernote and PagerDuty: Growing Together As Evernote continues to grow and evolve, PagerDuty will be right by its side. When Plasky joined Evernote in 2012, the company was using PagerDuty only for alerting and notifications. Today, his team also uses PagerDuty for scheduling on-call rotations and is taking advantage of the platform’s advanced analytics capabilities to give them a single source of truth for visibility into production issues. Evernote plans to increase its use of microservices over the next year, and the company will be adding more product engineering teams as PagerDuty users so they can be responsible for running their own service before handing it over to Plasky’s team. The additional PagerDuty features and integrations also figure prominently into future plans—particularly the available postmortem templates and response plays, so Evernote can continue to automate and improve its incident response process. “We have different sources of data and alerting. But having them all funnel through PagerDuty has value because it makes it easy for us to see what happened, what went wrong and when,” Plasky shared. “PagerDuty is what wakes us up when something critical breaks, which is essential to keeping customers happy.”
Yelp Delivers Global Site Availability and Customer Satisfaction with PagerDuty
Okta Relies on PagerDuty to Remove Friction from its Digital Operations Management Processes
PagerDuty: A Horse You Can Back at William Hill Australia
*Since the time of this writing, William Hill Australia has been acquired by The Stars Group. In the U.K., there’s the Royal Ascot. In the U.S., it’s the Kentucky Derby. But Australia takes horse racing to a whole new level: The Melbourne Cup, aka “the race that stops the nation,” is such a big event, it’s a public holiday in the state of Victoria. Imagine that—having a public holiday just so people can go watch horses run in circles! That is amazing! But just how big is betting on horses for Australians? According to the most recent figures, from 2015 – 2016, nearly one million Australians bet approximately $3 billion AUD on horse races. Now, imagine if you were one of these Aussies, excited about this day off—and then you find out you’re unable to place a bet because the systems at the sportsbook you use just stopped working (or are working so slowly, they may as well not be working). Not only would you be really upset, you’d likely also find another sportsbook to make your wager so you won’t have to watch the horses run away with your potential winnings. For companies like William Hill Australia, one of the country’s leading betting and gaming companies, avoiding such a scenario is the exact reason why it selected PagerDuty to help it ensure that its systems always remain available for its customers. To put it another way, anything less that 100 percent uptime on Melbourne Cup Day is unacceptable to the Infrastructure and Operations teams. According to them, “You just need one little blip and one minute of outage, and you’ve blown your KPI out the window.” Better Work-Life Balance With Automation and Integrations PagerDuty was brought on board to play a huge role in William Hill Australia’s digital transformation and cloud migration initiatives. Historically, team members had to monitor tickets flowing into their email box to manage incidents. The process was manual and relied heavily on team members making the right call—sometimes literally. And if they were going to wake a fellow employee at 3 a.m., they better make sure they fully understand the context and severity of an incident in order to communicate it effectively so the on-call person can solve it as quickly as possible. PagerDuty enables William Hill Australia to automate this entire process, which has helped on-call responders to react more quickly and maintain uptime. At the same time, the technology can’t send an alert or notification every single time an alarm goes off—with all the monitoring tools in place, the volume of alerts that the team would have to sift through would be overwhelming. The technology has to be smart enough to figure out what’s signal and what’s noise. Otherwise, William Hill Australia would be in the same situation as before, with people manually figuring out the incident and determining who should be called in. So, they’ve integrated their monitoring tools—including Splunk, AppDynamics, CloudWatch, you name it—into PagerDuty. Using PagerDuty’s event intelligence and alert grouping capabilities, the teams are now confident that when PagerDuty notifies them, it truly is for an incident that needs immediate attention. Additionally, the teams are empowered to respond how they want to respond. They control their on-call schedules and how they want to be notified. Instead of someone calling them, PagerDuty automatically notifies the on-call staff via SMS, email, phone—however they want to be notified. If there isn’t a response within a set period of time, PagerDuty automatically escalates to the next level, based on the criteria that William Hill Australia has established. According to the team, they “know the issue is going to get picked up at some point in the next few minutes, rather than having to hope that somebody is watching a screen or monitoring an email queue.” Saving on Downtime—and Costs Every minute of downtime can cost up to $100,000 AUD, so the ability to resolve issues before customers are aware of any problems has made a huge impact for William Hill Australia. According to their Head of Infrastructure and Operations, “I don’t want the business telling me I’ve got an issue. I want my technology to alert me.” The end result: People can successfully place bets through William Hill Australia—even during peak times— and now it’s off to the races! To learn more about how PagerDuty helps William Hill Australia stay up and running 24/7, check out the full story.
How Tyro Payments Drives More Business for Small Businesses
As PagerDuty’s customer liaison on the Marketing team, the best part about my job is being able to hear directly from our customers about what their companies do and how PagerDuty has helped them become even more successful while strengthening the customer experience. Most recently, I had the opportunity to speak with Ed Groenescheij, the Operations Team Lead at Tyro Payments, a leading independent payments provider based in Australia. Tyro processes over $10 billion AUD in transactions annually for more than 19,000 small to mid-sized businesses (merchants) across Australia. That’s a huge number to keep track of—plus Tyro also needs to ensure it continues to meet the commitment it has made to its customers, which is to help keep their businesses growing. For example, imagine it’s the holiday season and you’re shopping around for last-minute gifts at your neighborhood retail store. There’s a long wait to check out, but you get in line anyway because the wait is expected for this time of year—and then the store owners inform you and the other customers that the credit card machines have suddenly gone down. With that news, one of two things happen: Either the lines become longer because the checkout process has slowed down and people choose to wait, or people leave because they don’t have cash on hand, don’t have time to wait, or both. People leaving is obviously bad for business, but having people waiting in line longer than necessary also has a negative effect—small businesses are typically very cash flow dependent, and when credit card machines go offline during peak shopping times, those businesses lose sales. Additionally, every second of downtime means more opportunity for customers to head to a competitor's store to do their shopping. This is where Tyro comes in. During my conversation with Ed, I discovered a great story about how Tyro uses PagerDuty for incident management so that its platform stays up and running 24/7, helping small businesses avoid incidents like the above from happening. Tyro understands that cash flow is king for small businesses. It also understands that if its systems go down, it will heavily impact its customers’ top-line growth and profits. The lost revenue from not being able to accept credit cards reduces cash flow, which impacts the merchant’s ability to buy inventory or pay staff, vendors, utilities, rent … the list goes on. Clearly, it’s crucial that small businesses can process sales more quickly and reliably to avoid cash flow issues. “If anything fails, customers can no longer accept payments,” said Ed. “It’s critical to us to ensure the platform is always up.” And much like how customers might start shopping with a competitor during peak season if they can’t use credit cards, merchants might switch to a different payment processing provider if outages occur too frequently. These are just a few big reasons why Tyro’s commitment to having a stable and smooth running platform is essential. To meet that commitment, Tyro uses over one hundred microservices that support critical banking operations. A failure in any of them could trigger a major customer-impacting problem, which is where PagerDuty comes in. PagerDuty provides Tyro the assurance that if performance starts to slow on its platform, the necessary stakeholders will have visibility into the issue before it becomes a widespread incident. Using PagerDuty, Tyro can provide the right people with the right information so that they can acknowledge the problem and take quick action. To learn more about how PagerDuty helps Tyro Payments stay up and running 24/7, check out the full story.
William Hill Bets On PagerDuty for Incident Management
In Australia, A$18 billion a year is spent on gambling, and William Hill Australia is one of the leading betting and gaming companies. As a digital-only business, the focus and investment in IT Operations is one of William Hill’s greatest differentiators—they know firsthand how important it is to deliver always-on amazing digital experiences. To deliver on their promise of exceptional customer experience, in 2018, William Hill Australia is embracing a more agile approach, migrating new and existing services onto AWS and using PagerDuty to support their journey. Exceeding Expectations William Hill’s customers won’t hesitate to move their business to a competitor if applications are slow or down; they follow events globally, so services need to be 100 percent available, every hour of every day. But for high traffic, “Tier One” events like the Melbourne Cup, the stakes are even higher—downtime can cost A$100,000 per minute, and the operations team needs to keep an eye on over 700 critical servers that make up core services and applications. “We’ve got to be available 100 percent of the time,” said Alan Alderson, Head of Infrastructure and Operations for William Hill Australia. “You just need one little blip and one minute of outage, and you’ve blown your KPI out the window.” Being immediately alerted to any issue is critical for William Hill Australia so problems can be found and fixed before customers are aware. For Alderson, it’s important that his team finds the issues before anyone else. “I don’t want the business telling me I’ve got an issue. I want my technology to alert me,” explained Alderson. “That way, the business has confidence in us—that we’re on it, we’re monitoring our systems properly, we know when we’ve got an issue, and we’re working to restore services as quickly as possible.” “I have confidence that when something goes wrong, the service desk is going to get a phone call, and if that phone call isn’t answered, I know it’s going to be escalated.” - Alan Alderson, Head of Infrastructure and Operations, William Hill PagerDuty Automates Incident Management and Increases Visibility Since implementing PagerDuty, William Hill Australia has reduced its manual efforts around incident management and increased its confidence that the right messages are getting to the right people. William Hill Australia has numerous monitoring tools, including Splunk, AppDynamics, AWS Cloudwatch, and CA Unified Infrastructure Management. Prior to PagerDuty, all alerts would be sent to the Service Desk, which would then route “critical” alerts to ServiceNow, where an on-call engineer would watch the queue and manually call out if issues were not resolved. “With PagerDuty, we are no longer relying on watching those queues in order to identify incidents and respond in a timely fashion, ” said Alderson. Instead, PagerDuty correlates that data and immediately alerts engineers in the format they choose—via SMS, mobile app, phone call, or email. "[With PagerDuty], I have confidence that when something goes wrong, the service desk is going to get a phone call. And if that phone call isn’t answered, I know it’s going to be escalated,” Alderson said. “I know the issue is going to get picked up at some point in the next few minutes, rather than having to hope that somebody is watching a screen or monitoring an email queue. Alerts are now being picked up within seconds rather than minutes.” Future Plans: Continuing Cultural Change Through PagerDuty The newfound confidence is critical as William Hill Australia implements its cloud migration strategy. “Since January, we’ve been migrating our product, our infrastructure, and applications into AWS from our on-premise data centers,” explained Alderson. By combining PagerDuty with AWS CloudWatch’s system-wide visibility, William Hill can once again define how the team receives alerts and creates incidents. Alderson acknowledged that full implementation of PagerDuty will take time. “You can’t change culture overnight, but over time I want the rest of the business to see the value of PagerDuty,” he explained. “We’re becoming a more agile environment. We’re only starting out with our ‘build it, own it, fix it’ philosophy, but PagerDuty will help us mature in this space.” “PagerDuty is going to be the cornerstone of my plan for next year,” stated Alderson.
Tyro Payments Automates Microservices Incident Management with PagerDuty
Tyro Payments, Australia’s leading independent payments provider, processes over $10 billion in transactions annually for more than 19,000 small to mid-size businesses across the country. The company supports more than 200 point-of-sale integrations and promises transaction completion times of under two seconds. Keeping this promise and guaranteeing service uptime requires a robust monitoring and incident management solution, which is why Tyro has leveraged the PagerDuty platform since 2013. With PagerDuty, Tyro manages alerts and notifications for its microservices-based applications and infrastructure. Challenges: Manual Incident Monitoring and Scheduling Tyro’s application platform consists of over 100 microservices that support critical banking operations. A failure in any of them could trigger a major customer-impacting problem. “If anything fails, customers can no longer accept payments,” Groenescheij said. “It’s critical to us to ensure the platform is always up.” Before adopting PagerDuty, Tyro’s operations team struggled to identify failures in a timely manner due to their heavy reliance on manual processes for managing incidents. Alerts were sent via email to on-call engineers, who had to check emails manually to stay ahead of important notifications. Alert escalations when the on-call engineer did not respond or could not handle an incident independently also required manual intervention. If an incident affected an application and required developer support, the operations team would need to manually call out to them as well. All of these manual processes were time-consuming and potentially left Tyro’s customers at risk if the operations team could not resolve issues quickly. Achieving Automation With PagerDuty Once Tyro’s operations team adopted PagerDuty, the tedium and risk of manual incident management quickly became a thing of the past. “The key thing for us when we started using PagerDuty was the fact that we were able to schedule, automate, and escalate incident response immediately,” Groenescheij said. In addition, PagerDuty has facilitated better communication between the operations team and other parts of the organization by streamlining visibility into infrastructure and applications. “[Previously], if one of our infrastructure monitoring systems noticed an issue, and at the same time an issue occurred with one of our applications, the application team wouldn’t know that there was an underlying infrastructure issue,” Groenescheij explained. By allowing the team to coordinate monitoring data, PagerDuty now gives them a consolidated understanding of what's happening in their environment. PagerDuty has also helped Tyro’s engineers to work more efficiently with less stress. Engineers now receive notifications automatically, eliminating worries about missing an important alert. “We’re now able to step back and trust that PagerDuty will wake us up when we need to,” Groenescheij said. “When we started using PagerDuty... we were able to schedule, automate, and escalate incident responses immediately.” - Ed Groenescheij, Team Lead, Tyro Payments Expanding Infrastructure Visibility Further With Operations Command Console In the near future, Groenescheij and his team plan to take further advantage of additional PagerDuty capabilities. These include the Operations Command Console, which will help on-call engineers track associations between incidents in order to prevent cascading service failures, which can occur when an incident with one application or resource causes disruptions for others that depend on it. Operations Command Console will also provide a consolidated interface for viewing monitoring data from all of the alerting systems that PagerDuty integrates for Tyro. In addition, Tyro expects to extend use of PagerDuty beyond its operations team to include developers as well. “We want to ensure that developers gain instant visibility into application issues as they happen rather than relying on the operations team to walk over and tell them about the issue,” Groenescheij said. By integrating developers centrally into the incident management process, Tyro will further automate its software delivery and management workflows. In turn, Tyro will be even better positioned to accept payments with confidence, knowing that the ITOps and developer teams are working together to respond to issues quickly using PagerDuty’s automated incident management features.
REA Group Embraces Digital Transformation with PagerDuty
REA Group Replaces Pagers with PagerDuty REA Group Limited is a multinational digital advertising company specializing in property, operating the leading property website in Australia, and prominent sites across Asia. Their purpose is to ‘change the way the world experiences property’ which they do through developing innovative products and creating a dynamic working culture that fosters inventive thinking. Millions of people around the world use REA Group’s websites to find property every day, so the platform must always be on and performing well to ensure people can search for properties at any time, from anywhere. It’s therefore mission-critical for REA Group to act and respond to incidents impacting platform performance without their customers noticing. At the same time, REA Group cannot lose focus on operational efficiency for their software development and management team, especially in the face of rapid growth. Challenges: Monolithic Incident Alerting and Siloed Operations In 2014, prior to adopting PagerDuty, REA Group’s operations team relied on a monolithic, inefficient alert notification system that required engineers to carry physical pagers at all times. Because a system based on physical pagers was challenging to change and optimize, the REA team couldn’t guarantee that the right alerts were delivered to the right people, which delayed incident response times. Furthermore, on-call engineers were constantly being notified of non-critical or non-actionable alerts especially out of hours. “It was a nightmare during night-time — a really painful process,” Javier Turegano Molina, Global Infrastructure and Architecture Manager at REA Group, said about the on call experience in those early days. The second major challenge for the team was the siloed structure of the organization. The organization was composed of many different groups who were each responsible for developing distinct parts of the company’s ecosystem, but all incidents were relayed to a centralized operations team. REA focused on breaking down these silos by embracing a DevOps culture, shifting the ownership of operations towards the teams that were building and maintaining the applications. For this change to be successful, alerts needed to be delivered to the team directly and not sent to a separate centralized unit. “We now have a way of sending the right alerts to the right people, and at the right time.” - Javier Turegano Molina, Global Infrastructure and Architecture Manager at REA Group Achieving Agile Incident Management with PagerDuty In 2014, Turegano and his team implemented PagerDuty to improve incident response time and to fully embrace the DevOps way of working. With PagerDuty, REA can streamline the way incidents are managed across its entire organization by coordinating incident responses in a tailored, agile fashion. Incident escalation policies are customized so that alerts are delivered to the right people based on the nature of the problem, including the team who owns the affected service and the engineer who is best suited to handle the issue. The teams now put a great emphasis in designing their alert to match the SLAs and ensure the team is not alerted without a real reason. The outcome is that all teams who own the service now have full accountability. This has required a critical shift in mentality with teams now understanding that if you build it, you run it. “Being able to tune the schedules was a really great feature for us,” Turegano explained. Physical pagers have become a thing of the past. Incident notifications are now delivered through PagerDuty, allowing engineers (developers, qa’s, systems, etc.) to be notified via their phones and other devices that they already use and own. “Having no more physical pagers has been life-changing,” Turegano said. With the metrics that PagerDuty automatically collect, Turegano and his team have improved their operations. PagerDuty provides data that helps them determine their Mean Time to Repair (MTTR), which allows REA Group to track how the performance of its operations team evolves over time during an incident. PagerDuty also aggregates metrics from the diverse set of monitoring tools that the team already uses, such as AWS CloudWatch, Nagios, New Relic and Splunk. These aggregated metrics are invaluable for performing post-mortems after an incident in order to prevent similar issues from recurring in the future, Turegano said. REA now uses PagerDuty to power all of its digital operations. “Anything that can break will send an alert to PagerDuty, and we now have a way of sending the right alerts to the right people,” Turegano said. He added that REA has not just become more efficient in the way they handle alerts, they’ve undergone an entire DevOps cultural change and PagerDuty has been a great enabler in this journey. If you want to read more of how REA has scaled on-call check this post in their tech blog.
Bulletproof Enables Customer Cloud Migration with PagerDuty
Bulletproof, a major provider of cloud management and consultation services in Australia, is a company whose name and motto of “Mission Critical Cloud” proclaim a commitment to reliability. Delivering on this commitment to Bulletproof’s more than 700 customers requires careful management of the infrastructure and services that power the company’s operations. Some years ago, after struggling to handle incidents using ad-hoc processes and legacy technologies, Bulletproof adopted PagerDuty, which has enabled revolutionary improvement in their ability to maintain SLA response times and avoid critical service disruptions. This has driven increased customer satisfaction by enforcing compliance with entitlements. For example where a Customers service is entitled to be patched on a regular basis but their last patched date is out of bounds, their automation utilizes PagerDuty to assign an Engineer to remedy. Challenges: Inefficient Alerting and Lack of Automation Bulletproof’s engineers manage cloud infrastructure that is spread across multiple continents. It includes both the company’s internal infrastructure and the cloud environments of customers that they support on a long-term basis. Those customers are spread across a diverse set of industries, including government, finance, manufacturing and retail. Maintaining uninterrupted service for such a large spread of critical applications across multiple cloud environments is no small task. Initially, Bulletproof’s support team relied on a primitive notification system utilizing physical pagers to deliver alerts to its engineers when incidents occurred. The alerts were delivered by a third-party company that forwarded notifications to the pagers, which engineers had to literally carry and check regularly at all time. Reliance on a third-party messaging company added cost, latency and inefficiency to the alert system, which offered no support for integrations with other tools like chat platforms, or automation of tedious tasks such as on-call scheduling. These processes were instead performed manually, often on an ad-hoc basis. Improving Engineers’ Happiness and Response Time with PagerDuty Five years ago, Bulletproof decommissioned its physical pager alerting system in favor of PagerDuty and saw immediate results. One significant benefit was PagerDuty’s scheduling automation features, which provide “the clarity to know whether someone is on task and on schedule,” said Greg Cockburn – Chief Cloud Officer at Bulletproof. This scheduling visibility enabled a better life-work balance for Bulletproof’s staff because it made it easy to identify on-call engineers during an incident and avoid disrupting off-duty employees. “A better on-call system has been a significant improvement for our team’s happiness, and PagerDuty has been a big part of that story,” Greg said. He noted as well that PagerDuty helps to automate incident escalation and handoffs between engineers, further improving the efficiency of workflows and reducing the amount of manual coordination required from engineers. “We’ve been using PagerDuty for five years. We have had tens of thousands of incidents. We’ve never encountered a problem with the PagerDuty service. It has always just worked!” - Greg Cockburn, Chief Cloud Officer, Bulletproof With the help of PagerDuty, Bulletproof also significantly improved its response time when handling incidents — an essential benefit for a company whose ability to maintain service agreements with customers is a crucial part of its value. “One of our key SLA targets to our customers is human response time. PagerDuty is the foundation for us to provide that,” said Greg. Most important of all, PagerDuty has delivered a bullet-proof incident management solution for Bulletproof’s engineers. As Greg explained, “we’ve been using PagerDuty for five years. We have had tens of thousands of incidents. We’ve never encountered a problem with the PagerDuty service. It has always just worked.’” Bulletproof already leverages a range of PagerDuty features, from automated alert grouping to SLA reporting, to the ChatOps integration with HipChat. Going forward, however, with offerings like DevOps Support the company plans to take even greater advantage of the functionality that PagerDuty offers. “PagerDuty is a next-generation platform, with constant evolving features that we didn’t even know existed,” Greg says.
Sky Betting and Gaming Ups Its Game With Improved Real-Time Operations
Sky Betting and Gaming (SBG) is the leading mobile and online betting and gaming operator in the UK. Its platform processes over 44 million game transactions every week and posts over 50 million content updates every day. “Making sure customers get the best service is the ultimate goal for us,” said Rachel Watson, Head of Service Operations at SBG. With fierce competition targeting the same users, SBG must provide an outstanding customer experience by ensuring that its platform is available 24/7. Rapid Growth, Manual Processes Call for PagerDuty Scalability and availability are mission critical to SBG, as the company supports over 2 million active users and continues to grow rapidly. “If a customer wants to place a bet, they want to do it now. They don’t want to do it an hour later. By then, they would have placed their bet with somebody else,” said Watson. As SBG moved to more of a DevOps model, with engineering squads responsible for fixing the code they build, the manual incident management process did not scale. “As more squads joined the on-call rotation, we could no longer have the traditional handoffs through phone calls,” said Watson. “If we misdialed a single digit, we’d end up leaving a message for somebody who doesn’t even work for us.” Watson’s team was often unable to reach the right people in a timely manner, if at all. As a result, it would take them at least 30 minutes to mobilize the appropriate responders. SBG implemented PagerDuty to mitigate business disruption and accelerate response by automating the on-call management process. “Since using Modern Incident Response, our MTTR has decreased by 86%. The team’s morale has improved considerably as well as people’s satisfaction in their roles since PagerDuty has removed almost all manual aspects of monitoring. We’ve managed to claim back a considerable amount of time which has been reinvested in new projects as well as learning and development," shared Watson. “With PagerDuty, we get someone online in less than four minutes. Our average time to restore an incident now is under 30 minutes that we used to spend on manually contacting people.” - Rachel Watson, Head of Service Operation, Sky Betting and Gaming Reduced Noise Improves Visibility In addition to being able to automatically mobilize teams, SBG can now provide incident context so teams can immediately initiate a response. “Previously, when we got a major incident alert, we didn’t know what the issue was,” Watson explained. “We had a sea of red all the time because nobody had visibility into which alerts were genuinely critical.” With the PagerDuty Visibility console, the Service Operations team now has a central view of everything occurring within the IT environment, be it callouts, major incidents, or low-level alerts. As a result, the productivity and engagement of SBG squads have improved because they know the notifications received have real urgency behind them and are therefore empowered to immediately take action. “PagerDuty has helped us move away from an excess of false and redundant alarms; it allows us to focus purely on service impact and truly critical alerts.” The Service Operations team can now also identify trends and engage the engineering squads to investigate further. “If we’re seeing continual alerts, we can take them to the relevant squads and ask to look into why we are getting so many alerts,” said Watson. Integrations Power Tribe Autonomy Driving this increased visibility is PagerDuty’s comprehensive technology ecosystem of 300+ integrations, which allows the Service Operations team and engineering tribes to connect PagerDuty to several different monitoring tools. “In general, each tribe uses Prometheus, New Relic, Grafana and Nagios,” said Watson. “As long as they are feeding into PagerDuty, each tribe has the autonomy to choose the tools they want to use, while we simultaneously unify the incident management process.” SBG also recently went live with Jira and will be leveraging PagerDuty’s integration to automatically raise tickets within Jira when an incident occurs. “It was very separate before, where we would look at an alert, raise an incident, call out, and then raise a ticket in Jira. You can do that all now within one tool,” said Watson. For SBG, the number of manual tasks has been reduced dramatically, improving operational efficiency while ensuring its platform stays available for its users. Said Watson, “PagerDuty gets the right people engaged at the right time, all in one push of a button.” To learn more about what PagerDuty can do for your organization and sign up for a free trial, visit www.pagerduty.com.
Cloudflare Reduces Mean-Time-To-Action to Seconds with PagerDuty
As a global cloud-based performance and security solution to over 6,000,000 Internet assets, Cloudflare ensures that customer websites, applications, and APIs are secure, performant, and highly available. In serving over 10% of the world’s Internet traffic, it’s imperative that Cloudflare’s services remain online for customers at scale, while guaranteeing SLA uptime by identifying and resolving incidents, long before operations are ever disrupted. Cloudflare’s Site Reliability Engineering (SRE) team, lead by Michael Daly, sought an incident resolution solution which would help Cloudflare increase the stability of its operations, while delivering a flawless experience for every customer. Challenges: Visibility, Communication, and Escalation Cloudflare faced three challenges before adopting PagerDuty. The first was around optics. “We didn’t immediately know when something was broken because the engineering team did not receive automated alerts when an incident occurred,” Michael explained. The second challenge was in managing incidents. Once a problem was discovered, the engineering team relied on manual processes to address it. Engineers spent time diagnosing the cause of the problem, and if a solution required assistance from another department, SREs were required to contact that person over phone, text, or chat — a duty that became difficult if incidents occurred after working hours or on weekends. Given Cloudflare’s rapid growth, with less than 800,000 customers in 2013 to over 6 million in 2016, it was becoming difficult for Michael’s team to separate actionable, critical incidents from the growing volume of data generated by monitoring tools. While the team refused to dispose of potentially useful information, they needed to group related symptoms in order to gain actionable insight. Without the assistance of dynamic event management and triage, automation, and other capabilities available from PagerDuty, Michael and his staff had to evaluate the seriousness of each incident manually, a process that was becoming too slow to best serve the exponentially growing number of customers. “Mean-time-to-action has dropped from multiple minutes to seconds.” - Michael Daly, Engineering Manager, Cloudflare Increasing Stability and Response Time with PagerDuty By adopting PagerDuty, Cloudflare resolved all of these challenges. PagerDuty ensures that Michael and his team are always notified of incidents as soon as they occur and, if an incident should be handled by a different team, PagerDuty forwards the notification automatically to save time. The Cloudflare SRE team also uses the Operations Command Console, and benefit from capabilities like the highlighting of high-urgency incidents within the Major Incidents Application. As a result, with full-stack visibility into their infrastructure and pattern and anomaly detection, they no longer miss serious events. Michael explained, “When we adopted PagerDuty, we were able to take certain alerts and say to ourselves, this one is really important. We need to deal with it now.” In addition, other capabilities such as PagerDuty’s HipChat integration made it easier for Cloudflare’s SRE team to streamline communication, collaborate, automate ops-related tasks with commands, learn together, and more when responding to incidents. PagerDuty also eliminated the need for SREs to manually look up contact information for the right expert, as individuals, teams, or business stakeholders can be informed and recruited into an incident in just a click. With PagerDuty, they can get in touch instantly. Most importantly, PagerDuty reduced the time it takes Michael and his team to take action on incidents, to a small fraction of what it was previously. “Mean-time-to-action has dropped from minutes to seconds,” Michael said, adding that faster response time translates to greater service reliability and better customer outcomes — which is the ultimate goal and reason why Cloudflare sought out PagerDuty in the first place. “We had several options, but we chose PagerDuty because we had to do less work to make PagerDuty work with our systems. It was very nicely formatted, the API just worked, and the output from the app was very easy to interpret.” - Michael Daly, Engineering Manager, Cloudflare
Picnic delivers better customer experiences and scales their operations with the help of PagerDuty
Picnic is the world’s fastest growing online supermarket that makes grocery shopping simple, fun, and affordable for everyone. The company launched in 2015 after building the platform in stealth-mode with thirty engineers. They then went on to win the 2015 challenger award, 2016 best start-up award, and the 2017 most innovative company award. The service is now available to more than 500,000 people in the Netherlands. The business was in search of a solution that could streamline alert management and incident resolution, by automatically triaging and appropriately notifying team members when issues across the infrastructure stack occurred. The CTO at Picnic, Daniel Gebler, is responsible for all technology, infrastructure, engineering, and operations at the company. His team supports all Picnic store and fulfillment systems, including their mobile app and the backend infrastructure which requires payment system support, human resource systems, monitoring fulfillment, and all supply chain management systems. Addressing the challenges around separating system operations from system development Before implementing PagerDuty, Picnic faced challenges around ensuring accountability and load balancing work streams between the operations and the engineering team. There was no solution in place that tied full-stack infrastructure visibility with a system of accountability as well as collaboration tools, to ensure that issues were acknowledged, prioritized, and remediated in real time. In order to keep tabs on any issues from start to finish, the teams had to be available to monitor the system at all times — a costly and time consuming challenge. Picnic leveraged Slack for internal notification purposes but it wasn’t standardized across the company’s on-call resources, and it was not a reliable way to get in touch with the necessary stakeholders. It was critical for them to implement a solution that their on-call teams could standardize on and which could reach them 24/7/365. “PagerDuty allows us to separate business and system operations from system development and administration, saving time and money.” - Daniel Gebler, CTO, Picnic Saving time, costs, and improved MTTA/MTTR by 500% As the need increased to implement a solution that could support on-call automation, full-stack visibility, and end-to-end incident management, Picnic selected PagerDuty as their vendor of choice. “PagerDuty has allowed us to better separate system engineering from system operations, saving us time and money,” stated Daniel Gebler, CTO at Picnic. His teams are responsible for monitoring many different systems, including Picnic’s mobile application and backend systems. They monitor everything: from the load times of their prices and product catalog, the accuracy and availability of real-time analytics, all the way to tracking logistics from the warehouse to the doorstep of their customers. If there are slow response times at any point in the supply chain lifecycle or in the mobile shopping customer experience, PagerDuty automatically notifies the appropriate on-call resource. Picnic ties seamless real-time workflows to all that monitoring data by leveraging PagerDuty, which sits as the glue between its monitoring systems (New Relic to monitor the application and store related systems, and AWS CloudWatch to monitor its cloud infrastructure) and its collaboration tools such as Slack. All those tools are integrated with PagerDuty, enabling effective prioritization of issues and faster response to incidents taking place across the infrastructure, and creating several benefits: decreased alert fatigue, improved accountability and resolution times, and more agile processes for better business and revenue outcomes. Picnic’s warehousing operations are one of the services that need to be monitored in real-time. The process starts when a purchase is submitted to the supplier. It’s then delivered to the warehouse, picked up in their electrical vehicles, and delivered to the customer’s doorstep. If something goes wrong in the planning cycle, all warehouse operations would be massively disrupted. That’s the reason why Picnic uses PagerDuty — to allow them to take action and avoid any operational issues from occurring that would affect the customer's experience. “For us, the most important thing is that our customers get the best quality groceries, delivered for free and always on-time precise to the minute. PagerDuty helps us to identify early issues in the supply chain and allows us to take action to mitigate the issues and prevent customer impact,” said Gebler. It’s important for the teams responsible to be proactive rather than reactive to problems that happen during the incident management lifecycle. Having the ability to resolve those cases within a short amount of time allows them to ensure smooth and seamless operations and deliver better customer experience. The company is fully data-driven and has improved their mean-time-to-acknowledge and mean-time-to-resolve by 500% since they’ve implemented PagerDuty. “For us, the most important thing is that our customers get the best quality groceries, delivered for free and always on-time precise to the minute. PagerDuty helps us to identify early issues in the supply chain and allows us to take action to mitigate the issues and prevent customer impact.” - Daniel Gebler, CTO, Picnic Improving customer experience The biggest value Picnic has seen since using PagerDuty is the separation it provided for system operations and system engineering. PagerDuty provided a platform for the company to set apart the operational part of system incidents with the building portion which puts incidents in the hands of its rightful owner. In the end, this resulted in scaling their operations even further. Picnic has the ability to be more responsive due to PagerDuty’s robust incident management platform. Most importantly, the solution has helped Picnic improve customer experience and enhanced their external services.
Pantheon exceeds their 99.9% uptime SLA with the help of PagerDuty
Founded in 2010, Pantheon is the website management platform for Drupal and WordPress. More than just hosting, Pantheon’s platform includes all the tools professional developers need to build best-of-breed websites — like staging environments, version control, backups and workflows. Nick Stielau is the Director of Engineering at Pantheon and is responsible for organizing the engineering team, providing support around planning, delivering new products and features, and maintaining existing infrastructure and supporting existing functionality. As Pantheon continues to grow their customer base, implementing an incident management solution was critical to helping them manage their on-call resourcing and meet their high uptime SLA expectations. Replacing their previous alerting tool with a more scalable solution Since Pantheon implemented PagerDuty, they haven’t experienced challenges or obstacles in supporting their incident management resources. Prior to leveraging PagerDuty, they had implemented a custom-built alerting tool that didn’t scale to meet their needs as they grew their engineering and customer success teams. When issues arose, the team wanted to ensure that they had the right tools and systems in place to be able to respond to incidents 24/7, and needed a solution that was both highly reliable and that would grow with them as they scaled. “PagerDuty gives us the ability to serve our global customers 24/7 across both infrastructure and customer issues.” - Nick Stielau, Director of Engineering, Pantheon Serving customers 24/7 and surpassing uptime expectations “It was extremely nice to have PagerDuty from the start. We didn’t have to deal with the main problem areas that most companies do when managing incidents. It was one of the solutions that helped us create an operationally like-minded team,” said Nick Stielau. Pantheon is currently using PagerDuty for alert management, on-call automation with scheduling and escalations, real-time response orchestration, as well as reporting on system-level and operational efficiency metrics. PagerDuty has helped Pantheon operationalize and improve efficiency and collaboration across departments, especially within the engineering and customer success teams. The on-call engineers are responsible for handling and triaging alerts that come from their infrastructure monitoring stacks. Meanwhile, the customer success team is on-call for customer-based tickets and calls, and manages real-time customer communication and outage updates through their status page. Implementing PagerDuty has also allowed the company to provide a functional and positive feedback loop for both customers and their teams. “PagerDuty gives us the ability to serve our global customers 24/7 across both infrastructure and customer-facing issues,” said Stielau. PagerDuty has hundreds of self-service integrations and extensions with monitoring, ticketing, deployment, and collaboration tools, so that customers can easily customize the ideal incident resolution workflow for any environment. Pantheon in particular utilizes PagerDuty’s integrations with Slack and Sensu. With Slack, on-call engineers and support staff can immediately get notified on, acknowledge, respond to, and collaborate on incidents directly within Slack without having to toggle between tools, as well as tag the appropriate teams for additional help. Pantheon also integrated Sensu with PagerDuty to aggregate their customer support requests. PagerDuty enables the ideal real-time response orchestration, by automatically routing the issue to the right person depending on the service importance and severity of the incident, and escalating the issue to the next line of defense if it isn’t acted on. The PagerDuty platform helps Pantheon minimize time spent on administrative tasks, and instead frees up teams to direct their focus and energy to resolving issues effectively and innovating solutions. PagerDuty makes it possible for them to continue serving their customers 24/7. “One of our top level business KPI’s is site uptime. PagerDuty is a critical part of the system and processes which help us keep that uptime where we want it to be, resulting in exceeding our 99.9% uptime SLA,” said Stielau. “One of our top-level business KPI’s is site uptime. PagerDuty is a critical part of the system and processes which help us keep that uptime where we want it to be, resulting in us exceeding our 99.9% uptime SLA.” - Nick Stielau, Director of Engineering, Pantheon Meeting their commitment to uptime and performance “A big value Pantheon provides is committing to our customers’ success on a daily basis. PagerDuty helps us meet our commitments to uptime and performance,” said Stielau. Without having PagerDuty to support their incident prevention and resolution process, it would be difficult for the company to serve their customers efficiently and it would add frustration to those responsible for the product and customer experience. PagerDuty helps relieve the stress associated with being on-call: “there is literally always someone you can escalate an incident to. If you’re really indisposed or need the help, PagerDuty helps codify that support,” stated Stielau. To learn more about how PagerDuty can help your business, contact your account manager and try a 14-day PagerDuty trial today.
With PagerDuty, Jeppesen is able to respond more quickly to the needs of their customers, increasing overall uptime
Jeppesen delivers transformative information and optimization solutions to improve the efficiency of air operations around the globe. As the company started to grow and expand, the search for a solution that could aggregate all the alerts across their infrastructure, scale within the company, and escalate critical incidents became a priority. Pablo Castillo, Service Manager at Jeppesen, and his team members took the initiative to find a solution that would enable better operational agility for their IT teams and reliability for their environment. Overcoming challenges around on-call and incident management Jeppesen didn’t have a solution that supported on-call automation or incident management. A customer would call in to notify the company of an issue which would then fire off an alert. Additionally, their internally phone system required manually updating the on-call contact information, as such, on-call staff was never up to date so it wasn’t a reliable source to address a problem. At times calls would get forwarded to the wrong person. “There wasn’t any proactive detection of problems or incidents. As we got bigger, implementing a solution to manage this became a requirement,” said Castillo. As Jeppesen continued to expand the company and their customer base, the performance and availability of their applications became increasingly critical. With so many moving parts, getting an incident management solution in place to effectively manage their digital operations was top of mind for Castillo, his team, and the company as a whole. “We have 100% delivery for every product thanks to the support of PagerDuty.” - Pablo Castillo, Service Manager, Jeppesen Exceeding SLA expectations and decreasing downtime Jeppesen selected PagerDuty to overcome the challenges they faced around incident management, on-call automation, and incident triage and escalation. Since implementing PagerDuty, the company has gained full-stack visibility of critical applications, aggregate and manage alerts across their infrastructure, prioritize critical incidents requiring immediate response, and stop business-impacting situations. “PagerDuty provides us with a clear timeline as to when the problem started, when it was acknowledged, and when it’s been resolved,” stated Castillo. To deliver a faster response, with the help of PagerDuty Jeppesen implemented a ChatOps support model. Using the PagerDuty and Slack bi-directional workflow extension. With the click of a button, the Jeppesen team can acknowledge and resolve PagerDuty incidents from Slack. PagerDuty updates the Slack timeline, so it is always actively working on the issue, as well as when and what actions were taken. This also enables seamless collaboration and resolution on mobile. Jeppesen has different SLAs that are tied to specific applications — the most important and impactful being the ones related to tracking. One of the SLAs is that Jeppesen can’t have more than 15 to 30 minutes of downtime per month. In the event of downtime, they need to act quickly. “We have 100% delivery for every product thanks to the support of PagerDuty. We got a call from PagerDuty when one of our website applications went down. When we received notice from our customer, the problem was already resolved. You look good to a customer when something like this happens,” said Castillo. “PagerDuty enables us to deliver 24/7 website availability. With the platform, we are able to address incidents immediately, which enables our IT teams to act proactively in resolving issues.” - Pablo Castillo, Service Manager. Jeppesen 24/7 website availability and seamless digital operations management Jeppesen relies on PagerDuty to keep their site running at all times and notify the right on-call resources to take effective and immediate action whenever an incident arises. “PagerDuty enables us to make our site available 24/7. With the PagerDuty platform, we are able to address incidents right away which in turn allows us to act more proactively,” said Castillo. Jeppesen also recently implemented and intends to heavily use PagerDuty’s Live Call Routing capability, which enables any individual to immediately reach a live on-call engineer or leave a voicemail that is attached to an incident simply by calling a number. With PagerDuty, Jeppesen has gained the full-stack visibility and response orchestration required to manage the end-to-end digital experience for their customers, resulting in optimizing product delivery and meeting SLA expectations.
Quartet achieved a decrease of 25% in incidents with the help of PagerDuty
Quartet develops and delivers a cloud-based platform that facilitates the communication and collaboration of medical providers and behavioral health providers for patient care. Its platform relies on advanced analytics, proven treatment programs, and modern technology to make healthcare work for providers, patients, and insurers. With a strong focus on accommodating healthcare providers 24/7 and ensuring utmost data security and privacy, it’s important to keep close tabs on their internal systems and ensure things are operating efficiently and securely. Mustafa Shabib, Head of Engineering, is responsible for building the technology services and systems at Quartet. With an increase in company growth, including the expansion of Shabib’s team, having an incident management solution in place became a top priority to ensure the platform met customers’ needs and expectations. Overcoming the challenge of resolving incidents more rapidly In the beginning, when Quartet had a smaller team of seven engineers, they started using Sumo Logic and Slack to deliver real-time IT insights. The engineers had their incident alert notifications directed to a specific channel within Slack that allowed them to receive the alerts on their mobile phones and desktops. There were no on-call rotation schedules, so when an issue arose everyone swarmed the problem at the same time. Eventually, after discussion, a single person would take action - this swarming process meant the service disruption continued resulting in increased mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR). The Sumo Logic and Slack notifications didn’t provide a sense of urgency within the team. “We weren’t doing our due diligence around resolving incidents as rapidly as we could have with a different solution and process in place,” said Shabib. As the company grew, the lack of an incident management solution was taking its toll on providing the always-on platform customers and patients had come to expect. Implementing a solution that reduces MTTA and MTTR As the engineering team at Quartet grew, the need to deploy a solution to assist in maintaining their critical services and systems became an urgent matter. PagerDuty was carefully chosen to help the company overcome the challenges around resolving incidents quickly, while also supporting their goal around reducing MTTA, MTTR, and the overall number of incident that take place. Quartet looked at a few other solutions, but found PagerDuty to be more mature and overall had the better reputation within the industry. Quartet’s entire infrastructure is built in AWS and they leverage CloudWatch for system level resource alarming and monitoring. These alarms are triggered through PagerDuty, the web host, and outside to their 3rd party cloud-based log management and analytics service, Sumo Logic. They have agents running on all of their hosts which push the logs to Sumo Logic and create scheduled queries every minute which will trigger PagerDuty for incident alerts. Shabib noted that having a solution in place that fires off alerts and reminders until the issue is resolved helped create a sense of accountability within the team. This ultimately helped enforce the generation of high quality logs, while allowing individuals to debug those issues more rapidly as they occurred. The team also has an escalation policy that kicks into gear when the primary contact is unable to acknowledge the incidents, allowing for the secondary on-call contact to take action. “I think PagerDuty helps put ownership into the hands of the engineer. Putting them closer to the incidents, so when one occurs, the right people who actually built that software get notified and can resolve and improve the problem,” stated Shabib. This was much better than the “swarming technique” which could potentially place incidents into the hands of someone without the proper context or knowledge to resolve it, not to mention the inefficient process that involved the entire team when the issue could have been handled by just one individual. The company's goal is to improve their operational metrics and reduce mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR). “These metrics have improved a great deal with the help of PagerDuty, resulting in a 25% drop in incidents,” said Shabib. Gathering metrics using PagerDuty’s analytics feature allows the team to follow up on past incidents and measure the operational efficiency around the incident management process. “PagerDuty is resilient and guarantees that you will know when something problematic is happening to your apps. There aren’t a lot of services out there that can offer those guarantees.” - Mustafa Shabib, Head of Engineering, Quartet Providing resilience and guaranteed delivery PagerDuty has enabled Quartet to quickly and efficiently resolve incidents and decrease the number of incidents by 25%, while also reducing MTTA and MTTR. “If we didn’t have PagerDuty, we would be failing people in a way that goes beyond just customers. It would affect people's lives negatively if we allowed these incidents to occur without resolving them or having the urgency to resolve them. It’s not just a business failing but rather an ethical failing for patients,” said Shabib.
SendGrid Enhances Employee Productivity and Reduces Downtime with PagerDuty
SendGrid is a proven cloud-based customer communication platform that successfully delivers over 25 billion emails each month for Internet and mobile-based customers. The company is headquartered in Colorado with over 300 employees, 23 of those within the operations team and approximately 84 in the development group. Mary Moore-Simmons, Engineering Operations Manager, is in charge of managing the infrastructure at SendGrid, which includes servers and data centers, the network behind it all, virtualization stacks, and backend systems. With the high rate of emails that are sent from SendGrid, there are a multitude of incident alerts generated on a daily basis. Finding a scalable enterprise-grade solution to help streamline and simplify the manual incident alert process was a top initiative for the company. Replacing previous alerting tool and overcoming scalability challenges SendGrid receives up to two thousand incident alerts in a typical day and tens of thousands per minute during technical incidents or outages. With such a large amount, it’s important for the company to address alerts quickly and efficiently. Before making the move to PagerDuty, SendGrid used a different vendor for alerting, but realized they needed a full-scale incident management solution in place to support their high volume of incidents. “When you have a tool in place, you want it to work, especially when there is an outage; that’s when you expect it to work,” said Moore-Simmons. Faced with scalability challenges, SendGrid decided to make the move to a reliable and scalable incident management solution. “PagerDuty helps us respond faster to the alerts that we receive. We’re able to diagnose outages faster, which in turn improves the experience of our customers and reduces downtime as well as any associated costs.” - Mary Moore-Simmons, Engineering Operations Manager, Sendgrid Accelerating MTTA and MTTR by switching to a new incident management platform SendGrid implemented PagerDuty as their new incident management solution and uses the platform for collaboration, scheduling, escalation, and reporting. When on-call, a user is able to acknowledge an incident alert, escalate the alert if needed, or resolve the issue at hand, allowing them to move directly to the next incident without any delay. The main dashboard which reports all incidents is another critical benefit for SendGrid. “The way PagerDuty’s incident management dashboard’s UI is designed allows you to see what’s going on and what kind of alerts you are receiving. This is super helpful for us - no more having a list of alerts moving around at all times and losing focus on them,” said Moore-Simmons. Moore-Simmons finds PagerDuty’s reporting feature to be the most important asset for her role. Reporting on metrics enables her to gather insight around the number of alerts per day, per week, per month, and per year. “We had an estimate of 78,000 alerts happen this year and the company’s goal was to reduce the number of alerts by 50% compared to 2015. So far, we are on track with this metric, thanks to the support of PagerDuty,” stated Moore-Simmons. She was also able to figure out that the team’s average mean-time-to-repair (MTTR) is 19 minutes, while the average mean-time-to-acknowledge (MTTA) is only 2 minutes. Gathering this type of information helps both Moore-Simmons and the other engineering managers identify what’s working, what’s not, and how to fix the problem. The biggest benefit to SendGrid was that their operations and development teams could now resolve outages quickly and prevent them from happening again, thanks to the reliable and rapid incident notifications. Every minute that an outage occurs costs the company thousands of dollars and results in poor customer experience and customer churn, and with fewer outages, there has been less customer churn. Moreover, the team is now more satisfied and productive after switching to PagerDuty. “We have confidence in PagerDuty and no longer have to worry about unnecessarily long outages and revenue loss.” - Mary Moore-Simmons, Engineering Operations Manager, Sendgrid Enhancing employee productivity and improving scalability SendGrid can rely on PagerDuty as a trustworthy solution to support their use cases, critical alerts, and scheduling. “We have confidence in PagerDuty and no longer have to worry about unnecessarily long outages and revenue loss. Everyone on-call at SendGrid uses PagerDuty and knows the solution as an established provider,” said Moore-Simmons. Employees are happy and productive which is important to the business. Overall, the company has seen many advantages after switching to PagerDuty, including faster resolution times for outages, increased employee productivity and happiness, as well as pulling impressive bottom-of-the-line metrics that attest to the company's operational efficiency.
Nelnet increases uptime, boosts employee productivity, and reduces costs with PagerDuty
With a core value of focusing on the customer, Nelnet provides innovative educational services in loan servicing, payment processing, education planning, and asset management. These products and services help students and families plan, prepare, and pay for their education while making the administrative and financial processes more efficient for schools and financial organizations. They are headquartered in Lincoln, Nebraska, with more than 3,400 associates who serve customers throughout the education life cycle. Nelnet’s IT department has multiple service tiers; intake, on-call, escalation, and product owners and architects. Ryan Regnier is an IT manager at Nelnet and is responsible for the tier 2 team who is on-call, escalating issues, and responding to any outages as they arise. Managing a team of that nature involves a large number of critical incident alerts and figuring out how incidents are escalated to other service tiers. For these reasons, both Ryan and the company were in search of a solution that could help simplify these processes. Overcoming manual processes to manage on-call scheduling and incident escalation Nelnet is monitoring everything from web servers that process credit card payments to network devices that are transmitting traffic to web and database servers. The organization is monitoring 35,000 events at a given time, resulting in alerts firing off at all hours of the day. Before Nelnet implemented PagerDuty, managing on-call scheduling and escalations was a challenge because of the existing manual processes. If any app went down, the Network Operations (NOC) team members had to manually sift through pages of spreadsheets to identify who to contact. The spreadsheets outlined what to do when there was an incident alert, who to escalate the issue to, and how to react to each individual incident. This manual process didn’t easily scale, making it difficult for teams to work efficiently, and added time to the outage. This had a negative impact on customers in addition to Nelnet; if the core payment processing site was down, customers couldn’t make payments, resulting in loss of revenue and customer dissatisfaction. Who to contact during an incident was also an issue; even with a 24x7 NOC team, the wrong people were being contacted and at the wrong time. Not only did this create frustration, there was also no way to automate or customize how alerts were coming through. All of these obstacles resulted in a delay of incidents being resolved, customers unable to make payments, and a decrease in productivity due to the lengthy and complex manual process. Increasing operational efficiency and reducing costs Nelnet adopted PagerDuty to help minimize the challenges around scheduling, alerting, on-call escalations, and to help lower costs. An area they were able to reduce costs was within the NOC team. With PagerDuty’s automated and reliable incident management platform, Nelnet no longer needed to pay for a 24/7 NOC environment. “Before we brought in PagerDuty, we were looking for ways to cut costs and improve our incident response management. The PagerDuty solution has proven to be the right one for Nelnet. PagerDuty makes life easy,” said Regnier. An estimated 35,000 incidents are generated through Nelnet’s monitoring tools. These incidents, generated from file transfers and external websites, including those hosted on Amazon Web Services, are sent directly to PagerDuty. The typical use case for the on-call and escalation team consists of issues that come from any of their servers or services. PagerDuty alerts those on-call about the issue within seconds. This allows the on-call contacts to figure out what the problem is, escalate the issue if needed, and resolve it. Currently, Nelnet has 80 escalation policies, which are used multiple times each day. An example of these policies being used was when a large incident arose that required help from multiple teams. The incident management team logged into PagerDuty to send an email alerting the appropriate people about the issue. The solution then allowed people on-call to contact those individuals rather than blasting the notice out to everyone within those teams. Those involved ended up joining the incident call except for one person who was called every 5 minutes until the escalation policy kicked in after 20 minutes. Due to the escalation feature, the backup responder was able to acknowledge the alert and help get the issue resolved. “I would encourage everyone to consider PagerDuty. The cost savings can’t be overlooked. With PagerDuty, the person on-call is conveniently alerted with each incident. There is so much flexibility with scheduling and alerting the right people, it’s a simple decision to use PagerDuty.” - Ryan Regnier, IT Manager, Nelnet PagerDuty provides Nelnet the flexibility to contact users in a number of ways, including the option to receive alerts via text or email. “PagerDuty makes my team's lives easier and provides us with more structure. When finding a replacement for someone on-call, the solution provides that person with the option of being contacted in a variety of ways,” said Regnier. Nelnet is able to get services back up and running more quickly, enabling their customers to use the services and keep the business moving. “During the day we have people on-call who can respond to a server that has gone down within minutes of it happening. Depending on the complexity or nature of the problem, we can have it back up in 10 minutes or less. We know about these alerts within seconds and can respond to them within minutes,” stated Regnier. With increased uptime and employee productivity, PagerDuty has saved Nelnet $650,000 annually. Improving uptime, agility, and employee satisfaction Before PagerDuty there was little way of tracking outages. Now, they have critical data at their fingertips. Any incident or triggered item from up to a year can be reviewed. “When we were evaluating PagerDuty, we found there weren’t other organizations that had such a complete product offering, or feature set, and they weren’t as easy to use,” said Regnier. PagerDuty helps Nelnet increase uptime and employee productivity, provide teams with flexibility, and ensure that incidents are always addressed.
Manheim improves product agility and empowers collaborative teamwork with the help of PagerDuty
Manheim® is North America’s leading provider of vehicle remarketing services, connecting buyers and sellers to the largest wholesale used vehicle marketplace and most extensive auction network. Through its 115 physical, digital and mobile auction sales, the company helps dealer and commercial clients achieve business results by providing innovative end-to-end inventory solutions. At Manheim, the team is heavily investing in the future of software delivery, which is a strategic business driver. To move this aspect of the business forward, Manheim has adopted DevOps practices alongside traditional IT operations. Jason Riggins, director of production engineering, is in charge of providing strategic direction, leadership and oversight for a number of teams: release engineering, development operations, and site operations. These “production engineering” teams serve as a core foundation for ensuring reliable software delivery, and therefore, revenue streams. Overcoming the communication gap when critical incidents occurred Prior to having PagerDuty in place, Manheim had a manual follow-up process when incidents occurred and critical apps and services were impacted. On-call responders would have to use a Google phone number or on-premises phone system to file an incident. Manheim needed to keep up with their production engineering teams and improve their methods for recruiting the right responders when incidents occurred. They required a platform that could be standardized across all development teams. The challenge was the company was very siloed as a result of past organizational changes. This specific siloed nature within IT operations was becoming more and more challenging for Manheim, “We were less agile. We had tickets built up in the que, issues due to sheer siloed teams, and a communication gap that was created between the different units,” stated Riggins. “Something had to change for us to remain cutting edge.” “PagerDuty was overall a more mature product, which is why we chose them.” - Jason Riggins, Director of Production Engineering, Manheim Recruiting the right people, with the right information to reduce downtime “We did a bake off between PagerDuty and one of their competitors. One of the reasons we went with PagerDuty is because of the track record that they had already established in the industry and the proven customer base that existed. PagerDuty was overall a more mature and feature rich platform which is why we choose them,” stated Riggins. Manheim has since changed the way they develop software and improved their IT support and response capabilities. “A big part of these business-impacting changes is having PagerDuty because it helped us become more efficient,” stated Riggins. Currently, the organization is using PagerDuty for incident management of various services, on-call scheduling, escalation policies, event management and a customized API integration. Being able to customize notifications, scheduling and escalations policies helps Manheim recruit the right team, every time, while the event management feature enables Manheim to aggregate incidents and reduce mean time to resolution (MTTR). Implementing PagerDuty enabled Riggins and his teams to seamlessly assign workloads and incidents to the appropriate teams. The organization no longer depended on one team opening tickets when an incident occurs, and then waiting for another team to respond. “With PagerDuty, the power of managing incidents within and across teams allows them to develop their own escalation policies and become self-managed,” said Riggins. As a result of these business impacting benefits and tangible ROI, Manheim continues to grow its implementation of PagerDuty. When Manheim started looking for a solution, they found PagerDuty helped automate the work of their after-hours teams. “PagerDuty enabled us to move after hours dedicated headcount to the day time which increased overall productivity,” said Riggins. That’s when the company moved to a capability team model which shifted to a “you build it, you run it” model. The team develops the software and supports it, and PagerDuty helps with the monitoring and alerting of the application, increasing overall availability. The Enterprise Operation Center (EOC) identifies the top critical alerts that come through with an immediate response for each DevOps team, rather than sending them every single alert. If a developer on call misses an alert, they add the EOC team to the escalation policy, serving as another line of defense for the organization. Another advantage PagerDuty offers Manheim entails the out-of-the-box integrations with Datadog, New Relic, and Amazon Cloud Watch. “We have a suite of monitoring tools and PagerDuty allows us to be more proactive, quickly,” said Riggins. There was a time when the organization rolled out a change and their service levels dropped, queues were backed up and the response time wasn’t where they needed it to be. Manheim has now standardized on PagerDuty, Datadog, and New Relic operations stack. Finding a comprehensive and agile incident management solution PagerDuty enables Manheim to connect to critical IT services and generate event data with the right teams at the right time, which increases their operational agility and IT efficiency. Overall, the enterprise-grade incident management solution has improved trust and communication within the organization. “PagerDuty is going to end up growing in user count and become the standard. We rely on the stability and accuracy that we get through PagerDuty. If you ripped the solution out, we would be back to square one,” stated Riggins.
Signal Sciences addresses security anomalies quickly, keeping customer data safe with PagerDuty
Signal Sciences is making “smart security for the modern web.” The company – whose founding team ran security at ecommerce site Etsy – helps customers get visibility into security threats and provides insights to prioritize security resources to address attacks as they occur. With greater visibility and coverage, security teams are able to make informed decisions and confidently run their business’ web applications. Overcoming the challenges around timely security incident management and resolution Zane Lackey, founder and Chief Security Officer (CSO) at Signal Sciences is the executive responsible for the organization's entire security posture. Securing customer data, protecting their next generation web platform, and internally delivering a secure IT infrastructure is his focus. The company as a whole wanted to overcome challenges not only just around security incident management, but also around altering and resolution. Issues around manually coordinating security incident management and response involved a great deal of effort and maintenance. “It wasn’t pretty! Previously at other companies, we had to build our own internal versions of PagerDuty to get alerts when critical security events were occurring,” said Lackey. “Time and resources would go towards developing and maintaining these home-grown solutions versus focusing on the security imperative. This wasn’t the best use of our time as security practitioners.” As a repeat customer, Lackey, was familiar with the pitfalls and wasted resources of running an environment without PagerDuty. Shortening MTTA and MTTR to move the business forward securely and quickly PagerDuty enables Signal Sciences to orchestrate the ideal response and reduce the impact of security incidents by notifying and recruiting the right people to address system anomalies. Signal Sciences has instant visibility into incident status and who is on call, unlimited escalation options, and the ability to recruit additional responders from any team. “PagerDuty helps us stay on top of our security posture and resolve security incidents faster and more consistently,” said Lackey. Integrating with security monitoring and log management tools, PagerDuty provides a unified view across the entirety of security operations, with built-in triage and scheduling capabilities to ensure security teams work collaboratively to address anomalies quickly. “In a timely fashion, we want to ensure that there is a path of escalation and PagerDuty allows us to trust that nothing will slip through the cracks,” stated Lackey. “PagerDuty helps us stay on top of our security posture and resolve security incidents faster and more consistently.” - Zane Lackey, CSO, Signal Sciences PagerDuty plays a key role when escalating and recruiting people as needed to help shorten mean-time-to acknowledge (MTTA) and mean-time-to-resolve (MTTR). PagerDuty combined with Signal Sciences own web application visibility and defense product has helped the company reduce its MTTR by getting subject matter experts to explore and investigate anomalies faster, which in turn keeps customer data and the IT infrastructure safe. PagerDuty allows Signal Sciences to be even more proactive versus reactive with investigations. In the end, the combination of Signal Sciences next-generation web application firewall and PagerDuty allows Signal Sciences to not only provide proactive response capabilities so their product and infrastructure are secure, but also provide the same measures to protect their customers. Improving operational reliability and agility Signal Sciences has experienced first hand the many benefits that implementing PagerDuty brings; including, notifying the right subject matter expert or team in time, escalating issues to the right expert when needed, and most importantly, giving teams the confidence that they will receive the critical information they need at the right time when necessary. “PagerDuty allows us to move quickly. Being able to react immediately helps us move our business faster,” said Lackey.
AppDynamics Relies on PagerDuty to Automate Its Incident Response Workflows
'PagerDuty helps drive clarity during our incidents.' - Thomas Morse, Sr. Director of IT & Operations
ZENCONNECT Delivers Superior Support with PagerDuty's Proactive Incident Management Platform
With PagerDuty, ZENCONNECT routinely achieves alarm resolution in less than an hour, providing close-to-perfect customer service Any way you look at it, 2015 was a banner year for ZENCONNECT. The French network-solutions provider saw turnover (also known as total revenues) increase by 64% last year as they made the leap from niche-market to enterprise-level service provider. As they made this leap, however, they were not without early growing pains. ZENCONNECT was building rapidly on a homegrown system, in which the main support phone number rerouted to different technician’s cells throughout the week. Not only did this result in a number of misdirected issues, but only tier 1 alerts got the attention they needed. Without a full view of their infrastructure, ZENCONNECT’s engineer’s were constantly playing catch up. Offering remote network management and outsourcing, event WiFi and VPN services, and network-infrastructure project management, ZENCONNECT insists on providing flexible, crystal-clear, proactive solutions to SMEs and large firms alike. As the company sought to scale its IT infrastructure to keep pace with growth, PagerDuty stood out as offering the flexibility, data-driven clarity, and proactive approach they needed to succeed. And when ZENCONNECT partnered with PagerDuty, they started succeeding in a very big way. Flexibility to Meet End-Users Needs The first and most obvious benefit ZENCONNECT has seen with PagerDuty is that technician’s no longer have to waste time redirecting alerts not meant for them. Using PhoneDuty, a PagerDuty/Twilio integration hosted on Heroku, ZENCONNECT can automatically route clients’ calls to the proper on-call technician. Not only does this mean seamless customer service for ZENCONNECT’s clients, it means less stress for technicians trying to recharge and greater efficiency for those on call. And when they are on call, ZENCONNECT’s technicians use PagerDuty’s rich alerting to customize the way they are notified. This helps cut down on missed issues by routing them to multiple devices if necessary. Plus, urgency-based features make it possible to cut through the noise and focus of the most critical incidents first. With this functionality in place, says CTO Yohann Lecornet, ZENCONNECT has eliminated alert fatigue and built the blazing speed that can only be achieved in a truly flexible alerting environment: “With PagerDuty we’ve achieved response times we could only dream of before. Our agents now average 33 seconds before picking up a ticket and only 2.6 hours before responding with a solution.” System-Wide Visibility Not only does PagerDuty provide flexibility and speed from incident to incident, but that increase in efficiency means that ZENCONNECT’s engineers can get out of the weeds and get a big-picture view of their incident-response infrastructure. Which, you have to admit, is pretty crucial when you’re in the business of responding to network infrastructure incidents. For starters, PagerDuty’s dashboard allows ZENCONNECT to centralize, classify, and enrich events from all of its monitoring platforms. From there, teams can bundle actionable events to identify and resolve critical issues while providing contextual insights, including graphs, images, and runbook links, right in the incident details. The baseline clarity this event grouping and enrichment provides is ZENCONNECT’s first step in establishing fine-tuned triage protocols and cutting down on response time. From there, PagerDuty’s analytics lets ZENCONNECT visualize and analyze everything from system-wide efficiency down to individual performance. The big picture allows ZENCONNECT to identify and mitigate SLA hotspots while tracking MTTR trends so they can optimize operational protocols. The granular view allows them to evaluate agility and productivity by team and individual. These analytics make it possible to load-balance work streams and recognize team heros, building morale and efficiency simultaneously. The results, says Lecornet, are indisputable: “With PagerDuty, we have managed over 5,000 alarms with a loss of 0%. We just don’t miss anything anymore.” Proactive Incident Management Having established a flexible alerting environment backed by crystal-clear analytics, ZENCONNECT is done playing catch up. Now they’re ahead of the game, confident in their ability to provide proactive solutions to their clients’ needs because PagerDuty is a proactive solution to their own. ZENCONNECT now routinely achieves alarm resolution in less an hour, and over three-quarters of their tickets can be resolved with a single action. As with any operations metric, says Lecornet, it’s less about what their technicians see and more about what their customers don’t see: “Our customer experience is close to perfect now. We are preventing almost all issues before they affect our clients. PagerDuty has been essential to this achievement.”
Wehkamp Creates Shopping Magic Using PagerDuty
Everyone has had one of those magic shopping moments. You set out with a vague idea of what you need, but then you find that perfect, unpredictable, trend-setting thing. Wehkamp, the largest e-commerce retailer in the Netherlands, has made it a mission to surprise its customers this way. The company is constantly innovating to deliver customized yet surprising results on the cutting edge of fashion, home and garden, electronics, gaming, sports, and beauty. With 128 million visits per year, 1.7 million regular customers, and more than 7 million shipments annually, it takes the coordination and teamwork of over 200 IT staff to deliver a seamless online shopping experience 24/7/365. So when Wehkamp began to expand globally in October 2014, they came to PagerDuty looking for a critical IT ops solution that would give them the speed, reliability, and flexibility to build on their promise to innovate and surprise, even as their customer base expanded rapidly. In E-Commerce, Downtime Is Money Before partnering with PagerDuty, incident resolution times were unacceptably high for Wehkamp, especially for the fast-paced, high-volume world of e-commerce. The company averages a daily turnover of €2–2.5 million, and on action days it is even more. Just one hour of downtime during primetime already means a loss of turnover. Willem van den Broek, IT manager for financial systems, says he knew it was time to upgrade. “We needed a proactive system that would put solutions in place before an issue could arise. PagerDuty delivered.” Wehkamp had previously relied on a cumbersome SMS alert system that was inflexible and significantly hindered their performance. All incidents were being automatically sent to a single on-call engineer, essentially rendering them a dispatcher and distracting from other value-driving work. Not only that, but often it would be unclear who to give the issue to, meaning even more time was lost clarifying workflow. PagerDuty Drives Ownership and Action With PagerDuty, Wehkamp has seen drastically reduced downtime yield significant savings. PagerDuty’s critical operations performance software manages the on-call workflow for Wehkamp and gets the problems to the right people immediately. That’s how PagerDuty drives responsibility and ownership while slashing incident-resolution times. When Wehkamp’s engineers hear from PagerDuty, they know the issue is theirs to resolve directly. “Our engineers trust PagerDuty to bring them the issues they have the expertise to address, and that builds confidence, ownership, and team unity. PagerDuty’s reliability is like nothing else we’ve seen.” Wehkamp’s team is also enjoying greater flexibility thanks to PagerDuty’s mobile app. Now on-call engineers can step out of the office for a bite to eat or hit the gym without worrying about missing critical incidents, causing downtime. Improving quality of life is a priority that PagerDuty and Wehkamp share. “Employee well-being is a pillar of our corporate social responsibility. We are free to just live our lives now because we can trust PagerDuty to keep us up to speed wherever we go.” With the speed, reliability, and flexibility that PagerDuty has delivered, Wehkamp has eliminated costly downtime, improved operations performance, and found the freedom to focus on delivering those perfect shopping surprises to their customers.
Backcountry Delivers Exceptional Customer Experiences Using PagerDuty
Backcountry.com makes a business out of understanding people who love the outdoors. These are customers who live and breathe hiking, biking, skiing, running and numerous other sports and activities. They’re on 24/7 and they’re always ready for another adventure. This is why Backcountry’s production infrastructure needs to work—and work well—so that these customers have an adventure equipment outlet that is just as reliable as the gear they use to traverse the great outdoors. With Great Growth Comes Great Responsibility Backcountry previously relied on a network operations center (NOC) that monitored their infrastructure with people around the clock to address problems. This central authority assigned issues to engineers when outages occurred, but the process was less than perfect. “Things would break and not everyone who needed to know would be notified.” As the company grew, a bigger customer base and high-traffic consumer holidays like Black Friday strained Backcountry’s infrastructure. The team needed a better way to maintain uptime, ensuring the right engineers were assigned to the right problems as quickly as possible. “We needed more transparency into the process.” Mata had implemented PagerDuty successfully at several previous companies, so he didn’t have to look far for Backcountry’s solution. “The implementation was straight-forward and the UI was easy to use.” PagerDuty was perfect for the company’s engineering culture, which values getting things done as transparently as possible. With PagerDuty, everyone, even those outside of the engineering teams, would know the state of the systems’ health. More Resources Equal More Results Every incident at Backcountry impacts revenue — so much so that their team tracks the business impact of their outages. Because of PagerDuty, they have more stability and better system reliability, and when outages do happen, they are able to efficiently address them, maintaining revenue and customer satisfaction. PagerDuty changed the way Backcountry does business. Instead of just notifying engineers when issues occur, the NOC is now able to contribute even more value to the company. And it has more to work with since PagerDuty has streamlined operations by making their processes even more efficient. “It’s allowed us to transform the NOC into a value-driving operation.” Because PagerDuty aggregates all infrastructure data, the entire engineering team has full ownership of problems that occur. That means they’re enabled to fix things and fix them quickly, resulting in more reliable systems and happier customers. “PagerDuty allows us to get things done and get them done transparently. That includes visibility into response time, incidents and engagement.”
Nextdoor Keeps Its Own House In Order With PagerDuty
Local neighborhoods are the backbone of entire communities, cities and countries. They’re where we learn, work, live and play. Keeping them safe and sound is a top priority, and it’s what Nextdoor does best. Nextdoor is a private social network where local communities gather online to make their neighborhoods safer, happier and more prosperous. Nextdoor helps locals do everything from report break-ins to recommend local services to organize communal events. They do it so well that over 57,000 neighborhoods in the United States rely on Nextdoor to stay connected with their communities every day. It’s not just a career for the Nextdoor team. It’s a calling. Which is why the company relies on PagerDuty to keep things in its own backyard running smoothly. Ideal Neighbors Nextdoor and PagerDuty have been great neighbors almost from the beginning. The company started using PagerDuty shortly after its public launch in 2011. At that time, the Nextdoor engineering team consisted of five people. And every one of them received a notification when something went wrong with the company’s infrastructure. That led to missed sleep, miscommunication and, worst of all, missed issues. “Sometimes, we’d only find out about issues when a customer sent us a note.” No one was fully accountable for flagged issues, and Nextdoor lacked a way to track or audit them. Different dev teams had different on-call schedules and all needed access to different systems, which further muddied the waters. When the company’s Head of Operations suggested Nextdoor take a look at PagerDuty to solve the problem, the team was all ears. A Solution That Grows Up With Nextdoor Like any good Neighborhood Watch program, PagerDuty immediately started aggregating important issues and flagging them for the right people. Response times dramatically improved. And Nexdoor’s devs knew exactly who was responsible for fixing them. “It helps us manage in an explicit way who is accountable for being the first line of defense.” With PagerDuty, Nextdoor now tracks mean time to acknowledge (MTTA) and mean time to resolve (MTTR) metrics that they couldn’t before. The company does rigorous post-mortems based on the data to further improve ops. And devs know exactly how available they need to be. “As a result, response times are really good.” The best part? PagerDuty is able to grow up with Nextdoor. The company now has 45 engineers alone. As the team expands, Nextdoor can easily model how PagerDuty will scale with the company, instead of wasting time and energy building complex systems to handle the load. And no matter how big Nextdoor grows, they can always count on PagerDuty to deliver results. “PagerDuty is really reliable. We actually do sleep now.”
Ping Identity: Keeping Identities Safe With PagerDuty
Organizations all over the world trust Ping Identity to keep their employees’ identity safe. Their identity and access management platform gives enterprises one-click access to any application from any device. Ping Identity’s purpose is to enable and protect identity, defend privacy and secure the Internet. Early Ping Identity Challenges Keeping identities secure is a difficult task. Ping Identity takes their responsibility seriously and have heavily invested in their infrastructure to make sure their solutions are highly reliable. Their world-class distributed system means that if they experience data center failures because of a hurricane in the East Coast or a earthquake in the West Coast, their customers will not experience a disruption in their service. Ping Identity’s infrastructure team is responsible for ensuring all Ping Identity’s services are always up, extremely fast, all the time. To keep a pulse on how their systems are functioning day-to-day, they have created a multi-layered stack architecture to monitor their systems. Using a variety of solutions from New Relic for applications to Splunk for machine data, Ping Identity takes care of their IT systems so their customers never have to worry. However, having multiple monitoring tools presented a set of challenges. Alerts from various monitoring tools became difficult to manage and distribute. On-call responsibility was rotated weekly, but whenever someone went on-call they had to log into every monitoring system to change the alert notifications to their email. This process was cumbersome and error-prone. Also multiple teams worked with each monitoring tool so routing issues directly to the right person was difficult. Since there was no way to centrally manage alerts, alerts were missed or sent to the wrong person causing delays. This inefficient alerting process extended costly downtime. "As the team scaled, our method of changing email addresses was not going to work.” - Beau Christensen, Manager of Infrastructure Operations To provide a highly reliable service for their customers, Ping Identity needed to add an incident management solution to their infrastructure to resolve issues more quickly. How did PagerDuty Help? Since extreme reliability is important to Ping Identity, they require the same tenancy from the solutions they add to their infrastructure. Like their customers, they need to trust the tools that help keep their business going. Ping Identity has been relying on PagerDuty to reduce their incident response times. With dependable alert routing and escalation, the right person is reached faster, every time. There is no need to go into every monitoring tool to change contact information before every on-call rotation anymore. On-call engineers can maintain their contact information in PagerDuty and when it’s their turn to be on-call, PagerDuty routes alerts directly to them via multiple methods. Missed alerts were one of the biggest pains that Ping Identity experienced before PagerDuty. With multiple alerting methods, every engineer can personalize the combination of alerts that works for them. Additionally, by automatically escalating missed alerts Ping Identity never has worry about extended outages again. Not only are the right alerts going to the right person, but PagerDuty has helped to decrease alerting noise. With alert deduplication, PagerDuty has alleviated alert fatigue by bundling alerts to prevent event storms during an outage. Each incident can be analyzed to report how well each person and team is performing. By visualizing alerts, Ping Identity can spot trends and make informed decisions about how to improve their infrastructure. “If everything is going smoothly, no one should experience an interruption of service. With PagerDuty, it’s easier to keep that promise.” - Beau Christensen, Manager of Infrastructure Operations PagerDuty was initially adopted within the infrastructure team and has expanded to support, developers and help desk teams. PagerDuty has seamlessly integrated into Ping Identity’s culture and has made it easy for cross-functional teams to collaborate on incidents. Easier collaboration combined with more effective alerting has decreased incident repair time by 100%. “PagerDuty has blossomed within Ping Identity. It has become a core piece of our infrastructure.” - Beau Christensen, Manager of Infrastructure Operations If Ping Identity services are down, people cannot do their jobs. With PagerDuty, Ping Identity can deliver a highly reliable service so its customers can always have secure access to their favorite applications.
Brightcove: Providing Reliable Video Solutions With PagerDuty
Brightcove is a leading global provider of cloud services for video. The company offers products that revolutionize the way organizations deliver video experiences, including Video Cloud, the market-leading online video platform and Zencoder, a leading cloud-based media processing service and HTML5 video player technology provider. Brightcove has more than 6,300 customers in over 70 countries that rely on Brightcove cloud content services to build and operate video experiences across PCs, smartphones, tablets and connected TVs. Early Brightcove Challenges Three years ago, Brightcove embraced a DevOps model to give their engineers ownership over the design, production and support phases of their code. Building high-quality software faster is the goal of DevOps, but most DevOps transitions fail due to a lack of cultural and technological change to support the shift. Brightcove took the first cultural step toward this model by expanding on-call to developers. It made sense that the people who built the code would be the ones to fix issues when they occurred. Brightcove's strong team-oriented culture allowed for employees to choose their own on-call rotations. These schedules were difficult to manage as new team members were added and frequent changes were needed. Additionally, these schedules did not solve the issue of managing incident life-cycles. It was unclear who was working on an incident and what stage it was at. This caused issues to prematurely escalate to other teammates or managers. “We lacked visibility into incident ownership, which impacted our efficiency.” - Brian Sensale, Senior Engineering Manger All on-call engineers received alerts through Blackberries that were synced with their monitoring tools. Blackberries were rotated amongst the team members, but this approach was cumbersome, error-prone and limited the participation of those out of range. There was also no way to escalate issues to another team member if it was missed. As their teams grew to span 3 global offices, they needed to figure out how to fairly share on-call responsibility. “We needed a solution that would match our flexible work environment. Exchanging a physical device did not scale." Brightcove needed to make the logistics of on-call rotation simpler and move scheduling responsibility directly to the on-call engineers. After all, if engineers have to take on the on-call duty, they should have a say around when they should be on and how they are notified. To fully transition towards a DevOps model, they needed the technology to accompany their cultural shift. Giving Control To On-call Engineers Having a balanced lifestyle is important to Brightcove’s engineers. When team members want to go on vacation during a time they’re scheduled to be on-call, they work together to find substitutes. By using PagerDuty for on-call scheduling, changes can be made without any hassle. “After we horse trade on-call duties, it’s a breeze to update schedules in PagerDuty." Missed alerts at Brightcove due to ineffective alerting via their Blackberries are not a problem anymore thanks to PagerDuty. Engineers can now control the manner in which they are notified. Some teammates live in places with bad cell coverage, so they have programmed their home phone in their alert policy. All on-call engineers can now choose the alert settings that are most effective for them. If the primary on-call engineer misses alerts, the secondary engineer will be alerted. With a natural flow of escalation, PagerDuty acts like a safety net for Brightcove incidents and ensures all problems will be quickly addressed. “I can’t image life without PagerDuty. Having multiple alerting methods and escalations are no-brainers." To increase incident visibility, Brightcove has integrated PagerDuty with HipChat so everyone can track the lifecycle of the incident. This means there is no more scrambling to see who is taking care of an incident and if it has been resolved. And other teams can jump in to help if needed. Managers aren’t mistakenly alerted for low severity issues anymore and can be brought in quickly when there’s a larger issue. “We have less of a fire drill with PagerDuty. We now know if an incident is being handled and by whom. It is a stress reliever." With PagerDuty, Brightcove has the technology to support their DevOps shift and to deliver a high quality, highly reliable service for their customers.