Tokopedia Automates Incident Response and Sees Greater Engineer Accountability With PagerDuty

Size: 1,001 - 5,000 Employees

Industry: Technology

Location: Jakarta, Indonesia

Customer Since: 2018

Key Integrations:

Datadog
Firebase
Grafana
New Relic
Prometheus
Scalyr
StackStorm

Indonesia technology company Tokopedia is one of Southeast Asia’s largest marketplace businesses, with 100+ million monthly active users and 9+ million merchants on the site. Tokopedia prides itself on being more than just a marketplace, offering technology that empowers millions of merchants to participate in eCommerce.

Rajesh Gopala Krishnan is Tokopedia’s AVP of Engineering Productivity and executes the platform’s shared technology and services vision. “Tokopedia’s mission is to democratize commerce through technology,” he explained, “We help small retailers to become big brands, allowing them to reach a more diverse customer base and make it easier for them to do business across Indonesia and beyond.”

‘Born digital’ in 2009, Tokopedia dedicated itself to digital transformation two years ago when its customer base expanded rapidly. Tokopedia modernized its technology stack, shifting from monolithic infrastructure to a microservices-based, multi-cloud architecture, running 350+ services.

Manual to automated. Tokopedia increased daily software deployments by 3,000%

Increasing Complexity Leads to Slower Incident Response

However, this shift to a more dynamic, scalable architecture made it difficult for Tokopedia’s in-house incident management tools to keep up with alerts and for its teams to respond effectively. This meant incident response was taking longer and kept engineering resources away from improving the customer experience and building new services for merchants and customers. Tokopedia also experienced a high volume of alert noise, making it difficult to prioritize incidents.

“Our tools were identifying incidents, but addressing them was taking too long,” explained Krishnan, “Most usually took 30 minutes to resolve because we were manually looking up who was responsible for a particular service before notifying engineers and setting up war rooms to address the issue. We soon realized we needed a modern, automated incident response process to gain visibility into this complex environment, which is why we turned to PagerDuty.”

Automating Incident Response With PagerDuty

Since adopting PagerDuty, Tokopedia is now able to automate its incident response processes and reduce the time it takes to resolve incidents. After initially integrating PagerDuty with five services, Tokopedia saw dramatic improvements in metrics such as mean time to repair (MTTR) and decided to scale up the deployment to all 350+ services.

Additionally, PagerDuty has helped to reduce alert noise. “Instead of being bombarded with alerts, PagerDuty groups related alerts into one single incident, with all the details in one place rather than scattered across multiple tools. This not only reduces alert noise, but also helps us prioritize the most urgent incidents,” Krishnan shared.

Tokopedia’s investment in digital transformation and modern incident response also meant it was well-prepared to deal with peaks in demand following the COVID-19 pandemic in Southeast Asia. “By moving to the cloud and adopting PagerDuty, we’ve been able to gain greater control over the number of incidents we face. This was particularly crucial during the surge in online shopping we experienced during the COVID-19 outbreak and meant we could respond to incidents faster to ensure minimum disruption for sellers and shoppers.”

“By moving to the cloud and adopting PagerDuty, we’ve been able to gain greater control over the number of incidents we face.”

– Rajesh Gopala Krishnan, AVP of Engineering Productivity, Tokopedia

PagerDuty has also helped Tokopedia embrace full-service ownership and foster a culture of responsibility, something it had previously struggled to do with its in-house incident management tools.

As Krishnan explains, it was often unclear who should respond to an incident when it came in. “What was missing was accountability—who is responsible for this service or application? Have they seen there is a problem and are they working to solve the problem? We didn’t have a very clear picture of this.”

On-call engineers were also carrying additional phones for teams to reach them on when an alert came in. But even then, getting a hold of the right people was tricky because there was no centralized way to manage escalations. “With PagerDuty, we’ve been able to eliminate manual incident response processes. Instead, when an alert comes in, we are automatically routing incidents, based on our escalation policies, to whoever is responsible for a particular service,” Krishnan explained.

“Since adopting PagerDuty, our engineers have been spending less time on incident response.”

– Rajesh Gopala Krishnan, AVP of Engineering Productivity, Tokopedia

Benefits With PagerDuty

After implementing PagerDuty, Tokopedia has gained greater insight and control over incidents in its environment, with benefits including:

  • Greater accountability among engineering teams
  • Reduced alert noise
  • Faster incident response times
  • Increasing software updates from 10 to over 300 per day as team productivity increases through the use of automation

“Since adopting PagerDuty, our engineers have been spending less time on incident response. Instead, they’re able to focus on improving the customer experience, understanding what our merchants and customers want, and how they’re using our services,” Krishnan explained. “With PagerDuty’s support for automation, engineers are also far more productive. We’ve increased daily software deployments by 3,000%.”

Future Looking

Looking ahead, Tokopedia will continue to expand its use of PagerDuty. Part of this involves monitoring the performance of new features before deployment to identify problems before they go live in the production environment. Additionally, as Tokopedia continues to adopt automation across the software delivery cycle and build applications that can self-heal, PagerDuty will have a vital role to play in creating workflows and runbooks to prevent, diagnose, and resolve incidents without needing to escalate them to an expert.

To learn how PagerDuty can help your team make things simple and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.