Founded in 2017, Groww is an investment platform that enables users to invest in stocks, mutual funds, ETFs, and gold in a simple, paperless, and hassle-free manner. The FinTech is one of India’s fastest-growing investment platforms and has reached unicorn status by making investing simple and transparent for new investors.
Operating under a service ownership model, the DevOps team is responsible for several mission-critical services including authentication and payment services. The team must also ensure customers can view real-time market data and place orders. During the past year, the startup reached over 30 million users and increased its engineering staff by over 65% to support the rapid growth.
Aman Khare, DevOps Engineer, helps to support the platform’s infrastructure and security. “We make sure that the infrastructure is up and running. We make sure our customers have the best possible experience on our platform,” he said.
Groww had an on-call management solution in place, but it wasn’t always reliable during an incident, especially outside of normal business hours. DevOps engineers sometimes missed email and Slack notifications in the middle of the night. “We couldn’t depend on Slack for notifications late at night, and time to acknowledge was quite high,” shared Khare.
Involving other responders or subject matter experts during critical incidents required manual effort for the team. More frustrating, it was possible for an entire team to receive an alert that the on-call engineer hadn’t received an email about. These situations required tracking down the right individuals, which slowed down resolution time.
Further, the team didn’t have a way to suppress alerts based on certain conditions like severity. Some alerts weren’t relevant after hours and could wait to be addressed the next day. Too much noise made it challenging for engineers to focus on what mattered.
These challenges created a difficult on-call experience for the DevOps team. It became clear that the team needed a better incident response process that could scale with the company. “We needed something that could enhance the experience for our developers,” explained Khare.
After exploring alternate options, the team selected PagerDuty as a more reliable and comprehensive DevOps solution. By leveraging some of the 700+ integrations available through PagerDuty, Groww centralized alerts coming from monitoring systems such as Google Cloud Platform, Prometheus, New Relic, and Grafana. Groww customized PagerDuty to align with how services are deployed in the company’s infrastructure, driving clarity around who should be notified of an incident, and providing context around service dependencies.
PagerDuty’s flexible, dynamic notifications were an immediate win for the team, who can now receive notifications via SMS, call, or mobile app push notifications. This eliminated the need to check email and Slack after hours, and greatly improved the team’s mean time to acknowledge (MTTA). “PagerDuty gives us a call and ensures we never miss a critical issue,” said Khare.
PagerDuty also makes it easy to bring in additional responders when cross-functional triage is required—for example, if the security and database teams are impacted by the incident. Acknowledging, escalating, and resolving incidents can all be done within the mobile app, empowering teams to manage incident response from anywhere.
PagerDuty Event Rules provide Groww with the flexibility to suppress alerts that don’t need to wake up team members overnight such as low severity or non-actionable alerts. Reducing unnecessary noise helps the team focus and respond to important issues.
PagerDuty quickly proved its value at Groww, laying the foundation for a better incident response process that will fuel the company’s growth while ensuring a great user experience. PagerDuty helped:
“If people don’t need to spend time debugging and we’re able to avoid downtime, they could focus on more important work. People will feel more satisfied developing new products instead of putting out fires,” said Khare.
Having seen a quick time to value, the DevOps team is eager to find more ways to leverage PagerDuty to improve its operations. For instance, the team plans to evaluate alert analytics to better understand which issues are taking the longest to resolve. This information will help determine what system improvements will be most impactful. Also, the team is looking to use PagerDuty for stakeholder communications to provide the business with information about an incident’s scope of impact and progress toward resolution.