Quartet develops and delivers a cloud-based platform that facilitates the communication and collaboration of medical providers and behavioral health providers for patient care. Its platform relies on advanced analytics, proven treatment programs, and modern technology to make healthcare work for providers, patients, and insurers. With a strong focus on accommodating healthcare providers 24/7 and ensuring utmost data security and privacy, it’s important to keep close tabs on their internal systems and ensure things are operating efficiently and securely. Mustafa Shabib, Head of Engineering, is responsible for building the technology services and systems at Quartet. With an increase in company growth, including the expansion of Shabib’s team, having an incident management solution in place became a top priority to ensure the platform met customers’ needs and expectations.
Overcoming the challenge of resolving incidents more rapidly
In the beginning, when Quartet had a smaller team of seven engineers, they started using Sumo Logic and Slack to deliver real-time IT insights. The engineers had their incident alert notifications directed to a specific channel within Slack that allowed them to receive the alerts on their mobile phones and desktops. There were no on-call rotation schedules, so when an issue arose everyone swarmed the problem at the same time. Eventually, after discussion, a single person would take action – this swarming process meant the service disruption continued resulting in increased mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR). The Sumo Logic and Slack notifications didn’t provide a sense of urgency within the team. “We weren’t doing our due diligence around resolving incidents as rapidly as we could have with a different solution and process in place,” said Shabib. As the company grew, the lack of an incident management solution was taking its toll on providing the always-on platform customers and patients had come to expect.
Implementing a solution that reduces MTTA and MTTR
As the engineering team at Quartet grew, the need to deploy a solution to assist in maintaining their critical services and systems became an urgent matter. PagerDuty was carefully chosen to help the company overcome the challenges around resolving incidents quickly, while also supporting their goal around reducing MTTA, MTTR, and the overall number of incident that take place. Quartet looked at a few other solutions, but found PagerDuty to be more mature and overall had the better reputation within the industry.
Quartet’s entire infrastructure is built in AWS and they leverage CloudWatch for system level resource alarming and monitoring. These alarms are triggered through PagerDuty, the web host, and outside to their 3rd party cloud-based log management and analytics service, Sumo Logic. They have agents running on all of their hosts which push the logs to Sumo Logic and create scheduled queries every minute which will trigger PagerDuty for incident alerts.
Shabib noted that having a solution in place that fires off alerts and reminders until the issue is resolved helped create a sense of accountability within the team. This ultimately helped enforce the generation of high quality logs, while allowing individuals to debug those issues more rapidly as they occurred. The team also has an escalation policy that kicks into gear when the primary contact is unable to acknowledge the incidents, allowing for the secondary on-call contact to take action.
“I think PagerDuty helps put ownership into the hands of the engineer. Putting them closer to the incidents, so when one occurs, the right people who actually built that software get notified and can resolve and improve the problem,” stated Shabib. This was much better than the “swarming technique” which could potentially place incidents into the hands of someone without the proper context or knowledge to resolve it, not to mention the inefficient process that involved the entire team when the issue could have been handled by just one individual.
The company’s goal is to improve their operational metrics and reduce mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR). “These metrics have improved a great deal with the help of PagerDuty, resulting in a 25% drop in incidents,” said Shabib. Gathering metrics using PagerDuty’s analytics feature allows the team to follow up on past incidents and measure the operational efficiency around the incident management process.
Providing resilience and guaranteed delivery
PagerDuty has enabled Quartet to quickly and efficiently resolve incidents and decrease the number of incidents by 25%, while also reducing MTTA and MTTR. “If we didn’t have PagerDuty, we would be failing people in a way that goes beyond just customers. It would affect people’s lives negatively if we allowed these incidents to occur without resolving them or having the urgency to resolve them. It’s not just a business failing but rather an ethical failing for patients,” said Shabib.
“PagerDuty is resilient and guarantees that you will know when something problematic is happening to your apps. There aren’t a lot of services out there that can offer those guarantees.”