(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
We all know how important the customer service experience is. But getting customer service right is hard because it isn’t always easy to anticipate or control what customers will experience. That’s why, in order to keep your customers happy, you need to get ahead of the customer service experience by staying on top of the factors that shape it.
Your infrastructure and software tools are a big part of this equation. In this post, I’ll explain how to design systems to provide a positive customer experience from the ground up, as well as how to anticipate service disruptions and challenges so that you’re prepared to handle them before they impact customers.
Most design teams can anticipate or detect many potential user-interface problems during initial design or focus group testing, and other design problems may be uncovered during functional, performance, or canary testing. More subtle design issues which affect the customer experience, however, may only become apparent long after deployment, and even then only by means of customer feedback — a sometimes slow and often imprecise measure of quality.
Yet, it may be possible to detect unreported/underreported design problems through analytics — if they result in anomalous customer behavior. For example, if a specific feature or web page is accessed much less frequently than expected, that may be an indication of a design problem affecting the visibility or accessibility of the page or feature. Similarly, if a significant number of customers appear to be performing an unexpected sequence of actions, they could be attempting to work around a malfunctioning or non-functional feature.
The good news about design problems is that most of them can be prevented, and the ones that do escape pre-deployment detection are usually fairly easy to fix once they become apparent.
While user-interface design problems may be relatively easy to detect at an early stage, the kind of functional and performance problems which affect the customer experience may be much more difficult to anticipate. They can, in fact, be the result of entirely unpredictable factors outside of your application and beyond your control.
They are also the problems with the greatest potential to do long-term damage in terms of customer relations. A badly designed input screen may be annoying, but a system crash or prolonged latency or downtime is more likely to drive customers into the arms of your competitors. As an absolute bottom-line necessity, you need to at least keep up with functional and performance problems like these.
Even a slight delay in response can result in a major loss of customer traffic, and regaining traffic after a significant outage may require considerable time and effort. Anything that will allow you to not only keep up with but stay ahead of such problems can give you a major edge over your competitors. But if functional or design problems are hard to anticipate, how can you hope to get ahead of them?
As it turns out, there are some things you can do, using a combination of monitoring, analytics, and rapid incident response. Through effective use of these resources, you can minimize the time between the initial appearance of a problem and its resolution, and in many cases, you may be able to detect and respond to functional and performance problems before they become visible to the customer. Let’s break the process down:
You need to monitor your application for both functional and performance issues, but your monitoring system should do more than that. It should be able to detect signs of slow-down and near-failure in real-time and report them. If an application component is behaving anomalously, or if the program is approaching its performance limit, your monitoring system should report the issue as a potential problem.
In order to monitor effectively, you need to know which metrics are important, and which are not. For customer experience-related issues, the important metrics are those that capture anything which may directly or indirectly affect the interaction between the customer and your application. If that sounds like it would be a pretty long list, that’s because it usually is. Most functional and performance issues do have an effect on the way that your customers experience your software.
Along with comprehensive monitoring, you need a solid system for advanced analytics. While monitoring can pick up about-to-happen or already-happening problems, in-depth analytics can give you insight into performance degradation and other issues which may eventually lead to trouble. Among other things, analytics can pick up patterns of activity which have the potential to turn into problems, such as periodic bursts of customer activity that are increasing in volume and intensity over time, and that push the system close to its performance limits.
Analytics can also pick up long-term trends in activity which place increasing stress on the system or on individual resources. From the point of view of your customers, the failure of a key resource may be as bad as failure of the entire system as a whole. Preventing failure at the resource or individual service level is as important as preventing failure at the greater system level.
By definition, any failure, breakdown, or performance degradation which has a negative effect on customer experience is a serious incident and needs to be treated as such. An actual failure requires an immediate response. Your monitoring system should notify your incident management system, which should, in turn, be able to quickly filter out noise, provide enrichment and triage support, and send notifications to the correct response team (no more, no less) with the appropriate priority.
With a leading incident management solution, response teams can take care of the most serious incidents before they become apparent to your customers. But potential breakdowns, close-to-tolerances incidents, and trends toward increased system stress should also go through your notification system so that they can be handled before they become full-blown incidents.
Problems of this type may not require notification of your first-response teams, but your analytics or incident response system should channel them to the appropriate operational or supervisory personnel so that they can be investigated and the necessary remedial action can be taken.
An incident that is resolved before it comes to the attention of your customers is almost as good as an incident that never happened. And a problem that is resolved before it even becomes an incident is an incident that never happened. Effective monitoring systems, comprehensive analytics, and a top-quality incident management solution are your keys to staying consistently ahead in delivering a great customer experience.
Learn more about how PagerDuty can help you delight customers with incident management »