Best Practices for Monitoring
Implementing effective and high-performing monitoring tools can have a huge impact on your business. Without the right monitoring tooling in place, your mission-critical products can experience unsolicited disruptions and outages, putting the customer experience and company reputation in jeopardy. Even the smallest incident can have a ripple effect on customer loyalty, so it’s important to ensure the right tooling and process are in place in order to prevent this from happening.
Because of this, it’s important to define, implement, and execute a streamlined and effective way to monitor and maintain business-critical products and services. With concrete processes in place, it allows your teams to take action on issues effectively before they become customer-impacting.
This article will lay out best practices and processes for monitoring system events. These best practices will reduce the length and impact of outages, as well as help you prevent them — which means better business results. You’ll be able to create and implement an effective monitoring strategy in less time, without losing sleep.
In order to better understand the monitoring and alerting best practices, let’s define the foundations of these terms.
What is Monitoring and Alerting?
Monitoring
Monitoring is the process of gathering and analyzing data related to a critical system, service, or application’s performance.. Monitoring helps ensure systems and services are running as intended and helps teams keep a pulse on the performance and availability of any internal or external application, system, or service. If a disruption or outage occurs, teams are immediately aware via the monitoring system and can take action towards a resolution immediately.
Alerting
Alerting is the process of notifying stakeholders through various communication tools based on any change within a system, service, or application’s performance and/or status. Alerting can also be utilized across other, non-system related actions, including things like email, updates, rotation changes, and others.
Deciding What to Monitor
So you’ve made the commitment to monitor your systems, or to monitor them better. Choosing what to monitor is a critical first step. When you pick the right events to monitor, you ensure that:
- You know about mission-critical issues before customers or your manager
- You always have an accurate, real-time view of system performance and status
- You aren’t caught unprepared when key infrastructure fails
Those are all compelling reasons to carefully consider the decision to monitor upfront. But how do you actually execute? There are a number of metrics that may be important to the customer experience. To identify these, you’ll need to break down any available metrics into one of two categories.
Work Metrics:
Work metrics is a type of dataset that your service produces, such as site visits, queries, revenue, etc. These are actionable metrics, which means they are very important and can potentially help to further the growth of your business and surface areas where you can improve.
Resource Metrics:
Resource metrics are metrics that help to produce work metrics. This could include CPU, memory usage, network, etc. These metrics are useful, but mostly in order to gauge capacity and availability within certain systems and databases.
Think of resource metrics as how much life your character has in a video game and work metrics as your character’s achievements.
As useful as it is to know how much space you have on your CPU or how much memory you have left, work metrics are what you should be monitoring. However, make sure that the work metrics you monitor are actionable. An example of an actionable work metric is a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.
Once you have chosen the metrics you wish to monitor, rank them by urgency according to which event matters most to your business. Then, decide who should be notified for each metric. You can use PagerDuty to assign escalation policies, which alerts the first line of defense (the person assigned to that metric), then the second line of defense should the first person not acknowledge the alert, and so on.
To learn more about escalation policies, visit PagerDuty’s knowledge base.
Metrics for Mobile Apps
The metrics discussed above are for web services. Now, we will be covering metrics to monitor for mobile apps. Mobile apps face different concerns than web services. For example, you may be dealing with dissimilar OS versions or carrier latency.
For mobile apps, there are two key metrics that should be monitored:
Uptime
Uptime measures the uptime of an app, which is the percentage of app loads that don’t crash. In order to stay competitive uptime should be 99%.
Responsiveness
Responsiveness measures how quickly your app responds to and interprets commands. To satisfy your users, app responsiveness should be less than a second in most cases.
Alerts and Visibility
It is vital that teams are notified in a way that best suits their needs. Employees should be able to choose how they are alerted; whether via SMS, push notifications, or a phone call. Email notifications are not recommended because they are easily lost, and fail to provide team visibility.
To make responding to alerts and visibility easier to achieve, it’s helpful to have all your events in one place so employees can have all the information needed to mitigate an incident.
All team members should have access to this centralized location. Non-actionable alerts should be suppressed so that employees can focus on the alerts that can be mitigated. This centralized platform will be used to analyze analytics which can help teams find solutions to incidents during postmortems.
To gain visibility and insight into an event, all alerts from different systems should be in one place. With PagerDuty’s 350+ natural integrations, PagerDuty allows you visibility into all your alerts (to learn more check out PagerDuty’s integrations page).
Culture
To make sure that an incident is taken care of smoothly and effectively it is important to foster a culture of accountability and transparency within your organization. This is achieved by putting the customer first so that everything a business creates and does centers around the customer, what they need, what they want, and how a product can make their life easier.
To help employees to have a transparent system when it comes to monitoring programs, and acknowledging alerts, treat incident response time as a performance metric. Make sure there is an established list of best practices for on-call employees and set roles during an incident.
To learn more about being on-call and some best practices to employ, check out PagerDuty’s Ops Guides.
Conclusion
Monitoring your programs and coordinating alerts is necessary, but can be difficult without the right tools. PagerDuty acts as the central nervous system for your entire monitoring stack, allowing your teams to have real-time visibility across all business-critical systems and services.
To learn more about monitoring best practices, sign up for a free 14-day trial with PagerDuty today.
Additional
Resources
EBook
Maximizing the ROI of incident management
Podcast
The Unplanned Show, Episode 3: LLMs and Incident Response