Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great SaaS applications that will ping your system to let you know if you are up and running, just in case your team needs to sprint to find a fix.
Knowing that your website is down is only the first step in alerting, but it should be the last step in your monitoring chain. Ideally, you should be set up for alerts before something breaks that takes the entire service down. But when that isn’t possible you need to know why there’s a problem and where.
A quick ping to your site every 15 seconds can be extremely beneficial in order to tackle any issues that may cause your site to go down. Issues with your hosting provider, regional support, spikes in memory, or increased network traffic may have caused your site to crash.
To go beyond a basic ping, there are some very simple steps to get more valuable information. At PagerDuty, we have simple uptime monitoring on pagerduty.com, but we also have multiple external services pinging a simple test suite. Not only do we know that events are flowing through our system, but also that the average processing time is below a threshold and our alert volume is within a safe range.
If your monitoring tool supports it, each test can trigger alerts of different severity. When we experience heavy load due to an IaaS provider having trouble, we’ll often trigger a sev-3 alert even if no delays are reported. This wakes up an engineer in case we need one.
You shouldn’t just check to see that your page is responding, instead make sure that it’s returning the right content. If your server is returning 200 status codes but garbled text, then all of your monitoring was for nothing. Don’t forget to check that you’re returning CSS & scripts too, if they come through a different asset pipeline.
The deeper your monitoring and alerting is, the better the chance you have to catch problems before your customers are affected.
To create a complete picture of your service, you will need to monitor the entire stack to find the root cause for an outage. This means going beyond receiving an HTTP request or DNS check, but instead looking behind your load balancer. It may just be a network problem that is causing your outage.
By monitoring your internal, non-customer facing systems you will be able to correlate metrics in order to find the root cause for your site’s outage. We recommend using a tool that lets you go beyond a simple ping to find the reason for your outage; without having to guess. Is your system running slow because of increased network traffic or if there something else going on a little deeper? It’s imperative to find the correct source behind your systems outage; this way you can prevent the same outage from happening again.
If you’re looking to implement a solution check out a few of our partners. You may even want to use more than one to add redundant checks to make sure you never miss an alert.