PagerDuty Blog

Real-time Log Management Alerting: Getting the Good with the Bad

Guest blog post from Trevor Parsons, Chief Scientist & Co-Founder at Logentries. Trevor has over 10 years experience in developing monitoring and performance tools for software systems. He was formerly a Scientist at the IBM Center for Advanced Studies and holds a PhD from University College Dublin, Ireland. Chat with Trevor @trevparsons.

log-management-alertsLog data can reveal important business activity and user events to share across your organization. Most traditional logging solutions run background jobs every 5 or 10 minutes, but that doesn’t really cut it. Consider this, if there was an emergency at home would be acceptable to wait 5, 10 or 15 minutes before you picked up the phone and called emergency services? In addition to identifying events as they happen, getting all the right people in the know is important. But how do you differentiate what is worthy of waking someone up in the middle of the night vs. something good to note? Setting correct thresholds and associating certain events to a specific alert type make it easy to keep everyone in the know.

Here are my top 5 alerts that we find pretty useful at Logentries to send through PagerDuty. I’ll admit the last two are a little unorthodox for an IT incident management platform, but why not share the good news with the bad.

1. Exceptions and Errors
This is a pretty obvious one, but you’d be surprised by the amount of times exceptions or errors go unnoticed, especially if you do not have a well thought out logging and monitoring set of practices. Alerts that contain some contextual information such as which application component and where the exception originated will help you quickly discover the root cause.

What you can do: Correlate these integrated alerts with any notifications related to performance issues or resource usage information to help pinpoint the exact root cause of the problem. Which of these exceptions to alert on depends on your application and what is important in your context of your problem. However, think about this up front and configure alerts based on exceptions and errors that are uniquely important for your application. It can help here to group different exceptions/errors using logging severity levels such that alerts are only created for those that are particularly important.

2. Response Time
Setting alerts when performance thresholds are breached is a great way to be certain that you will be notified when your users are experiencing a slow app or website. Most log management tools generally allow you to work with field values, such that you can get notified on ‘response_time>50ms’. This is particularly useful when you can measure the response time from the users’ perspective. Logentries provides integrations that allow you to log directly from the users browser or mobile app such that you can perform real user monitoring. This enables trigger notifications when an individual user is seeing slow page loads on a given device, browser or operating system.

What you can do: A good rule of thumb for alerting on response times is to follow the 3 response time limits as outlined by Jakob Nielsen in his publication on ‘Usability Engineering’ back in 1993 that is still relevant today. In short 0.1 second is about the limit for having the user feel that the system reacting instantaneously, 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, and 10 seconds is about the limit for keeping the user’s attention focused on the dialogue.

3. Resource Usage
Similar to setting thresholds around response time, it can also be useful to get notified if a given server is in trouble and is starting to max out on a particular resource (e.g. CPU, Network, Disk, Memory). Proactive resource usage monitoring – where suddenly a server instance starts to misbehave and CPU is maxed out – has also become particularly important for running always on cloud services whereby you might want to restart it or auto spin up another instance to replace it or to help share the load.

What you can do: One advantage of using a log management solution to analyze resource usage trends is that you can roll up the individual log entries into a resource usage dashboard to visualize trends in CPU, network, memory, etc. You can also drill back down into the individual log events and cross correlate events related to spikes in CPU. For example, with events related to errors or exceptions you can very quickly identify root causes and rectify any issues. This is generally not possible with server monitoring tools, which do not allow for such a fine-grained view or correlation with other log events relating to response times, errors or exceptions.

And now for the good news events….

4. Events That Affect Your Top Line
Let’s face it – seeing those customers roll in on a new service puts a smile on everyone’s face. Alerts don’t always need to be bad news. For once, wouldn’t it be nice to be disturbed during your night out with some good news?

What you can do: At Logentries we don’t like to discriminate when it comes to alerts – we like the good, the bad and the ugly – and we actually send alerts to our own team on all sorts of important events so that everyone on the team knows what’s going on across the service and the business – be it good or bad. Set-up customized tagging and tracking of business events such as “trial sign-ups” or “webpage visits” so you can monitor in real-time the health of your business beyond just exceptions and errors.

5. Feature Adoption Events
Similar to the last point – it can also be useful when you release a new feature to get notified when your first 100, or 1000, customers have had a chance to play with it. You may not want to get woken up in the middle of the night for this but you can share this milestone with your company via a quieter alert method such as email.

What you can do: Take advantage of alerting thresholds in Logentries to only get notified if an event has matched a particular pattern over a given number of times (e.g. when feature X was used over 100 times). This can be useful for a number of reasons:

  • It is simply good for team morale when you have all been slaving away over a new feature, you release it and … yes … people actually use it and like it.
  • Maybe you’d like to see what people think of this new feature and you’d like to ask them for some feedback. If you configure your system to also log an account ID or a user identifier you can always go back and ask them what they think – iterate on it – and improve – then rinse and repeat this for the next 100 people who use it.

With these real-time log management alerts you can increase visibility across your teams and organization. Check out the new Logentries and PagerDuty integration in your own environment today!