I’ve worked in software companies with employee numbers ranging from tens to thousands.
In the company with about ten employees developers were responsible for all monitoring. A lot of things were never monitored because we simply didn’t invest the time. We felt the learning threshold was too big and didn’t have the budget to purchase expensive hosted monitoring. This resulted in quite a few embarrassing moments where our customers told us about issues before we discovered them.
The mid-sized company with about one hundred employees had a dedicated operations team that was responsible for monitoring. They spent quite some time setting up redundant Nagios servers and adding checks to them. It worked reasonably well but we still had a lot of things that were not monitored which was mostly due to the fact that development and operations weren’t as close as they could have been. Operations did not know the software we developed well enough to monitor the right things and the schedule for the development team was usually too tight to walk them through it. Monitoring was usually added as an afterthought, if at all, which caused frustrations on both sides.
At the large company with more than one thousand employees developers were never involved in monitoring. I knew we had good infrastructure and server monitoring in place but none of the webapps we built ourselves ever received any love. We did a few valiant attempts at adding monitoring, but in the end we just ended up with a mix of solutions. Whenever we asked for something the whole company could use we got a “we’ll look into it”. Apparently the operations team had trouble finding a system that was flexible enough and at the same time easy to work with for developers.
What I’ve learned from this are the following facts which hold true regardless of company size:
- You won’t add monitoring if it’s too costly with respect to time or money.
- Every developer writing code that goes into production needs to know how to add monitoring. It has to be dead simple to avoid the need for special training in monitoring systems with highly elaborate configuration languages.
Setting Up the Infrastructure Stage
Imagine a fictitious turn based online strategy game called Awesome Space Battle. The players tell their troops what to do and once every minute their orders are executed and epic battle ensues. As you might have guessed I’m not a game designer. This would probably be the most boring game ever, but it’ll do just fine for the purpose of this explanation.
The server infrastructure is built around two servers running on Amazon AWS behind a load balancer. The servers are simply called S1 and S2. They’re not directly accessible from the public internet since we want our players to go through our load balancer. We’re using a third party payment provider where the checkout page is being served by them.
It’s crucial that the site is available and responsive for the players at all times. You’ll want a tool that can monitor both website availability and performance. The measurement of performance is usually called APM (Application Performance Management), and a specific subset is called RUM (Real User Monitoring).
APM is for measuring things like time spent doing SQL queries and template rendering. It’s a great tool for finding bottlenecks in your application. It’s important to remember to measure first and optimize later since our intuition about what is going to be slow is wrong a lot of the time.
individual server and not the one of the load balancer. This is very important to diagnose any issues as quickly as possible. Checking directly at the load balancer could generate intermittent failure notifications in case only one of the two servers were down. AWS security groups were used to allow the test servers to connect directly to the servers based on the source IP.
Eventually we also decided to add a test for https://store.awesomespacebattle.com/ which is hosted by the payment provider. It’s sensible to have at least one check for each part that runs on a different infrastructure, since they might have completely different reasons for failing.
It was a great comfort knowing we’d get an alert if the website failed completely and didn’t display properly any more. However that was not enough. Remember that Awesome Space Battle is a turn based game and we have a periodic job which runs the bulk of the game logic. We wanted to be certain that this runs every minute without any hiccups. Our users were also supposed to receive an email ten days before their account is up for renewal. We found this greatly improves conversion rate.
Monitoring key events like this is not straightforward using traditional web application monitoring tools so this is where we needed something else.
Event monitoring works in the opposite way of traditional web application monitoring. It’s based on the server signaling that certain expected events occurred instead of having an external test server poll for status periodically. If the events do not occur as often as expected the test will fail and someone will be alerted. In a sense you could think of traditional web app monitoring as “pull based monitoring” while event monitoring is more of “push based monitoring”.
Two event tests were added for Awesome Space Battle – “Game logic should run every minute on two servers” and “Renewal email sending should run every day on one server”.
As you might have noticed the email sending is just being done by a single server. We simply ask the load balancer for all attached healthy servers, and the one with the lowest instance id gets to do the job.
Notifying The Right People
So now we had really nice monitoring in place and we felt pretty confident. Eventually we got an issue that no one noticed because it was a weekend and no one was on call. This is the point where we decided to integrate with a system that would give us on call schedules and escalation policies together with all the notification choices we wanted. Be sure to choose a service which integrates well with all your monitoring and notification tools. We for example choose to forward any issues to our HipChat room. This ended up being a great choice because all developers in the team immediately got to know when something was wrong and could deal with it immediately.
When it comes to unit testing there’s a saying “If it’s not tested, it’s broken”. I would argue the same for monitoring – “If it’s not monitored, it’s down”.
I would advise everyone who isn’t monitoring their applications and infrastructure to start doing so straight away. Start small with something that’s easy to work with and flexible enough for most of your use cases. A common mistake is to try to find a single monitoring tool that can do everything you can ever imagine. Such flexibility comes at the cost of ease of use and if your tools are too hard to use you simply won’t use them.
Having tools that are easy to grasp is essential if you want your developers to become fully committed to monitoring. It makes sense to have developers be responsible for their own monitoring. Who would know what to monitor better than the person who built the software?
A full stack developer is responsible for creating software (programming), ensuring it behaves correctly (testing) and continues to do so during its lifetime (monitoring). It’s all too common with developers who feel that their work is done after the initial programming part is done. An initial investment in monitoring will ensure you’re ready to deal with unforeseen issues down the road. No matter how good a developer you are there’ll always be things you didn’t anticipate.
Improve mean time to resolution with PagerDuty.