This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
This is a guest blog post about setting up IT operations software for startups written by Anthony Gibbons, the Operations Manager at Airhead Education. Airhead Education is a UK-based company that helps schools harness the power of cloud-based learning.
I joined a small but ambitious startup called Airhead Education in February 2014 as their Operations Manager. Airhead provide an affordable, cloud-based learning environment that ‘plays well with others’, which is to say that we integrate with whichever technologies our customers wish to use.
I had spent the previous two years working as an Application Support Specialist for one of the largest firms in the financial sector. Whilst the work had been enjoyable, and the people brilliant, I had a craving to get back into an operations role, which I felt was where my true strengths lay and was what I was best at.
At the start of 2014, Airhead were at a point where they needed to get serious about monitoring and supporting their growing infrastructure in Microsoft Azure. I was still in touch with a former colleague who was a founding employee of the company. An initial conversation over a few beers eventually progressed to a job offer that I gladly accepted.
What next? I must confess, I found the prospect of setting up our infrastructure monitoring and notification system from scratch a little daunting. Due to the company’s position as a startup, I also had a relatively small budget with which to do it. In the past, I had mainly tuned and tweaked existing infrastructure monitoring tools. My initial instinct was not to waste time reinventing the wheel. At Airhead, we have a ‘cloud first’ attitude, always seeking to integrate with best of breed cutting edge technologies for our customers. I decided to carry this philosophy through to backend operations and support. I had thought that budgetary constraints may have an impact on the quality of tools and services I would be able to use. I was completely wrong! With the advent of cloud services and companies willing to integrate with each other, it is now entirely possible for a small startup to use the same monitoring tools as industry stars such as Airbnb, Pinterest and Path.
Within a week or so, I was up and running with Microsoft SCOM, Site 24×7 for external monitoring and New Relic for application monitoring. We also set up a status page on StatusPage.io. Initially, alerts were generated and sent to our email addresses. Status updates were set manually on our status page if something went wrong. This was OK for a while but eventually emails got missed, our status page wasn’t always updated quickly enough and so on. We had monitoring down pretty well but we were way short on our notification solution. I wasn’t too keen on lugging a pager about again and I was even less keen on the associated costs. Then I found PagerDuty via a New Relic partner promotion. I signed up for a trial and all of my prayers were answered! PagerDuty would integrate with all of my monitoring solutions and alert the right people when things went wrong.
It probably took me an hour to integrate all of my services with PagerDuty. Very quickly, I was able to generate meaningful alerts to the iOS app that my colleague and I had installed on our existing phones. Escalation policies were flexible and easy to visualise. We went for something quite simple and effective: general alerts would go to DevOps guys whilst a full outage would escalate to all staff. On call rotas were easy to configure so we could share the pain of late night wake up calls. Speaking of wake up calls, what better way to be alerted than with a sad trombone or a barbershop quartet style rendition of ‘The server’s on fire’? The push sounds for the iOS app keep getting better and better!
After a couple of weeks of use, it was time to investigate some of the more advanced features. If an incident or outage occurred within our app, I was now confident that the right people would be notified. But what about our customers? As I mentioned previously, we use StatusPage.io for our custom status page. By integrating StatusPage.io with the Pager Duty API, we have been able to create rules that will change the public status of our service if certain events are triggered from Pager Duty. This lets our customers know as soon as we do if there is a major issue affecting our platform. In addition to this, we have integrated PD with HipChat so we can quickly and easily view a summary of all alerts. This can be extremely useful when trying to understand an incident timeline.
One of the best things about PagerDuty is the rate at which the service continues to improve and evolve. Just one of the new things I will look at this month is ‘Rich Incidents‘, which gives me more context into alerts by embedding links and images into alerts. Oh, and hopefully we will get even more push alert sounds for the app. Keep them coming!
The best thing about PagerDuty is that it, like Airhead, ‘plays well with others’. They occupy an important role in operations and they’re happy to integrate with other fantastic cloud services. With affordable, flexible and continuously improving services such as these, it is a great time to be involved in IT operations. What was I worried about?