This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
This is Part 1 in a multi-part series dealing with tips for being on-call.
There is only one thing worse than being woken up at 3am by PagerDuty to learn that your systems are down: to wake up on your own at 8am and discover that your systems were down for 5 hours and nobody got the alert.
This post, along with future ‘Best Practices’ posts, will include tips on how to make sure your high-severity alerts are received reliably and promptly by on-call staff. This is the first step in reducing your mean time to recovery (MTTR) when – not if – problems happen.
First, and most obviously, a cellphone is a must for an on-call shift. This is for receiving phone and SMS alerts when not at home, as well as for contacting and being contacted by others when high-severity problems occur.
Make sure to set your phone’s ringer volume to ‘high’ when your shift starts, to reduce the chance of sleeping through an alert, or missing a call or SMS in noisy situations. If your SMS ringtone has a separate volume control, make sure to crank that up too. Picking sharp or piercing phone and SMS ringtones always help. Finally, if you bought a cellphone not exactly known for its awesome battery life, like I did long ago, then you’d be wise to keep a charger handy as well.
Another must-have piece of equipment for an on-call shift is a mobile USB broadband modem or mobile hotspot device. This, of course, is only true if your team can deal with operational issues remotely, such as with a laptop and VPN connection. If this is the case, then having one of these mobile devices allows on-calls to connect to the internet from wherever they are, rather than having to rush home (or worse, to the office!) to fight your operational fires.
These mobile hotspots or modems can be a lifesaver: both in terms of reducing incident response times as well as improving the lives of on-call staff, who can now venture further than 10-15 minutes away from home or other sources of guaranteed internet connections. Only one device is needed per team – they can be passed around along with the primary on-call rotation if need be – and the monthly fees are quite economical for basic data plans. We recently got the new LG Verizon 4G modem for our on-call (and yes, PagerDuty has on-call too), and it seems pretty decent, but any 3G device would likely work just as well.
Users should always include multiple contact methods in their PagerDuty Contact Info in order to ensure reliable delivery of notifications. SMS is a quick, terse, and convenient notification method, but the protocol does not guarantee immediate delivery of messages, and notifications can occasionally be delayed significantly within your mobile carrier’s network. We’ve seen occasional delays of several minutes – and sometimes more – between when an SMS is sent and when it is received by a handset. On our end, we partner with multiple SMS providers in order to try and ensure reliable and timely delivery of our SMS notifications, but it isn’t always enough.
To that end, we strongly recommend using both SMS and phone notifications in your user contact info. Use a fairly short delay – a couple minutes at most – in between notifications, as it is quick and easy to acknowledge a notification once it is actually received. If you have a work phone line that is land or VoIP-based, you can also include that as a 3rd contact method in case your cell coverage is spotty at work. If you want to kick it old-school, you can even setup PagerDuty to send email to your alphanumeric pager! (Just so long as your pager’s wireless carrier has an email-to-pager gateway; ironically this is the only method of paging that PagerDuty currently supports. Nobody has asked for more yet, but let us know if you want it.)
Future ‘Best Practices’ posts will give tips for creating robust on-call schedules and escalation policies. Stay tuned!