PagerDuty Blog

Chaos Engineering With Ana Medina

Recently, I sat down with Ana Medina of Gremlin for a PagerDuty Community AMA!

Ana is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

You can check out the entire AMA here:

https://www.youtube.com/watch?v=Rf7CedwLnYY

If you prefer to read, however, here were some of the questions asked, along with a summary of Ana’s responses.

Q: I have been on multiple projects and [worked with] customers where Chaos Monkey is discussed or floated and there is great interest and conversation, but once it comes time to actually run it, people get scared. Objections slip out, and it becomes a game of “but what if data is corrupted, a customer is impacted, the very important person becomes upset?” Have you faced these objections and how have you overcome them? – Joel Heenan

It doesn’t have to be scary—consider the prerequisites, such as blast radius and monitoring. The first step is monitoring and observability. You can’t get started if you don’t know what your current system or service looks like right now or what it will look like once you start your experiment.

Next, consider what type of experiment you will be doing. Understand your hypothesis, as well as what will it take to stop the experiment if you start to discover you are about to breach an SLA. Be aware of what your abort conditions are.

Also consider your blast radius—why run this in production if you don’t know what it will do in staging or another pre-production environment? You can start this in a non-production environment, one that is safer and doesn’t touch customers. Instead of running it on 50 percent of your infrastructure, maybe just run it on three of your hosts to get a feel for what the impact might be. For example, when Ana was at Uber, they would have their SRE team embed with the service owners to walk them through what the experiment would do so there was a better understanding of the conditions and potential impacts.

A lot of the concerns are around safety, so it’s important to have a “big red button” that will stop all the experiments from running. You can have this automated through your monitoring and observability so that if things start to go wrong, you can have the experiments automatically shut down before a customer-impacting issue occurs.

Finally, if you communicate the potential learnings that can come from the experiments, it really can help overcome fears and concerns.

Q: What have you found most effective to prevent burnout from being on call or on several intensive projects over a short amount of time? – Taylor Dolezal

Having a great manager and a great relationship with your manager is key. You need to be able to talk to your manager about what’s going on with you and what your workload looks like. Communicate that when you’re on call, that’s going to be your No. 1 priority and that having high-priority projects at the same time isn’t effective.

Chaos engineering can help prevent burnout because you can prepare for those 2 a.m. pages by practicing during business hours to build muscle memory for responding to outages. Even better, you may be able to detect potential issues in advance and prevent the 2 a.m. pages from happening at all.

Self-care is also a key part of preventing burnout. Make sure you eat healthy and get enough rest…and don’t fill your body with sugar, even if it seems like a good idea at the time! When Ana is on call, she makes sure to schedule time to go out with friends to disconnect from the stress of being on call. She’s on the road a lot, and to help with de-stressing, she always makes sure she travels with bath bombs!

Q. Who is your favorite person (or favorite people) within the technical community? – Taylor Dolezal

Because burnout was a major factor in previous roles, Ana’s current manager Tammy Bütow is one of her favorites; Amy Chen is another favorite.

Q. What are you interested in learning at the moment? – Taylor Dolezal

These days Ana is focusing on learning about world history and American history. She’s also interested in exploring the Seeking SRE book and looking into best practices around observability.

Q. What are you most excited about in the chaos engineering space for 2019? – Taylor Dolezal

It’s going to be a year of adoption—2018 included a lot of people who were learning about the topic, onboarding into observability practices, etc. Now that the groundwork has been done, Ana expects we will see more adoption.

Q: How can PagerDuty be used by chaos engineers? – Tammy Bütow

A chaos engineer usually has a PagerDuty account! They probably already have had experience being on call, but they could also be the engineers who are testing PagerDuty alerts using dummy services and running chaos experiments against them. This helps train engineers on responding effectively to alerts as well.

Thanks for reading this AMA summary, and don’t forget to check out our other AMA videos! And if you’re wondering who our next guest on the PagerDuty Community AMA will be, head over to our Community forums for updates!