Chaos Engineering the Chaos Engineers

Even when your entire business is Chaos Engineering, fully embracing the process can present challenges. I'll share what we've learned & how we've adapted our approach to find success.

Eating your own dog food can be difficult, especially when you're too busy to slow down and eat. Despite being proponents of Chaos Engineering for a very long time, running Chaos Engineering internally at Gremlin hasn't always been easy. Like most companies, feature work takes high priority and reliability efforts were often a distant second, especially as the company grew.

In this talk, I'll share how Gremlin experimented with our own Chaos Engineering practices, what we tried, what worked, and what didn't. I'll also discuss ways that you can keep your own Chaos Engineering practices lightweight and quick so you can strike the right balance between feature work and reliability in your own organization.

This talk focuses on our evolution from larger, traditional gamedays to smaller, more efficient ones that promote more regular practices.

Chaos Engineering the Chaos Engineers

"While I was working for MongoDB, I realized that Chaos Engineering was not just a fad, it was something which was meant for more than just Netflix, Google, and these large companies." This is how Jason Lee began his talk on 'Chaos Engineering the Chaos Engineers' on day 2 of Chaos Carnival 2021.

Jason continues, "Chaos Engineering is a fantastic way to validate your monitoring. If latency is injected into a system, it should be visible on the APM tools. Monitoring and observability are critical, and hence, we used Chaos Engineering to ensure the service is reliable."

BEFORE

Previously, a chaos experiment used to be scheduled at a certain time that suited the team, post which, tests were picked to give the whole process an easy start. Nowadays, we create a hypothesis that projects the expectations of what the end result is going to be like. If the experiment was unsuccessful, be ready with a plan and a way forward of what should be done next in regards to the system.

GAME LEVELS

Terminate the service. Block access to 1 dependency.
Block access to all dependencies.
Terminate the host.
Degrade the environment.
Spike Traffic.
Terminate the region/cloud.

When Jason joined Gremlin, it was just a startup. People were being hired on a monthly basis and they had just heard the word Chaos Engineering without ever having worked around it. Like every other startup, the ultimate goal for Gremlin too was to ship code and build the product. Jason used his deployment method from Datadog and infused it with Gremlin's traditional Chaos Engineering techniques.

Whenever we talk about Chaos Engineering, we mention how to define a hypothesis, how to inject failure, and how to analyze results to make systems reliable. That's the easy part. The real deal is how do we prioritize Chaos Engineering in an organization, and how do we balance it with future development.

COMPLAINTS

"One of the engineering managers came up to me and complained regarding the team being on a grinding halt due to recurring gamedays, where everything is shut and all they do is analyze the results. Again, another engineering manager complained that we had generated enough tickets from the first gameday itself, that it would suffice an entire sprint, which would mean derailing the work and delaying the product feature." Jason mentions.

These complaints can be summed up in three major categories:

Lack of time: Besides ensuring that everyone is aware of how to conduct a chaos experiment, they should be caring about reliability as well. And for that, the time commitment is essential which most people forget to track.
Lack of process: When the process is not clearly defined, it is difficult to project how much time an experiment might take. Instead, if a team is prepared with well-defined processes in the form of run books and documentation, they tend to solve the problems much faster.
Lack of priority: This issue arises when a gameday results in the generation of high-value tickets, this makes it difficult to prioritize work.

TOOLS

Donut is one of the tools used by Gremlin which was formerly meant for the folks to engage. It acts as a great team-building device because most employees work remotely and are spread across the US. Donut is designed to pair or team people up so that they can share some common experiences. It works by pairing up three less active engineering managers on slack and bringing them together to give them an opportunity to do a mini-gameday. Post that Donut will block their calendars at a common time and send in an automated Zoom link. Therefore, all the planning and logistics will be taken care of.

Gremlin really wanted to automate the process of running a game day so they built an MVP and kept iterating it. Google form was an easy tool with apt fields which require you to just fill in the answers. Once the form is filled by everyone, we can generate a report which will be beneficial later.

Slight is an internal documentation tool that is used to gather notes. Engineering managers have decided to use this as a practice for their incident response for running mini-gamedays. Iteration helps in better information capturing for future references.

"Reliability does not come for free, it's an investment. But you will eventually have to be consistent with your processes and keep updating and adapting in your teams. But as you start to adapt to Chaos Engineering, remember, the experimentation is not the hard part. The hard part is everything else, so as you roll out your practices, keep these three things in mind; keep it short, create a process or automate and balance your priorities whilst focusing on one thing at a time," and with this Lee concluded his talk.

Chaos Engineering the Chaos Engineers

Videos

by Experts

Related Blog

LitmusChaos moves to CNCF Incubator

Start your reliability journey in minutes