Chaos with Care
For yet another enlightening session on day 2 of Chaos Carnival 2021, Carl Chesser delivered a talk on how to create chaos with care.
Carl starts off by saying, "In this talk, we will go through the lessons learned at the corporation when we tried to introduce Chaos Engineering. One of the most common challenges we face when it comes to introducing something new is that there is uncertainty and confusion among the team. But we tackle this by safely building the practice on valuable earlier success to reduce failure. We should always have that approach of starting with the why and having a goal of what we want to achieve before applying the how."
HUMANS AND SOFTWARE SYSTEMS
While introducing new technologies, we as humans often tend to improve the software systems and the dependencies that come with them. Humans can easily be considered as a part of these systems because the majority of influence on these is done by us. Hence, when you look at applying the practice of chaos engineering in production, you not only have to consider the software systems but also the human resources involved. As a result, this approach can benefit the production by helping you build confidence and understand your systems well. For example, introducing Chaos Engineering in a large organization will require you to have a safe, beneficial and methodical approach while getting started, as it is not just the machines that are involved, it's the people as well.
ALIGN THE INTRODUCTION OF CHAOS WITH ORGANIZED EXPERIMENTS
When we look at introducing chaos, what we are looking at is that common offering of doing it through organized experiments. We need to start conveying the importance of the value of chaos and the most initial part of that value deliverable is learning. The most effective way to learn is to have an organized time for experiments versus randomly killing off the Kubernetes pods, and expecting the teams to immediately react to it during the preliminary part. Hence, it is highly advisable to have the first experiment fully planned with the team so that they can study, gather and access the data for the forthcoming ones.
SHARE RISKS OF THE UNKNOWN
Many times in our system we are in this evolutionary model where you're changing the system by migrating to or introducing some new technology. In these scenarios, the teams should identify the known unknowns that impose reliability risks and how they can manage them with more information gathered via experiments.
PREPARE THE EXPERIMENT
Many of us know how important it is to prepare for our chaos experiments. We need to capture these experiments in a shared and open location so that it becomes easy for people to find them, and use them as inventory while building hypotheses., This might also help in identifying prerequisites that are necessary to be effective when we do our experiment. For example, improved telemetry on connection refresh of data stores.
VALUE OF SPECIFIC SCENARIOS
When we talk about conveying, another important aspect that needs consideration is the value of specific scenarios, so that it becomes easy to understand your system and its dependencies. For example, we have some sort of database and if some latency starts increasing there, we can start noticing the effects. Hence, when we have these scenarios, we try to depict them through the simple box and arrow type of illustration to show what is being impacted, what is being affected, what is the actual end-user effect, and so on. Storytelling is more effective when you can articulate specific scenarios that can visualize larger potential failures.
BUILD DEV/TEST MOMENTUM
In the beginning, it's usually a good idea to start with space where you are comfortable failing during these experiments, as your team is going to need some momentum to learn about the specifics of these experiments. These experiments can sometimes also have a larger impact than previously expected. To start with, you can try dev/test types of environments instead of jumping right into production, so that you have a safe space to fail and learn. It helps in 'inventorying' the known unknowns when you are building out how it is expected to work.
PROACTIVE VS REACTIVE
The systems and the newer dependencies will need to scale up through a game-day scenario to acquire more data. This will also further empower your team, so you can see benefits from this of how this is now reducing risks instead of introducing risks by doing experiments, and also that a proactive nature helps a team learn how to address this. It's normal for something to come up, and when that happens all you have to do is identify and address it at the right time, and continue to invest more time in achieving reliability.
OBSERVABILITY IS CRITICAL:
Observing the past experiments and documenting the data helps teams in observability to understand and get access to all the things that one expects. You will need easy access to the telemetry data of all parts of the system. The way forward after this would be that it should make you capable of asking different and new questions about the system without having to change it. When you notice a gap in the visibility, focus on how to rebuild your system with the improvement through low coordination.
BUILDING PATTERNS TO LEARN IN POD
As you will be getting better at learning and building the system out and doing experiments in safe environments, it is really important to figure out how to get into production, how you can start introducing things like this so that it's easy to understand all parts of your system. Find ways to contain or control traffic flows so you can safely learn and experiment in production. Consider ways to replicate or simulate traffic to ensure live traffic flows are not obstructed.
UNDERSTAND AND EMBRACE NEEDED COMPLIANCE
Another part of this is, when you go to production, embrace compliance. There are a lot of newer compliance-type rules that always come with production systems. Here's a list of all the reasons why we should not be able to do this in this environment. . It is really helpful to find who the subject matter experts are in your organization, who can help you in understanding it in more detail.
PLAN TO BE SURPRISED
All these systems have always had this surprise factor to them. It usually means making the environment for the teams safe from failure or minimizing the impact. Sometimes, the actual hypothesis might differ from the expected hypothesis, resulting in a larger impact like shadowing or replaying traffic to another system and saying everything in the system that can be impacted by this test. One thing the teams should be doing here is to capture the misalignment in the projection in an open and searchable repository and plan and invest extra time to digest the principles.
In the end, while concluding his talk Carl quotes, "So, hopefully, these are helpful reminders of how you can reduce what risks, and also in molding people's mindset into growing in this practice. This will help your teams to start in a safe environment and eventually make them confident in implementing chaos in the production."