Chaos with Care


About the speaker

Carl Chesser

Carl Chesser

Principal Engineer,


Carl is a principal engineer supporting the service platform at Cerner Corporation, a global leader in healthcare information technology. The majority of his career has been focused on evolving and scaling the service infrastructure for Cerner's core electronic medical record platform called Millennium. He is passionate about growing a positive engineering culture at Cerner and contributes as an organizer of hackathons, meetups, and giving technical talks.

About the talk

Oftentimes, chaos engineering is viewed as a source of risk versus one that can reduce risk to their systems. This talk provides a guide on how to influence people on this approach, by growing and showing how it can reduce risk that is important to your systems.

We all care about reliability. It is often assumed that any new feature is reliable; however, we recognize it is not a free property of the systems we build. Our speed to market with newer features and the inevitable growth of our systems can be combative forces to the reliability that we pursue. Chaos engineering is a practice that can continuously improve our reliability, but it often comes with a cost that must be balanced with the expectation of newer features.

In this talk, we will discuss how to introduce and grow chaos engineering in your organization. This includes how to effectively communicate this approach with leadership and related teams, strategies on how to start small and expand into more risk adverse environments, and valuable lessons learned on making it effective within your systems. Approaches will include specific examples of how teams have applied chaos engineering at Cerner Corporation, a healthcare technology company. From this talk, you will be equipped with ideas on how to safely grow your chaos engineering practice within your organization.


Chaos with Care

For yet another enlightening session on day 2 of Chaos Carnival 2021, Carl Chesser delivered a talk on how to create chaos with care.


Carl starts off by saying, "In this talk, we will go through the lessons learned at the corporation when we tried to introduce Chaos Engineering. One of the most common challenges we face when it comes to introducing something new is that there is uncertainty and confusion among the team. But we tackle this by safely building the practice on valuable earlier success to reduce failure. We should always have that approach of starting with the why and having a goal of what we want to achieve before applying the how."


While introducing new technologies, we as humans often tend to improve the software systems and the dependencies that come with them. Humans can easily be considered as a part of these systems because the majority of influence on these is done by us. Hence, when you look at applying the practice of chaos engineering in production, you not only have to consider the software systems but also the human resources involved. As a result, this approach can benefit the production by helping you build confidence and understand your systems well. For example, introducing Chaos Engineering in a large organization will require you to have a safe, beneficial and methodical approach while getting started, as it is not just the machines that are involved, it's the people as well.


When we look at introducing chaos, what we are looking at is that common offering of doing it through organized experiments. We need to start conveying the importance of the value of chaos and the most initial part of that value deliverable is learning. The most effective way to learn is to have an organized time for experiments versus randomly killing off the Kubernetes pods, and expecting the teams to immediately react to it during the preliminary part. Hence, it is highly advisable to have the first experiment fully planned with the team so that they can study, gather and access the data for the forthcoming ones.


Many times in our system we are in this evolutionary model where you're changing the system by migrating to or introducing some new technology. In these scenarios, the teams should identify the known unknowns that impose reliability risks and how they can manage them with more information gathered via experiments.


Many of us know how important it is to prepare for our chaos experiments. We need to capture these experiments in a shared and open location so that it becomes easy for people to find them, and use them as inventory while building hypotheses., This might also help in identifying prerequisites that are necessary to be effective when we do our experiment. For example, improved telemetry on connection refresh of data stores.


When we talk about conveying, another important aspect that needs consideration is the value of specific scenarios, so that it becomes easy to understand your system and its dependencies. For example, we have some sort of database and if some latency starts increasing there, we can start noticing the effects. Hence, when we have these scenarios, we try to depict them through the simple box and arrow type of illustration to show what is being impacted, what is being affected, what is the actual end-user effect, and so on. Storytelling is more effective when you can articulate specific scenarios that can visualize larger potential failures.


In the beginning, it's usually a good idea to start with space where you are comfortable failing during these experiments, as your team is going to need some momentum to learn about the specifics of these experiments. These experiments can sometimes also have a larger impact than previously expected. To start with, you can try dev/test types of environments instead of jumping right into production, so that you have a safe space to fail and learn. It helps in 'inventorying' the known unknowns when you are building out how it is expected to work.


The systems and the newer dependencies will need to scale up through a game-day scenario to acquire more data. This will also further empower your team, so you can see benefits from this of how this is now reducing risks instead of introducing risks by doing experiments, and also that a proactive nature helps a team learn how to address this. It's normal for something to come up, and when that happens all you have to do is identify and address it at the right time, and continue to invest more time in achieving reliability.


Observing the past experiments and documenting the data helps teams in observability to understand and get access to all the things that one expects. You will need easy access to the telemetry data of all parts of the system. The way forward after this would be that it should make you capable of asking different and new questions about the system without having to change it. When you notice a gap in the visibility, focus on how to rebuild your system with the improvement through low coordination.


As you will be getting better at learning and building the system out and doing experiments in safe environments, it is really important to figure out how to get into production, how you can start introducing things like this so that it's easy to understand all parts of your system. Find ways to contain or control traffic flows so you can safely learn and experiment in production. Consider ways to replicate or simulate traffic to ensure live traffic flows are not obstructed.


Another part of this is, when you go to production, embrace compliance. There are a lot of newer compliance-type rules that always come with production systems. Here's a list of all the reasons why we should not be able to do this in this environment. . It is really helpful to find who the subject matter experts are in your organization, who can help you in understanding it in more detail.


All these systems have always had this surprise factor to them. It usually means making the environment for the teams safe from failure or minimizing the impact. Sometimes, the actual hypothesis might differ from the expected hypothesis, resulting in a larger impact like shadowing or replaying traffic to another system and saying everything in the system that can be impacted by this test. One thing the teams should be doing here is to capture the misalignment in the projection in an open and searchable repository and plan and invest extra time to digest the principles.

In the end, while concluding his talk Carl quotes, "So, hopefully, these are helpful reminders of how you can reduce what risks, and also in molding people's mindset into growing in this practice. This will help your teams to start in a safe environment and eventually make them confident in implementing chaos in the production."

Creating a learning culture
Creating a learning culture
Sanil Kumar
Resilience and Reliability for SODA with Chaos

Sign Up

for our Newsletter

Get tips, best practices and updates on chaos engineeering in cloud native.


by Experts

Checkout our videos from the latest conferences and events

Our Videos


Cloud Native Reliability

May 01, 2021   |   6 MIN READ