IBM's Principles of Chaos Engineering
For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Robert Barron, AIOps, Assets, & Architecture from IBM, to speak about "IBM's Principles of Chaos Engineering".
IBM has been present in the market for more than a hundred years. We have been dealing with computers and computing devices. While improvising Apollo 13 at the NASA center after an explosion, we started preparing the components for such incidents in space. And this preparedness is what we call Chaos Engineering. IBM has a very proud legacy and IBM computers being the central part of those chaos experiments which helped us reach the moon.
From that time, we have had around 30,000 highly regulated customers, 40,000 mission-critical applications, and 60,000 billion yearly financial transactions. IBM does a variety of things for the customers, starting from being a cloud hyper provider in IaaS, FaaS, SaaS, PaaS, CaaS, etc., providing on-premise hardware and software, managed services, and solutions.
CHALLENGES
- Regulated Industries:
Industries that IBM deals with sometimes are highly regulated which leads to our clients being air-gapped. Updating a new version is not as simple as downloading. We have to have a new container and restarting a pod will allow you to pick a new container. The limitation here is that new things in the new container can be inserted only every three months.
- Technical Debt:
The applications that are being used in today's infrastructure range back by 20-30 years and through this we run our Chaos Engineering tests. We need to understand such monolithic applications to proceed with the process.
- Culture & Skills:
Due to the above-mentioned challenges faced by the IBM team, there is time scarcity to focus on building the culture and climate of the working environment. Iterating the experiments without modifying them will leave us no chance of learning new things.
Robert beautifully quotes Haytham Elkoja, a Chaos Commander saying, "Everything in today's infrastructure is not 100% efficient, things tend to break, and hence, we have to improvise and plan for betterment. The applications must mitigate these five f's: fires, floods, fools, and fat-fingers."
The stepping stones of IBM have been always trying to add resilience, reduce risks, and test for probable failures in their systems.

IBM'S PRINCIPLES OF CHAOS ENGINEERING
- Strengthen Reliability Discipline:
Most of our clients have been using our services for a while now so strengthening by running tests to make sure the applications are resilient through adding a layer of Chaos Engineering is our one main focus.
- Understanding the systems:
IBM's systems are very complex and are spread across multiple technological domains, so running chaos experiments can be a bit tricky. You will have to understand both the makeup of the system and the interfaces they have.
- Experiment on every component:
Analyzing and understanding your systems leads us to make sure we experiment with every component to keep reliability in check.
4 . Strive for production:
There must be some factors like contractual, regulatory, complexity, or risks that might be a hindrance to completely reach the production phase.
- Contain the impact:
Defining a blast radius of that particular chaos experiment will help us contain the impact in case something goes wrong in the experiment.
- Measure, learn, and improve:
Post the completion of the experiment, we have to measure the key performance indicators and study the results to figure out ways to improve them in the forthcoming experiments.
- Increase complexity gradually:
Increasing the blast radius and gradually increasing the number of components involved in the experiment will help the team learn and make the process better for the systems.
- Socialize continuously:
Everyone in the team should be aware that the most important stakeholder executive is responsible for the system downtime to the people who are in the trenches doing the operations or development work.
IBM'S CHAOS ENGINEERING METHODOLOGY
Robert goes on to briefly explain the basic steps that IBM keeps in check for a smoother flow of their chaos experiments.
- Understand System End-to-End
-
Break the solution into layers, components, and their topology.
-
Understand reliability requirements.
-
Review preview outages, failures, bugs, RCAs, and postmortems.
-
Identify weak points in the system and their potential impact.
-
Understand the existing monitoring.
-
Review and maintain backlog for experiments.
- Gain Organizational Agreement
-
Inform, enable and socialize early.
-
Identify stakeholders.
-
Reconfirm strategic intent.
-
Respect SLOs/SLIs and error budgets.
-
Highlight track record of safety, success, and rigor.
-
Anticipate objections.
-
Create hypotheses and plan experiments
-
Enable Observability
-
Identify tools for observability
-
Considering observability factors such as metrics, logs, and traces.
-
Capture baseline data before the experiment.
-
Set up alerts for exceptional values.
-
Prepare your experiment
-
Run chaos experiments
- Analyze the results to prove or disprove
- Communicating Findings and Improve
-
Growing blast radius and repeating the experiments.
-
Expand to production
-
Start with game-days in production.
-
Inject chaos engineering into the CI/CD Pipelines.
-
Add 'Auto-Pilot' mode.
-
Access processes in place.

"If I were to summarize the whole talk into just one sentence, I would say 'To succeed, it's vital to present Chaos Engineering as a mature and rigorous engineering method.'" and hence, Robert concludes his talk.