IBM's Principles of Chaos Engineering

About the speaker

Robert Barron

AIOps, Technology Assets & Architecture,

IBM

Robert is an SRE, AIOps and ChatOps evangelist who enjoys helping others solve problems even more than he enjoys solving them himself. With over 20 years of experience in IT development & operations (13+ of them with IBM), he is happiest when learning something new. Robert has spoken at global conferences such as IBM Think, SRECon & others. He publishes at flyingbarron.medium.com and his social media handle is @flyingbarron. Robert lives in Israel with his wonderful wife and two children.

About the talk

IBM has spent over a century improving the reliability of systems ranging from the largest of mainframes to the smallest of microservices. As part of cultural and organisational improvements we've codified a list of principles which define our view of Chaos Engineering.

IBM's Principles of Chaos Engineering do not replace existing principles, but adapt them and match them to the requirements we have from our clients and from our own internal services. In this session we will describe a little of the process of getting engineers from across to agree on these principles (herding cats is child's play in comparison) and present the principles and lessons which we agreed upon.

Most Chaos Engineering examples come from either born-on-the-cloud examples or companies which are significantly smaller than IBM. In this session attendees will learn about adopting Chaos Engineering in a highly complex environment - both from a technical and a cultural perspective. We believe that the principles we present will be relevant to all types of environments and organisations of all sizes.

We will also touch on IBM's Chaos Engineering method that allowed internal IBM SREs to successfully build and maintain reliable mission critical services and showcase the benefits of a well formed and measurable Chaos Engineering program for digital transformation engagements.

Transcript

IBM's Principles of Chaos Engineering

For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Robert Barron, AIOps, Assets, & Architecture from IBM, to speak about "IBM's Principles of Chaos Engineering".

IBM has been present in the market for more than a hundred years. We have been dealing with computers and computing devices. While improvising Apollo 13 at the NASA center after an explosion, we started preparing the components for such incidents in space. And this preparedness is what we call Chaos Engineering. IBM has a very proud legacy and IBM computers being the central part of those chaos experiments which helped us reach the moon.
From that time, we have had around 30,000 highly regulated customers, 40,000 mission-critical applications, and 60,000 billion yearly financial transactions. IBM does a variety of things for the customers, starting from being a cloud hyper provider in IaaS, FaaS, SaaS, PaaS, CaaS, etc., providing on-premise hardware and software, managed services, and solutions.

CHALLENGES

Regulated Industries:

Industries that IBM deals with sometimes are highly regulated which leads to our clients being air-gapped. Updating a new version is not as simple as downloading. We have to have a new container and restarting a pod will allow you to pick a new container. The limitation here is that new things in the new container can be inserted only every three months.

Technical Debt:

The applications that are being used in today's infrastructure range back by 20-30 years and through this we run our Chaos Engineering tests. We need to understand such monolithic applications to proceed with the process.

Culture & Skills:

Due to the above-mentioned challenges faced by the IBM team, there is time scarcity to focus on building the culture and climate of the working environment. Iterating the experiments without modifying them will leave us no chance of learning new things.

Robert beautifully quotes Haytham Elkoja, a Chaos Commander saying, "Everything in today's infrastructure is not 100% efficient, things tend to break, and hence, we have to improvise and plan for betterment. The applications must mitigate these five f's: fires, floods, fools, and fat-fingers."

The stepping stones of IBM have been always trying to add resilience, reduce risks, and test for probable failures in their systems.

IBM'S PRINCIPLES OF CHAOS ENGINEERING

Strengthen Reliability Discipline:

Most of our clients have been using our services for a while now so strengthening by running tests to make sure the applications are resilient through adding a layer of Chaos Engineering is our one main focus.

Understanding the systems:

IBM's systems are very complex and are spread across multiple technological domains, so running chaos experiments can be a bit tricky. You will have to understand both the makeup of the system and the interfaces they have.

Experiment on every component:

Analyzing and understanding your systems leads us to make sure we experiment with every component to keep reliability in check.

4 . Strive for production:

There must be some factors like contractual, regulatory, complexity, or risks that might be a hindrance to completely reach the production phase.

Contain the impact:

Defining a blast radius of that particular chaos experiment will help us contain the impact in case something goes wrong in the experiment.

Measure, learn, and improve:

Post the completion of the experiment, we have to measure the key performance indicators and study the results to figure out ways to improve them in the forthcoming experiments.

Increase complexity gradually:

Increasing the blast radius and gradually increasing the number of components involved in the experiment will help the team learn and make the process better for the systems.

Socialize continuously:

Everyone in the team should be aware that the most important stakeholder executive is responsible for the system downtime to the people who are in the trenches doing the operations or development work.

IBM'S CHAOS ENGINEERING METHODOLOGY

Robert goes on to briefly explain the basic steps that IBM keeps in check for a smoother flow of their chaos experiments.

Understand System End-to-End

Break the solution into layers, components, and their topology.
Understand reliability requirements.
Review preview outages, failures, bugs, RCAs, and postmortems.
Identify weak points in the system and their potential impact.
Understand the existing monitoring.
Review and maintain backlog for experiments.

Gain Organizational Agreement

Inform, enable and socialize early.
Identify stakeholders.
Reconfirm strategic intent.
Respect SLOs/SLIs and error budgets.
Highlight track record of safety, success, and rigor.
Anticipate objections.

Create hypotheses and plan experiments
Enable Observability

Identify tools for observability
Considering observability factors such as metrics, logs, and traces.
Capture baseline data before the experiment.
Set up alerts for exceptional values.

Prepare your experiment
Run chaos experiments

Run game-day activities in pre-execution, execution, and post-execution stages.
Continual experimenting

Analyze the results to prove or disprove

Document the findings
Evaluating experiments results
Draw Conclusions.
Decide if the test was sufficient or any new or other tests are required.

Communicating Findings and Improve

Publish and communicate results
Address the findings
Validate

Growing blast radius and repeating the experiments.
Expand to production

Start with game-days in production.
Inject chaos engineering into the CI/CD Pipelines.
Add 'Auto-Pilot' mode.
Access processes in place.

"If I were to summarize the whole talk into just one sentence, I would say 'To succeed, it's vital to present Chaos Engineering as a mature and rigorous engineering method.'" and hence, Robert concludes his talk.

< Back to Talk List

Event-Driven Chaos Injection

Stand up for Chaos!

IBM's Principles of Chaos Engineering

IBM's Principles of Chaos Engineering

Videos

by Experts

Related Blog

LitmusChaos moves to CNCF Incubator

Start your reliability journey in minutes