Event-Driven Chaos Injection

About the speakers

Raj Babu Das

Software Engineer,

ChaosNative

Currently working at ChaosNative, Raj is busy building software. In the past, he worked on a wide range of technologies with many startups and has over three 3+ years of hands-on experience on different cloud technologies like Serverless, Kubernetes, Prometheus, and many more.

Soumya Ghosh Dastidar

Software Engineer,

ChaosNative

Soumya currently works at ChaosNative as a Chaos Engineer and contributor to Litmus. He helps design and build backend architecture. He is not only passionate about development but also curious about breaking stuff with the practice of Chaos Engineering.

About the talk

Running chaos-experiments manually or in CI/CD pipelines? We'll show you how you can configure your environment to trigger chaos-experiments automatically with changes in application state using Litmus' powerful event-driven chaos injection features and seamless Git integration.

The need for chaos engineering has grown over the years, and now it's well understood. According to the one of prediction by a CNCF TOC member, Chaos engineering will be one of important technology in 2021. Its usage in CI/CD pipelines, implemented via 'chaos stages' in frameworks such as Spinnaker & Keptn, are further proof of Chaos Engineering's ubiquity over the earlier ops-driven, AWS game day oriented models.

With the above use cases, a lot of tools are finding alternative ways to induce chaos into the system. Here, we will discuss the event-driven chaos injection which will help to trigger chaos based on any specific change in the application's manifest. We will show alongside with Gitops, a practice in which the changes are pushed to a Git source that houses the deployment artifacts picked up by a GitOps operator (ex: ArgoCD/WeaveFlux) that updates the Kubernetes cluster, ensuring a synched declarative source of truth for the application. We have tested much as the application in dev environments, there is undeniable value in subjecting it to controlled failure in staging/prod environments, especially upon being updated with new specifications. These experiments offer a nuanced view of stability that can trigger rollbacks or alert stakeholders on potential issues.

In this talk, Soumya Ghosh Dastidar & Raj Babu Das will demonstrate an event-driven chaos injection mechanism that has worked alongside standard GitOps tooling, wherein subscribing applications are automatically injected with chaos upon being updated in the cluster. The presentation will also discuss how Git is used as the source of truth for the chaos workflows themselves, extending the GitOps maxim of 'All Operations on Git'' to chaos too!

Transcript

EVENT-DRIVEN CHAOS INJECTION

They are not only passionate about development but also curious about breaking things with the practice of Chaos Engineering." This is how moderator Ajesh Baby chose to introduce Raj Das Babu and Soumya Ghosh Dastidar one day 1 of Chaos Carnival 2021, who delivered a talk on Event-Driven Chaos Injection.

Raj kicks off the session by defining Chaos Engineering as,"Chaos Engineering is how confident we are in making something in production and making it resilient in the future."

He goes on to explain some basics of working with Chaos.

SysAdmins, SREs, DevOps Engineers, etc. are the ones who practice.
Recent practitioners of Chaos testing are Google, Slack, GitHub, etc.
Some of the tools to perform Chaos injection are Chaos Monkey, LitmusChaos, and Gremlin.

INTRODUCTION TO EVENT-DRIVEN CHAOS INJECTION:

There are several possible ways of injecting chaos into a system or application, like manual injection; scheduled injection -chaos is scheduled for a particular instance; random injection -the idea of injecting chaos randomly at any point of time to check reliability; injection in the CI/CD pipeline has Chaos introduced into the system in the CI stage and event-driven injection. Both speakers go on to address event-driven chaos injection in-depth in their talk.

So, what exactly is Event-Driven Chaos Injection?

Addressing this very question, Raj explains, "Event-Driven Chaos Injection is a type of Chaos Injecting method where chaos is triggered based on a particular event. Events can vary from changes in configuration, changes in replicas, etc. A policy can be initiated, which will then manage the events and detect a fault on immediate change in configuration."

He continues, "Litmus web-based Portal is a part of the LitmusChaos project, which allows you to connect your remote Kubernetes Cluster and inject chaos in that cluster, and is currently used in the beta version."

The Litmus Portal is divided into two major clusters: Red and Blue. The red cluster consists of components from the Litmus Portal while the blue cluster components are responsible to connect themselves with the red one.

ARCHITECTURE OF EVENT-DRIVEN CE:

Components of Red Cluster (Litmus Portal):

Web UI: Web components that allow the external user to interact with the dashboards.
Auth Server: Server responsible for managing all the authentication-related activities.
MongoDB: Used for persisting the data.
GraphQL Server: the server used to manage the API requests and which also serves as a connection point between the red and blue clusters.
GitHub Repository: an external component used for GitOps purposes.

Components of Blue Cluster:

Subscriber: First component that interacts with the GraphQL server by accepting its requests.
Litmus Operator: Required for LitmusChaos to inject chaos.
Argo Server & Workflow Controller: Used for managing Argo workflows while creating chaos. Argo workflows have a set of experiments and a chaos engine defined which, after implementation, will inject chaos on a sequential basis.
Event Tracker: It tracks all the resource changes in the Kubernetes cluster namely deployment state presets and demonstration. The resources have been annotated in the field of GitOps in boolean value and in workflows, it is a string value. You have to enable GitOps in your system to be tracked by Event Tracker.

"For example, we have a web application, named App V1. We need to upgrade App V1 to App V2 (which is being termed as an event) which will be tracked by the event tracker. This request will be sent to the GraphQL server with a particular workflow ID. The workflow is then sent to the Subscriber and the workflow is applied. Now you can check the resiliency of the web application using a PodDelete experiment in every configuration change, hence, fewer chances of failure." Raj explains.

< Back to Talk List

Chaos testing Red Hat Openshift Virtualization

Evolution in Chaos Engineering

Event-Driven Chaos Injection

EVENT-DRIVEN CHAOS INJECTION

INTRODUCTION TO EVENT-DRIVEN CHAOS INJECTION:

ARCHITECTURE OF EVENT-DRIVEN CE:

Components of Red Cluster (Litmus Portal):

Components of Blue Cluster:

Videos

by Experts

Related Blog

Site Reliability Engineering: An Introduction

Start your reliability journey in minutes