Chaos testing Red Hat Openshift Virtualization

ChaosWheel

About the speaker

Jordi Gil

Jordi Gil

Senior Software Engineer,

RedHat

Jordi is a Senior Software Engineer at Red Hat. For the past 5 years, he has been involved in a myriad of projects related to cloud technologies, including implementing a highly efficient distributed log aggregation solution around Prometheus for a legacy system and overseeing the development of a generic Helm Operator using the Operator SDK.

About the talk

A talk on Chaos testing Red Hat Openshift Virtualization.

Transcript

Chaos Testing Red Hat Openshift Virtualization

"Our objective was simple, to test the maturity of the Red Hat Openshift Virtualization solution, and Litmus helped us achieve the same." This is how Jordi Gil, a senior Software Engineer in Red Hat started his talk at Chaos Carnival and will later on go to explain how Litmus has been beneficial to them.

"We wanted to anticipate how the platform would behave in certain stressful scenarios that would impact production clusters, learn from its behavior, and report issues back to the project," Gil explains.

TESTING ENVIRONMENT

Red Hat Openshift Virtualization is a technology that allows you to deploy your virtual machine as a pod on top of an open shift cluster. It's Red Hat's commercial solution of the open-source project named Cubert to test open shift virtualization. The team benefited from open shift running on a single machine where we develop libvert/qemu to create the node VMS and deploy open shift virtualization on top due to hardware constraints. We had only one open shift cluster available to run our tests.

This became a bottleneck as we started adding long-running tests to our suit to interact with the nodes at the operating system level. They connected directly using SSH commands at the same time to manage the VMS that host the open shift nodes. Red Hat leveraged the lead qemu commands that allow them to perform actions such as suspend restart or even manage the storage and network interface cars of the VMs.

Red Hat wanted to make the testing closer to real usage, so we built a program that was constantly generating low synthetic load during the test execution enough to cut some extra load to the cluster. The test logic was developed using bash shell scripting.

Red Hat would start the environment just before the team created the chaos engine in some cases. It required interacting with the nodes to prepare them for the test. After the test was completed, the team executed a cleanup script that reverted the changes made due to the test

WHY LITMUSCHAOS

Gil goes on to explain his experience on choosing Litmus as their chaos injecting tool, "Before we selected litmus the team did field research on different solutions available at the time for cloud-native chaos testing. It is open-sourced, and at that time it already had a wide selection of existing experiments to leverage upon. It is a CNCF Sandbox project and also has a vibrant community with like-minded people. There are frequent releases and it's very well maintained."

USING LITMUSCHAOS

"We started using litmus back in September 2020. Our first test cases were leveraging single experiments, and early on we discovered that the pros were the best way to validate the status of experiments before, during, and after the chaos was injected. The current results optics provide a single source of test data due to our needs. Under test subjects, we developed the node-restart experiment which allowed us to restart the node VM by invoking the restore command directly on the node operating system. Shortly after the node power off was created, it reused the node-restart code." Gil adds

TYPES OF EXPERIMENTS

Gil goes on to explain the various Litmus experiments used in Red Hat Openshift Virtualization in a lucid and detailed way:

  1. Pod: Pod Delete, Pod Autoscaler
  2. Node: Node Restart, Node Drain, Node Poweroff, Node Suspend
  3. Network: Pod Network Loss, Network Disruption, NIC Failure
  4. Storage: Disk IO Failure, NFS Server Restart, OCS Pod Delete

LESSONS LEARNED

  1. Optimizing execution time: Always try to find ways to reduce the time it takes for your changes to be tested.
  2. Choose the right path and language: Use a language that will allow you to develop and maintain your tests with confidence.
  3. Using probes for help: Capture your chaos test results using only the ChaosResults object.
  4. Contribute to Litmus: The LitmusChaos has always had a welcoming approach towards the changes that are suggested by their customers.
  5. Keep your test runs free from the flaky tests: Invest early in breaking your framework to avoid getting odd test behavior.
  6. Validate test results early: Because a test result is not valid until it has failed.
  7. Validate the status of your bugs at runtime: Checking for bug validation can be a tedious manual task, hence, you should automate it with your code.

FEEDBACK FOR LITMUS:

While concluding his talk, Gil goes on to give very valuable feedback based on his experience of using LitmusChaos," I think that some types of experiments might benefit from injecting the same chaos intermittently. Expanding probe chaining capabilities to all types of probes will benefit the path to orchestrate the experiment execution. Conditional probe chaining will be successful if we apply the condition for it to run only on the chained probe. These are some candid ways-forward for the Litmus team that I would like to mention."

Improve Application Resiliency with AWS Fault Injection Simulator
Improve Application Resiliency with AWS Fault Injection Simulator
Event-Driven Chaos Injection
Event-Driven Chaos Injection

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos