Chaos on VMWare Infrastructure using Litmus

ChaosWheel

About the speaker

Delphine Joyneer

Delphine Joyneer

Senior Software Engineer,

Wipro

Delphine is from Infrastructure Quality Assurance Team at Wipro. She has extensively worked on developing tools that assure the infrastructure of both private and public clouds.

About the talk

To know the capability of VMware environment, we are introducing chaos in the VMware environment based on different parameters. These chaos will be introduced using litmus which already has a well-defined console/UI to introduce chaos.

Over three-quarters of all businesses make use of server virtualization and the most common and used solution for Server Virtualization is VMware . In order to check system's capability to withstand turbulent and unexpected conditions, we need to introduce chaos and check how it responds. We are designing the Workflows to test that test the VMware environment based on different parameters like compute, network and storage.

Transcript

Chaos on VMWare Infrastructure using Litmus

Ankur Ghosh & Delphine Joyneer deliver an insightful session on 'Chaos on VMware infrastructure using Litmus' on the first day of Chaos Carnival 2021. They talk about Reliability as a Service (RaaS), Wipro's RaaS framework called iAssure, and more.

Ankur kickstarts this session by explaining Wipro RaaS. He defines reliability engineering as, "the measures are taken to improve resistance to failure over time, estimate expanded life, and predict time to failure."

"Reliability as a Service (RaaS) Framework is a bunch of services that are designed to prevent failures of hardware and software components. It measures a system's resilience by analyzing the impact of systematically orchestrated chaos injections. Our RaaS framework evaluates a system with reliability scores calculated based on individual analysis of fault tolerance and performance benchmark attributes of the system." he adds.

WIPRO'S iASSURE: A RaaS FRAMEWORK

Ankur describes in detail the working of the RaaS framework in Wipro called iAssure.

"Test elements such as memory, network, storage, OS, middleware, applications, etc. are the areas where we will be applying chaos experiments. We are going to set boundaries and define the scope in the test elements which acts as a blast radius, after which, we identify the critical process and activities, and launch the chaos. Then we have to define which type of chaos experiment will be taking down the VM and pushing the test element to block any kind of request. We develop these experiments in the private cloud environment and hence, VMWare Cloud for us is the major deal. The output is in the form of scores which will help us gauge the condition of the application resilience or the infrastructure resilience and identify the areas which are brittle currently, to fix them and make them highly available."

FEATURES OF iASSURE:

  1. Platform agnostic chaos injection.

  2. Tasks-based automation through scripts and Ansible playbooks.

  3. Historical data storage and analytics.

  4. Custom Chaos is integrated through the Litmus portal which acts as the centrifuge or the fulcrum of the whole chaos experiment. Also, it acts as a provision by which we can integrate or onboard custom chaos experiments that are written for VMWare

  5. Typical chaos types like CPU, Memory, Disk IO faults, NIC faults, vCentre VM charge faults, fil handle leaks, kernel panic, N/w packet delay, N/w packet loss.

FINAL OUTCOME:

  1. Infrastructure application resilience quantification.

  2. Early prognosis of a device failure.

  3. Continuous feedback and improvement at every stage in the release cycle.

  4. SLA Adherence.

  5. Reliability attributes and indicators benchmark comparison.

KEY ATTRIBUTES USED TO PROVIDE RELIABILITY SCORE:

"Once the applications are taken down, we test the traffic error latency saturation, and based on them we try to find whether it is meeting the standards through the previously set benchmark values. Hence, historical data is really important as we need to know if a server was catering to thousands of requests after it was struck by chaos, keeping a check on the number of queries thereafter. If it is not consistent, we need to look after that to reach the right benchmark values." Ankur explains.

He goes on to add," SLA is an agreement between a provider and the client about unanimously decided parameters critical to business. For example, if a customer needs 99.99% of availability in applications hosted on the infrastructure, to achieve this high availability we resolve the statement into different key attributes that we are going to monitor. Let's say, 0.5% errors as the SLO is decided internally and then we try to monitor the errors. It is noted that the errors are now at 0.1% of the requests for all the traffic that is getting generated or getting served from the server. Hence, SLO is meeting the benchmark along with the SLA."

CLOSING

Ankur ends his talk with a demonstration on VMWare Cloud which is a very simple architecture consisting of a load balancer from which two VMs are hanging and the queries are getting routed in a round-robin way to different web servers which are listed into different VMs. We will try to prove the resiliency of the application which is hosted in this infrastructure by taking down on VM with the help of the chaos, and once it is done, we can showcase that the application is resilient by launching the queries and still finding that the application is up and running through the VM.

He concludes the talk by saying."By working on the right feedback, we can achieve the desired resilience score for our applications on infrastructures."

In the kitchen: A sprinkle of fire and chaos
In the kitchen: A sprinkle of fire and chaos
Improving Business Resiliency with Chaos Engineering
Improving Business Resiliency with Chaos Engineering

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos

Related Blog

Litmus 2.0

Read

Aug 15, 2021

6 Min Read

Uma's Blog on Litmus 2