May 01, 2021

6 Min Read

Cloud Native Reliability

Our journey and the path forward

Uma Mukkara

Uma

Mukkara

CEO, ChaosNative

uma@chaosnative.com

It was exactly three years ago that we announced Litmus at 2018 EU KubeCon with some great people recognising it and we were thrilled to see CNCF announcing details of the Chaos Engineering work group. The genesis of Litmus is found in the efforts of building test strategies for another CNCF project OpenEBS but the growth and roadmap is led by the interest and enthusiasm from the Cloud Native community itself. Chaos Engineering coming out as one of the trending technologies for 2021 was a definite encouragement for me to take Chaos Engineering and reliability with more focus and put more resources into our path forward, thus the beginning of ChaosNative.

ChaosNative mascot | Chaotik

Chaos Engineering has been known as a practice of willful injection of faults by some experts in operations. The guidelines have been around for sometime now on how to run chaos experiments and how to conduct game days as part of your chaos engineering strategy, but from the cloud native perspective, chaos engineering can be seen as an approach to build reliability into the strategy of migration to cloud native or operating something in cloud native.

Reliability in cloud native is important simply because of the dynamic nature of the changes that need to happen. The nature of cloud native demands dynamic changes to the microservices. Some of the challenges that we wanted to solve with Litmus project are

  • Doing Chaos Engineering the cloud native way
  • Scaling it as per the cloud native norms
  • Lifecycle management of chaos experiments or scenarios or workflows
  • Observability of chaos with the cloud native lens

I have been mentioning them in many places as principles of Cloud Native Chaos Engineering (or CNCE). All of these base goals have to be met with teaming and collaboration among the SREs or chaos users as a given. Over the last two years, we helped build these goals into the Litmus project one by one while listening to the community while contributions poured in many areas. As a result, we think we now have a platform which allows the SRE teams to build and manage chaos workflows from a single chaos control plane, target them towards any resource in their multi cloud environment with chaos observability easily integrated into any existing cloud native monitoring system.

As we worked through the building blocks of Litmus, we learnt from many large users that chaos for non-Kubernetes targets is as important as that of Kubernetes primarily because of the hybrid nature of their large services. Large users often build the chaos engineering strategy for their overall business where some applications are moving to cloud native architecture. Reliability is usually considered as an approach for a business service than an application microservice. We had to build this goal also into Litmus. Now Litmus works as a cloud native application but can include chaos workflows that can target non-Kubernetes applications/platforms, bare metal servers, virtual machines and public cloud infrastructures.

To simplify the description of what Litmus for end users is, it is a highly scalable cloud native chaos platform for hybrid or multi cloud environments where users can take a common approach towards reliability.

Scalable cloud native chaos Platform - LitmusChaos
Offerings from ChaosNative

The next objective for us is to see Litmus helping users and enterprises in the field in achieving the reliability growth. To be honest, cloud native is too dynamic to arrive at an exact design of Litmus workflows to help users implement their chaos engineering goals. Each user’s environment is unique and one-good-design may not work for everyone. I believe that we are going to see ourselves consulting and implementing Litmus chaos workflows for many users who are taking the journey of cloud native reliability through Litmus based chaos engineering. Hence the commercial support package from ChaosNative includes some base consulting as well as professional services to implement the required chaos workflows to some extent during the initial period. If you are looking for starting chaos engineering for your cloud native initiatives or you are specifically looking for commercial support for Litmus, contact us.

The Road Ahead for ChaosNative
Scalable cloud native chaos Platform - LitmusChaos

I believe we are the beginning of the evolution of cloud native chaos engineering. More chaos workflows will continue to evolve in the short term with the chaos experiments that constitute these chaos workflows becoming a commodity. While we will find the required resources to build Litmus into a stronger, feature rich and community loved project we will also strive to learn the enterprise challenges in implementing their chaos engineering strategies and implement them back into our ChaosNative Enterprise offerings that we expect to throw light on in the near future.

Try

ChaosNative Litmus Cloudfor free

What’s included in the free tier?

  • 2 Agents
  • 60 workflow runs/month
  • Slack Support
  • 30 days retention period
Create free account
Principles of Cloud Native Chaos Engineering