Chaos Engineering in Telco Cloud-Native Infra
For an enlightening session on day 2 of Chaos Carnival 2021, we had with us Vaibhav Chopra and Samar Siddharth from Orange, to speak about "Chaos Engineering in Telco Cloud-Native Infra".
"Through this talk, I would like to explain how Telco Cloud Infra works efficiently with Chaos, and I would also love to touch on the base points on why Chaos is needed in a Telco-grade cloud. Here, you can see, we have started swiftly changing our technology trends from Infrastructure-as-a-Service (IaaS) to Container-as-a-Service (CaaS). IT Cloud has no acceleration and is not secure or reliable. It was predominantly used for hosting IT and web applications. Whereas on the other hand, Telco-grade cloud is highly secure and reliable with ultra-low latency and high throughput. It provides network, data plane, acceleration, and CPU Pinning. It is used to host OSS and BSS applications like VNFs, CNFs, etc., and is SDN enabled." explaining this Vaibhav kickstarts his talk.
The containerization world has boomed in the last few years. It is highly flexible and can be used in any industry and all stages of applications such as Proof of Concept (PoC), development, testing, and production. There are some questions that Chaos Engineering resolves such as, has DevOps adoption challenged testing practices, apart from CI/CD, why don't we practice CT, and are SREs solely responsible for reliability and assurance or not.
E2E TELCO INFRA ASSURANCE
COMMON CHALLENGES IN INFRA BENCHMARKING
Telco workloads are complex due to ultra-low latency and high throughput.
CNCF hosts multiple projects in different domains such as 50+ databases, 20+ logging, 10+ service mesh, and 30+ security.
- Lack of Standardization:
Establishing standards in midway, for example, CKD stands for Certified Kubernetes Distribution which CNCF provides after confirming the deployments. Another one is KCSP which stands for Kubernetes Certified Solution Provider and CNCF provides this for professional services.
CHAOS ENGINEERING PREREQUISITES FOR TELCO CLOUD
5G and edge computing:
Ultra-Reliable Low Latency (URLLC) and enhanced Mobile Broadband (eMB) needed 1 millisecond of latency and up to 10 Gbps data rate.
Integration with hyper scalers:
Public cloud providers like Gameazure and Kubernetes managers like Rancher, Openshift, Kubermatic, etc.
From IaaS to CaaS:
CloudNative Environment tools and stack, the evolution of CNFs, containerization brings benefits like lightweight, modularity, agnostic, isolation, and resource efficiency.
- E2E Automated Assurance:
Certify CloudNative environment, identify bottlenecks wrt performance, customize and on-demand testing, and build automated CI/CD pipeline testing strategies for all layers.
NON-TANGIBLE ASPECTS WRT CHAOS IN TELCO
- Revenue Assurance:
It is usually very high in a Telco-grade cloud because every data packet holds some value. Whenever there is an outage or a network disruption, the leakage needs to be taken care of.
- Identifying Race Conditions:
When the servers are not behaving properly or having complexities such as low/high load, these can be identified and customized in a chaos testing scenario.
Chaos Engineering can leverage this by monitoring components while testing and by improving our training there, we can tune up our monitoring in Prometheus.
We always need our systems to behave in full tolerance. These parameters are identified while testing, and we can make them fault-tolerant by tuning them a bit.
Session termination or read transmissions and other terminologies hold great importance while data traffic is ongoing. By configuration tuning, we can get a great value in CPU tuning eventually.
Orange was looking for a feasible chaos testing tool for their containerized control plane which can host their private cloud and they stumbled upon LitmusChaos. It was a framework designed for Kubernetes workload. Litmus is written in Go and uses three main CRDs to execute and experiment:
The definition of experiment with default parameters.
Binds the experiment definition with chaos target workload. If successful, the experiment is initiated by Litmus Operator, overriding any variables specified in the manifest file.
Displays basic information about the progress and eventually the result of the experiment.
As you can see, we have a Chaos Hub, which contains the definition and the details of experiments. We then add the definition and the details of applications such as pod to the Chaos Engine, which then tries to map them. Once the mapping is done successfully, the application is passed down to the Chaos Operator. From there on it goes to the Chaos Runner which injects a specific chaos experiment and then we wait for the result. You can check the Chaos result with your hypothesis at the start of the testing if it is valid or not.
Firstly, because Kubernetes is a dynamic and complex system, and Litmus has workloads in the same framework, it is ideal. It has a wide variety of generic test cases for Kubernetes workload and infrastructure. Components have a reason to interact in several ways, causing emergent behavior and better culture. Hardware failures, pod outages, and resource exhaustion threaten the most well-managed cluster. It also triggers customized validations. The number of possible interactions between components increases as deployment grows in size. Lastly, it has easy integration with our existing automation framework as well as great community support.
Towards the end of the talk, Samar mentions, "We attain resiliency in Telco Cloud by E2E automation, testing interdependency amongst different open-source applications, monitoring our control plane infra, and validating the control plane of different control plane services." and the proceeds for a brief demo of their use case.