A powerful framework for network chaos experiments
For an enlightening session on day 2 of Chaos Carnival 2021, we had Andreas Krivas, Engineering Manager from Container Solutions, speak about "A powerful framework for network chaos experiments".
Andreas begins the high-level objective of the technical demo by explaining the architecture of the target application, a demo e-commerce website developed using a microservices architecture called Hipster Shop.


As a part of the demonstration, a network chaos experiment will be performed on the Hipster Shop, where a two seconds latency will be injected between the microservices Frontend and CheckoutService for one minute using the pod-network-latency experiment for LitmusChaos.

The Frontend microservice refers to the user-facing front-end web application that can be used to access the application's different features, such as browse the products and purchase the products available in the catalog. CheckoutService is the microservice responsible for the facilitation of payments for the products that are to be bought.
The demo setup includes the deployment of Hipster Shop in the GKE cluster running on GCP. LitmusChaos is also installed on the same cluster. Andreas then proceeds to give a brief understanding of the ChaosEngine manifest for the pod-network-latency experiment. Here, he explains the experiment details such as the appinfo, in which the target application's namespace, label, and resource type is to be specified, which in this case are hipster-shop, app=frontend, and deployment respectively.
He also explains the experiment tunables such as the DESTINATION_HOSTS tunable which helps to target only the specific hosts which in this case is just the CheckoutService. The NETWORK_LATENCY tunable for specifying the network latency duration, which is 2000ms i.e. 2s. In this case, the TOTAL_CHAOS_DURATION specifies the duration through which the chaos shall take place which in this case is 60s i.e. 1 minute.

Andreas then heads over to the terminal where application microservices can be observed to be in running status in their respective Kubernetes pods. Once the pods are verified, the ChaosEngine manifest is then applied. As soon as the manifest is applied, the experiment pod is created and the experiment begins to execute its various steps, which can be traced through experiment logs. While the chaos injection is in progress, Andreas heads over to the website and attempts to check out a product, and it is then observed that the checkout took place with a network latency of 2.11 seconds, which is slightly more than 2 seconds, because of the chaos injection.

It was also observed that the other microservices were unaffected and were working properly, without any significant network latency. Once the chaos duration was over, the checkout service was accessed once more and this time there was no significant network latency.
Andreas then proceeds to make a few changes to the experiment ChaosEngine manifest, such as commenting the DESTINATION_HOSTS tunable which effectively adds a network latency to every network request made from the Frontend microservice. Once the manifest is applied and the chaos injection is in progress, the website is found to be not reachable as the Frontend microservice pod has gone to a Not Ready state.

As soon as the chaos duration was over, the Frontend pod went back to the ready state and the website was reachable once again.