In the kitchen: A sprinkle of fire and chaos
For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Ana Medina, Senior Chaos Engineer at Gremlin, to speak about "In the kitchen: A sprinkle of fire and chaos".
Ana starts with a simple analogy, "We all have learned cooking by trying various food items, observing how a dish is made by watching numerous YouTube videos and recipes, post which, lastly, mastering the art of cooking and making not just for yourself but for others as well, taking the valuable feedback to improve the process."
HOW DO WE LEARN?
Our approach to learning about distributed systems, cloud space navigation, or related topics has been to consume content through conferences, YouTube videos, and tutorials or by building or breaking, and experimenting on a physical level.
Ana again mentions the analogy, "We learn that our food tastes better when we taste as we cook in intervals, cook for others so that they can review your skills, and if the food is not good you can either burn it or start over. What matters here the most is to take feedback as a way forward in the process."
WHAT DOES COOKING TEACH IN BUILDING CLOUD-NATIVE APPLICATIONS?
As we mentioned some great steps on how to know if your food was good or not, we can apply the same as well for cloud-native applications. Tasting while the dish is being prepared gives us the learning of observing gradual deployments in your applications. The second step is cooking for others which gives us the lesson of experimenting with your team. The third and the last step was to burn it, which means to study the past failures and replicate the resilience of the experiments with a lesser chance of failure.
TODAY'S FOCUS: LEARNING
"Some of us start learning by practicing, handling the deployment of a reliable Kubernetes cluster, setting up EC2s to have cloud alarms, learning fundamentals of cloud and networking concepts, building muscle memory, etc. Our businesses, health, and safety rely on applications and systems that are not 100% efficient and will fail at a certain point. So, bringing in reliability and minimizing the risk of failures will give us the utmost customer satisfaction." An explains.
MEASURING THE COST OF DOWNTIME
Every time our applications break, it is costing us a lot of money and rarely the costs are quantifiable.
Cost = R + E + C + (B + A)
where R = Revenue Lost (During the outage)
E = Employee Productivity (During the outage)
C = Customer Chargebacks (After the outage; SLA breaches)
B = Brand Defamation (Unquantifiable)
A = Employee Attrition (Unquantifiable)
"When we are talking about the failures, we also need to remember that whatever we are building in the cloud is only getting complex which makes it difficult to operate. The pressure that we have of driving new technologies of infrastructure, application development architectures makes it extremely hard.
"In the cloud-native world, we need experimentation, making sure you are building reliable applications for your customers and end-users. That is the beauty of experimentation. At the end of the day, we are building a complex distributed system. There are some parts which we have to test, otherwise, we are going to suffer the outage. There are things upon which we have to build our redundancy and test the reliability." Ana states.
Anna continues, "This brings us to our favorite ingredient to build cloud-native resilient applications, which is Chaos Engineering. It is a thoughtful and planned way of experimentation designed to reveal the weakness in our systems."
It is a practice to make your systems, processes, architecture, people, diagrams, etc. more resilient and reliable. Chaos Engineering has a scientific method to go about crafting experiments. We need to create a hypothesis and think about what will happen to my system when I inject failure. The blast radius is the area of the surface of servers on the cloud that you are looking to target. It can be one host, one Kubernetes Cluster, or one microservice. Magnitude is the strength of the experiment that you will be releasing. At the initial stage, one should always start with a small blast radius and a small magnitude impact. Abort conditions are about thoughtful practices and making sure safeguards are in place. These are the conditions that happen to your dashboard on your application to alert you about when a chaos experiment should be stopped.
Observe: By observing the system we can have an architecture and microservice diagram which, in turn, will help us determine a critical path for our application.
Baseline your metrics: Having a set parameter based on which we can have service level objectives or indicators (SLOs/SLIs). It gives us an overall understanding of our systems in normal conditions.
Form Hypothesis with Abort Conditions: Studying the systems and abort conditions helps us form hypotheses and when to cease them.
Define Blast Radius and Magnitude: Defining a small blast radius and magnitude impact helps us start small and gain confidence one experiment after another.
Run Experiment: While running these experiments for the first few times, you might notice the results are varying from the hypothesis which helps you learn about alternatives results.
Analyze results: Post the completion and documentation of the experiment, we can get back and study those results and observations which helps you answer whether the experiment was successful or not. If the experiment was successful, you might want to make the blast radius a bit larger and the magnitude impact a bit stronger.
Sharing the results: You should share the results with your leadership team and with the community. This helps in building open-source technologies and cloud-native applications.
Experiment 1: Deploying PHP Guestbook application with Redis
A single-instance Redis Master to store guestbook entries.
Multiple replicated Redis instances to serve reads.
Multiple web-fronted instances.
This experiment is going to shut down one of the containers of application Redis. Once I shut down one of the containers, I would like to know which replica of the Redis instance is going to continue serving data. The work condition here is if we run across 400/500 HTTP errors, we will be stopping this experiment and if our frontend has any issues and we suffer any data loss, then too, the experiment will be halted. The hypothesis here is if we shut down a Redis container, my application will have no impact and it will be running smoothly.
Ana then proceeds to show a brief demo and upon completion, she states," Our experiment has to stop because we hit our abort condition, that is, data loss. This proves that our hypothesis was wrong but we got to learn that the impact of losing Redis primary container means we suffered data loss. This was something which we would have never learned if we never lost any data in the first place."
CHAOS ENGINEERING ON KUBERNETES
1. Plan and test your limits:
Make sure your Kubernetes infrastructure can handle the first few set of experiments. Ideally, the chaos injection too attacks the CPU with the blast radius as 100% and magnitude impact as 60%. Abort conditions can be HTTP 400/500 errors, the application being unresponsive, response rates increasing, etc. The hypothesis can be around-consuming memory resources, which the application will handle or autoscale.
2. Graceful degradation:
This is conducted when there is a gradual decrease in responses in our API or a certain part of the entire portion of the application is not working. Here, the targeted attack is on latency with Redis-Cart as the blast radius and the magnitude impact being 400ms. Bort conditions are the same as above, being HTTP 400/500 errors, the application being unresponsive, response rates increasing, etc. The hypothesis here states when the catching layer experiences a latency increase, the application should work without any issues.
RECIPE FOR BUILDING RELIABLE APPLICATIONS
"Implementation of the highly available applications, being thoughtful about the capacity and the footprint they are going to be using, and the most common thing to consider here is failover and deployment in multiple regions and availability zones, which for some of you might be physical datacentre and cloud, and lastly, be prepared with the disaster recovery plans and exercising them quite often with your team. These things culminate in reliability to our customers. Reliability through experimentation can only be achieved by practice." with this Ana concludes her talk.