Day 1 Keynote by Adrian Cockcroft
For an amazing keynote session on day 1 of Chaos Carnival 2021, we had with us Adrian Cockcroft, VP Cloud Architecture Strategist from Amazon Web Services, to speak about "Failing over without falling over".
Adrian defines Chaos Engineering as "It is an experiment to ensure that the impact of failures is mitigated."
CHAOS ENGINEERING TREND
More than a decade ago, Netflix released their first blog about Chaos Monkey, the first chaos testing tool. Chaos Engineering emerged during that time because of the cloud environment. Nowadays, even though people are aware of it, it's still less seen in regular practices. It will be the next big thing shortly because we will be constantly testing and improving our systems towards resiliency.
If we look at resilience about 40-50 years ago, it was basically disaster recovery and backup data centers. Now, people have begun practicing Chaos Engineering as means to achieve resilience, and in the future, continuous resilience with a greater impact on production is expected. We build redundancy in our system so that if something fails, we fail over nevertheless, an alternative. However, our ability to failover is complex and hard to test, so often the whole system falls over.
We need to start looking for errors through Chaos Engineering from the initial stages itself. Remember, the last thing you notice to be a malfunction might not be the only one. The last strand that breaks is not the cause of failure. Building resilient systems like a rope, not a chain but make sure you know how much margin you have and how frayed your system is.
Chaos architecture consists of four main parts of the architectural pattern. At the bottom, we have an infrastructure that is cloud-based nowadays. This part is very redundant. The second part is switching, and here you can switch the tools with alternatives. This is the most poorly tested component. But we are exercising on failures to be injected in switches to operate and check whether they work properly. Then we have applications, which need to work even when there are timeouts, failures, and retries. People are a very integral part of chaos testing, they are the ones who are going to coordinate amongst themselves and fix the problems. People here do not mean just the Chaos Engineering team or the security red team, it's people from all the departments involved, even the stakeholders. AWS Fault Injection Simulator has got an extra perk, it is not only running the experiments but also ensuring that both the availability and performance impact of the failure is mitigated.
Adrian goes on to quote Chris Pinkham, "You cannot legislate against failure, focus on fast detection and response." Here, a very notable thing to mark is, what Pinkham meant by fast-detection is nothing but observability. You need to be a vigilante at each stage of the experiment and response means controlling the impact of the failure.
OBSERVABILITY AND CONTROL
SPTA Model is used to understand the hazards that could disrupt successful application processing. This control diagram has sensors that go upwards and controls are the flows that go downwards in the diagram. So, at the bottom, we have a Data Plane which contains a web service. In this case, customer requests are being processed. Lots of code is required in managing the data plane. The control plane consists of a control algorithm which might be an AWS Autoscaler or Fraud Detection Algorithm. All sorts of modeled controlled processes such as rejecting certain customer requests, preventing document attacks, etc. are carried out. If you see traffic from a suspicious IP address, blocking it is the kind of automated control plane activity. We have a human operator next in another level, above the control plane. It basically consists of control action generation, model of automation, and model of controlled processes.
Large-scale failures happen when the application is out of control. They can happen due to many possibilities like automation failure, network partition due to no route connecting applications to customers, application crashes or corrupts which makes it not easily startable, etc.
So over a region, we have three availability zones, data centers are different too but they will be synchronizing amongst themselves so that the data always exists in these three zones, and to maintain consistency. Zone failure modes are independent of each other so that they can cater to each zone specifically. They are about 10-100 kilometers apart. But nevertheless, the applications should not be affected and work perfectly fine with any zone offline. Routing service or router control plane manages the failovers.
The hazards that we can face in SPTA's sensor metrics checklist are missing updates, zeroed, corrupted, out of order, rapid updates, infrequent updates, delayed updates, coordination problems, or degradation over time. The routing control does not clearly inform the human operators that everything is taking care of and offline zone delays and breaks other metrics with a flood of errors. Confused human controllers disagree amongst themselves about whether they need to do something or not, with floods of errors, displays that there is a lag in reality by several minutes, and out of date runbooks. In the end, the result is, instead of failing over, the system falls over. The inrush of extra traffic from the failed zone and extra work from a cross-zone requests retry zone which causes A and B to struggle and trigger a complete failure of the applications. Meanwhile, the routing service also has a retry storm and is impacted.
HOW TO FAIL WITHOUT FALLING OVER?
- Alert Correlation:
Floods of alerts need to be reduced to actionable insights, for example, Amazon DevOps Guru. Observability system needs to cope with floods without failing. Run regular chaos engineering experiments with AWS Fault injection Simulator.
- Retry storms:
Can be used as a preventive measure for work amplification. Reduce retries to zero except at subsystem and exit points. Reduce timeouts to drop orphaned requests. Route calls within the same zone.
Symmetries will be required in high-level automation, consistent configuration as code consistent instance types, services, versions, zones, and regions. Principles that we need to keep in mind during these processes are if it can be the same, make it look identical, if it is different, then make it clearly visible, and test your assumptions continuously. In the case of multi-region zones, get your AZ failover solid before attempting multi-region, use STPA to analyze multi-region-specific hazards, and follow well-architected guide patterns.
WHY IS CHAOS ENGINEERING DIFFICULT
It is difficult to stitch together different tools and homemade scripts, agents or libraries required to get started, ensure safety and reproduce 'real-world events'. These problems can be solved by a fully managed Chaos Engineering service that is easy to get started. It shares the experiments and manages real-world conditions with built-in safeguards.
FULLY MANAGED CHAOS ENGINEERING SERVICE
- Easy to get started:
No need to integrate multiple tools and homemade scripts or install agents.
Use the AWS Management Console or the AWS CLI
Use pre-existing experiment templates and get started in minutes.
Easily share it with others.
- Real-world conditions:
Run experiments in the sequence of events or parallel.
Target all levels of the system (host, infrastructure, network, etc.)
Real faults were injected at the service control plane as well.
COMPONENTS OF AWS FAULT INJECTION SIMULATOR
Actions are the fault injection actions executed during an experiment. Actions include fault types, durations, targeted resources, timing relative to any other actions, fault-specific parameters, such as rollback behavior or the portion of the requests to throttle.
Targets define one or more AWS resources on which we act. Targets include resource types, resource IDd, tags, and filters and selection mode, for example, ALL, RANDOM, etc.
- Experiment templates:
Experiment templates define an experiment and are used at the start-experiment request. Experiment templates include actions, targets, stop condition alarms, IAM role, description, and tags.
Experiments are snapshots of the experiment template when it was first with a couple of additions. Experiments include creation and start time, status, execution ID, experiment template ID, and IAM role ARN.
"As data centers are mitigated by the cloud, the fragile and manual disaster recovery process has been standardized and automated. Testing failure mitigation has moved from scary annual experience to automated continuous resilience." With this observation, Adrian ends his keynote session.