Improve Application Resiliency with AWS Fault Injection Simulator

ChaosWheel

About the speaker

Gunnar Grosch

Gunnar Grosch

Senior Developer Advocate,

AWS

Gunnar is a Senior Developer Advocate at Amazon Web Services (AWS) based in Sweden. He has 20 years of experience in the IT industry and has previously worked as both a frontend and backend developer, as an operations engineer within cloud infrastructure, and as a technical trainer, in addition to several different management roles. With a focus on building reliable and robust serverless applications, Gunnar has been one of the driving forces in creating techniques and tools for using chaos engineering

About the talk

AWS Fault Injection Simulator is a fully managed chaos engineering service that helps you improve application resiliency by making it easy and safe to perform controlled chaos experiments. In this session, see how to use AWS Fault Injection Simulator to make applications more resilient to failure.

With the wide adoption of microservices and large-scale distributed systems, architectures have grown increasingly complex and can be hard to understand and difficult to debug and test. These systems sometimes have unpredictable behaviors caused by unforeseen turbulent events that can cause cascading and catastrophic failures. Since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal weaknesses before they become outages, and to gain confidence in the system's resilience to failures.

AWS Fault Injection Simulator is a fully managed chaos engineering service that helps you improve application resiliency by making it easy and safe to perform controlled chaos engineering experiments on AWS. In this session, see an overview of chaos engineering and AWS Fault Injection Simulator, and then see a demo of how to use AWS Fault Injection Simulator to make applications more resilient to failure.

Session talks about the complexity of modern distributed systems, how chaos engineering can help, and gives an overview of AWS Fault Injection Simulator.

Transcript

Improve Application Resiliency with AWS Fault Injection Simulator

Gunnar kicks off the demo of AWS FIS using the AWS console for a simple EC2 instance stop experiment. He proceeds to show how the experiment details can be specified from the AWS console.

DEMONSTRATION 1

The first step is to create the experiment template:

  • Select the IAM role for it
  • Choose the action to be performed as part of the experiment, which in this case is EC2 stop-instances, and then,
  • Specify the duration of the chaos injection.
  • Lastly, the target EC2 instance was specified.

Notably, no stop condition has been specified to automatically stop the experiment. Hence, the experiment template is ready and the experiment is then executed. Once executed, the experiment stopped the specified EC2 instance, which was confirmed by checking its status from the AWS console.

DEMONSTRATION 2

The second demonstration happened to be an extension of the previous experiment with the added support of alarms. Alarms allow the AWS FIS experiments to be automated and stopped on their own based on a defined set of parameters.

To start off, Gunnar demonstrates the AWS Cloudwatch dashboard, which has an alarm set for any abnormal network metrics for the target EC2 instance. He then proceeds to update the experiment template created as a part of the previous demonstration, by adding the alarm for the abnormal network as a stop condition. He also slightly extended the chaos experiment duration so that its effects could be properly analyzed. Once done, the experiment template was saved and the experiment was then executed. Same as before, the target EC2 instance shut down in a while, as observed from the AWS console.

Now, the alarm for the abnormal network was manually triggered using the AWS CLI and thus, the alarm went into an active state. This caused the target EC2 instance to automatically restart, which was confirmed from the AWS console. Moreover, the AWS FIS console depicted that the experiment run had failed as the alarm got triggered during the chaos experiment and thus, the experiment got stopped, and the effects of the chaos injection were rolled back.

DEMONSTRATION 3

The third demonstration also demonstrated the EC2 instance stop experiment but this time with filters as well as alarms. Filters allow the target resources to be filtered as per the given conditions, as instantiated by Gunnar using an instance stop JSON template file, which had two filter conditions for targeting the EC2 instances, first being the availability zone of instances should be us-east-1b and the second one is a tag for the instance. He then proceeds to apply this template using the AWS CLI, which makes the template available in the AWS Console to be used. Further, he executed the experiment using the AWS console which caused all the instances in the us-east-1b zone with specified tags to shut down. The same alarm as the previous demonstration has been assigned to this template as well, which is then manually triggered using the AWS CLI. Upon getting triggered, the alarm causes the running experiment to stop halfway through and restarts the target EC2 instances.

DEMONSTRATION 4

For the next demo, Gunnar performs the CPU stress fault injection with instance stop-experiment, using the AWS SSM documents. AWS SSM allows a set of actions to be performed on a set of managed EC2 instances using an SSM document containing the script for the desired action. Gunnar then proceeds to explain the experiment template file having the details of the target instance that have been configured with the SSM:send-command action along with the SSM document to be used to inject the chaos. The SSM document contained a script for running stress-ng in order to stress the target instance's CPU. The experiment was then executed from the CLI which caused the target EC2 instance's CPU usage to spike from 0% to 100%, as observed by logging into the target instance. As before, the specified alarm was triggered to stop the experiment during its execution, which caused the CPU usage to fall back to 0%.

For the final demo, Gunnar demonstrates the inject API throttle error chaos experiment, which will inject chaos in the control plane level of the AWS level by throttling the requests made to other services. The experiment template uses the inject-API-throttle-error action upon the ec2 service for throttling the DescribeInstances and DescribeVolumes operations. An experiment was created and executed using the template which caused the AWS CLI command for describing EC2 instances to be throttled and eventually give an error after failed retries. Hence, the alarm was manually triggered from the CLI and to stop the experiment, and once the experiment was stopped, the EC2 instances could be described now using the CLI command.

Chaos Engineering in Telco Cloudnative Infra
Chaos Engineering in Telco Cloudnative Infra
Jordi Gil
Chaos testing Red Hat Openshift Virtualization

Sign Up

for our Newsletter

Get tips, best practices and updates on Chaos engineering in cloud native.

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos

Related Blogs

Site Reliability Engineering: An Introduction

Read

July 29, 2021

7 Min Read

Prithvi's Blog