Improve Application Resiliency with AWS Fault Injection Simulator
Gunnar kicks off the demo of AWS FIS using the AWS console for a simple EC2 instance stop experiment. He proceeds to show how the experiment details can be specified from the AWS console.
The first step is to create the experiment template:
- Select the IAM role for it
- Choose the action to be performed as part of the experiment, which in this case is EC2 stop-instances, and then,
- Specify the duration of the chaos injection.
- Lastly, the target EC2 instance was specified.
Notably, no stop condition has been specified to automatically stop the experiment. Hence, the experiment template is ready and the experiment is then executed. Once executed, the experiment stopped the specified EC2 instance, which was confirmed by checking its status from the AWS console.
The second demonstration happened to be an extension of the previous experiment with the added support of alarms. Alarms allow the AWS FIS experiments to be automated and stopped on their own based on a defined set of parameters.
To start off, Gunnar demonstrates the AWS Cloudwatch dashboard, which has an alarm set for any abnormal network metrics for the target EC2 instance. He then proceeds to update the experiment template created as a part of the previous demonstration, by adding the alarm for the abnormal network as a stop condition. He also slightly extended the chaos experiment duration so that its effects could be properly analyzed. Once done, the experiment template was saved and the experiment was then executed. Same as before, the target EC2 instance shut down in a while, as observed from the AWS console.
Now, the alarm for the abnormal network was manually triggered using the AWS CLI and thus, the alarm went into an active state. This caused the target EC2 instance to automatically restart, which was confirmed from the AWS console. Moreover, the AWS FIS console depicted that the experiment run had failed as the alarm got triggered during the chaos experiment and thus, the experiment got stopped, and the effects of the chaos injection were rolled back.
The third demonstration also demonstrated the EC2 instance stop experiment but this time with filters as well as alarms. Filters allow the target resources to be filtered as per the given conditions, as instantiated by Gunnar using an instance stop JSON template file, which had two filter conditions for targeting the EC2 instances, first being the availability zone of instances should be us-east-1b and the second one is a tag for the instance. He then proceeds to apply this template using the AWS CLI, which makes the template available in the AWS Console to be used. Further, he executed the experiment using the AWS console which caused all the instances in the us-east-1b zone with specified tags to shut down. The same alarm as the previous demonstration has been assigned to this template as well, which is then manually triggered using the AWS CLI. Upon getting triggered, the alarm causes the running experiment to stop halfway through and restarts the target EC2 instances.
For the next demo, Gunnar performs the CPU stress fault injection with instance stop-experiment, using the AWS SSM documents. AWS SSM allows a set of actions to be performed on a set of managed EC2 instances using an SSM document containing the script for the desired action. Gunnar then proceeds to explain the experiment template file having the details of the target instance that have been configured with the SSM:send-command action along with the SSM document to be used to inject the chaos. The SSM document contained a script for running stress-ng in order to stress the target instance's CPU. The experiment was then executed from the CLI which caused the target EC2 instance's CPU usage to spike from 0% to 100%, as observed by logging into the target instance. As before, the specified alarm was triggered to stop the experiment during its execution, which caused the CPU usage to fall back to 0%.
For the final demo, Gunnar demonstrates the inject API throttle error chaos experiment, which will inject chaos in the control plane level of the AWS level by throttling the requests made to other services. The experiment template uses the inject-API-throttle-error action upon the ec2 service for throttling the DescribeInstances and DescribeVolumes operations. An experiment was created and executed using the template which caused the AWS CLI command for describing EC2 instances to be throttled and eventually give an error after failed retries. Hence, the alarm was manually triggered from the CLI and to stop the experiment, and once the experiment was stopped, the EC2 instances could be described now using the CLI command.