Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents
For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Yury Nino Rao, a Chaos Engineering Advocate from ADL Digital Labs, to speak about "Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents".
Yury starts her talk by quoting the author of Incident Management for Operations, He"Incidents are not concerned about making quick decisions. Incidents are about making the best decisions in the shortest amount of time. Time is the most important commodity that exists."
INCIDENTS ARE INEVITABLE!
An incident is an occurrence, either human-caused or a natural phenomenon, that requires action by emergency services personnel to prevent or minimize loss of life or damage. Incidents are happening all over the world so the response to those incidents comes in many forms from one person working on an issue to a group of people dealing with a conference. For example, we can learn Incident Management from wildfires. In 1970, a series of devastating wildfires swept across California, destroying more than 700 homes over 775 square miles in 13 days with 13 fatalities, and resulting in more than $233 million in losses. Thousands of fighters from around the state and beyond responded, but found it very difficult to work together. Several fire service leaders created a revolutionary system by managing emergencies named Incident Management System (IMS).
Following the principles of IMS, we tried to implement them on Slack. The IM Lifecycle starts with declaring an incident, and we can use automated alerts and internal and external notifications. The next one is containment, which is about determining if they're responsible for driving a particular incident to resolution based on the incident source type, and priority in the transit of containment and remediation. Creating an idea is to create a communication channel that will act as a key to prevent the creation of silos, for example. In Analyze and learning, we have a lot of activities and this part is related to postmortem. After the incident, another part related to this framework is the preparation, to have the opportunity to prepare our engineering team.
INCIDENT MANAGEMENT PERSONAS
Incident Commander: He is the leader of the incident response. He is responsible for focusing the group on developing incident objectives providing direction and time management
Group Leader: His job function is created by the incident commander. He is another leader in this situation. Basically, he leads a group of subject matter experts.
Subject Matter Experts: They are the technical experts who are solving the incident job solution and they are responsible for supporting and working with the incident company to company.
PATTERNS IN THE MIDDLE OF CHAOS
Yury goes on to explain some of the past incidents that have taken place in the industry during the past years. She also talks about the patterns and the common factors amongst those incidents.
Out of sync times
Yury goes on to quote Laura Nolan, by saying, "Each outage is unique, but there are anti-patterns and sometimes we can use those to create defenses."
DOCUMENTING CHAOS ON POSTMORTEMS
A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident. A postmortem is an artifact with a detailed description of exactly what went wrong in an incident.
BENEFITS OF WRITING POSTMORTEMS
A collaborative process that allows everyone to contribute what they learned and build resiliency.
An opportunity for rebuilding confidence in people who were closely involved in the incident.
A strategy to let partners, customers, and end-users know what happened and take steps for improvement.
Conducting a postmortem is an expensive and highly time-consuming task. Automation and using the anti-patterns identified after reviewing a compilation of classical postmortems will help in reducing the exhaustion of the process.
ML IN THE MIDDLE OF CHAOS
Machine learning is a process of building models that learn from data. Machine learning models are algorithms that learn patterns from data. Machine learning is a research field at the intersection of statistics, artificial intelligence, and computer science for extracting knowledge from data.
"Failures are an inevitable part of making software products and services, however, it is not necessary to repeat the mistakes of the past." Yury beautifully quotes.
Yury goes on to explain their TULP mechanism with a well-illustrated diagram. TULP is a machine tool that is being designed by the research and development team. Our idea is to use a crawler to build a data set with the information. You can use clustering in a specific algorithm of a class to identify and map each postmortem. With this information, we implement a machine learning pipeline that learns about this data and provides a classification model. To have a model while we are facing a new incident, we have a pre-configured template for postmortem and we can learn from the past.
"Time is the one resource that you can never get back. Time cannot be created, it can only be saved or wasted. Time plays a major role in chaos engineering, which is the discipline of experimenting with failures in production, to reveal their weaknesses and to build confidence in their resilience capability. Chaos adds complexity and confusion, but more importantly, adds unnecessary time to incident resolution, TTFR, TTR, MTTR, and putting the company's business at a bigger risk." saying this Yury ends her talk.