Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

Web applications living in an imperfect world, exposed to inevitable incidents. It imposes challenges for Artificial Intelligence and Chaos Engineering. Aware of this, our team designed to Tulp integrating these disciplines to Incident Managing to provide useful information for solving a new outage.

Introduction

The infrastructure required by a software system can be as complex as the software itself. Web applications living in this complex and imperfect world, in which outages are around the corner, are exposed to inevitable incidents. It imposes challenges for many fields, including novel techniques such as Artificial Intelligence and Chaos Engineering. Aware of this, our team designed a platform named Tulp that integrates the knowledge that we have regarding Incident Managing, Artificial Intelligence, and Chaos Engineering. Although every production failure is unique, Tulp should be able to learn from previous incidents in order to provide useful information for solving a new outage. Tulp classifies post mortems available on the web, with the objective of getting the most important information that could be used when we are facing a new storm. We would like to tell you about our journey, sharing our experience, what we learned, and the challenges that we identified designing and implementing Tulp.

Methodology

Present a background of Incident Management.
Analyze the anatomy of one postmortem.
Find common patterns in them.
Present the design of a tool that uses AI to build models from postmortems.
Show our learnings designing and building a POC for this.

Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Yury Nino Rao, a Chaos Engineering Advocate from ADL Digital Labs, to speak about "Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents".

Yury starts her talk by quoting the author of Incident Management for Operations, He"Incidents are not concerned about making quick decisions. Incidents are about making the best decisions in the shortest amount of time. Time is the most important commodity that exists."

INCIDENTS ARE INEVITABLE!

An incident is an occurrence, either human-caused or a natural phenomenon, that requires action by emergency services personnel to prevent or minimize loss of life or damage. Incidents are happening all over the world so the response to those incidents comes in many forms from one person working on an issue to a group of people dealing with a conference. For example, we can learn Incident Management from wildfires. In 1970, a series of devastating wildfires swept across California, destroying more than 700 homes over 775 square miles in 13 days with 13 fatalities, and resulting in more than $233 million in losses. Thousands of fighters from around the state and beyond responded, but found it very difficult to work together. Several fire service leaders created a revolutionary system by managing emergencies named Incident Management System (IMS).

Following the principles of IMS, we tried to implement them on Slack. The IM Lifecycle starts with declaring an incident, and we can use automated alerts and internal and external notifications. The next one is containment, which is about determining if they're responsible for driving a particular incident to resolution based on the incident source type, and priority in the transit of containment and remediation. Creating an idea is to create a communication channel that will act as a key to prevent the creation of silos, for example. In Analyze and learning, we have a lot of activities and this part is related to postmortem. After the incident, another part related to this framework is the preparation, to have the opportunity to prepare our engineering team.

INCIDENT MANAGEMENT PERSONAS

Incident Commander: He is the leader of the incident response. He is responsible for focusing the group on developing incident objectives providing direction and time management
Group Leader: His job function is created by the incident commander. He is another leader in this situation. Basically, he leads a group of subject matter experts.
Subject Matter Experts: They are the technical experts who are solving the incident job solution and they are responsible for supporting and working with the incident company to company.

PATTERNS IN THE MIDDLE OF CHAOS

Yury goes on to explain some of the past incidents that have taken place in the industry during the past years. She also talks about the patterns and the common factors amongst those incidents.

Hardware issues
Configuration errors
Security vulnerabilities
Hitting limits
Conflicts
Out of sync times
Slowness
Automation interactions

Yury goes on to quote Laura Nolan, by saying, "Each outage is unique, but there are anti-patterns and sometimes we can use those to create defenses."

DOCUMENTING CHAOS ON POSTMORTEMS

A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident. A postmortem is an artifact with a detailed description of exactly what went wrong in an incident.

BENEFITS OF WRITING POSTMORTEMS

A collaborative process that allows everyone to contribute what they learned and build resiliency.
An opportunity for rebuilding confidence in people who were closely involved in the incident.
A strategy to let partners, customers, and end-users know what happened and take steps for improvement.

Conducting a postmortem is an expensive and highly time-consuming task. Automation and using the anti-patterns identified after reviewing a compilation of classical postmortems will help in reducing the exhaustion of the process.

ML IN THE MIDDLE OF CHAOS

Machine learning is a process of building models that learn from data. Machine learning models are algorithms that learn patterns from data. Machine learning is a research field at the intersection of statistics, artificial intelligence, and computer science for extracting knowledge from data.

"Failures are an inevitable part of making software products and services, however, it is not necessary to repeat the mistakes of the past." Yury beautifully quotes.

TULP

Yury goes on to explain their TULP mechanism with a well-illustrated diagram. TULP is a machine tool that is being designed by the research and development team. Our idea is to use a crawler to build a data set with the information. You can use clustering in a specific algorithm of a class to identify and map each postmortem. With this information, we implement a machine learning pipeline that learns about this data and provides a classification model. To have a model while we are facing a new incident, we have a pre-configured template for postmortem and we can learn from the past.

"Time is the one resource that you can never get back. Time cannot be created, it can only be saved or wasted. Time plays a major role in chaos engineering, which is the discipline of experimenting with failures in production, to reveal their weaknesses and to build confidence in their resilience capability. Chaos adds complexity and confusion, but more importantly, adds unnecessary time to incident resolution, TTFR, TTR, MTTR, and putting the company's business at a bigger risk." saying this Yury ends her talk.

Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

Videos

by Experts

Related Blog

LitmusChaos moves to CNCF Incubator

Start your reliability journey in minutes