Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

ChaosWheel

About the speaker

Yury Niño Roa

Yury Niño Roa

Senior Site Reliability Engineer,

ADL Digital Lab

Yury Niño is a DevOps Engineer in Aval Digital Labs. She is a Software Engineer with 6+ years of experience designing, implementing, and managing the development of software applications using agile methodologies such as scrum and kanban. She also has 2+ years of hands-on experience supporting, automating, and optimizing mission-critical deployments, leveraging configuration management, CI/CD, and DevOps processes. She is interested in solving performance, resilience, and reliability issues.

About the talk

Web applications living in an imperfect world, exposed to inevitable incidents. It imposes challenges for Artificial Intelligence and Chaos Engineering. Aware of this, our team designed to Tulp integrating these disciplines to Incident Managing to provide useful information for solving a new outage.

Introduction

The infrastructure required by a software system can be as complex as the software itself. Web applications living in this complex and imperfect world, in which outages are around the corner, are exposed to inevitable incidents. It imposes challenges for many fields, including novel techniques such as Artificial Intelligence and Chaos Engineering. Aware of this, our team designed a platform named Tulp that integrates the knowledge that we have regarding Incident Managing, Artificial Intelligence, and Chaos Engineering. Although every production failure is unique, Tulp should be able to learn from previous incidents in order to provide useful information for solving a new outage. Tulp classifies post mortems available on the web, with the objective of getting the most important information that could be used when we are facing a new storm. We would like to tell you about our journey, sharing our experience, what we learned, and the challenges that we identified designing and implementing Tulp.

Methodology

  • Present a background of Incident Management.
  • Analyze the anatomy of one postmortem.
  • Find common patterns in them.
  • Present the design of a tool that uses AI to build models from postmortems.
  • Show our learnings designing and building a POC for this.

Transcript

Tulp: Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents

For yet another insightful session on day 2 of  Chaos Carnival 2021, we had with us Yury Nino Rao, a Chaos Engineering Advocate from ADL Digital Labs, to speak about "Integrating Artificial Intelligence and Chaos Engineering to Learn from the Incidents".

Yury starts her talk by quoting the author of Incident Management for Operations, He"Incidents are not concerned about making quick decisions. Incidents are about making the best decisions in the shortest amount of time. Time is the most important commodity that exists."

INCIDENTS ARE INEVITABLE!

An incident is an occurrence, either human-caused or a natural phenomenon, that requires action by emergency services personnel to prevent or minimize loss of life or damage. Incidents are happening all over the world so the response to those incidents comes in many forms from one person working on an issue to a group of people dealing with a conference. For example, we can learn Incident Management from wildfires. In 1970, a series of devastating wildfires swept across California, destroying more than 700 homes over 775 square miles in 13 days with 13 fatalities, and resulting in more than $233 million in losses. Thousands of fighters from around the state and beyond responded, but found it very difficult to work together. Several fire service leaders created a revolutionary system by managing emergencies named Incident Management System (IMS).

Following the principles of IMS, we tried to implement them on Slack. The IM Lifecycle starts with declaring an incident, and we can use automated alerts and internal and external notifications. The next one is containment, which is about determining if they're responsible for driving a particular incident to resolution based on the incident source type, and priority in the transit of containment and remediation. Creating an idea is to create a communication channel that will act as a key to prevent the creation of silos, for example. In Analyze and learning, we have a lot of activities and this part is related to postmortem. After the incident, another part related to this framework is the preparation, to have the opportunity to prepare our engineering team.

INCIDENT MANAGEMENT PERSONAS

  1. Incident Commander: He is the leader of the incident response. He is responsible for focusing the group on developing incident objectives providing direction and time management 

  2. Group Leader: His job function is created by the incident commander. He is another leader in this situation. Basically, he leads a group of subject matter experts.

  3. Subject Matter Experts: They are the technical experts who are solving the incident job solution and they are responsible for supporting and working with the incident company to company.

PATTERNS IN THE MIDDLE OF CHAOS

Yury goes on to explain some of the past incidents that have taken place in the industry during the past years. She also talks about the patterns and the common factors amongst those incidents.

  1. Hardware issues

  2. Configuration errors

  3. Security vulnerabilities

  4. Hitting limits

  5. Conflicts

  6. Out of sync times

  7. Slowness

  8. Automation interactions

Yury goes on to quote Laura Nolan, by saying, "Each outage is unique, but there are anti-patterns and sometimes we can use those to create defenses."

DOCUMENTING CHAOS ON POSTMORTEMS

A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident. A postmortem is an artifact with a detailed description of exactly what went wrong in an incident.

BENEFITS OF WRITING POSTMORTEMS

  1. A collaborative process that allows everyone to contribute what they learned and build resiliency.

  2. An opportunity for rebuilding confidence in people who were closely involved in the incident.

  3. A strategy to let partners, customers, and end-users know what happened and take steps for improvement.

Conducting a postmortem is an expensive and highly time-consuming task. Automation and using the anti-patterns identified after reviewing a compilation of classical postmortems will help in reducing the exhaustion of the process.

ML IN THE MIDDLE OF CHAOS

Machine learning is a process of building models that learn from data. Machine learning models are algorithms that learn patterns from data. Machine learning is a research field at the intersection of statistics, artificial intelligence, and computer science for extracting knowledge from data.

"Failures are an inevitable part of making software products and services, however, it is not necessary to repeat the mistakes of the past." Yury beautifully quotes.

TULP

Yury goes on to explain their TULP mechanism with a well-illustrated diagram. TULP is a machine tool that is being designed by the research and development team. Our idea is to use a crawler to build a data set with the information. You can use clustering in a specific algorithm of a class to identify and map each postmortem. With this information, we implement a machine learning pipeline that learns about this data and provides a classification model. To have a model while we are facing a new incident, we have a pre-configured template for postmortem and we can learn from the past.

"Time is the one resource that you can never get back. Time cannot be created, it can only be saved or wasted. Time plays a major role in chaos engineering, which is the discipline of experimenting with failures in production, to reveal their weaknesses and to build confidence in their resilience capability. Chaos adds complexity and confusion, but more importantly, adds unnecessary time to incident resolution, TTFR, TTR, MTTR, and putting the company's business at a bigger risk." saying this Yury ends her talk.

Putting Chaos into Continuous Delivery - how to increase the resilience of your applications
Putting Chaos into Continuous Delivery - how to increase the resilience of your applications
Eugenio Marzo
Chaos Engineering is fun!

Sign Up

for our Newsletter

Get tips, best practices and updates on chaos engineeering in cloud native.

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos

RELATED BLOGS

Site Reliability Engineering: An Introduction

July 29, 2021   |   7 MIN READ