It won't make a noise when it breaks

About the speaker

Piyush Verma

Co-Founder & CTO,

Last9

Piyush Verma is the Co-founder and CTO at Last9.io, an SRE platform that aims to minimize the toil that SREs and decision-makers need to go through to reduce the time to make a decision. Earlier, he led SRE @ TrustingSocial.com to produce 600 million credit scores a day across 4 countries.

In his past life, he built oogway.in (exit to TrustingSocial.com), datascale.io (exit to Datastax), and siminars.com.

About the talk

Systems fail but the real failures are the ones from those we learn nothing. This talk is a tale of few such failures that went right under our noses and what we did to prevent those. The techniques covered range from Heterogenous systems, unordered events, missing correlations, and human errors.

Every time there is a failure there is a root cause analysis and there is a vow to not repeat the mistake. I will take some curious failures that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:

Isolate
Limit the spread
Prevent from happening again

Failure 1

An un-replicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering 5 engineers and 20 hours to find one single line of change.

Failure 2

A failed distributed lock in etcD. Forcing us to re-write the whole storage on Consul and hours of migration. Only to find later that it was a clock Issue.

The above Isolation and immediate fixes were painfully long, yet doable. The real ambition was to prevent similar such Incidents from repeating. I will share samples of some of our RCAs and what was missing with each one of those versions. And what the resultant RCA looks like. This section does touch briefly upon blameless RCA but real point of focus is action-ability of an RCA.

Failure 3

In this section, I will showcase some of the in-house frameworks and technologies (easy to replicate) that were built to turn the prevention/alert section of RCAs into lines of code rather than lines of blurb of text. The goal of this section is to advertise and advocate the need to build/adopt toolchains which promise early-detection and not just faster-resolution.

Transcript

It won't make noise when it breaks

For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Piyush Verma, Co-founder, and CTO of 9.io, to speak about "It won't make noise when it breaks".

Piyush kickstarts his talk by saying, "There are many types of failures that we face in our systems, but today we will be talking about not software failures but the flavors of failures in humans, network, processes, and culture. The ambition here is to identify the root cause of a failure and eradicate it using relevant tools. The beauty of these tools is that they look pretty simple and can avoid the most massive failures in the system."

Systems fail but the real failures are the ones from which we learn nothing. This talk is a tale of a few such failures that went right under our noses, and what we did to prevent those. The techniques covered range from heterogeneous systems, unordered events, missing correlations, and human errors.

OUTAGE #1

The most common outage that any system faces is when the customer service reports/login are down, even when nothing wrong was found in tracing, servers, loads, and errors. The real problem here was that one of the DevOps engineers manually altered the security groups and accidentally deleted the 443 rule. The root cause of the problem here was that despite the components functioning well, the error was at a place where we would have not found it manually.

OUTAGE #2

At 7:30 AM on a given day, 25 hours before their country launch, PagerDuty goes down. ElasticSearch shows they are on 500s loose. Logs start coming in at a speed of 1 Mbps, and the team starts copying them. After a few minutes, 500s stop, and PagerDuty is auto-resolved.

Five minutes later, again, PagerDuty goes off and the API is unreachable. The same story continues and they continue to get email alerts, and it's still not working. It looks like crisis mode because they were supposed to go live in a day and things are not working. They do check rundeck because whenever they have faced an access monitoring system, they just ask the team who made what changes, and that's how they know where the error lies. Even though the APMs, servers, load, and the docker were working perfectly fine, they could not identify exactly what was the error. 20 hours later, they found that the mount command had not run on a database shard, then rebooted and the data was wiped.

Piyush adds, "Validation of configuration management might have been the solution for this and that's probably the lesson we learned."

Asking important questions has been a really important part of analyzing. How did the system work so far with the potential of failure, if one failure is detected, what are the other ones that we can identify; these are some of the basic scrutinies the team should be doing post-failure.

Every time there is a failure there is a root cause analysis and there is a vow to not repeat the mistake. I will take some curious failures that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:

Isolate
Limit the spread
Prevent from happening again

Failure 1

An unreplicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering 5 engineers and 20 hours to find one single line of change.

Failure 2

A failed distributed lock in etcD. Forcing us to rewrite the whole storage on Consul and hours of migration. Only to find later that it was a clock Issue.

"The above Isolation and immediate fixes were painfully long, yet doable.

The real ambition was to prevent similar incidents from repeating. I will share samples of some of our RCAs and what was missing with each one of those versions. And what the resultant RCA looks like. This section does touch briefly upon blameless RCA but the real point of focus is the action-ability of an RCA." Piyush elaborates.

Failure 3

In this section, Piyush showcases some of the in-house frameworks and technologies (easy to replicate) that were built to turn the prevention/alert section of RCAs into lines of code rather than lines of the blurb of text. The goal of this section is to advertise and advocate the need to build/adopt toolchains that promise early detection and not just faster resolution.

At the end of his talk, Piyush mentions, "Taking feedback from our team or customers has been beneficial and we use this to understand and analyze for the better. We will be really happy to put this on an open domain where everyone can benefit from it."

< Back to Talk List

Improving Business Resiliency with Chaos Engineering

Getting Started with Observability for Chaos Engineering

It won't make a noise when it breaks

Videos

by Experts

Related Blog

Site Reliability Engineering: An Introduction

Start your reliability journey in minutes