It won't make a noise when it breaks

ChaosWheel

About the speaker

Piyush Verma

Piyush Verma

Co-Founder & CTO,

Last9

Piyush Verma is the Co-founder and CTO at Last9.io, an SRE platform that aims to minimize the toil that SREs and decision-makers need to go through to reduce the time to make a decision. Earlier, he led SRE @ TrustingSocial.com to produce 600 million credit scores a day across 4 countries.

In his past life, he built oogway.in (exit to TrustingSocial.com), datascale.io (exit to Datastax), and siminars.com.

About the talk

Systems fail but the real failures are the ones from those we learn nothing. This talk is a tale of few such failures that went right under our noses and what we did to prevent those. The techniques covered range from Heterogenous systems, unordered events, missing correlations, and human errors.

Every time there is a failure there is a root cause analysis and there is a vow to not repeat the mistake. I will take some curious failures that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:

  1. Isolate
  2. Limit the spread
  3. Prevent from happening again

Failure 1

An un-replicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering 5 engineers and 20 hours to find one single line of change.

Failure 2

A failed distributed lock in etcD. Forcing us to re-write the whole storage on Consul and hours of migration. Only to find later that it was a clock Issue.

The above Isolation and immediate fixes were painfully long, yet doable. The real ambition was to prevent similar such Incidents from repeating. I will share samples of some of our RCAs and what was missing with each one of those versions. And what the resultant RCA looks like. This section does touch briefly upon blameless RCA but real point of focus is action-ability of an RCA.

Failure 3

In this section, I will showcase some of the in-house frameworks and technologies (easy to replicate) that were built to turn the prevention/alert section of RCAs into lines of code rather than lines of blurb of text. The goal of this section is to advertise and advocate the need to build/adopt toolchains which promise early-detection and not just faster-resolution.

Improving Business Resiliency with Chaos Engineering
Improving Business Resiliency with Chaos Engineering
Shelby Spees
Getting Started with Observability for Chaos Engineering

Sign Up

for our Newsletter

Get tips, best practices and updates on chaos engineeering in cloud native.

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos

RELATED BLOGS

Site Reliability Engineering: An Introduction

July 29, 2021   |   7 MIN READ