It won't make noise when it breaks
For yet another insightful session on day 2 of Chaos Carnival 2021, we had with us Piyush Verma, Co-founder, and CTO of 9.io, to speak about "It won't make noise when it breaks".
Piyush kickstarts his talk by saying, "There are many types of failures that we face in our systems, but today we will be talking about not software failures but the flavors of failures in humans, network, processes, and culture. The ambition here is to identify the root cause of a failure and eradicate it using relevant tools. The beauty of these tools is that they look pretty simple and can avoid the most massive failures in the system."
Systems fail but the real failures are the ones from which we learn nothing. This talk is a tale of a few such failures that went right under our noses, and what we did to prevent those. The techniques covered range from heterogeneous systems, unordered events, missing correlations, and human errors.
The most common outage that any system faces is when the customer service reports/login are down, even when nothing wrong was found in tracing, servers, loads, and errors. The real problem here was that one of the DevOps engineers manually altered the security groups and accidentally deleted the 443 rule. The root cause of the problem here was that despite the components functioning well, the error was at a place where we would have not found it manually.
At 7:30 AM on a given day, 25 hours before their country launch, PagerDuty goes down. ElasticSearch shows they are on 500s loose. Logs start coming in at a speed of 1 Mbps, and the team starts copying them. After a few minutes, 500s stop, and PagerDuty is auto-resolved.
Five minutes later, again, PagerDuty goes off and the API is unreachable. The same story continues and they continue to get email alerts, and it's still not working. It looks like crisis mode because they were supposed to go live in a day and things are not working. They do check rundeck because whenever they have faced an access monitoring system, they just ask the team who made what changes, and that's how they know where the error lies. Even though the APMs, servers, load, and the docker were working perfectly fine, they could not identify exactly what was the error. 20 hours later, they found that the mount command had not run on a database shard, then rebooted and the data was wiped.
Piyush adds, "Validation of configuration management might have been the solution for this and that's probably the lesson we learned."
Asking important questions has been a really important part of analyzing. How did the system work so far with the potential of failure, if one failure is detected, what are the other ones that we can identify; these are some of the basic scrutinies the team should be doing post-failure.
Every time there is a failure there is a root cause analysis and there is a vow to not repeat the mistake. I will take some curious failures that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:
Limit the spread
Prevent from happening again
An unreplicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering 5 engineers and 20 hours to find one single line of change.
A failed distributed lock in etcD. Forcing us to rewrite the whole storage on Consul and hours of migration. Only to find later that it was a clock Issue.
"The above Isolation and immediate fixes were painfully long, yet doable.
The real ambition was to prevent similar incidents from repeating. I will share samples of some of our RCAs and what was missing with each one of those versions. And what the resultant RCA looks like. This section does touch briefly upon blameless RCA but the real point of focus is the action-ability of an RCA." Piyush elaborates.
In this section, Piyush showcases some of the in-house frameworks and technologies (easy to replicate) that were built to turn the prevention/alert section of RCAs into lines of code rather than lines of the blurb of text. The goal of this section is to advertise and advocate the need to build/adopt toolchains that promise early detection and not just faster resolution.
At the end of his talk, Piyush mentions, "Taking feedback from our team or customers has been beneficial and we use this to understand and analyze for the better. We will be really happy to put this on an open domain where everyone can benefit from it."