Creating a learning culture
For yet another insightful session on day 2 of Chaos Carnival 2021, we had Amir Shaked join us to talk about "Creating a learning culture."
"Perimetrix is a SaaS company where we provide solutions to modern web apps to protect them at scale. Behind the scenes, we have a cloud-based microservices environment on a large scale, about 300 microservices running fully dockerized over 15000 cores. Like any other production environment, we too see failures all the time. Today, we are going to cover our journey, going through the process of change towards creating a healthy and supportive learning culture by taking those failures and building on them. This is the essence of chaos engineering." Amir starts his talk.
"I have set a goal of wishing to see rapid deployment and rapid changes, and being able to provide the most adequate and up-to-date solution," Amir adds. "In this case, a world of moving target defense where the scope of features changes all the time due to threat actors being able to deploy changes quickly is a major factor to provide competitive products. Often, a good DevOps culture can be a differentiator, and act as a competitive edge for your company. We want to have zero downtimes, and having errors or failures is negotiable when you are learning from it for the first time but never twice."
Amir continues "However, the starting point was not that right for us. We say repeating issues, minor errors causing failure in the production code changes or configuration changes, and being too prone to their incidents in their underlying cloud environment was affecting our stability. Those two factors are very concerning when we look at how we are going to grow and scale, looking 10-100 weeks ahead and what may be a minor risk today, might become a catastrophic risk in the future."
There was an incident about a customer complaining about the environment, where they will eventually reach out to you for support. The support team will try to analyze and understand the problem with the engineering team and try to solve it. We are expected to face such similar issues again with similar root causes. You will here need to seriously consider, what part of the solution can be altered to make it better. Setting a particular time to analyze is the only way to make sure the root causes found and the process problems are improved. Lessons are learned when a code was deployed in production by mistake. When there was a need to upscale, autoscale was added around that through which we noticed a significant growth in usage and the autoscale spun up the dockers with the buggy code. And here lies the problem, we need to understand why in the first place there was this misunderstanding.
There are all sorts of questions that arise like, "What happened to the timeline?", "What was the impact?", "How much time did it take for us to identify the issue?" or "How much time did we take to fix the issue?" which can be answered only through change, and changes take time. It can be due to lack of trust, the team blaming each other for failures, not keeping a check on AIs, communication gaps in the team, and/or having a narrow focus on things.
These problems can be avoided by creating a safe and welcoming culture in the team, where everyone is allowed to make mistakes. Try to avoid assigning blame and start taking ownership of the failure, this will help you have a positive approach towards learning. You need to go easy on the WHY questions because that lays the foundation of learning. You need to know why something happened, and why somebody did a particular thing. Make sure you have a balance of how deep you are diving into a subject. Maintaining consistency and keeping calm while proceeding to solve a problem will help the team in analyzing the data further to avoid failures.
Amir sums up his talk by saying, "Our most notable learning can be that humans make mistakes and that's a stepping stone for growth, realizing that gradual rollout is not a silver bullet, doing crisis mode process with feature flagging, inculcating behavioral driven development, looking for more breaking points, and treating configuration as code."