Creating a learning culture

ChaosWheel

About the speaker

Amir Shaked

Amir Shaked

Senior VP of R&D,

Perimeter

Amir Shaked is the Senior VP of research and development at PerimeterX, responsible for building a multi-zone distributed system that detects and mitigates automated attacks on websites in real-time. Before that, he led several software engineering groups in the Israeli ministry of prime-minister, focusing on cybersecurity products, and worked many years as a software engineer and security researcher.

About the talk

Building and marinating a five 9s system isn't just about the tools and technologies.

In this talk, we will discuss how to create a learning culture using debriefs, what to avoid, and how to instill change in an engineering organization.

Building and marinating a five 9s system isn't just about the tools and technologies. Development culture has a big part in how you keep a system available while scaling it up and supporting more features, users, and locations.

A healthy learning culture, supporting the development, not repairing mistakes, and identifying weak points is another tool in the engineering toolbox.

In this talk, we will discuss how to create a learning culture using debriefs, what to avoid, and how to instill change in an engineering organization.

The talk will cover our path to culture change, examples, common pitfalls to avoid.

Transcript

Creating a learning culture

For yet another insightful session on day 2 of Chaos Carnival 2021, we had Amir Shaked join us to talk about "Creating a learning culture."

"Perimetrix is a SaaS company where we provide solutions to modern web apps to protect them at scale. Behind the scenes, we have a cloud-based microservices environment on a large scale, about 300 microservices running fully dockerized over 15000 cores. Like any other production environment, we too see failures all the time. Today, we are going to cover our journey, going through the process of change towards creating a healthy and supportive learning culture by taking those failures and building on them. This is the essence of chaos engineering." Amir starts his talk.

DESTINATION

"I have set a goal of wishing to see rapid deployment and rapid changes, and being able to provide the most adequate and up-to-date solution," Amir adds. "In this case, a world of moving target defense where the scope of features changes all the time due to threat actors being able to deploy changes quickly is a major factor to provide competitive products. Often, a good DevOps culture can be a differentiator, and act as a competitive edge for your company. We want to have zero downtimes, and having errors or failures is negotiable when you are learning from it for the first time but never twice."

STARTING POINT

Amir continues "However, the starting point was not that right for us. We say repeating issues, minor errors causing failure in the production code changes or configuration changes, and being too prone to their incidents in their underlying cloud environment was affecting our stability. Those two factors are very concerning when we look at how we are going to grow and scale, looking 10-100 weeks ahead and what may be a minor risk today, might become a catastrophic risk in the future."

OUR PROCESS

There was an incident about a customer complaining about the environment, where they will eventually reach out to you for support. The support team will try to analyze and understand the problem with the engineering team and try to solve it. We are expected to face such similar issues again with similar root causes. You will here need to seriously consider, what part of the solution can be altered to make it better. Setting a particular time to analyze is the only way to make sure the root causes found and the process problems are improved. Lessons are learned when a code was deployed in production by mistake. When there was a need to upscale, autoscale was added around that through which we noticed a significant growth in usage and the autoscale spun up the dockers with the buggy code. And here lies the problem, we need to understand why in the first place there was this misunderstanding.

There are all sorts of questions that arise like, "What happened to the timeline?", "What was the impact?", "How much time did it take for us to identify the issue?" or "How much time did we take to fix the issue?" which can be answered only through change, and changes take time. It can be due to lack of trust, the team blaming each other for failures, not keeping a check on AIs, communication gaps in the team, and/or having a narrow focus on things.

These problems can be avoided by creating a safe and welcoming culture in the team, where everyone is allowed to make mistakes. Try to avoid assigning blame and start taking ownership of the failure, this will help you have a positive approach towards learning. You need to go easy on the WHY questions because that lays the foundation of learning. You need to know why something happened, and why somebody did a particular thing. Make sure you have a balance of how deep you are diving into a subject. Maintaining consistency and keeping calm while proceeding to solve a problem will help the team in analyzing the data further to avoid failures.

Amir sums up his talk by saying, "Our most notable learning can be that humans make mistakes and that's a stepping stone for growth, realizing that gradual rollout is not a silver bullet, doing crisis mode process with feature flagging, inculcating behavioral driven development, looking for more breaking points, and treating configuration as code."

Chaos Engineering for Cloud Native Security
Chaos Engineering for Cloud Native Security
Carl Chesser
Chaos with Care

Sign Up

for our Newsletter

Get tips, best practices and updates on Chaos engineering in cloud native.

Videos

by Experts

Checkout our videos from the latest conferences and events

Our Videos

Related Blogs

Principles of Cloud Native Chaos Engineering

Read

May 06, 2021

10 Min Read

Karthik's blog