Clear the ring for Chaos Engineering at Vertrieb Deutsche Bahn! One year sensations and attractions!

Let's talk about some experiences, learnings and failures. Your asking about our learning environment? Well there you go: 300+ microservices, 100+ developers, 100+ Gamedays, countless experiments with various outcomes :)

Have you ever been to the circus?

As software developers, we have much more in common with circus trainers than we think. On the one hand, we have to tame and maintain a steadily growing zoo of technologies and on the other hand, the undesirable audience expects us to show more and more astonishing features within a short period of time. On top of that, we're also shooting right into the circus ring with our big CI/CD cannon. That can lead to interesting effects, because we're shooting while a show is running.

In this session you will learn about our top 5 excuses not to do Chaos Engineering and how even the most hostile environment can be used to do Chaos Engineering.

Whatever your role is, product owner or developer, we will share our experience in establishing a chaos engineering centered culture at DB Vertrieb for future ringmasters to withstand turbulent conditions in production and ultimately to satisfy your customer's expectations.

Clear the ring for Chaos Engineering at Vertrieb Deutsche Bahn

For yet another enlightening session on day 2 of Chaos Carnival 2021, we had with us Maik Figura and Oliver Kracht, from Vertrieb Deutsche Bahn, to speak about " Clear the ring for Chaos Engineering".

Oliver started his talk by saying, "We have been using Chaos Engineering for about a year now. Our IT systems have historically grown to such a point that we had to renew them because they were decades old. The support had to be discontinued for technology because they were not cloud-ready. With our new platform, we have the possibility of ruling out stable features and changes. But all this comes with a cost, we had new teams and technology and processes."

As software developers, we have much more in common with circus trainers than we think. On the one hand, we have to tame and maintain a steadily growing zoo of technologies and on the other hand, the undesirable audience expects us to show more and more astonishing features within a short period. On top of that, we're also shooting right into the circus ring with our big CI/CD cannon. That can lead to interesting effects because we're shooting while a show is running.

Oliver defines Chaos Engineering as, "The discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in the production."

Oliver continues to explain, "At the initial stages, our infrastructure was monolithic yet sturdy and durable. It was the surrounding components of the system that were much older and programmed in very old languages. With this kind of setup, it was hard for us to meet the demands of our new modern customers. So, within three months, we revamped our system and met the customer experience. But the businesses and developers still needed more in terms of new features, faster delivery, etc. all under modern cadence. Our services run 24x7 and in parallel to each other and we keep updating them while they are running. In our software delivery process, we moved to cloud-ready technology. We use domain-driven design to separate our concerns on continuous deployment. Our quality assurance has changed from manual testing processes to fully automated testing. A powerful CI/CD Cannon allows you to shoot features directly into production systems. KCI (Komfort Check-In) is that one new feature on our platform which allows you to connect to the current old monolith. This particular feature went down and there were many complaints from the customer's side."

Maik takes over and further explains, "Even after changing the technology, culture, responsibility, processes, etc. there was still something more that needed change. Was it the code coverage? Was it more governance? Was it more technical acceptance tests? Was it better tracking? Was it more documentation? What we needed was some special testing to avoid such outages. Our main measure of success was to achieve fast delivery with fewer errors and with better UX - making customers happy. We introduced operation and performance tests in both parts of Deutsche Bahn respectively, one was Operations and the other one was Development. We hoped that by doing this, it would benefit us in production but we were wrong. This time, we got inspired by Netflix, we have a similar infrastructure about microservices and tried Chaos Monkey as our chaos testing tool for our Kubernetes orchestration."

Oliver quotes Russ Miles in saying, "Thoughtful and planned experiments that reveal the weakness in our socio-technical systems before they appear in production."

DOING CHAOS EXPERIMENT

Pick an aspect of the system you want to experiment on.
Plan your experiment.
Prepare an environment (traffic, monitoring, access rights, etc.).
Measure the steady-state.
Monitor your system.
Experiment.
Document your findings.

VALUES OF CHAOS ENGINEERING

Increase cross-team communication to reduce friction.
Helps verify your non-functional requirements.
Find unknown technical debts.
Reduce time to recovery.
Increase time between failure.

TOOLING

Invest in a great (application performance) monitoring solution like Instana.
In the beginning, Deutsche Bahn mostly used Pumba and Chaos Monkey for spring boot.
At the moment, they are evaluating Chaos Engineering platforms like SteadyBit.

TOP FIVE EXCUSES NOT TO START WITH CHAOS ENGINEERING

"GoLive is in a few days, you are messing up the test plan."

In the microservices world, there is a GoLive every minute or ten, especially when you are starting a new service. Chaos Engineering should be a part of the test plan or even better, shift left and integrate it in your deployment process.

"We don't have enough budget access rights."

This means that there is some technical gap when people often do not know where to find the correct information.

"We cannot think of any experiments that are useful to us."

This is an alarming excuse as a new effort in implementing and operating Chaos Engineering was not enough. We have to give more coaching for explaining the gains of adopting chaos experiments and also try to push them to reflect use cases,

"There is no time because right now, we are only doing features. We will take care of technical things later."

Use-cases, technical features, and non-functional requirements are inherently connected. This just results in technical debt accumulation.

"We would love to do Chaos Engineering but the product owner is not prioritizing this."

If the product owner is not managing this, it might lead to stability issues with this approach and is pushed by management.

TIME TRAVEL

Oliver goes a big back in time to acknowledge the milestones Deutsche Bahn has achieved as a team, "Till today, we have had more than 83 GameDays done, more than 124 production relevant weaknesses discovered, started Red Teaming for Chaos Engineering, having unified documentation of every experiment with single-point entry, and hosting several community meetings."

TOP FIVE LEARNINGS

People, processes, and practices are key factors:

You have to communicate with the stakeholders, get them together while experimenting and learn more about the system.

Don't just deploy the Chaos Monkey:

Do not rush, rather take your time through game days and think of implementing a chaos testing tool in your system.

Start small, aim for production:

Start with a small blast radius with a minor magnitude impact so that you can learn and prepare better for the upcoming experiments through the findings. Keep aiming for production while you are implementing your experiment at the initial stages.

Don't be afraid to bug people, don't get discouraged:

You have to push the practice of Chaos Engineering in your organizations, at first there might be some resistance but eventually, the team will realize it's for the better.

Don't get eaten by the process line:

In the case of quality assurance, make specific experiments mandatory so that the other departments also understand the need for the infrastructure.

Maik concludes the talk by giving some call-to-actions for conducting Chaos Engineering, "You can start by inviting your team for the first game day, share your findings with the other departments as well for them to learn from it, and finally, be confident and do your Chaos Experiments."

Clear the ring for Chaos Engineering at Vertrieb Deutsche Bahn! One year sensations and attractions!

Videos

by Experts

Related Blog

Site Reliability Engineering: An Introduction

Start your reliability journey in minutes