Chaos Engineering for Cloud Native Security

Human errors and misconfiguration-based vulnerabilities have become a major cause of data breaches in cloud native infrastructure. We tackle these security challenges via Risk-driven Fault Injection (RDFI), a novel application of cyber security to chaos engineering.

Human errors and misconfiguration-based vulnerabilities have become a major cause of data breaches and other forms of security attacks in cloud-native infrastructure (CNI). The dynamic and complex nature of CNI and the underlying distributed systems further complicate these challenges. Hence, novel security mechanisms are imperative to overcome these challenges. Such mechanisms must be customer-centric, continuous, not focused on traditional security paradigms like intrusion detection. We tackle these security challenges via Risk-driven Fault Injection (RDFI), a novel application of cyber security to chaos engineering. Chaos engineering concepts (e.g. Netflix's Chaos Monkey) have become popular since they increase confidence in distributed systems by injecting non-malicious faults (essentially addressing availability concerns) via experimentation techniques. RDFI goes further by adopting security-focused approaches by injecting security faults that trigger security failures that impact integrity, confidentiality, and availability. Safety measures are also employed such that impacted environments can be reversed to secure states. Therefore, RDFI improves security and resilience drastically, in a continuous and efficient manner and extends the benefits of chaos engineering to cyber security. We have researched and implemented a proof-of-concept for RDFI that targets multi-cloud enterprise environments deployed on AWS and Google Cloud platform.

Chaos Engineering for Cloud-Native Security

To talk about "Chaos Engineering for Cloud-Native Security" we had Kennedy A Torkura from Mattermost join us on day 2 of Chaos Carnival 2021

"Cloud-Native Security is about securing the cloud-native infrastructure. Security in the cloud-native world is all about ensuring that every abstraction layer is secured. The first outermost layer is the cloud layer, which is a public cloud infrastructure like AWS, Google Cloud, Azure, etc. Security in this layer is important as it makes you understand different kinds of security problems, shared security, and responsibility models. The next layer is the cluster layer and they consist of orchestrators. They help in building communities that are aware of the Kubernetes framework and they are keen on outsourcing deployment in this layer. The third layer is the container layer which majorly consists of docker security. The innermost layer is the code which is built by the developers." Kennedy explains.

Nowadays, code is being pushed to a more rapid production environment. We should also take care of the vulnerabilities even if we use open-source code. We have to ensure that every time we make a change, it is already procured with a security model or security checks that are integrated into the continuous integration pipeline.

Kennedy continues, "This model talks about making sure that you deploy a defense-in-depth security model, and also answers questions such as why is it important, and why do we need to use the same models."

He then goes on to explain how in the past years there has been a surge in cyberattacks, predominantly on cloud infrastructure. In a 2020 report on cloud native trade, Aqua Security Team mentioned that when they were using honeypots, the attacks were comparatively low but during the fall of 2019, the attacks grew relatively higher in rate and became much more complex. They described these attacks as designed by organized criminals and were meant to evade all kinds of detection.

WHY ARE THESE ATTACKS INCREASING

Complexity: Complexity is the worst enemy of security. When we try to observe the systems for any attack, the parts keep getting more complex and that makes it harder to eradicate the attack. Security people need to understand the infrastructure so that they can defend it by using existing security tools or by making tools suitable solely for protecting their system.
Misconfiguration vulnerabilities: these have increased in the past few years, and according to Gartner, 99% of the cloud incidents that will be happening till 2024 will be because of user faults. These user faults are basically due to misconfigured resources. The reason behind all these is the knowledge gap and the inadequate tooling support.
Speed vs Security: Organizations move to the cloud because it invokes things like DevOps, CI/CD pipelines, etc. Unfortunately, they don't work well with security because of the difference in the environment. Security works well in traditional rather than dynamic environments.

Kennedy goes on to quote Aaron Reinhart, the co-founder of Verica, "Security Chaos Engineering is the identification of security control failures through proactive experimentation, to build confidence in the system's ability to defend against malicious conditions in production."

DIFFERENCE BETWEEN CHAOS AND SECURITY CHAOS ENGINEERING

Chaos Engineering, at least what we see in most tools, addresses availability problems and uses resiliency patterns to address these problems. Things like timeouts, bulkheads, and circuit breakers are being built into different technologies like microservices, Kubernetes, etc. They try to detect the availability problems early enough so that they don't turn bad with time and begin to affect the product, production systems, and customers.

Security Chaos Engineering too addresses availability problems, but of the kind like service attacks because they affect things such as confidentiality and integrity. It verifies security patterns or controls such as preventive controls (eg. firewalls), detective controls (eg. IDS), and corrective controls (eg. incident response systems). It particularly aims to detect security blind spots.

JOURNEY INTO SECURITY CHAOS ENGINEERING

"We had a project with our academic partners which was built to defend against attacks for cloud infrastructure, and this was cloud and identity management with cloud storage. As cloud resources do not have firewalls, we managed to get the states of these resources over a specific period and compare them consistently. By doing this we came up with a way to detect when things go wrong or are suspicious. This allows us to protect our systems and to study more about them."

"We have also researched about Proactive Security Risk Analysis, Automated Threat Detection and Incident Response in Multiple Cloud Storage Systems, Security Chaos Engineering for Cloud Services and Chaos Engineering for Security and Resiliency in Cloud Infrastructure. Apart from these, we had an opportunity to contribute to the first book on Chaos Engineering." Kennedy explains.

SCE FEEDBACK LOOP

Plan: Apply outcomes of analysis to improve security. Design and plan future security hypotheses.
Execute: Inject security faults based on a crafted hypothesis.
Monitor: Observe and monitor the execution of security perturbations. Intervene when necessary to ensure safety.
Analyze: Collect and analyze observations. Vulnerabilities can be ranked and prioritized.
Knowledge: All these processes revolve around Knowledge. Security insights and information including security fault models detected vulnerabilities and analytical outcomes.

Chaos Engineering for Cloud Native Security

Chaos Engineering for Cloud-Native Security

Videos

by Experts

Related Blog

LitmusChaos moves to CNCF Incubator

Start your reliability journey in minutes