Observability in Chaos Engineering
Shelby Spees, a Developer Advocate from Honeycomb.io, addresses the important question of how much are you learning from your chaos experiments, in her talk on day 2 of Chaos Carnival 2021.
Chaos at 3 pm and not at 3 am helps us reduce the processing time in applying experiments and discovering the dark corners. When we approach a team to adapt Chaos Engineering, sometimes it is difficult for them due to unpreparedness and priorities which makes them invest no time in reliability.
"In software, the ability to understand any state a system can get into, no matter how bizarre or novel, without deploying new code. This is not about changing your log level to debug in production or going in and adding new timers around blocks of code that might be introducing latency, that requires a deployment. Observability is the ability to answer new questions about our systems using the same data that we already are capturing." Shelby explains.
She continues, "This is why we can learn more from chaos, from both the chaos that is inherent in our systems and the one we are injecting in our planned experiments. Observability also helps us decide when and how we can experiment, so we can better account for business risks and make it easier to get leadership on board with chaos engineering if they have been hesitant about it."
HONEYCOMB.io IS AN OBSERVABILITY TOOL:
Honeycomb is a system analytics tool, it's about interacting with data and it works by ingesting your system's telemetry data where telemetry is data gathered by the state of your system at runtime. It can come in the form of metrics, logs, traces, and structured events. Honeycomb is built to support telemetry in the form of structured events, it also supports trace visualizations and capturing all context together which allows us to ask novel questions later on.
At Honeycomb, the big thing is fast querying by building a storage engineer from the ground to support fast querying on arbitrary fields. That's how the user need not define in advance what telemetry is gonna be important someday.
SERVICE LEVEL OBJECTIVES:
The quality of the work in Honeycomb is measured by SLOs and charity majors. It acts as an API for the engineering team, which means that SLOs are a common language between engineers and business stakeholders. They help in defining what success means and also help in measuring the success throughout the lifecycle of a customer. Thinking about events that are relevant to the business like home page loading quickly, user-run queries returning fast results, customer data ingestion getting stored successfully, and likewise. Each of these events has runtime contacts that we collect in our telemetry from the response code to the user agent to individual query parameters. And the SLI (Service Level Indicator) is defined to categorize events into a good or bad bucket according to their performance.
Once we have our SLO defined, we can calculate our error budget, for example, how many bad events were allowed in a given period, if you are serving a million total requests per month and aiming for a 99.9% SLO, you will have a thousand requests failover that particular period.
Just like the financial budget, the error budget can give us a wiggle room which we will be needing in case of an emergency. Error budgets also tell us when and where to not take risks, when you have an extra error budget, it's time to perform some chaos experiments. Conversely, you should not be performing your experiments when the error budget is tight, which, in this case, translates into customers facing incidents recently or the engineers feeling burnt out. In case of budget constraints, the time is better spent on reliability and learning instead.
INGEST LATENCY SLO AND ERROR BUDGET:
In Honeycomb, the data that is sent by the customers is ingested and then quickly made available so that the customers can observe their systems in real-time. Currently, it is above 99.9% for ingesting latency.
The budget burndown graph on the right side shows that the steeper the line, the faster we are burning through that error budget and this allows us to alert provocatively.
DEFINING AN SLO:
We need to measure events that are relevant to the SLO and higher quality data, which allows us to set more goals effectively.
Shelby then goes on to explain the codes that Honeycomb uses for instrumenting ingest code, faults, capturing error context, and defining the SLI with eligible & good events in a very detailed and lucid manner.
GETTING STARTED WITH OBSERVABILITY:
- Choose an observability tool:
Many vendors offer a free trial, so you can start anywhere by learning the edges of your existing tooling, keeping your engineers' needs in check as well.
- Use OpenTelemetry:
It is an open-source instrumentation framework that is supported by many vendors in the community and large open-source communities. It's approaching general availability with auto instrumentation for several HTTP and gRPC frameworks and instrumentation.
- Dip a toe in the water:
Start with auto instrumentation which can be used especially for tracing. Choose an app or service in the critical path and send it from your development environment without instrumenting all of the production at once. As it is more comfortable to add different types of instrumentation, deploy a canary branch to a subset of the production, and if that is not feasible, you can deploy your staging environment or your test deployment.
- Iterating your instrumentation:
You can build on auto instrumentation by starting to add custom fields in your code and a good way to approach this is by adding instrumentation where there is a risk. Once you are comfortable with individual services, you can set up distributed tracing, which a lot of vendors are supporting now.
- Experiment with Observability:
As a practitioner, it is not just about creating a hypothesis for what your service can do but also what the Telemetry should be able to tell you.
Lastly, Shelby concludes her talk by saying we should be able to celebrate our successes in building redundancy mechanisms as well as our failures when the hypothesis goes wrong, where there is a scope of updating the mental models and to look it up as an opportunity to learn and invest in reliability. The Chaos engineering community and the resilience community go hand in hand because we all are trying to learn from the complexity of chaos that is inherent in our systems as well as the chaos that is being injected.