Getting Started with Observability for Chaos Engineering


About the speaker

Shelby Spees

Shelby Spees

Developer Advocate,


Shelby is a Developer Advocate at Honeycomb, where she helps developers better understand their services in production in order to deliver more business value. Before joining Honeycomb, Shelby worked on applications, build pipelines, infrastructure, and other reliability efforts. She's dedicated to improving access and equity in tech as well as to helping people feel good about their work. Shelby lives in Los Angeles, CA, where she enjoys doing karaoke with her pitbull, Nova.

About the talk

How much are you learning from your chaos experiments? How do you know it's safe to perform them? Learn how to instrument your systems in order to gain confidence and get the most out of chaos, whether you're injecting it on purpose or it's already there in your system.

How much are you learning from your chaos experiments? How do you know it's safe to perform them? Learn how to instrument your systems in order to gain confidence and get the most out of chaos, whether you're injecting it on purpose or it's already there in your system. This talk explores how Honeycomb's engineering team leverages high-context telemetry data to measure service heath and risk in order to determine when it's safe to perform experiments. You'll also hear the lessons we learned about our backend ingest service after performing controlled experiments. Attendees will come away with next steps for instrumenting for observability in order to get the most out of their own chaos experiments.

Topics to be covered:

Intro: Honeycomb is a data storage engine and analytics tool

  • Backend ingests structured event data and indexes on individual fields to support fast querying
  • We measure how well we're doing with SLOs (examples of SLOs for different services)
  • Error budget helps us decide when to take risks and when to invest in reliability
  • Structured events capture runtime context and benchmarking data in a format that supports trace visualizations and event-based SLOs
  • Here's what our instrumentation looks like
  • (if time) Overview of experiment on our backend ingest service and what we learned
  • Reference Graviton2 rollout and 40% cost savings
  • Outro: Here's how to get started instrumenting for observability!


Observability in Chaos Engineering

Shelby Spees, a Developer Advocate from, addresses the important question of how much are you learning from your chaos experiments, in her talk on day 2 of Chaos Carnival 2021. Chaos at 3 pm and not at 3 am helps us reduce the processing time in applying experiments and discovering the dark corners. When we approach a team to adapt Chaos Engineering, sometimes it is difficult for them due to unpreparedness and priorities which makes them invest no time in reliability.

"In software, the ability to understand any state a system can get into, no matter how bizarre or novel, without deploying new code. This is not about changing your log level to debug in production or going in and adding new timers around blocks of code that might be introducing latency, that requires a deployment. Observability is the ability to answer new questions about our systems using the same data that we already are capturing." Shelby explains.

She continues, "This is why we can learn more from chaos, from both the chaos that is inherent in our systems and the one we are injecting in our planned experiments. Observability also helps us decide when and how we can experiment, so we can better account for business risks and make it easier to get leadership on board with chaos engineering if they have been hesitant about it." IS AN OBSERVABILITY TOOL:

Honeycomb is a system analytics tool, it's about interacting with data and it works by ingesting your system's telemetry data where telemetry is data gathered by the state of your system at runtime. It can come in the form of metrics, logs, traces, and structured events. Honeycomb is built to support telemetry in the form of structured events, it also supports trace visualizations and capturing all context together which allows us to ask novel questions later on.

At Honeycomb, the big thing is fast querying by building a storage engineer from the ground to support fast querying on arbitrary fields. That's how the user need not define in advance what telemetry is gonna be important someday.


The quality of the work in Honeycomb is measured by SLOs and charity majors. It acts as an API for the engineering team, which means that SLOs are a common language between engineers and business stakeholders. They help in defining what success means and also help in measuring the success throughout the lifecycle of a customer. Thinking about events that are relevant to the business like home page loading quickly, user-run queries returning fast results, customer data ingestion getting stored successfully, and likewise. Each of these events has runtime contacts that we collect in our telemetry from the response code to the user agent to individual query parameters. And the SLI (Service Level Indicator) is defined to categorize events into a good or bad bucket according to their performance.

Once we have our SLO defined, we can calculate our error budget, for example, how many bad events were allowed in a given period, if you are serving a million total requests per month and aiming for a 99.9% SLO, you will have a thousand requests failover that particular period.

Just like the financial budget, the error budget can give us a wiggle room which we will be needing in case of an emergency. Error budgets also tell us when and where to not take risks, when you have an extra error budget, it's time to perform some chaos experiments. Conversely, you should not be performing your experiments when the error budget is tight, which, in this case, translates into customers facing incidents recently or the engineers feeling burnt out. In case of budget constraints, the time is better spent on reliability and learning instead.


In Honeycomb, the data that is sent by the customers is ingested and then quickly made available so that the customers can observe their systems in real-time. Currently, it is above 99.9% for ingesting latency.

The budget burndown graph on the right side shows that the steeper the line, the faster we are burning through that error budget and this allows us to alert provocatively.


We need to measure events that are relevant to the SLO and higher quality data, which allows us to set more goals effectively.

Shelby then goes on to explain the codes that Honeycomb uses for instrumenting ingest code, faults, capturing error context, and defining the SLI with eligible & good events in a very detailed and lucid manner.


  1. Choose an observability tool:

Many vendors offer a free trial, so you can start anywhere by learning the edges of your existing tooling, keeping your engineers' needs in check as well.

  1. Use OpenTelemetry:

It is an open-source instrumentation framework that is supported by many vendors in the community and large open-source communities. It's approaching general availability with auto instrumentation for several HTTP and gRPC frameworks and instrumentation.

  1. Dip a toe in the water:

Start with auto instrumentation which can be used especially for tracing. Choose an app or service in the critical path and send it from your development environment without instrumenting all of the production at once. As it is more comfortable to add different types of instrumentation, deploy a canary branch to a subset of the production, and if that is not feasible, you can deploy your staging environment or your test deployment.

  1. Iterating your instrumentation:

You can build on auto instrumentation by starting to add custom fields in your code and a good way to approach this is by adding instrumentation where there is a risk. Once you are comfortable with individual services, you can set up distributed tracing, which a lot of vendors are supporting now.

  1. Experiment with Observability:

As a practitioner, it is not just about creating a hypothesis for what your service can do but also what the Telemetry should be able to tell you.

Lastly, Shelby concludes her talk by saying we should be able to celebrate our successes in building redundancy mechanisms as well as our failures when the hypothesis goes wrong, where there is a scope of updating the mental models and to look it up as an opportunity to learn and invest in reliability. The Chaos engineering community and the resilience community go hand in hand because we all are trying to learn from the complexity of chaos that is inherent in our systems as well as the chaos that is being injected.

It won't make a noise when it breaks
It won't make a noise when it breaks
Putting Chaos into Continuous Delivery - how to increase the resilience of your applications
Putting Chaos into Continuous Delivery - how to increase the resilience of your applications


by Experts

Checkout our videos from the latest conferences and events

Our Videos