You are on page 1of 54

Learning OpenTelemetry

Setting Up and Operating a Modern Observability System

With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.

Austin Parker and Ted Young


Learning OpenTelemetry
by Austin Parker and Ted Young

Copyright © 2024 Austin Parker and Ted Young. All rights


reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway


North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or


sales promotional use. Online editions are also available for
most titles (http://oreilly.com). For more information, contact
our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Acquisitions Editor: John Devins

Development Editor: Sarah Grey

Production Editor: Gregory Hyman

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Kate Dullea


November 2023: First Edition

Revision History for the Early Release


2023-04-27: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098147181 for


release details.

The O’Reilly logo is a registered trademark of O’Reilly Media,


Inc. Learning OpenTelemetry, the cover image, and related trade
dress are trademarks of O’Reilly Media, Inc.

The views expressed in this work are those of the authors and
do not represent the publisher’s views. While the publisher and
the authors have used good faith efforts to ensure that the
information and instructions contained in this work are
accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any
code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.

978-1-098-14718-1
Chapter 1. The State of Modern
Observability: A Brief Overview

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form
—the authors’ raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.

This will be the first chapter of the final book.

If you have comments about how we might improve the content


and/or examples in this book, or if you notice missing material
within this chapter, please reach out to the authors at
austin@ap2.io and ted@tedsuo.com.

History is not the past but a map of the past, drawn from a
particular point of view, to be useful to the modern traveller.

—Henry Glassie, US historian1

This is a book about the difficult problems inherent to large-


scale, distributed computer programs, and how to apply
OpenTelemetry to help solve those problems.

Modern software engineering is obsessed with end-user


experience, and end users demand blazing-fast performance.
Surveys show that users will abandon e-commerce sites that
take more than 2 seconds to load. You’ve probably spent a fair
amount of time trying to optimize and debug application
performance issues, and if you’re like us, you’ve been frustrated
by how inelegant and inefficient this process can be. There’s
either not enough data or too much, and what data there is can
be riddled with inconsistencies or unclear measurements.

Engineers are also faced with stringent uptime requirements.


That means identifying and mitigating any issues before they
cause a meltdown, not just waiting for the system to fail. And it
means moving quickly from triage to mitigation. To do that, you
need data.

But you don’t need just any data: you need correlated data, data
that is already organized, ready to be analyzed by a computer
system. As you will see, data with that level of organization has
not been readily available. In fact, as systems have scaled and
become more heterogeneous, finding the data you need to
analyze an issue has become even harder. If it was once like
looking for a needle in a haystack, it’s now more like looking for
a needle in a stack of needles.

OpenTelemetry solves this problem. By turning individual logs,


metrics, and traces into a coherent, unified graph of
information, OpenTelemetry sets the stage for the next
generation of observability tools. And since the software
industry is broadly adopting OpenTelemetry already, that next
generation of tools is being built as we write this.

The times, they are a-changin’


Technology comes in waves. As we write this in 2023, the field
of observability is riding its first real tsunami in at least 30
years. You’ve chosen a good time to pick up this book and gain a
new perspective!

The advent of cloud computing and cloud-native application


systems has led to seismic shifts in the practice of building and
operating complex software systems. What hasn’t changed,
though, is that software runs on computers, and you need to
understand what those computers are doing in order to
understand your software. As much as the cloud has sought to
abstract away fundamental units of computing, our 1’s and 0’s
are still using bits and bytes.
Whether you are running a program on a multiregion
Kubernetes cluster or a laptop, you will find yourself asking the
same questions, like:

“Why is it slow?”

“What is using so much RAM?”

“When did this problem start?”

“Where is the root cause?”

“How do I fix this?”

Carl Sagan said, “We have to know the past to understand the
present.”2 That certainly applies here: to see why a new
approach to observability is so important, you first need to be
familiar with traditional observability architecture and its
limitations.

This may look like a recap of rudimentary information! But the


observability mess has been around for so long, most of us have
developed quite the pile of preconceptions. So even if you’re an
expert – especially if you’re an expert – it is important to have a
fresh perspective. Let’s start this journey by defining several
key terms we will use throughout this book.
Observability: Key terms to know
First of all, what is observability observing? For the purposes of
this book, we are observing distributed systems. A distributed
system is a system whose components are located on different
networked computers that communicate and coordinate their
actions by passing messages to one another.3 There are many
kinds of computer systems, but these are the ones we’re
focusing on.

At the highest level, a distributed system consists of resources


and transactions:

Resources

Resources are all of the physical and logical components


that make up a system. Physical components, like servers,
containers, processes, RAM, CPU, and network cards, are
all resources. Logical components, like clients,
applications, API endpoints, databases, and load
balancers, are also resources. In short, resources are
everything from which the system is actually constructed.

Transactions
Transactions are user requests that orchestrate and utilize
the resources the system needs to do work on behalf of
the user. Usually, a transaction is kicked off by a real
human, who is waiting for the task to be completed.
Booking a flight, hailing a rideshare, and loading a
webpage are examples of transactions.

How do we observe these distributed systems? We can’t, unless


they emit telemetry. Telemetry is data that describes what your
system is doing. Without telemetry, your system is just a big
black box filled with mystery.

Many developers find the word telemetry confusing. It’s an


overloaded term. The distinction we draw in this book, and in
systems monitoring in general, is between user telemetry and
performance telemetry:

User telemetry

User telemetry refers to data about how a user is


interacting with a system through a client: button clicks,
session duration, information about the client’s host
machine, and so forth. You can use this data to
understand how users are interacting with an e-
commerce site, or the distribution of browser versions
accessing a web-based application.
Performance telemetry

Performance telemetry is not primarily used to analyze


user behavior; instead it provides operators with
statistical information about the behavior and
performance of system components. Performance data
can come from different sources in a distributed system,
and offers developers a “breadcrumb trail” to follow,
connecting cause with effect.

In plainer terms, user telemetry will tell you how long someone
hovered their mouse cursor over a ‘checkout’ button in an e-
commerce application. Performance telemetry will tell you how
long it took for that checkout button to load in the first place,
and which programs and resources the system utilized along
the way.

Underneath user and performance telemetry are different types


of signals. A signal is a particular form of telemetry. Event logs
are one kind of signal. System metrics are another kind of
signal. Continuous profiling is another. Each of these signals
serves a different purpose, and they are not really
interchangeable. You can’t derive all of the events that make up
a user interaction just by looking at system metrics, and you
can’t derive system load just by looking at transaction logs. We
need multiple kinds of signals to get a deep understanding of
our system as a whole.

Each signal consists of two parts: instrumentation within the


programs themselves, and a transmission system for sending
the data over the network to an analysis tool, where the actual
observing occurs.

This raises an important distinction: it’s common to conflate


telemetry and analysis, but it’s important to understand that the
system that emits the data and the system that analyzes the
data are separate from each other. Telemetry is the data itself.
Analysis is what you do with the data. .

Finally, telemetry plus an analysis equals observability.


Understanding the best way to combine these two pieces into a
useful observability system is what this book is all about.

A Brief History of Telemetry

Fun fact: it’s called telemetry because the first remote diagnostic
systems transmitted data over telegraph lines. While people
often think of rockets and 1950s aerospace when they hear the
term telemetry, if that was where the practice had started, it
would have been called radiometry. Telemetry was actually first
developed to monitor power plants and public power grids –
early but important distributed systems!

Of course, computer telemetry came later. The specific history


of user and performance telemetry maps to changes in software
operations, and to the ever-increasing processing power and
network bandwidth that have long driven those trends.
Understanding how computer telemetry signals came to be and
how they evolved is an important part of understanding their
current limitations.

The first and most enduring form of telemetry was logging.


Logs are text-based messages meant for human consumption
that describe the state of a system or service. Over time,
developers and operators improved how they stored and
searched these logs, by creating specialized databases which
were good at full text search.

While logging did tell you about individual events and moments
within a system, understanding how that system was changing
over time required more data. A log could tell you that a file
couldn’t be written because the storage device was out of space,
but wouldn’t it be great if you could track available storage
capacity and make a change before you ran out of space?
Metrics are compact, statistical representations of system state
and resource utilization. They were perfect for the job. Adding
metrics made it possible to build alerting on data, beyond
errors and exceptions.

As the modern internet took off, systems became more complex


and performance became more critical. A third form of
telemetry was added: distributed tracing. As transactions grew
to include more and more operations, and more and more
machines, localizing the source of a problem became more
critical. Instead of just looking at individual events – logs –
tracing systems look at entire operations and how they
combined to form transactions. Operations have a start time
and an end time. They also have a location: on which machine
did a particular operation occur? Tracking this made it possible
to localize the source of latency to a particular operation or a
machine. However, due to resource constraints, tracing systems
tended to be heavily sampled and only ended up recording a
small fraction of the total number of transactions, which
limited their usefulness beyond basic performance analysis.

The three browser tabs of


observability
While there are other useful forms of telemetry, the primacy of
these three systems – logs, metrics, and tracing – lead to the
concept known today as the “three pillars of observability.”4
The three pillars are a great way to describe how we currently
practice observability – but they’re actually a terrible way to
design a telemetry system!

Traditionally, each form of observability – telemetry plus


analysis – was built as a completely separate, siloed system, as
described in Figure 1-1.

Figure 1-1. A “pillar” of observability

A logging system consists of logging instrumentation, a log


transmission system, and a log analysis tool. A metrics system
consists of metrics instrumentation, a metrics transmission
system, and a metrics analysis tool. Same for tracing. Hence, the
three pillars described in Figure 1-2.
Figure 1-2. The “three pillars” of observability

This is basic vertical integration: each system is built to purpose,


end to end. It makes sense that observability has been built this
way – it’s been evolving over time, with each piece added as it
was needed. In other words, observability is structured this way
for no better reason than historical accident. The simplest way
to implement a logging system or a metrics system is to do it in
isolation, as a standalone system.

So, while the term “three pillars” does explain the way
traditional observability is architected, it is also problematic - it
makes this architecture sound like a good idea! Which it isn’t.
It’s cheeky, but I prefer a different turn of phrase - “the three
browser tabs of observability.” Because that’s what you’re
actually getting.

Emerging complications
The problem is that our systems are not composed of logging
problems or metrics problems.

They are composed of transactions and resources. When a


problem occurs, these are the only two things we can modify.
Developers can change what the transactions do, and operators
can change what resources are available. That’s it.

But the devil is in the details. It’s possible for a simple, isolated
bug to be confined to a single transaction. But most production
problems emerge from the way many concurrent transactions
interact.

A big part of observing real systems involves identifying


patterns of bad behavior, then extrapolating to figure out how
certain patterns of transactions and resource consumption
cause these patterns. That’s really difficult to do! It’s very hard
to predict how transactions and resources will end up
interacting in the real world. Tests and small-scale deployments
aren’t always useful tools for this, because the problems you
are trying to solve do not appear outside of production. These
problems are emergent side effects, and they are specific to the
way that the physical reality of your production deployment
interacts with all of the system’s real users.

This is a pickle! Clearly, your ability to solve these problems


depends on the quality of the telemetry your system is emitting
in production.

The three pillars were an accident


You can definitely use metrics, logs, and traces to understand
your system. Logs and traces help you reconstruct the events
that make up a transaction, while metrics help you understand
resource usage and availability.

But useful observations do not come from looking at data in


isolation. You can’t look at a single data point, or even a single
data type, and understand anything about emergent behavior.
You’ll almost never find the root cause of a problem by just
looking at logs or metrics. The clues that lead us to answers
come from finding correlations across these different data
streams. So, when investigating a problem, you tend to pivot
back and forth between logs and metrics, looking for
correlations.

This is the primary problem with the traditional “three pillars”


approach: these signals are all kept in separate data silos. This
makes it impossible to automatically identify correlations
between changing patterns in our transaction logs and changes
patterns in our metrics. Instead, you end up with three separate
browser tabs, and each one only contains a portion of what you
need.

Vertical integration makes it even worse: if you want to spot


correlations across metrics, logs, and traces, you need these
connections to be present in the telemetry your systems are
emitting. Without unified telemetry, even if you were able to
store these separate signals in the same database, you would
still be missing key identifiers which make collerlations reliable
and consistent. So the three pillars are actually a bad design!
What we need is an integrated system.

A Single Braid of Data


How do you triage your systems, once you’ve noticed a
problem? By finding correlations. How do you find
correlations? There are two ways: with humans and with
computers.

Human investigation

Operators sweep through all the available data, building a


mental model of the current system. Then, in their heads,
they try to identify how all the pieces might be secretly
connected. Not only is this mentally exhausting, it’s also
subject to the limitations of human memory. I mean, think
about it: we’re literally looking for correlations by using
our eyeballs to look at squiggly lines.

Computer investigation

The second way to find correlations is by using


computers. Computers may not be good at forming
hypotheses and finding root causes, but they are very
good at finding correlations. That’s just statistical
mathematics.

But, again, there’s a catch: computers can only find


correlations between connected pieces of data. And if your
telemetry data is siloed, unstructured, and inconsistent,
then the assistance computers can offer you will be very
limited. This is why human operators are still using our
eyeballs to scan for metrics while also trying to memorize
every line in every config file.

Instead of three separate pillars, let’s use a new metaphor: a


single braid of data. Figure 1-3 shows my favorite way of
thinking about high-quality telemetry. We still have three
separate signals – there’s no conflating them – but the signals
have touch points that connect everything together into a single
graphical data structure.

Figure 1-3. Finding correlations when all the telemetry signals are braided together.

With a telemetry system like this, it’s possible for computers to


walk through the graph, quickly finding distant but important
connections. Unified telemetry means it’s finally possible to
have unified analysis, which is critical to developing a deep
understanding of the emergent problems inherent to live
production systems.

Does such a telemetry system exist? It does. And it’s called


OpenTelemetry.

Conclusion
This book will be your guide to learning OpenTelemetry. It is
not meant to be a replacement for OpenTelemetry
documentation, which can be found on the project’s website.
Instead, this book explains the philosophy and design of
OpenTelemetry, and offers practical guidance on how to wield it
effectively.

In Chapter 2, we explain the value proposition OpenTelemetry


brings, and how your organization benefits from replacing
proprietary instrumentation with instrumentation based on
open standards.

In Chapter 3, we move to a high-level overview of the various


components which make up OpenTelemetry.

In Chapter 4, we dive into instrumenting an Application,


including a checklist to help ensure that everything works and
the telemetry is high quality.

In Chapter 5, we discuss instrumenting OSS libraries and


services, and explain why library maintainers should care
about observability.

In Chapter 6, we review the options for observing software


infrastructure - cloud providers, platforms, and data services.

In Chapter 7, we go into detail on how and why to build


different types of observability pipelines, using the
OpenTelemetry Collector.

In Chapter 8, we provide advice on how to deploy


OpenTelemetry across your organization. Since telemetry –
especially tracing – is a cross-team issue, there are
organizational pitfalls when rolling out a new observability
system. This chapter will provide strategies and advice on how
to ensure a successful rollout.

If you are brand new to OpenTelemetry, I strongly suggest


reading chapters 2 and 3 first. After that, the chapters can be
read in any order. Feel free to skip to whichever section is most
relevant to the task you need to accomplish.
1 Glassie, Henry H., Passing the time: folklore and history of an Ulster community
(New York 1982)

2 Sagan, C. E. (author and presenter). (1980) Episode 2: One Voice in the Cosmic Fugue
[Television series episode]. In Adrian Malone (Producer), Cosmos: A Personal Voyage.
Arlington, VA: Public Broadcasting Service.

3 Tanenbaum, Andrew S.; Steen, Maarten van, Distributed systems: principles and
paradigms, 2002

4 Cindy Sridharan, Distributed Systems Observability (O’Reilly, 2018).


Chapter 2. Why Use OpenTelemetry?

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form
—the authors’ raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.

This will be the second chapter of the final book.

If you have comments about how we might improve the content


and/or examples in this book, or if you notice missing material
within this chapter, please reach out to the authors at
austin@ap2.io and ted@tedsuo.com.

A map is not the actual territory.

— ALFRED KORZYBSKI

Software systems are the vital heart of the global economy. This
may sound like an exaggeration, but it really isn’t; There’s
almost no productive enterprise that doesn’t involve software
systems at some point. Commerce, logistics, manufacturing,
telecommunications, textile production, you name it, software
systems play a crucial role. The only thing more crucial than the
software itself are the legions of humans tasked with the
systems care and upkeep.

These humans -- you, included -- have inherited a colossal task.


Given less time, fewer resources, and greater complexity, you
must ensure that the system continues to function with as few
interruptions as possible. To aid you, you’re given some
documentation, a team of like-minded individuals, and forty
hours a week. It doesn’t take long to discover that this is,
perhaps, not quite enough.

Systems will always drift away from their documentation


despite the best efforts of everyone involved. The larger the
team, the more complex the organization, the greater the cause
-- the further and quicker the mental and physical map of a
system diverges from the actual implementation. This fact
introduces significant friction to your daily work building and
operating a software system.

How should individuals, teams, and organizations combat this


drift? As we discussed in Chapter 1, we need telemetry about
systems to aid in our understanding. Telemetry allows us to not
only view the map, but it allows us to view the territory --
instead of relying solely on documentation and inference by
code, telemetry gives us real-world data from our production
system.

This isn’t a novel conclusion -- as mentioned in the prior


chapter, developers and operators have been creating telemetry
for decades. There’s a broad status quo of metrics and logs that
power monitoring practices for an overwhelming amount of
production software. However, this status quo often leaves
developers needing or wanting more -- more depth, more
insights, better data, deeper analysis. Getting more isn’t just a
matter of creating more data -- it requires linking data together,
layering data to patch gaps in understanding, and
understanding the needs of the many stakeholders in system
operation.

OpenTelemetry achieves these objectives, but to understand


how, we first need to dive into the whys.

Production Monitoring: The Status


Quo
As organizations and systems scale up in size and complexity,
the collection and analysis of telemetry signals becomes
significantly more challenging. Features that take weeks to
build often result in months of integration work to monitor and
understand. Integrating new products or teams into existing
production monitoring practices can sometimes take years, and
often results in the addition of even more tools and data
sources. Incidents and failures can take days or weeks to
resolve due to a lack of data, or an inability to identify all
contributing factors to an outage.

Why is this the case? Often, monitoring and telemetry is treated


as a “second class citizen” of system operations. Features and
work are delivered as projects, where all work is carefully
scoped and completed in pre-production, then shipped as a
complete functional unit. Forget to add some metrics, or traces?
Well, too bad, you’re not going to get another bite at that apple
(at least, until something goes wrong, then everyone panics
trying to fix it!). Another symptom of this is that developers and
operators don’t have a good “first choice” of instrumentation
options, or they’re forced to try and convert various
incompatible formats into a custom or proprietary solution.
Assuming the data exists at all, developers and operators often
have to page through multiple monitoring systems, manually
correlating data across multiple browser tabs, or sharing
insights through conference calls and chat systems.
Studies have shown that end-user facing incidents are getting
harder to detect, diagnose, and repair1 -- traditional monitoring
practices are letting down developers, operators, and
organizations alike.

Observability promises to free developers and operators from


the status quo of monitoring pitfalls, but it is often perceived to
be a marketing pablum rather than effective practice. This is
because true observability is more than collecting data and
creating some pretty dashboards. Observability is a practice
that “requires evolving the way we think about gathering the
data needed to debug effectively”2.

This evolution starts with telemetry data itself, and re-


evaluating what’s important about it. To begin, let’s talk about
context.

Why Context Matters

Context is an overloaded term in the monitoring and


observability space. It can refer to a very literal object in your
application, to data being passed over an RPC link, or to the
logical and linguistic meaning of the term. However, the actual
meaning is fairly consistent between these definitions -- context
is metadata that helps describe the relationship between
telemetry data.

Broadly speaking, there are two types of context that we care


about, and those contexts appear in two places. The types of
context are what I’ll refer to as “hard” and “soft” contexts, and
the places are in an application, or in infrastructure. An
observability system can address and support varying mixtures
of these contexts, but without them, the value of telemetry data
is significantly reduced -- or vanishes altogether.

A hard context is a unique, per-request identifier that services


in a distributed application can propagate to other services that
are a part of the same request. A basic model of this would be a
single request from a web client, through a load balancer into
an API server, which calls a function in another service to read
a database and returns some computed value to the client (see
Figure 2-1). This can also be referred to as the logical context of
the request (as it maps to a single desired end-user interaction
with the system).

A soft context would be various pieces of metadata that each


telemetry instrument attaches to measurements from the
various services and infrastructure that handle that same
request: for example a customer identifier, the hostname of the
load balancer that served the request, or the timestamp of a
piece of telemetry data (also pictured in Figure 2-1). The key
distinction between hard and soft contexts is that a hard
context directly and explicitly links measurements that have a
causal relationship, whereas soft contexts may do so but are not
guaranteed to.

Figure 2-1. A diagram demonstrating hard and soft contexts emitted by telemetry in an
n-tier web application.

Without contexts, the value of telemetry is significantly


reduced, because you lose the ability to associate
measurements with each other. The more context you add, the
easier it becomes to interrogate your data for useful insights,
especially as you add more concurrent transactions to a
distributed system.

In a system with low levels of concurrency, soft contexts may be


suitable for explaining system behavior. As complexity and
concurrency increase, however, a human operator will quickly
be overwhelmed by data points and the value of the telemetry
will drop to zero. You can see the value of soft context in
Figure 2-2, where viewing the average latency of a particular
endpoint doesn’t give a lot of helpful clues as to any underlying
problems, but adding context (a customer attribute) allows you
to quickly identify a user-facing problem.
Figure 2-2. An illustration of a time-series metric showing average latency for an API
endpoint. One graph plots average (p50) latency, the other applies a single group-by
attribute. In the comparison, the group-by graph illustrates that a single group is
suffering significantly worse p50 latency than most other groups.

The most common soft context used in monitoring is time. One


tried and true method of spotting differences or correlating
cause and effect is to align multiple time windows across
several different instruments or data sources, then visually
interpret the output. Again, as complexity increases, this
method becomes less effective. Traditionally, operators are
forced to layer in additional soft contexts, ‘zooming in and out’
until they’ve found a sufficiently narrow lens to actually find
useful results in their data set.

Hard context, on the other hand, can dramatically simplify this


exploratory process. A hard context allows association of
individual telemetry measurements with other measurements
of the same type -- for example, ensuring that individual spans
within a trace are linked together -- but also for linking different
types of instruments. For example, you can associate metrics
with traces, link logs to spans, and so forth. The presence of
hard context can dramatically reduce the time a human
operator spends investigating anomalous behavior in a system.

To achieve observability, developers and operators must free


themselves from long and painful investigatory processes.
Telemetry data must be available as a “first-class citizen” of
standard development and operation processes. It shouldn’t
just be added haphazardly in response to incidents or in ad hoc
project-based feature work. You can’t rapidly detect changes in
a complex, evolving, and dynamic distributed system without
both hard and soft context. Achieving both requires planning
and care.

Telemetry Layering
In the above discussion, you might have noticed that we’re
assuming the operator is using multiple forms of telemetry in
their investigatory workflows. This may strike you as either
obvious or confusing. “Aren’t logs just logs? Are traces logs?
How do I get metrics from traces?” Telemetry signals are
actually just specific ways of modeling system state and
behavior. There’s nothing intrinsic about any given signal that
makes it what it “is” other than what we make of it. You can
convert any signal type to any other signal type, if that’s what
makes sense for how you want to use the data.

A straightforward example of signal conversion is often


implemented for log data. All software emits logs, but it’s
extremely difficult to use logs alone to alert operators about
anomalous behavior at scale. Instead, most operators employ
tools that represent log data as a time-series metric, or that use
correlation identifiers to add hard context to the logs. (This
turns the logs into a form of traces.)

‘Upcycling’ unstructured log data into other formats can be


advantageous, since it requires little work for developers. It
doesn’t force them to re-instrument their software to emit new
signals, and the cost of managing large volumes of telemetry
data usually doesn’t fall to them.
However, there are drawbacks to this approach, foremost of
which is cost. It’s expensive to store and process huge amounts
of log events. The effort is significant as well: SRE teams and
operators must spend time massaging data from disparate
sources to meet an internal standard. There’s also a timeliness
cost. All of these conversions add processing steps. Sometimes
they cause alerts to fire for behavior that’s already been
occurring in production for several minutes , and sometimes
for cases that cleared before the alert even fired. This leads
operators to ignore alerts or tune them to require longer
periods of anomalous behavior, which often results in a poor
end-user experience.

A better solution to this problem is to layer telemetry signals


and use them in complementary ways rather than attempting to
turn a single ‘dense’ signal -- such as application logs -- into
other forms. You can use more tailored instruments to measure
application and system behavior at specific layers of
abstraction, link those signals through contexts, and layer your
telemetry to get the right data from these overlapping signals,
recorded and stored in appropriate, efficient ways. Such data
can answer questions about your system that you might not
even have known you had. Layering telemetry, as shown in
Figure 2-3, allows you to better understand and model your
systems.
Figure 2-3. An illustration of layered signals -- histogram metrics measuring API
latency, and exemplar traces for each bucket that focus on errors, with linked profiles
on specific failing components.

Understanding The Territory

Monitoring is a passive action. Observability is an active


practice. In order to analyze the territory of a system -- which is
to say, to understand how it actually works and performs in
production, rather than relying on the parts you can see, like
code or documentation, you need more than just passive
dashboards and alerts based on telemetry data.

Even highly contextual and layered telemetry, by itself, is not


enough to achieve observability. You need to store that data
somewhere, and you need to actively analyze it. Your ability to
effectively consume telemetry is thus limited by many factors --
storage, network bandwidth, telemetry creation overhead (how
much memory/CPU is utilized to actually create and transmit
signals), analysis cost, alert evaluation rate, and much more. To
be more blunt, your ability to understand a software system is
ultimately a cost optimization exercise. How much are you
willing to spend in order to understand your system?

This fact causes significant pain for existing monitoring


practices. Developers are often restricted on the amount of
context they can provide, as the amount of metadata attached
to telemetry increases the cost of storing and querying that
telemetry. In addition, different signals will often be analyzed
multiple times for distinct purposes. As an example, HTTP
access logs are a good source of data for the performance of a
given server. They are also critical information for security
teams keeping an eye out for unauthorized access or usage of
production systems. This means that the data must be
processed multiple times, by multiple tools, for multiple ends.

In practice, this duplicative effort usually ends in a patchwork


of multiple commercial solutions, each with their own
requirements and pricing structures, locking developers and
operators into specific libraries or agents to generate the
necessary data for those products to function. This poses a
challenge to the organization itself, especially when attempting
to optimize costs -- you’re at the mercy of your vendor, because
your data isn’t portable without significant investments in re-
platforming.

Developers and operators need more than vendor lock-in to


achieve observability and understand their systems. Why does
this lock-in persist, though? Historically there’s been a lack of a
consensus option for framework and library authors to target
for instrumentation. When it does exist, it’s usually tightly
bound to a specific framework or technology stack, making it
unsuitable for polyglot application architectures that are
common in cloud-native systems. In addition, telemetry data is
growing at an exponential rate thanks to the improvements in
scalability due to the cloud.
All of these problems -- a lack of required context, a lack of
layered telemetry, significant amounts of vendor lock-in -- have
a common cause. The pace of innovation has out-stripped our
ability to produce meaningful, modern standards around
telemetry. Teams that are seeking to understand the territory of
their systems are left wanting -- you need to control costs, to
align telemetry from dozens of incompatible sources, to ensure
that the telemetry is useful for a variety of purposes, to make it
accessible and useful for diagnostic purposes -- but without a
unifying force to make these problems solvable. As time
marches on, and organizational scale and complexity grows,
these problems become more painful to address as the factors
that led to them embed deeper into the organizational strata.

What Do Developers and Operators


Need?
How do we begin to untangle this Gordian knot? It can feel
overwhelming, and we’re all out of swords to slice through it.
Let’s begin by focusing on the ‘first responders’: the developers
and operators tasked with maintaining and caring for
applications and systems. What do they need? This section
focuses on three of those needs: built-in telemetry, clear and
consistent APIs, and integrated tools.

Built-In Telemetry

We said that telemetry is necessary, but not sufficient, for


observability. It is extremely necessary, however, and it must
become a ‘built-in’ feature of cloud-native software. This means
software should be “observable by default.” It should emit all of
the appropriate metrics, logs, traces, profiles, events, and so
forth. Telemetry should be:

Well annotated

Well-annotated telemetry has consistent and clear


metadata about what the software is doing, what it’s
responding to, where it’s running, and even how much it
costs (if it’s running on a metered instance, for example).

Emitted by dependencies

Telemetry should be emitted by the dependencies of a


system. Yes, that means managed databases, API
gateways, load balancers, and other external APIs. It also
means continuous integration and deployment (CI/CD)
tools, security scanners, load and stress testers, and
orchestration tools. All of the system’s functional
components should emit as much telemetry as they can,
in a commonly understood format. That way, the system’s
operators can see into and understand the system as a
whole and as each of its parts.

Invisible

Telemetry should be mostly invisible to developers. What


developers need isn’t another library or agent, but a
checkbox that they can toggle on to immediately receive a
rich stream of layered telemetry in a universal format.
Rather than only exposing this telemetry through external
agents, proprietary tools, or code-level integrations, it
should mostly just exist, waiting in the wings to be
consumed by analysis and storage tools.

Clear and Consistent APIs

Even in our dream future of ubiquitous, built-in telemetry,


developers will need to interact with the telemetry system itself.
There will always be things that the telemetry system cannot
intuit when creating metadata about a service. Developers will
need to be able to easily enhance and annotate both existing
and new signals within their code.
To this end, the telemetry system needs to build not only a
lingua franca for the expression of its data, but a common
lexicon of ‘nouns and verbs’ to describe how to interact with it.
OpenTelemetry provides this language through its data model
and semantic conventions, which we’ll cover later in this
chapter and in Chapter 3. We need fairly universal concepts of
different metric instruments, how they appear in code, and
how they’re used. We need to know how to add metadata that
represents concepts in our business logic (such as customer
identifiers) to traces, metrics, and logs. We need this not only to
improve our efficiency, but to make telemetry a transferable
skill. If application developers are to be responsible for creating
and interpreting telemetry data, then its concepts need to be as
universal as an object or a function.

OpenTelemetry isn’t “one-size-fits-all” -- it’s designed for you to


use as much, or as little, as you need. The SDK and API both are
composable, allowing you to re-implement certain components
as needed. This makes integration and translating from existing
formats less painful and faster.

Integrating Telemetry with Tools

If creating telemetry is going to be a default part of the


development process (see sidebar), then we must build it into
our tooling as well. Integrated development environments, code
editors, language servers, automated testing and verification
tools must all become ‘telemetry-aware’.

THE INEVITABILITY OF TELEMETRY

In this chapter, we’ve suggested that telemetry data must


become a part of the development process itself, so developers
can quickly understand what their applications are doing. So
why haven’t we always done it this way?

Well, the people who used to be responsible for understanding


what was going on in an application were the QA Department.
Before developers were shipping to production on a daily basis,
there was a lot less pressure to make system behavior self-
documenting and discoverable. You could spot the interesting
interactions and bugs during QA cycles. Today, though, we need
high-quality telemetry to catch them.

In fairness, even a crackerjack QA team would probably still


want this kind of telemetry. It would make their jobs much
easier. 😎

Bringing the process of writing telemetry closer to the process


of writing code will make it easier for developers to create high-
quality telemetry. It will also unlock a wealth of opportunities
for language designers and developer-productivity specialists to
build the next generation of toil-reducing tools. OTLP is vendor-
agnostic, allowing you to instrument your system without tying
yourself to a particular commercial product. In addition to
reducing vendor lock-in, integrating telemetry into tools will
speed the process of learning new codebases. OpenTelemetry is
supported by major cloud providers and monitoring vendors,
directly and through official contributions to the project.

Imagine using an IDE that helps you make complex code more
readable, automatically completes function definitions, and
even suggests cost-saving patterns, based on real-world
performance data from other services. Add ubiquitous
telemetry to machine learning algorithms and artificial
intelligence, and the potential is vast!

OpenTelemetry provides a rich set of tooling, beyond telemetry


creation and export APIs and SDKs. The OpenTelemetry
Collector acts as a ‘Swiss Army knife’ of telemetry collection and
processing, allowing you to build custom observability
pipelines and tailor them to your needs.

With real-time telemetry integrated into the development


lifecycle, you can profile changes in production versus
development to ensure that optimizations are working, or
discover potential integration snafus by exercising API changes
across all remote procedure calls (RPCs). In the future, service
endpoints can become self-documenting by showing you valid
parameters based on real data. They could even compare your
changes to existing Service Level Objectives, so you can
understand how new features or fixes might impact latency or
availability.

What Do Businesses and


Organizations Need?
Solving telemetry challenges doesn’t just benefit developers and
operators. Teams, businesses, and organizations can also realize
significant benefits from modern, cloud-native telemetry. Now
that we’ve outlined some of the obvious productivity benefits of
improving speed and reliability, let’s look at what enterprises
need: data that is standardized, portable, and compatible with
their existing data.

Standardized Data Formats

Standardization makes it much easier to manage data, people,


processes, and costs. It makes these tasks more reliable,
predictable, and uniform. This is as true for telemetry data as it
is for anything else. Data standardization is a constant and
continuous process at most enterprises, driven primarily by
data warehousing and Extract, Transform, Load (ETL)
processes. But there is still no uniform standard for all
important telemetry signals.

Organizations need a single, well-supported, and ubiquitous


choice that supports observability with consistent and accurate
metadata and hard and soft context, and that can integrate and
interoperate with a variety of tools. OpenTelemetry Protocol
(OTLP) offers a single protocol and data format for transmitting
and exchanging telemetry data. This makes it possible to switch
between analysis tools with no lock-in, and to integrate
different data sources.

Portability

Telemetry needs to be portable between different analysis tools.


Large organizations tend to employ specialists who use data for
different purposes. An operations team that focuses on security,
for example, will most likely analyze the same data very
differently than an application development team will.
Additionally, you might want to share these streams of
telemetry data between internal and external tools, platforms,
and consumers. Perhaps your organization wants to share a
subset of its telemetry with its integration partners, or with
customers who build value-added services. The organization
might also need to provide data to independent auditors for
regulatory and compliance reasons. If data will be stored in
multiple sites for varying lengths of time, portability is also
important.

OpenTelemetry is built on a unified context mechanism that


allows for observability data to be passed between services and
threads. This allows you to side-channel important telemetry
attributes (like customer identifiers or database shards)
between services.

Compatibility with Existing Data

The unfortunate reality is that almost no enterprise will commit


to a complete rewrite of their application and system telemetry,
even if it would help. There are both practical and logistical
hurdles to doing so, as mundane as “not having enough time” to
“not having the underlying source code to production
applications”.
Therefore, organizations need telemetry systems that can be
integrated into existing systems and formats. There are
tradeoffs with this approach, however. Certain types of hard
context cannot be easily plumbed into existing systems. Older
components may simply not have the capability to emit certain
signals at all! Organizations need flexible and composable
frameworks to standardize on. Rather than end-to-end ‘magic’
solutions, they require a rich and somewhat un-opinionated
framework to build on and extend, bringing old and new
together to provide a comprehensive picture of their systems.

Defined versioning, deprecation, and stability guarantees offer


organizations a sane and future-proofed guarantee about the
investments they make in OpenTelemetry. The OpenTelemetry
Semantic Conventions ensure consistent data labeling across
different signals– even those produced by different databases,
cloud providers, and frameworks.

We’ll go into depth in Chapter 3 about all of these points and


more.

OpenTelemetry: A Unifying Force


The real value in OpenTelemetry is that it recognizes that
telemetry data must be a commodity for developers, and it’s
working toward this future. When we talk about “built-in”
telemetry, what we mean is that OpenTelemetry envisions a
future where software emits high-quality, rich telemetry
without developers having to do anything. It aims to create a
river of data running just under the surface, so that you can
draw from it and enrich your data as needed.

I hope we’ve convinced you by now that the real question isn’t
“why OpenTelemetry?” but “why not OpenTelemetry?”
Although the project is still in its early stages (it’s just over three
years old as I write this), it’s the CNCF’s second most popular
project!3 OpenTelemetry is well positioned to achieve the lofty
goals and objectives we’ve set out here.

Observability is no longer a “nice to have'' for software systems:


it’s a “need-to-have.” You need to be able to collect high-quality,
unified telemetry data from your systems. OpenTelemetry is
already an indispensable part of making that possible. OTLP
provides a platform for innovation, and OpenTelemetry uses
semantic conventions to improve data quality and consistency,
reduces vendor lock-in, and provides universal APIs and SDKs
capable of embedding telemetry through the entire stack. We’re
extremely excited to share it with you in this book.
In the next chapter, we’ll give you a tour of OpenTelemetry’s
components at a code and tool level. We’ll show you what
makes up the primary signals of metrics, logs, and traces, and
cover architectural details like the API and the Collector. We’ll
also explain how you can get help from the OpenTelemetry
community or even get involved as a contributor.

1 The VOID Report for 2022 includes many interesting insights into the lack of a
relationship between incident severity and duration, leading us to conclude that the
important thing in telemetry isn’t its utility in reducing MTTR.

2 Observability Engineering p. 8 (Majors, Fong-Jones, Miranda) 2022

3 As measured by project velocity, a combination of code contributions, issues


created/closed, and sentiment/usage analysis as tracked by the CNCF.
About the Authors
Austin Parker is the Head of Developer Relations at Lightstep,
and has been creating problems with computers for most of his
life. He’s a maintainer of the OpenTelemetry project, the host of
several podcasts, organizer of Deserted Island DevOps,
infrequent Twitch streamer, conference speaker, and more.
When he’s not working, you can find him posting on Twitter,
cooking, and parenting. His most recent book is Distributed
Tracing in Practice (O’Reilly).

Ted Young is one of the Co-Founders of the OpenTelemetry


project. With twenty years of experience, he has built
distributed systems in a variety of environments, including
visual fx pipelines and container scheduling systems. He
currently works as Director of Developer Education at
Lightstep. He loves speaking to users, teaching OpenTelemetry,
and sharing observability best practices.

You might also like