Professional Documents
Culture Documents
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
The views expressed in this work are those of the authors and
do not represent the publisher’s views. While the publisher and
the authors have used good faith efforts to ensure that the
information and instructions contained in this work are
accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any
code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-098-14718-1
Chapter 1. The State of Modern
Observability: A Brief Overview
With Early Release ebooks, you get books in their earliest form
—the authors’ raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
History is not the past but a map of the past, drawn from a
particular point of view, to be useful to the modern traveller.
But you don’t need just any data: you need correlated data, data
that is already organized, ready to be analyzed by a computer
system. As you will see, data with that level of organization has
not been readily available. In fact, as systems have scaled and
become more heterogeneous, finding the data you need to
analyze an issue has become even harder. If it was once like
looking for a needle in a haystack, it’s now more like looking for
a needle in a stack of needles.
“Why is it slow?”
Carl Sagan said, “We have to know the past to understand the
present.”2 That certainly applies here: to see why a new
approach to observability is so important, you first need to be
familiar with traditional observability architecture and its
limitations.
Resources
Transactions
Transactions are user requests that orchestrate and utilize
the resources the system needs to do work on behalf of
the user. Usually, a transaction is kicked off by a real
human, who is waiting for the task to be completed.
Booking a flight, hailing a rideshare, and loading a
webpage are examples of transactions.
User telemetry
In plainer terms, user telemetry will tell you how long someone
hovered their mouse cursor over a ‘checkout’ button in an e-
commerce application. Performance telemetry will tell you how
long it took for that checkout button to load in the first place,
and which programs and resources the system utilized along
the way.
Fun fact: it’s called telemetry because the first remote diagnostic
systems transmitted data over telegraph lines. While people
often think of rockets and 1950s aerospace when they hear the
term telemetry, if that was where the practice had started, it
would have been called radiometry. Telemetry was actually first
developed to monitor power plants and public power grids –
early but important distributed systems!
While logging did tell you about individual events and moments
within a system, understanding how that system was changing
over time required more data. A log could tell you that a file
couldn’t be written because the storage device was out of space,
but wouldn’t it be great if you could track available storage
capacity and make a change before you ran out of space?
Metrics are compact, statistical representations of system state
and resource utilization. They were perfect for the job. Adding
metrics made it possible to build alerting on data, beyond
errors and exceptions.
So, while the term “three pillars” does explain the way
traditional observability is architected, it is also problematic - it
makes this architecture sound like a good idea! Which it isn’t.
It’s cheeky, but I prefer a different turn of phrase - “the three
browser tabs of observability.” Because that’s what you’re
actually getting.
Emerging complications
The problem is that our systems are not composed of logging
problems or metrics problems.
But the devil is in the details. It’s possible for a simple, isolated
bug to be confined to a single transaction. But most production
problems emerge from the way many concurrent transactions
interact.
Human investigation
Computer investigation
Figure 1-3. Finding correlations when all the telemetry signals are braided together.
Conclusion
This book will be your guide to learning OpenTelemetry. It is
not meant to be a replacement for OpenTelemetry
documentation, which can be found on the project’s website.
Instead, this book explains the philosophy and design of
OpenTelemetry, and offers practical guidance on how to wield it
effectively.
2 Sagan, C. E. (author and presenter). (1980) Episode 2: One Voice in the Cosmic Fugue
[Television series episode]. In Adrian Malone (Producer), Cosmos: A Personal Voyage.
Arlington, VA: Public Broadcasting Service.
3 Tanenbaum, Andrew S.; Steen, Maarten van, Distributed systems: principles and
paradigms, 2002
With Early Release ebooks, you get books in their earliest form
—the authors’ raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
— ALFRED KORZYBSKI
Software systems are the vital heart of the global economy. This
may sound like an exaggeration, but it really isn’t; There’s
almost no productive enterprise that doesn’t involve software
systems at some point. Commerce, logistics, manufacturing,
telecommunications, textile production, you name it, software
systems play a crucial role. The only thing more crucial than the
software itself are the legions of humans tasked with the
systems care and upkeep.
Figure 2-1. A diagram demonstrating hard and soft contexts emitted by telemetry in an
n-tier web application.
Telemetry Layering
In the above discussion, you might have noticed that we’re
assuming the operator is using multiple forms of telemetry in
their investigatory workflows. This may strike you as either
obvious or confusing. “Aren’t logs just logs? Are traces logs?
How do I get metrics from traces?” Telemetry signals are
actually just specific ways of modeling system state and
behavior. There’s nothing intrinsic about any given signal that
makes it what it “is” other than what we make of it. You can
convert any signal type to any other signal type, if that’s what
makes sense for how you want to use the data.
Built-In Telemetry
Well annotated
Emitted by dependencies
Invisible
Imagine using an IDE that helps you make complex code more
readable, automatically completes function definitions, and
even suggests cost-saving patterns, based on real-world
performance data from other services. Add ubiquitous
telemetry to machine learning algorithms and artificial
intelligence, and the potential is vast!
Portability
I hope we’ve convinced you by now that the real question isn’t
“why OpenTelemetry?” but “why not OpenTelemetry?”
Although the project is still in its early stages (it’s just over three
years old as I write this), it’s the CNCF’s second most popular
project!3 OpenTelemetry is well positioned to achieve the lofty
goals and objectives we’ve set out here.
1 The VOID Report for 2022 includes many interesting insights into the lack of a
relationship between incident severity and duration, leading us to conclude that the
important thing in telemetry isn’t its utility in reducing MTTR.