You are on page 1of 40

1 of 40 TEXT

distributed systems ONLY

Distributed
Latency Profiling
through
Critical Path Tracing
CPT can provide
actionable and
precise latency
analysis

F
BRIAN EATON, or complex distributed systems that include
JEFF STEWART, services that constantly evolve in functionality
JON TEDESCO, and data, keeping overall latency to a minimum is
AND N. CIHAN TAS, a challenging task. Critical path tracing (CPT) is a
GOOGLE, INC., USA new applied mechanism for gathering critical path
latency profiles in large-scale distributed applications.
It is currently enabled in hundreds of Google services,
providing valuable day-to-day data for latency analysis.
Fast turnaround time is an essential feature for an
online service. For services that are constantly supporting
increasing data and functionality, keeping overall
latency to a minimum is an ongoing issue. Determining
the root causes of high latency in a distributed system
is challenging. The goal is to answer a key optimization
question: Given a distributed system and workload, which
subcomponents can be optimized to reduce latency?
Low latency is an important feature for many Google
applications such as Search,4 and latency-analysis tools
play a critical role in sustaining low latency at scale.
acmqueue | january-february 2022 1
distributed systems 2 of 40

The systems change constantly because of code and


deployment changes, as well as shifting traffic patterns.
Parallel execution is essential, both across service
boundaries and within individual services. Different slices
of traffic have different latency characteristics. Latency-
analysis tools struggle to make sense of the system in
practice.
CPT provides detailed and actionable information about
which subcomponents of a large and complex distributed
system are contributing to overall latency. This article
presents results and experiences as observed by using CPT
in a particular application, Google Search.

THE NEED FOR CRITICAL PATH TRACING


The critical path describes the ordered list of steps
that directly contribute to the slowest path of request
processing through a distributed system, such that
optimizing these steps reduces overall latency. Individual
services have many subcomponents, and CPT relies on
software frameworks15 to identify which subcomponents are
on the critical path. When one service calls another, critical
path information is propagated from the callee back to the
caller via remote procedure call (RPC) metadata. The caller
then merges critical paths from its dependencies into a
unified critical path for the entire request.
The unified critical path is logged with other request
metadata. Log analysis is used to select requests of
interest, and then critical paths from those requests are
aggregated to create critical path profiles. The tracing
process is efficient, allowing large numbers of requests to
be sampled. The resulting profiles give detailed and precise

acmqueue | january-february 2022 2


distributed systems 3 of 40

information about the root causes of latency in distributed


systems.
The overall goal is latency profiling for distributed
systems. How can developers get actionable information
about the root causes of latency?

An example system
Consider the distributed system in figure 1, which consists
of three services, each of which has two subcomponents.
The purpose of the system is to receive a request from
the user, perform some processing, and return a response.
Arrows show the direction of requests where responses
are sent back in the opposite direction.
The system divides work across many subcomponents.

FIGURE 1: A simple distributed system

user

service A
A1 A2

service B service C
B1 B2 C1 C2

acmqueue | january-february 2022 3


distributed systems 4 of 40

Requests first arrive at Service A, which hands request


processing off to subcomponent A1. A1 in turn relies
on subcomponents B1 and A2, which have their own
dependencies. Some subcomponents can be invoked
multiple times during request processing (e.g., A2, B2, and
C2 all call into C1).
Even though the system architecture is apparent from
the figure, the actual latency characteristics of the system
are hard to predict. For example, is A1 able to invoke A2
and B1 in parallel, or is there a data dependency such that
the call to B1 must complete before the call to A2 can
proceed? How much internal processing does A2 perform
before calling into B2? What about after receiving the
response from B2? Are any of these requests repeated?
What is the latency distribution of each processing step?
And how do the answers to all of these questions change
depending on the incoming request?
Without good answers to these questions, efforts to
improve overall system latency will be poorly targeted and
might go to waste. For example, in figure 1, to reduce the
overall latency of A1 and its downstream subcomponents,
you must know which of these subcomponents actually
impact the end-to-end system latency. Before deciding
to optimize, you need to know whether 𝐴1 → 𝐴2 actually
matters.

Analysis with RPC telemetry


RPC telemetry is commonly used for latency analysis.
Services export coarse-grain information about how many
times an RPC is made and the latency characteristics of
those RPCs. Monitoring services collect this information

acmqueue | january-february 2022 4


distributed systems 5 of 40

and create dashboards showing system performance.


Coarse-grain slicing (e.g., time range, source, and
destination information) is commonly used to enhance RPC
telemetry-based analysis.1,9,14
RPC telemetry works well when there are a few RPC
services that are always important for latency. Monitoring
for those services can quickly identify which are causing
problems. Service owners can be identified, and they can
work to improve performance.
RPC telemetry struggles with parallelism, however.
Referring to the figure, assume that A2 does CPU-bound
work while waiting for responses from B2, C1, and C2.
Improving latency for B2, C1, and C2 will not improve
overall performance, because A2 is not actually blocked
while waiting for their responses.
Repeated RPCs can also make monitoring confusing.
Perhaps A2 is making hundreds of requests to C1. RPC
telemetry can show the average latency as very fast, but
a single slow RPC out of hundreds might be the root cause
of slowness across the entire system.7 RPC telemetry
also cannot tell whether those hundreds of requests are
happening in parallel or are serialized. This is important
information for understanding how those requests impact
latency.
Heterogeneous workloads are also a challenge for RPC
telemetry, because data points from different workloads
are typically mixed together. Information about low-
volume but important requests is lost. Problems with
diverse workloads increase as the number of important
slicing dimensions grows. If an A/B experiment shows a
latency regression, latency should be analyzed for each

acmqueue | january-february 2022 5


distributed systems 6 of 40

arm of the experiment to identify the root cause. If a


latency problem appears in the 99th percentile (i.e., within
the top 1 percent of the requests that take the most time),
only those very slow requests should be analyzed.
This type of workload slicing is difficult with RPC
telemetry. Most RPC telemetry systems rely on
aggregating data along a few dimensions for efficiency
reasons. For example, each process might aggregate RPC
data by caller, and then data collection further aggregates
the data across processes. Collecting data across a large
number of dimensions (e.g., separately for every A/B
experiment) is usually not feasible. Dimensions of interest
for latency analysis also need to be defined in advance and
shared across all services. For example, imagine a caller
wants to analyze latency based on the client device model.
If downstream services are not aware of the device model
information, they cannot provide RPC telemetry sliced by
that dimension.
RPC telemetry struggles with identifying important
subcomponents within services. For example, both A1 and
A2 (subcomponents within Service A) make requests into
B2. Telemetry for Service A and Service B will typically
mix these requests together, even though they might have
different latency characteristics. This can make it difficult
to tell which requests need to be optimized.
The last major issue with RPC telemetry is the
streetlight effect: RPC telemetry sheds light in one
particular area of the system, so you spend time optimizing
that part of the system. Meanwhile, latency problems not
caused by RPCs are lost in the dark.

acmqueue | january-february 2022 6


distributed systems 7 of 40

Analysis with CPU profilers


CPU profiling complements RPC telemetry well. Once
RPC telemetry has identified a problematic service, CPU
profiling can help figure out how to make that service
faster. CPU samples with function call stacks are collected
and aggregated, providing insights into expensive code
paths. Profiles are typically collected for a single service
at a time and might include the type of request in-flight and
hardware profile counters.2,10,13
CPU profiling excels at identifying specific expensive
subcomponents within services. In cases where CPU time
contributes to overall latency, CPU profiling can help
identify where to optimize. Many of the same issues that
impact RPC telemetry, however, also cause problems for
CPU profiles. Lack of information about parallelism means
that you can’t tell whether CPU-bound work is happening in
parallel with RPCs, in parallel with other CPU-bound work,
or is actually blocking request progress. Heterogeneous
workloads cause problems as well: Small but important
slices of traffic are lost in the noise. Joining CPU profiles
with information about parallelism and request metadata
from distributed tracing can address these limitations, but
deployment of that technology is not widespread.3
The streetlight effect also impacts CPU profiling: It
makes it more likely you will focus on code that uses a
lot of CPU, even if that code is not contributing to overall
system latency.

Analysis with distributed tracing


The last common tool in the latency profiling toolkit
is distributed tracing. This approach follows individual

acmqueue | january-february 2022 7


distributed systems 8 of 40

requests through a system, collecting timing points and


additional data as those requests are processed. Traces are
aggregated and analyzed to yield application insights.5,16,17
Unlike RPC telemetry and CPU profiling, distributed
tracing handles parallelism and heterogeneous workloads
well. Information about all cross-service requests is
collected, including timing points. Visualization shows
exactly when work for each service began and ended, and
which services were running in parallel versus serial.
Most distributed tracing includes tracing for
RPC boundaries by default, but leaves out service
subcomponent information. Developers can add tracing for
subcomponents as needed.
Workload slicing to find traces for particularly
important requests is also possible, although again,
developers need to manually tag traces that are
important. Distributed tracing even allows automated
analysis to identify which services contribute to total
latency.5,18
The major obstacle to using distributed tracing for
detailed latency analysis is cost. In Google Search, a single
service might have a few or dozens of RPC dependencies
but can easily have 100 times that number of important
subcomponents. Instrumenting subcomponents by default
increases the size of traces by orders of magnitude.
Sample rates for distributed tracing systems are
another issue. System operators need to make tradeoffs
between the amount of data they collect with distributed
tracing and the overhead they impose by doing so.
As tracing becomes more detailed, it becomes more
expensive. Adaptive sampling techniques16,17 are sometimes

acmqueue | january-february 2022 8


distributed systems 9 of 40

used to increase the sampling rate for interesting


requests. If the task is to investigate the 99th percentile
latency for a 1 percent experiment, finding a single relevant
example requires 10,000 traced requests. As shown
later in this article, 10,000 to 100,000 examples might
be necessary to gain statistical confidence about where
a regression occurred. Together, this means that it might
be necessary to gather full traces for 108 to 109 requests,
with each trace being 100 times larger than the default
expected by the distributed tracing system.

Critical path tracing


CPT is designed to fill some of the gaps in these systems.
The term critical path in project management12 refers to
the many interdependent steps that must be completed to
finish a project. In this article, critical path describes the
ordered list of steps that directly contribute to the slowest
path of request processing through a distributed system.
Aggregating these traces into critical path profiles can
identify latency bottlenecks in the overall system.
This methodology is used extensively today for latency
analysis in hundreds of Google services, including Search,
and has several benefits.
Software frameworks are used to instrument service
subcomponents automatically.15 Support has been added to
most of the commonly used frameworks in Google Search,
and developers using those frameworks get fine-grain CPT
without additional coding effort. (The impact of automatic
instrumentation on overall critical path accuracy is
discussed later.)
Framework instrumentation code automatically

acmqueue | january-february 2022 9


distributed systems 10 of 40

identifies the critical path for request execution. Only


the critical path is retained for analysis; other trace
information is discarded. This reduces tracing cost by
orders of magnitude.
Critical path traces are logged alongside other
request and response metadata, such as A/B experiment
information and which features appeared on the Search
results page. This allows standard log-analysis techniques
to use business criteria to find traces from requests of
interest. Centralized logging provides additional cost
savings. Other distributed tracing systems log traces
separately for each machine involved in a request.
Reconstructing a complete request requires joining traces
from hundreds of machines. Logging all critical path
subcomponents together avoids the overhead of joining
the traces.
Together, these cost reductions allow detailed traces
with high sampling rates.

TRACING A REQUEST
This section describes the work needed to gather a fine-
grain critical path trace for a single request.

Critical path definition


For latency profiling, the key input used is the critical path:
the set of steps that blocked progress on producing the
response. When multiple steps proceed in parallel, the
slowest step is the only one on the critical path.
The execution of the request can be modeled as a
directed graph of named nodes (e.g., subcomponent
names). Each node in the graph does some of its own

acmqueue | january-february 2022 10


distributed systems 11 of 40

computation. Each edge in the graph is a dependency


where a node must wait for the completion of one of
its dependencies before computation can proceed. The
critical path is the longest-duration path through the
nodes, starting at the request entry point and finishing
at the node that computes the response. The length of
the critical path is the total latency for processing the
request.
Consider a distributed system such as that in figure
1 and suppose the subcomponents execute in parallel
according to scenarios as summarized in figure 2. Figure
2a shows an example critical path calculation where the
requests to B1 and A2 happen sequentially: A1 does some
computation, then blocks, waiting for B1 to complete. A1
then proceeds with additional computation before waiting
for A2 to complete. This example has a critical path of
{A1=5 ms, B1=20 ms, A1=8 ms, A2=2 ms}, with a total
critical path of 35 ms.
Figure 2b shows how this changes when B1 and A2
execute in parallel. The new critical path becomes {A1=5
ms, B1=20 ms, A1=8 ms}, for a total critical path of 33
ms. A2 has been eliminated from the critical path. In this
scenario, optimization efforts should focus on A1 and B1.
Parallel execution also applies when the parent node
overlaps with child nodes, as in figure 2c. In this example,
A1 sends an RPC to B1 but does not immediately block
to wait for a response. Instead, A1 continues with other
computations in parallel. For this case, whichever node
finishes last to the critical path is assigned: {A1=3 ms,
B1=14 ms, A1=10 ms}

acmqueue | january-february 2022 11


distributed systems 12 of 40

FIGURE 2: Different parallelized execution scenarios

A1 A1
5ms 8ms
B1 A2
20ms 2ms
time
(a) serial execution

A1 A1
5ms 8ms
A2
2ms
B1
20ms
time
(b) parallel execution

A1 A1
3ms 6ms 10ms
B1
6ms 8ms
time
(c) overlapping execution

Identifying service subcomponents


Infrastructure-first distributed tracing systems typically
collect data at RPC boundaries by default and allow

acmqueue | january-february 2022 12


distributed systems 13 of 40

developers to add trace annotations for more detailed


tracing.16,17 With CPT, you can take advantage of the
standardization provided by software frameworks to
collect more detailed information by default.15
As an example, the Dagger framework encourages
authors to write code as ProducerModules. (Dagger is a
fully static, compile-time dependency injection framework
for Java, Kotlin, and Android. It is an adaptation of an
earlier version created by Square and now maintained
by Google; https://dagger.dev.) Each ProducerModule
declares what inputs it requires, and what outputs it
produces. Dagger then coordinates the execution of
ProducerModules to process a request. Referring back
to figure 1, the following code snippet shows a Dagger
implementation of subcomponent A1:
@ProducerModule
public abstract class A1ProducerModule {
@Produces
@A1Output
static A1Output runA1(@A2Output a2, @B1Out-
put b1, @B2Output b2) {
... Code reads information from a2, b1, and
b2, and calculates A1Output...
}
}
The collection of ProducerModules creates a graph of
subcomponents that the framework executes to process
the request. For this example, the framework knows
which of A2, B1, and B2 was the last to block execution to
produce A1’s output. Since the framework is aware of the
subcomponent dependencies, it can record the critical path.
acmqueue | january-february 2022 13
distributed systems 14 of 40

For Google Search, subcomponent-level traces are


collected from several software frameworks in multiple
programming languages. Framework-level implementation
is essential for scalability, since it allows relatively small
teams of developers to provide detailed critical path
traces for code written by thousands of other people.
Since each framework instruments and reports the critical
path automatically, most developers are not aware that
critical path tracing is happening. A few services that are
not implemented using frameworks have implemented
service-specific CPT as well. Traces from these services
are coarser grained. The system degrades gracefully
when requests are made to services that do not provide
any traces. The size of the blind spot in the critical path
is reported correctly and includes which service is
responsible for that part of the critical path.
See the “Adoption” section later for metrics showing the
overall accuracy of critical path traces for Google Search.

Propagation and merging


Propagating child-node paths to parent nodes allows more
detailed views of the request critical path.
In figure 2b, the initial critical path is {A1=5 ms, B1=20
ms, A1=8 ms}. The B1 child node takes up the majority of
execution time; however, B1’s internal execution details are
lacking. To add this detail, B1 first computes its own critical
path (e.g., {B1=4 ms, B2=12 ms}) and returns that critical
path to A1.
A1 then merges the critical path from B1 into the overall
critical path. Note that the total critical path returned
from B1 is only 16 ms, but A1 observed 20 ms of execution

acmqueue | january-february 2022 14


distributed systems 15 of 40

time in B1. This is normal and could occur as a result of


network delays, queuing delays, or simply execution
paths that are not instrumented to return critical path
information. To account for the discrepancy, A1 adds a 4-ms
“unaccounted” node to the critical path. Similarly, each
subcomponent adds a “self” node to the critical path for
denoting the latency incurred internally and not caused by
other child subcomponents. The notation 𝑆1/𝑆2 denotes
the latency incurred by 𝑆2 as the child subcomponent of 𝑆1.
The merged critical path for the request is {A1=5 ms,
B1/self=4 ms, B1/B2=12 ms, B1/unaccounted=4 ms,
A1=8 ms}.
This process happens recursively. If B2 returns a more
detailed path, the B2 entries in the overall path also
become more detailed.
Propagation and merging occur both within a single
service and across RPC service boundaries. The RPC
transports support RPC request and response metadata
channels for conveying horizontal concerns, such as the
trace identifiers in Dapper.17 Services participating in the
critical path protocol use standard fields in the response
metadata to propagate their critical path to callers.
Framework-level code then merges the critical path from
RPCs into the per-service critical path.
Propagation requires error correction. Measurement
errors are inevitable in large-scale distributed systems:
Networks take time, programs stall, bugs exist or will
be added over time, and sometimes you cannot be
sure exactly what happened. The goal is to keep the
measurement error small enough that you can still draw
good conclusions from the latency profiles.

acmqueue | january-february 2022 15


distributed systems 16 of 40

Overcounting is the most obvious problem to detect.


It occurs when the sum of the critical path elements
exceeds the observed wall time for the request by more
than a specified margin of error. This typically indicates
a bug in the framework that generated the critical path
(e.g., because of reporting multiple operations as having
happened in serial when they actually occurred in parallel).
Google’s implementation handles overcounting by allowing
it to be merged into the overall critical path. This allows
discovering which services are overcounting so they can be
fixed. See “The case of the phantom blind spot” later in this
article for an example.
Undercounting is expected. The reported critical
path for an RPC will be less than the observed wall time
because of serialization costs, network transmission
time, and scheduling delays. An “unaccounted” child node
reflects the differences.
Blind spots are also a common problem, especially
when analyzing new code paths that have only recently
showed up on the critical path. Blind spots look like large
blocks of time on the critical path that are all attributed
to a single subcomponent. One common reason for a blind
spot would be making an RPC to a slow service that has not
implemented CPT.

Triggering
Depending on the framework implementation, CPT can
incur significant overhead. In practice, not every request
has to be traced, so sampling is used to amortize the
cost. (The tradeoffs between overhead and precision are
discussed later in the article.) Sampling requires cross-

acmqueue | january-february 2022 16


distributed systems 17 of 40

service coordination to avoid creating unnecessary blind


spots in the traces.
Each service can make an independent sampling
decision, and then it conveys that decision on outbound
RPCs in request metadata. As with Facebook’s Mystery
Machine,5 when downstream services see that the caller
has opted in to CPT, the downstream services enable
tracing as well. Even if the caller has not opted in to
sampling, downstream services are free to track and log
their own critical paths. Callers that have not requested
sampling will ignore the resulting traces.
Google Search also implemented a consistent sampling
mechanism that greatly increases the CPT sampling
overlap among multiple services in a distributed fashion.
This system is used when the servers are locally deciding
if CPT should be enabled for a particular request, and it
minimizes blind spots when there are disconnected call
flows (such as CPT-capable servers behind CPT-incapable
ones). This mechanism uses the Dapper17 trace identifier
as the source of distributed agreement and turns this
identifier into multiple sampling buckets. Each service then
makes a sampling decision by checking if the requests fall
in the corresponding buckets for their sampling strategy.
System operators can opt in to tracing for specific
requests, instead of relying on random sampling. This
is useful for cases where a human operator needs to
collect traces for debugging purposes. When identifying
a particularly slow type of request, the operator collects
many samples for that request to get data for a profile. Since
these requests are not random samples, by default they are
excluded from the aggregated analysis, described later.

acmqueue | january-february 2022 17


distributed systems 18 of 40

Limitations of CPT
The critical path as defined here is useful for addressing
culprit-finding problems but has some limitations.
In general, any latency optimization efforts should
focus on the subcomponents that are on the critical path.
Resource contention with off-critical path subcomponents,
however, can also slow down critical path execution.
Considering figure 2b again, imagine that A2 is holding a
mutex or is running intensive computation and causing CPU
starvation in critical path code. The critical path will show
part of the resource-starvation issue: Nodes waiting on
the blocked resource will appear on the critical path. The
culprit node (A2), however, will not appear on the critical
path. Profilers focused on CPU and lock contention are
well-suited for identifying these bottlenecks.
The critical path also lacks visibility into the drag and
slack12 of the overall execution. Drag and slack are measures
of the potential change in the critical path based on
optimizing a single step. A single subcomponent that blocks
all others has large drag: Improving that one subcomponent
is likely to improve the overall execution time.
When multiple subcomponents run in parallel, they have
large slack: Even if one slows down, it probably won’t impact
the overall time. Causal profiling6 uses an experiment-driven
approach to identify headroom by automatically injecting
latency into different subcomponents to determine their
drag and slack. Quartz3 aims to identify subcomponents with
high drag via CPU profiling. The Mystery Machine5 identifies
subcomponent slack by reconstructing system-dependency
graphs via log analysis.
Streaming is an important technique for improving

acmqueue | january-february 2022 18


distributed systems 19 of 40

latency, and, unfortunately, CPT for streaming APIs is not


well-defined. Services can return a stream of results,
allowing work to begin on the early results before later
results are ready. Bidirectional streaming, where client
and server send multiple messages back and forth, is
a more complex programming model but is also useful
for latency in some situations. The critical path must be
carefully defined for these operations: Should segments
end when the first message is returned, or the last? What
if the caller of the API is also streaming? Google Search’s
current implementation of CPT deals with this challenge
by defining the critical path segment as ending when the
last message of the stream is received, but this definition is
misleading in situations where earlier messages are more
important for latency.
As is true of most data-collection systems, CPT
provides more precise information as more data is
collected and aggregated. Aggregating data can also hide
critical details. The “Aggregation and Visualization” section
discusses the tradeoffs that occur when critical path
traces are aggregated.
Even with these caveats, CPT helps focus effort on
areas where it is most likely to make a difference.

Adoption
Prior to adoption of CPT, latency analysis in Google Search
was primarily based on log analysis and RPC telemetry.
Adoption costs were low, since RPC monitoring was
enabled by default, but granularity of the measurements
was insufficient. Log analysis could correctly identify
small changes in latency; for example, regressions of a

acmqueue | january-february 2022 19


distributed systems 20 of 40

few milliseconds could easily be detected via A/B testing.


Identifying which service had caused the regression was
challenging, since many services operated in parallel.
Identifying which subcomponent within a service had
caused the regression was virtually impossible, since
there was no systematic collection of subcomponent
contributions to end-to-end latency.
Broad deployment of CPT throughout Google Search
has resulted in much better visibility of the root causes of
latency.
Figure 3 shows the granularity of CPT in Google Search
by comparing the granularity of individual critical path
elements with overall request latency.
About 50 percent of request latency is attributable to
subcomponents with a latency of under 10 ms. This section
of the curve shows software written in frameworks that
support fine-grain tracing. When latency regressions
occur in these subcomponents, supplemental tools such
as CPU profilers are frequently unnecessary. CPT alone is
precise enough to flag the root cause of the regression.
The right side of the granularity curve represents
latency incurred in blind spots: CPT is too coarse-grained
to immediately flag root causes. In some cases, the
latency regression has occurred in an RPC service using
a framework that has not yet implemented fine-grain
CPT. CPT is sufficient to demonstrate that the root cause
is somewhere in the given RPC service, but it cannot
automatically identify the culprit subcomponent. In other
cases, CPT has correctly identified a slow subcomponent
on the critical path. CPU profiling of these subcomponents
typically finds opportunities to reduce latency.

acmqueue | january-february 2022 20


distributed systems 21 of 40

FIGURE 3: Critical Path Granularity

100
90
80
percent of critical path

70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
max component latency (milliseconds)

Operational costs
Operational overhead of CPT is low enough that profile
data collection is enabled by default in many applications,
including Google Search. For Search, traces are
continuously collected on 0.1 percent of requests. Traced
requests see a 1.7 percent increase in mean latency, adding
.002 percent to overall mean request latency. Overhead
remains low at the tail, with 99th percentile latency
overhead for traced requests being 2.0 percent.
CPU overhead is more difficult to calculate.
Frameworks were rewritten to incorporate CPT at
runtime. Those changes typically incurred less than 0.1

acmqueue | january-february 2022 21


distributed systems 22 of 40

percent of the overall CPU cost, but framework overhead


depends on the workload. The CPU overhead was deemed
acceptable when compared with the value of the latency
profiles.
Network overhead is significant, largely because of
an overly verbose wire format. For a critical path with
𝑁 elements and an average of 𝑀 subcomponents per
element, the wire format uses 𝑂(𝑁 ∗ 𝑀) memory. A
graph representation would encode the same information
at lower cost. In practice, the network overhead is
mitigated both by sampling only 0.1 percent of requests
and via compression.

AGGREGATION AND VISUALIZATION


A single critical path trace is interesting but might not
present a realistic view of overall system performance.
Single requests can be outliers. Merging multiple
profiles creates a statistically significant view of system
performance.
Individual critical paths are logged along with other
metadata about the request. Standard log-analysis
techniques are used to select a random sample of
requests, meeting whatever criteria are of interest. For
example, critical path analysis for Google Search routinely
includes slicing by A/B experiment arms, filtering to
select requests where certain UI elements appeared on
the Search results page, and filtering to select requests
from certain time ranges and geographic regions. One
productive technique is simply to select only requests that
were extremely slow. Analyzing these requests frequently
reveals systemic issues that amount to significant

acmqueue | january-february 2022 22


distributed systems 23 of 40

optimization opportunities.
Note that unlike distributed tracing systems,17 a single log
entry contains a critical path that spans multiple services.
This is important for correctness reasons: Back-end services
can record their own critical paths, but they cannot have
a priori knowledge of whether their work will be on the
critical path of their callers. CPT implements a critical-path-
merging mechanism where the callers incorporate critical
paths from their back ends into their own. Compared with
the alternative of joining information from the back ends’
logs, this is a much simpler and faster approach.

Aggregation algorithm
The techniques Google Search uses for aggregating
critical path information are similar to the mechanisms
generally used in CPU profiling. The process creates a
profile for visualization and analysis by merging critical
paths from sampled requests into a single “average critical
path,” similar to the Mystery Machine.5 Google Search uses
pprof8 for visualization, but CPT is flexible and can be used
in conjunction with other visualization tools.
Consider figure 1 as an example system where two
requests are received. Table 1 shows how these two
requests are merged to create an average critical path.
In Request 1, A1 calls A2 and B1 in parallel, and only A2
is on the critical path. In Request 2, A2 and B1 are again
called in parallel, but B1 is slower and on the critical path.
These are aggregated into a single profile by merging
subcomponents with identical paths and then dividing by
the number of samples to keep metrics relative to average
request latency.

acmqueue | january-february 2022 23


distributed systems 24 of 40

TABLE 1: Critical Path Aggregation

Subcomponent Request 1 Request 2 Aggregated


A1/self 6ms 4ms 5ms
A1/A2 10ms - 5ms
A1/B1 - 4ms 2ms
Total 16ms 8ms 12ms

This aggregation can be thought of as the average


critical path of the overall system. Interpretation of
the average critical path needs to bear in mind that
subcomponents are not sequential, and the average might
not reflect any real critical path that the system could
have taken. If two subcomponents are called in parallel,
it is not possible for both to be on the critical path for a
single request. Both will still likely appear on the average
critical path.
Subcomponent time on the average critical path
reflects both how often the subcomponent is on the
critical path and how long it takes when present. The
difference becomes important when considering system
subcomponents that are usually fast but occasionally slow.
A system subcomponent that always takes 5 ms on the
critical path will show up in an aggregate profile as taking
5 ms. A system subcomponent that is only on the critical
path 1 percent of the time but takes 500 ms when it does
appear will also show up in an aggregate profile as taking
5 ms.
As with other statistical analysis, looking at the
distribution of the data in the sample pool can be helpful.

acmqueue | january-february 2022 24


distributed systems 25 of 40

In addition to pprof format profiles, aggregation tools


collect latency histographs for critical path segments. At
Google we consult this data early on when investigating
latency problems. Optimizing something that occurs rarely
but is very slow is a different problem from optimizing
something that occurs frequently. In cases where the
latency distribution for a particular subcomponent is
highly skewed, additional sampling is helpful, to focus on
cases where the subcomponent is slow. The “Case Studies”
section later in this article discusses several incidents
where focused analysis made root causes more obvious.

Precision of total critical path


How many samples are needed before we are confident
that a critical path change of a certain size is not a result
of random variation in request latency? The central limit
theorem can calculate confidence intervals for various
sample sizes. Table 2 shows estimates of the width of the
95th percentile confidence intervals for Google Search
profiles.
In practice, profile visualization becomes quite slow
with millions of requests in a sample. We default to looking
at samples of 100,000 requests as a compromise between
precision and usability of profiling tools, and increase
sampling beyond 100,000 requests as needed.

Individual subcomponent precision


Aggregated profiles contain thousands of subcomponents.
This creates a situation where even 95th percentile
confidence intervals can yield a large number of false
positives—cases where a subcomponent appears to have

acmqueue | january-february 2022 25


distributed systems 26 of 40

TABLE 2: Precision of Critical Path Time

Sample Size 95th Percentile CI Width


1000 ±19.9ms
10,000 ±6.3ms
100,000 ±2.0ms
1,000,000 ±0.7ms

had a statistically significant change in latency, but where


the difference is actually caused by random variation. False
positives waste engineering time since investigating the
latency change will not find a root cause.
One way of measuring the false positive rate is to create
two profiles using identical sampling criteria, and then
compare the resulting profiles using pprof ’s difference-
base view. The difference-base view highlights large
differences between profiles. When you know in advance
that the profiles should be the same, any differences
larger than a given threshold are false positives.
To quantify the latency-profiling false positive rate
for Google Search, we repeated the pairwise comparison
experiment 100 times at various sample sizes and with
various thresholds for considering a profile difference
important enough to investigate. Table 3 shows the 95th
percentile confidence intervals for the expected number of
false positives. As an example, if the profile comparison is
based on 1,000 requests per profile, you can be 95 percent
confident that the average number of components with
false positives larger than 5 ms is between 0.2 and 0.4.
In Google Search, it’s extremely rare for profiles to
acmqueue | january-february 2022 26
distributed systems 27 of 40

show large differences in a single subcomponent because


of random variation. An additional finding from the analysis
is that not all subcomponents are equally likely to show
random differences. Components that have larger mean
times and larger standard deviations also have larger false
positive rates. (The central limit theorem predicts this
outcome. Even though subcomponents have non-normal
latency distributions, estimates of mean latency become
more precise as standard deviation shrinks and sample
sizes grow.)
This finding has important implications for latency
investigations in Search. Identifying the root causes of
latency differences in A/B experiments is straightforward,
because two profiles with identical sampling criteria
can be compared. Any subcomponent-level differences
above 1 ms were probably caused by the experiment, not
by random variation. Root-cause analysis is much more
difficult when comparing profiles taken on different
workloads. Under those circumstances, larger sample
sizes do not make the profile differences clearer. Any large
differences between the profiles are probably a result of
the different sampling criteria, not random variation.

TABLE 3: Components with False Positives

Sample False Positives False Positives False Positives


Size > 5ms > 1ms > 0.1ms
1000 [0.2 - 0.4] [3.5 - 4.2] [56.8 - 61.2]
10,000 n/a [0.7 - 1.1] [15.6 - 18.0]
100,000 n/a n/a [3.0 - 3.6]

acmqueue | january-february 2022 27


distributed systems 28 of 40

Visualization
Once profiles have been aggregated, a variety of
visualizations can be used to help engineers understand
system latency. Top-down, bottom-up, call graphs, and
flame graphs11 are all helpful. Figure 4 shows an example
CPT visualization as a flame graph, where a latency
critical path is represented as columns of boxes, and each
box represents a subcomponent. (The color of each box
is insignificant and optimized for easy viewing.) Profile
comparison is essential for diagnosing regressions, since
they show only subcomponents with large changes. This
is particularly helpful for analyzing regressions in A/B
experiments.

CASE STUDIES
In the following examples CPT profiles are used to debug
latency issues.

The case of the nonblocking RPC


One frequent strategy for identifying optimization

FIGURE 4: An example CPT visualization as a flame graph

acmqueue | january-february 2022 28


distributed systems 29 of 40

opportunities is to look for outlier requests. Some high-


latency threshold will be chosen (e.g., slow enough to be
a very bad experience for end users). Latency profiles are
then taken for requests that exceeded that threshold.
Sometimes the resulting profiles simply prove that a series
of unusual events can cause high latency. In this case,
the outlier requests weren’t outliers: They were part of a
pattern that had gone undetected for years.
One latency analysis turned up a single common query
that had latency that was several hundred milliseconds
longer than normal. The latency profile showed a blocking
call to an API that was rarely used in Search. The profile
also identified the subcomponent that was making this call,
which was sufficient to show that the behavior was caused
by a particular type of Search feature that occurred in
less than 10 percent of requests. Additional profiles were
taken, this time including all requests that showed this RPC
on the critical path.
The first identified request was actually not as much
of an outlier as originally thought: Almost all requests
that triggered this Search feature were incurring a large
latency penalty. This RPC had triggered a transient alert 18
months earlier, but it was not severe enough to require a
thorough investigation.
Contacting the team that owned the relevant code
was helpful: The original intent of using this API was
for reporting purposes, and in fact this code was never
intended to block request processing. Refactoring the code
to make an actual nonblocking RPC resolved the issue and
resulted in significant latency reductions on the affected
slice of Search traffic.

acmqueue | january-february 2022 29


distributed systems 30 of 40

Conclusion: Putting events in context is helpful. As


discussed earlier, RPC telemetry fell short due to a lack of
support for traffic slicing. CPT’s traffic-slicing capabilities
helped identify and resolve the problem.

The case of the code that didn’t run


When investigating average Search latency, a common
pattern appeared in one particular service. Thousands
of subcomponents were occasionally blocked on a child
identified on the critical path as a framework, indicating
the critical path delay occurred in the framework rather
than application code. Even though most of these blocked
subcomponents were fast, a few were slow; cumulatively,
they represented a significant opportunity to cut latency.
Initial investigations failed to identify the root cause.
CPU profiling the framework code showed no obvious
problems. Investigating the impacted subcomponents
that presented high delay in framework code also showed
no obvious patterns, other than that they seemed slow. A
few CPU optimizations were implemented, but none had
significant latency impact.
Eventually an engineer hit on the idea that the problem
was not the framework using too much CPU; the issue
was that the framework should be using more. A few
years earlier, someone had identified that per-request
lock contention was causing significant latency problems.
Dozens of CPUs might be working on the same request in
parallel and were competing to grab locks that protected
per-request state. Experiments tested different levels
of parallelism per request, and it turned out that limiting
a single request to using no more than two CPUs at the

acmqueue | january-february 2022 30


distributed systems 31 of 40

same time was optimal. That configuration used less CPU


because of decreased lock contention, and it reduced
Search latency.
The new discovery was that the two-CPU request rule
was optimal only as a default. Certain subcomponents had
large amounts of CPU-bound work to perform and could
benefit from additional parallelism without increasing lock
contention. A new framework feature was added, allowing
subcomponents to request that additional CPUs be
assigned to their work. The optimization worked and is now
routinely applied when subcomponents show high amounts
of framework time.
Conclusion: When optimizing for latency, latency
profiling should be preferred. CPU profiling, RPC
telemetry, and distributed tracing were all unable to
detect this particular issue, whereas latency profiling
successfully discovered the unexpected latency pattern
and motivated further investigation.

The case of the phantom blind spot


While researching blind spots in CPT, a concerning pattern
appeared. A subcomponent labeled as deferredwork had
large blind spots and was contributing significantly to
critical path time. Researching the code showed that
the deferredwork subcomponent was under active
development. It was intended to generate information
that should be added to the bottom of the Search results
page. One common optimization in Search is to generate
information that should be displayed at the top of the
results page first and then stream that information to
the client. Was the deferredwork subcomponent actually

acmqueue | january-february 2022 31


distributed systems 32 of 40

blocking the top of the page as well?


That explanation didn’t match the data: The
deferredwork subcomponent was routinely showing up as
taking more than a second on the critical path. Routinely
adding one second to Search latency would have been
causing alerts to fire. Comparing the total request latency
to the sum of the critical path latency showed the real
issue: The critical path time was much larger than the
actual request latency. The framework team maintaining
the service with the problem investigated, and it confirmed
that the deferredwork code was triggering a critical path
tracing bug.
Conclusion: Better framework support for streaming
behavior is an area for future development.

The case of the slow home page


Monitoring showed that the frequency of very slow home-
page loads had increased. Investigation showed that
the problem had begun a few days prior, and the initial
assessment was that the problem was significant but not
severe. The proximate cause had been a configuration
change rollout. Each rollout batches together hundreds
of independent changes and runs them through a canary
process to make sure they are safe. The canary had passed,
these changes had been live for days without causing
major issues, so no rollback was performed.
Concern grew, however, as the problem was investigated.
The issue was actually more serious than the monitoring
indicated, because the problem was not random. Most
users were not impacted at all, but users on certain types
of devices in one geographic area were experiencing high

acmqueue | january-february 2022 32


distributed systems 33 of 40

latency. The urgency of the investigation increased.


None of the changes included in the configuration
rollout seemed likely to have caused the problem, so
the root cause was hard to identify. Normally a latency
regression in a geographic area is a result of networking
issues. A Dapper trace that captured the problem,
however, showed the opposite. RPCs were fast, but one
service involved was using far too much CPU serving these
requests. The Dapper trace was too coarse-grained to
capture which subcomponents were burning CPU. CPU
profiling wasn’t helpful either: Requests triggering the
problem were rare enough that they were lost in the noise.
At this point in debugging, the problem mysteriously
vanished with another configuration rollout. Investigation
continued, now using latency profiling to try to figure
out what had happened. Two latency profiles were taken,
one capturing normal home-page requests, and another
capturing extremely slow home-page requests.
Comparing the profiles found the culprit: A
customization system responsible for short-lived region-
specific home-page changes was blocking request
processing. Reviewing the configuration changes again
showed that the rollout that triggered the problem had
included a new home-page customization and that the
scheduled end of the home-page customization had
coincided with the end of the latency problems.
The team responsible for home-page customization
took a look at the issue and quickly identified the root
cause: A “simple” configuration had been expanded by an
automated system into an unexpectedly complex set of
configuration rules. Processing those rules had required

acmqueue | january-february 2022 33


distributed systems 34 of 40

seconds of CPU time on every request where they


triggered.
Conclusion: Ability to focus on the impacted traffic
is key to latency profiling. Home-page requests from a
specific category of devices in a specific geographic region
were lost in the noise of CPU profiles. Dapper traces
could show the issue but were unable to focus on the
subcomponents causing the problem. The ability to gather
fine-grain latency profiles focusing on just the impacted
requests made the difference.

FUTURE WORK
The approach outlined here is practical and scales well
but has limitations. One area for future work is increasing
adoption of fine-grain tracing and reducing the size of
blind spots in profiles. This would likely require increasing
adoption of frameworks-based programming models.
Joining critical path tracing with CPU profiling could be
a powerful tool. This work will be quite challenging, since
sampling CPU profilers by necessity take a sample before
request processing has completed. The critical path is not
known until the request is complete, which might be long
after the sample has been collected. Finding a way to join
data from these systems would allow CPU profilers to
focus developer attention on code that is already known to
be on the critical path.
Adding frameworks-level support for CPT in streaming
protocols would be helpful. The approach described in this
article is defined only for request/response RPC protocols
that create a directed graph of calling nodes, and only
when all the critical work for those RPCs is done between

acmqueue | january-february 2022 34


distributed systems 35 of 40

the arrival of the request and the transmission of the


response. Streaming protocols that produce incremental
progress, or conversational protocols with multiple
requests and responses in a session, are harder to analyze.
Avoiding blind spots in profiles where streaming protocols
are used would be helpful.
CPT focuses on server-side latency analysis. Adding
support for client-side components of latency (as was
done with Facebook’s Mystery Machine5) would improve
the ability to understand which contributors to end-to-end
latency have the most impact on the end-user experience.
The frameworks integration that CPT uses to identify
service subcomponents has profiling applications
outside of latency. We have built proof-of-concept
distributed cost profilers within Google Search that allow
downstream systems to trace expensive traffic to specific
subcomponents in callers. These cost profiles are detailed
and useful but have not been generalized for use outside of
Search.
Finally, adding metadata to trace collection seems
both possible and helpful. For example, frameworks could
collect slack information for each critical path node. In
cases where a node is blocked, frameworks could collect
information about the contended resource causing
the delay. Continued research in this area should yield
additional techniques to help profile for latency.

CONCLUSIONS
In large, real-world distributed systems, existing tools
such as RPC telemetry, CPU profiling, and distributed
tracing are valuable to understand the subcomponents

acmqueue | january-february 2022 35


distributed systems 36 of 40

of the overall system, but are insufficient to perform end-


to-end latency analyses in practice. Issues such as highly
parallel execution flows, heterogeneous workloads, and
complex execution paths within subsystems make latency
analysis difficult. In addition, these systems and workloads
change frequently.
No one team or person has a detailed understanding of
the system as a whole.
CPT in such systems addresses these challenges to
provide actionable, precise latency analysis. Integration
at the frameworks level15 provides detailed views of
application code without requiring each developer to
implement their own tracing. Common wire protocols
allow consistent mechanisms for triggering, aggregation,
and visualization. Efficient implementation allows a high
sampling rate, giving precise profiles even for relatively
rare events. Scalable and accurate fine-grain tracing has
made CPT the standard approach for distributed latency
analysis for many Google applications, including Google
Search.

Acknowledgements
This article would not be possible without a number of
contributors from Google. Fabian Gura, David German,
and numerous others in Search built the instrumentation
and tools that implement this methodology in practice in
Google Search. Steven Procter designed, implemented,
and launched the first frameworks-based CPT algorithm,
demonstrating that collecting fine-grain traces at scale
was possible. Several frameworks teams collaborated
to scale the effort to support thousands of developers.

acmqueue | january-february 2022 36


distributed systems 37 of 40

Michael Levin’s work guided the statistical precision


analysis.
We are grateful to Jennifer Klein, Łukasz Milewski, Ali
Sheikh, Paul Haahr, and Brian Stoler for providing critical
feedback about the readability and contents of the article.
Salim Virji coordinated the publishing process. Without his
support, the critical path for publishing this article would
have been a lot longer.

References
1. Amazon Web Services. Amazon CloudWatch:
Observability of your AWS resources and applications
on AWS and on-premises; https://aws.amazon.com/
cloudwatch/.
2. Amazon Web Services. What is Amazon CodeGuru
profiler?; https://docs.aws.amazon.com/codeguru/latest/
profiler-ug/what-is-codeguru-profiler.html.
3. A nderson, T.E., Lazowska, E.D. 1990. Quartz: A tool for
tuning parallel program performance. In Proceedings of
the ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems, 115–125; https://dl.acm.
org/doi/10.1145/98457.98518.
4. A rapakis, I., Bai, X., Cambazoglu, B.B. 2014. Impact
of response latency on user behavior in web search.
In Proceedings of the 37th international ACM SIGIR
Conference on Research and Development in
Information Retrieval, 103–112; https://dl.acm.org/
doi/10.1145/2600428.2609627.
5. Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.
2014. The Mystery Machine: End-to-end performance
analysis of large-scale Internet services. In Proceedings

acmqueue | january-february 2022 37


distributed systems 38 of 40

of the 11th Usenix Symposium on Operating Systems


Design and Implementation, 217–231; https://dl.acm.org/
doi/10.5555/2685048.2685066.
6. C  urtsinger, C., Berger, E.D. 2015. Coz: Finding code that
counts with causal profiling. In Proceedings of the 25th
Symposium on Operating Systems Principles, 184–197;
https://dl.acm.org/doi/10.1145/2815400.2815409.
7. D ean, J., Barroso, L.A. 2013. The tail at scale.
Communications of the ACM 56(2), 74–80; https://dl.acm.
org/doi/10.1145/2408776.2408794.
8. G  itHub. pprof; https://github.com/google/pprof.
9. G  oogle. Cloud monitoring; https://cloud.google.com/
monitoring.
10. Google. Cloud Profiler; https://cloud.google.com/
profiler.
11. Gregg, B. 2016. The flame graph. Communications
of the ACM 59(6), 48–57; https://dl.acm.org/
doi/10.1145/2909476.
12. K  elley., J.E. 1961. Critical path planning and scheduling:
Mathematical basis. Operations Research 9(3), 296–
435; https://www.jstor.org/stable/167563.
13. M icrosoft. Profile production applications in Azure with
application insights; https://docs.microsoft.com/en-us/
azure/azure-monitor/app/profiler-overview.
14. Microsoft. Azure Monitor; https://azure.microsoft.com/
en-au/services/monitor/.
15. Nokleberg, C., Hawkes, B. 2021. Best practice:
Application frameworks. acmqueue 18(6), 52–77; https://
queue.acm.org/detail.cfm?id=3447806.
16. P andey, M., Lew, K., Arunachalam, N., Carretto, E.,
Haffner, D., Ushakov, A., Katz, S., Burrell, G., Vaithilingam,

acmqueue | january-february 2022 38


distributed systems 39 of 40

R., Smith, M. 2020. Building Netflix’s distributed


tracing infrastructure. The Netflix Tech Blog; https://
netflixtechblog.com/building-netflixs-distributed-
tracing-infrastructure-bb856c319304.
17. Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson,
P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. 2010.
Dapper, a large-scale distributed systems tracing
infrastructure. Google Technical Report; https://static.
googleusercontent.com/media/research.google.com/
en//archive/papers/dapper-2010-1.pdf.
18. Y ang, C.-Q., Miller, B. 1988. Critical path analysis for
the execution of parallel and distributed programs.
In Proceedings of the 8th International Conference on
Distributed Computing Systems. IEEE Computer Society
Press, 366–375.

Brian Eaton joined Google’s Information Security team as a


software engineer in 2007, focusing on usable security and
authentication. Beginning in 2014, he joined Web Search
to lead engineering productivity and performance teams,
developing latency profiling and the first microservices
deployments in Web Search. He now works on Search Quality
and user trust, helping users understand the origins and
context of information they find on the Web. He holds a BS
degree from the University of California, Santa Cruz, and is a
semi-professional foosball player.

Jeff Stewart worked at Google for over 17 years. For more


than 10 years, he improved Web Search, leading teams to
improve compute efficiency and end user latency. He led

acmqueue | january-february 2022 39


distributed systems 40 of 40

the initiative to enable https on search traffic by default.


Jeff helped launch Gmail. He moved to a new employer in
2021. He earned both his bachelor’s degree in Electrical and
Computer Engineering and his master’s degree in Information
Networking from Carnegie Mellon University in Pittsburgh,
Pennsylvania. He holds a black belt in American Kenpo Karate.
He lives near Boston.

Jon Tedesco has worked at Google for 9 years in web Search


infrastructure. During that time he has led several teams to
build latency and compute analysis tools for the Search stack,
and to develop new components of Search infrastructure
focusing on performance analysis and optimization. He holds
MS and BS degrees in computer science from the University of
Illinois at Urbana-Champaign.

N. Cihan Tas is a site reliability software engineer on the


Stadia team at Google, currently focusing on the optimization
of datacenter networks. Prior to Google, he worked at
Siemens Corporate Research in the area of wireless networks,
broadly ranging from vehicle-to-vehicle networks to smart-
grid communication networks. He holds a BS degree from
Bilkent University, and MS and PhD degrees from University of
Maryland, College Park, all in computer science. N. Cihan is an
ACM Member.
Copyright © 2022 held by owner/author. Publication rights licensed to ACM.

acmqueue | january-february 2022 40

You might also like