Professional Documents
Culture Documents
Abstract many ways to measure it. In fact, speed tests vary in design
and implementation, and we lack a rigorous understanding
Consumers, regulators, and ISPs all use client-based of how the results from different speed tests compare, as well
“speed tests” to measure network performance, both in single- as why (and under what circumstances) they may produce
user settings and in aggregate. Two prevalent speed tests, different results.
Ookla’s Speedtest and Measurement Lab’s Network Diagnos- Two prevailing speed tests are Ookla’s Speedtest [35]
tic Test (NDT), are often used for similar purposes, despite (“Ookla”), and Measurement Lab’s Network Diagnostic Tool
having significant differences in both the test design and (“NDT”) [21]. Both of these speed tests are used extensively—
implementation and in the infrastructure used to conduct Ookla and NDT report a daily average of over 10 million [31]
measurements. In this paper, we present a comparative evalu- and 6 million tests [17], respectively. Ookla and M-Lab have
ation of Ookla and NDT7 (the latest version of NDT), both in collectively amassed billions of speed tests [7, 31] and have
controlled and wide-area settings. Our goal is to characterize compiled data sets that have become universal resources for
when, how much and under what circumstances these two analyzing broadband Internet performance [3, 9, 28, 40]. Un-
speed tests differ, as well as what factors contribute to the fortunately, these datasets have also been used out of context,
differences. To study the effects of the test design, we conduct without a clear understanding of the caveats and limitations
a series of controlled, in-lab experiments under a variety of of these tools under different circumstances and environ-
network conditions and usage modes (TCP congestion con- ments [32]. The misuse of data that these tools produce stems
trol, native vs. browser client). Our results show that Ookla from confusion about what each tool measures and a lack of
and NDT7 report similar speeds when the latency between the a specific, quantitative characterization of why a tool reports
client and server is low, but that the tools diverge when path the speed that it does.
latency is high. To characterize the behavior of these tools Although these datasets have been used continually over
in wide-area deployment, we collect more than 40,000 pairs the past decade, our understanding of these tools and datasets
of Ookla and NDT7 measurements across six months and 67 has perhaps never been more important. In the United States,
households, with a range of ISPs and speed tiers. Our analysis the recently-passed Infrastructure Investment and Jobs Act
demonstrates various systemic issues, including high variabil- commits $43.5 billion to Internet infrastructure improve-
ity in NDT7 test results and systematically under-performing ments [5], prioritizing areas of the country determined to
servers in the Ookla network. be “unserved" (<25/3) or “underserved" (<100/20) by current
broadband infrastructure. The National Telecommunications
1 Introduction and Information Administration (NTIA) [27] maintains a Na-
tional Broadband Availability Map to identify these high-
Network throughput—colloquially referred to as “speed”— priority areas using, among other sources, Ookla and NDT
is among the most well-established and widely used network speed test data. In addition, state and local officials across
performance metrics. Indeed, “speed” is used as the basis for the country are currently urging consumers to participate in
a wide range of purposes, from network troubleshooting and speed test crowd-sourcing initiatives to help establish which
diagnosis, to policy advocacy [3, 10, 28, 40] (e.g., on issues areas meet the federal funding criteria.
related to digital equity), to regulation and litigation [9] (e.g., To their credit, the organizations that have released these
on issues related to ISP advertised speed). Given the extent to speed test tools have issued various guidance about how pub-
which stakeholders, from consumers to regulators to ISPs, all lic data should (and, perhaps, should not) be used, but their
rely on “speed”, it is in some sense surprising that there are so
1
recommendations are qualitative, obscure, and vague. For The NDT7 client under-reports throughput at higher latencies, in
example, many consumers aiming to test their own Internet comparison to Ookla: The Ookla client reports speeds up to 12%
or contribute to crowd-sourced campaigns use search engines higher than NDT7 at 200 ms round-trip latency, and up to 56%
higher at 500 ms latency (§3.2).
such as Google, which drive users to run an NDT speed test,
even though NDT documentation itself concedes that the tool
The NDT7 native client we initially tested in the lab incorrectly
may not accurately measure ISP speeds [22]. Furthermore, reported upload speeds that significantly exceeded link capacity.
public mailing lists show that users continually struggle to This bug was fixed by the NDT7 team after we filed an issue (§3.4).
explain why different tools report different speeds [24, 25],
as the tools themselves are even advertised in ways that are Across all households in the wide-area deployment, the median
counter to their stated objectives. This further confuses the fraction of paired tests for which Ookla reports a speed that is 0–5%
use of these tools and derived datasets, including the key issue higher than NDT7 is 73.8%. The fraction of paired tests for which
of how various datasets can be applied to specific questions Ookla reports a speed that is 5–25% higher is 13.4% (§4.2).
of public interest. A more rigorous, quantitative, and specific
understanding of when and under what circumstances these For Ookla, the choice of test server can significantly affect the
reported speed. Tests using certain Ookla servers systematically
tools may produce different results can increase the likelihood
report speeds 10% lower than other servers. (§4.3).
that these datasets are ultimately correctly applied—and that
their limitations are known. NDT7 tests are more likely to under-report during peak hours.
This paper conducts the first systematic, comparative 43.4% of households observed a statistically significant decrease in
study of NDT7 (the latest version of NDT) and Ookla, which NDT7-reported download speed tests during peak hours, whereas
has two parts: (1) in-lab measurements that allow direct com- only 18.9% of these households saw the same for Ookla (§4.4).
parison of both tools, under controlled network conditions
where “ground truth” is known; (2) paired wide-area net- Table 1: Main results, and where they can be found in the paper.
work tests, whereby the two tests are run back to back in
a deployment across 67 homes for a six month period. We tion 5 covers the related work providing nuanced differences
use controlled measurement to characterize how NDT7 and in these approaches. In recent years, two pre-dominant tools,
Ookla behave under a wide range of network conditions— Ookla and NDT7, have emerged as “speed tests”.
specifically, varying throughput, latency, and packet loss. We Background. Both Ookla and NDT7 rely on TCP for trans-
also study how different transport congestion control algo- port and attempt to measure the capacity of the path between
rithms and client types (i.e., browser vs. native client) may the client and server by sending as much traffic as possible
affect the measurements that each tool reports. Second, we (“saturating” the link) and computing a throughput value ac-
compare the behavior of these two tools using data from our cording to the number of bytes transferred over some window
wide-area network deployment (encompassing 7 different of time. These techniques measure the bottleneck link along
ISPs), ultimately collecting more than 40,000 tests from each the end-to-end path. When Internet service providers offered
tool. Unique to our study is the use of paired speed tests, in lower speeds to customers, the bottleneck link was very often
which we run Ookla and NDT7 back to back. To our knowl- the ISP access link; over time, however, as access speeds in-
edge, this is the first comparative analysis of Ookla and NDT7 creased, the bottleneck link has shifted to other places along
in deployment over a significant number of networks for an the path, including the WiFi network, interconnection links
extended period of time. between ISPs, the server infrastructure, and even the client
Table 1 summarizes our main findings. We observe signif- device. Ookla and NDT7 differ in three ways that can funda-
icant differences in Ookla and NDT7’s behavior and explain mentally affect measured throughput: (1) the way the client
the causes of these differences. Our results and suggestions attempts to saturate, or flood, the link; (2) the end-to-end path
should help users understand why the reported speed from between the client and server; (3) the sampling and aggrega-
Ookla and NDT7 may differ, as well as guide policymakers tion of the throughput metrics.
towards more accurate and appropriate use of public data sets Flooding mechanism. The mechanisms that Ookla and
based on these tools. NDT7 use to (attempt to) saturate the bottleneck link along
the path between client and server are quite different. Ookla
2 Differences in Speed Test Design adapts both the number of open TCP connections and the test
length in response to changes in the measured throughput
Past research has developed many different approaches to
over the course of the test. NDT7 opens only a single TCP
measure throughput, available bandwidth, capacity etc. Sec-
2
connection, and the test itself always runs for ten seconds. 3 Controlled Measurements
We will discuss this in greater detail in Section 3.2. The latest
versions of both Ookla and NDT7 use TCP websockets. In this section, we explore how Ookla and NDT7’s re-
End-to-end path. Ookla and NDT7 both measure the spective design and implementation affect their accuracy in
throughput of an end-to-end path between the client and controlled network settings, where we know and control the
the server. The tools differ in where these servers are hosted. “ground truth” network conditions. We explore in particu-
Ookla and NDT7 manage and operate their server infrastruc- lar how these two tools report throughput under a variety
ture in very different ways: any network can operate an Ookla of network conditions, under a range of latency and packet
server, yet servers are typically selected based on client prox- loss conditions, as well as how the choice of TCP congestion
imity (as measured by latency), and removed from the set of control algorithm affect reported speed.
options if they are empirically determined to under-report 3.1 Method and Setup
throughput. On the other hand, NDT7 servers are operated
For all of our in-lab measurements, we simplify the end-
by a managed infrastructure, owned and operated by a single
to-end path and connect the client directly to the server via
organization (Measurement Lab). Ookla servers are some-
an ethernet cable. We host both the NDT7 and Ookla server
times “on net” (within the same ISP as the client), although
daemon on the same physical server. This setup allows us to
that is neither a requirement nor a guarantee; on the other
control the network conditions at the client, server, and the
hand, because NDT7 servers are operated in data centers, they
path in between, thus ensuring that any observed differences
are typically “off net”. Servers that are off net result in end-
between Ookla and NDT7 are a result of design and implemen-
to-end paths that may traverse multiple networks, including
tation differences between the tests, as opposed to an artifact
transit networks and interconnection points that may intro-
of changing network conditions along the end-to-end path.
duce bottlenecks. Past history, for example, has shown that
We use the native speed test clients for both the in-lab measure-
transit providers such as Cogent have served as bottlenecks
ments and the wide-area measurements in the next section
for NDT7 tests, and even that these transit providers have
because the native client provides metadata (e.g., socket-level
prioritized test traffic in times of congestion, thereby affecting
information, test UUID) that the browser version does not
the accuracy of the test [23]. Section 4.3 discusses some of
provide. In practice, most users run speed tests through the
these decisions in more detail.
browser version of these respective tools, despite the fact that
Sampling and aggregation. The beginning of a TCP con-
both tools also offer native clients. To understand potential
nection has a period called “slow start”, whereby the client
effects of browser-based measurements, we compare speed
and server transfer data at a rate that is slower than the steady
test measurements from both the native and browser clients
state transfer rate. Packet loss can also cause a TCP sender to
(Section 3.3); we find no significant difference in the results
significantly reduce its sending rate. Such variability, particu-
reported from each modality. This finding is a positive and
larly at the beginning of a transfer, introduces a design choice
somewhat surprising result, as early browser-based speed
concerning how to sample and aggregate a set of samples
tests did have difficulty achieving more than 200 Mbps [12].
indicative of instantaneous sending rates over shorter win-
Finally, because of a bug we discovered in NDT7 (discussed
dows. Speed tests can also sample the throughput at either the
in more detail in Section 3.4), for all NDT7 tests we calculate
sender or receiver. Both NDT7 and Ookla report throughput
the throughput of NDT7 upload tests using the TCPInfo (based
based on the amount of data transferred in the TCP payload.
on the throughput calculated by the server at the kernel level)
NDT7 reports the average throughput over the entire test
information, as opposed to the value reported to the user,
(bytes transferred / test time). The sampling methodology for
which uses AppInfo (based on the sending rate of the client at
Ookla’s latest version is not public. The legacy HTTP-based
the application level).
Ookla client sampled throughput 20 times and discarded the
Hardware. The server and client are run on identical Sys-
top 2 and the bottom 25% samples, thus, effectively disregard-
tem76 Meerkat Model meer6 desktop computers (Intel 11th
ing the TCP slow start period [34]. While, the latest version
Gen i5 @2.4Ghz, 16-GB DDR memory, and up to 2.5 Gbps
does not use the same sampling method (manually verified
throughput) running Ubuntu 20.04.
by inspecting traces), we do observe some form of sampling
whereby it discards low throughput samples. Software. For a measurement server, we used ndt-server ver-
sion 0.20.6 and Ookla daemon build 2021-11-30.2159. Because
there is no publicly available Ookla daemon, we collaborated
with the developers at Ookla to obtain a custom version. For
native client tests, we used ndt-client-go version 0.5.0 and
3
1.0 1.0
Accuracy [Measured / Capacity]
0.0 0.0
5 10 100 1000 2000 0 1 2 3 4 5
Link Capacity [Mbps] Packet Loss [%]
Figure 1: Download accuracy vs. link capacity. Shaded region repre- Figure 2: Download accuracy vs. packet loss. Shaded region represents
sents a 95% confidence interval for n = 10 tests. The reported method a 95% confidence interval for n = 10 tests. The reported method shows
shows the speeds reported by the tool. The average method is the average the speeds reported by the tool. The average method is the average data
data transfer during the test. transfer rate during the test.
Ookla version 3.7.12.159. For browser-based tests, we use typically achieve a throughput that approaches link capac-
Google Chrome version 100.0.4896.75. We used the tc netem ity, typically maxing out at 80% of the capacity. Our results
package to set link capacity, latency, and packet loss. Both likely differ because client operating systems (a bottleneck
the server and client use the TCP BBR congestion control suggested in previous work from Bauer [4]) have improved
algorithm unless otherwise stated. and the design of speed tools has changed. The results in
To facilitate our analysis, we define the accuracy as the previous studies of NDT [4, 39] concern the original NDT
reported speed, divided by the true link capacity (i.e., the value version, which differs from the current implementation of
that we configure with tc). This metric enables us to com- NDT7 in several important ways. Most notably, the original
pare results across a range of network conditions, including NDT server used TCP Reno, which often prevented the tests
different link capacities. from saturating the link [18]. The current NDT7 server uses
TCP BBR [20]. Absent high latency or packet loss, both tools
3.2 Effects of Network Conditions achieve similar accuracy across a wide range of link capacities.
We first study how accurately NDT7 and Ookla measure Upload accuracy under different uplink capacities, shown in
the link capacity under different network conditions. We are Figure B.1 in the Appendix, show the same trends, with even
interested in quantifying the effects of both the mechanisms less variability.
used to flood the network path and the sampling technique Packet loss. To study how the packet loss between the server
used to calculate the final speed. To do so, we show both and client affects the accuracy of Ookla and NDT7, we fix the
the average throughput (calculated using traffic headers) and uplink and downlink capacity to 100 Mbps and introduce
the final reported throughput. The average throughput com- random loss along the path between the client and server. For
putation enables us to compare results from the different both NDT7 and Ookla, packet loss up to 5% has very little
approaches, even if the tools report different numbers as a impact. Figure 2 shows how download accuracy varies as
result of different sampling and aggregation techniques. Hav- packet loss is induced. The range of reported speeds and
ing isolated these effects, we can then separately quantify the average throughputs are within 0.3% for Ookla and 2% for
effect of sampling technique by comparing the final reported NDT7. We see little accuracy degradation because both the
speed and the average throughput. server uses TCP BBR as their congestion control algorithm,
Link capacity. Figure 1 shows each tool’s accuracy when which does not use packet loss as a congestion signal.
measuring downstream throughput under different link ca- Latency. We now study the effects of latency on speed test ac-
pacities. The results show that there is no significant differ- curacy for a 100 Mbps link capacity. Figure 3 shows download
ence between Ookla and NDT7, in terms of both the reported accuracy is affected as round-trip time (RTT) increases. Look-
and average data transfer. Both tools can achieve 95% link ing first at reported speed, Ookla is unaffected by the increase
saturation up through a 2 Gbps bandwidth connection using in RTT until round-trip latency exceeds 400 ms. Meanwhile,
only a single TCP connection. Past work [4, 39] has suggested the median speed reported by NDT7 decreases to 90% of the
that a speed test using only a single TCP connection cannot link capacity at only 100 ms, further degrades to 83% at 200 ms.
4
1.0
Accuracy [Measured / Capacity]
5
1.0 TCP connections when using TCP BBR but not for TCP CUBIC.
Accuracy [Measured / Capacity]
(a) Accuracy vs. Latency In this section, we compare the accuracy of the NDT7 browser
1.0
client and the NDT7 native (command-line) client. There is
Accuracy [Measured / Capacity]
6
1.0 1.0 1.0
Accuracy [Measured / Capacity]
Figure 5: NDT7 accuracy on different client types (browser vs. native) for upload tests.
Accuracy [Measured / Capacity]
TCPInfo
browser calculates upload speed using the TCPInfo messages,
1.30
APPInfo not the AppInfo messages.
We communicated our findings to the NDT7 team at Mea-
1.20
surement Lab, who have already implemented a bug fix for the
1.10
NDT7 native client to use the TCPInfo messages instead of the
AppInfo messages to compute the speed. Because the NDT7
1.00 native client sends data in units of size up to 1 MB, the number
0.95 of transferred bytes at the sender and receiver can differ by up
0.90 to 1 MB. This characteristic explains why tests at lower link
0.5 1 5 10 25 50 100 500
Link Capacity [Mbps] capacities exhibited greater relative differences than tests at
higher link capacities. Although this bug is now in the process
Figure 6: NDT7 upload throughput calculated using TCPInfo messages of being fixed, it has implications for analysis of past data.
and AppInfo messages. Note that the y-axis begins at 0.9. Shaded region Since many NDT7 native client tests were conducted when
represents a 95% confidence interval for n = 10 tests. speed was calculated using the AppInfo messages, we suggest
that researchers who analyze past NDT7 data recalculate the
at the application level, while the TCPInfo messages track the upload speeds using the TCPInfo statistics available in the
amount of data transferred at the kernel-level. Before fixing Measurement Labs BigQuery database [17].
the bug, the native ndt-go-client used the AppInfo messages
to calculate speed. This choice only affects upload results, Takeaway: NDT7 upload tests conducted using the na-
as this method counts all bytes sent to the TCP socket, as tive client report speeds greater than the link capacity
opposed to bytes that are successfully sent across the network. for link capacities under 10 Mbps. This overestimation is
Because download speed is also calculated at the client, only caused by the native client reporting sender-side statis-
successfully received bytes are counted, so the calculation tics. This inaccuracy is especially important to consider
method does not affect the reported download speed. in the context of residential networks, for which upload
Figure 6 shows how NDT7 upload accuracy varies across speeds (especially for historical data) may be less than
different link capacities when calculated using the AppInfo 10 Mbps.
and TCPInfo messages. The native NDT7 client reports a
throughput greater than the link capacity at link capacities
under 10 Mbps. At 0.5 Mbps, the native NDT7 client reports an 4 Wide-Area Measurements
average speed that is 134% of the link capacity. It would thus We complement our in-lab analysis with a longitudinal
be inappropriate to AppInfo to measure upstream throughput. study of Ookla and NDT7 on a set of diverse, real-world resi-
This discrepancy is of special concern for measuring upstream dential networks, which experience dynamic network con-
throughput on residential networks, where it is common for ditions in real deployment settings for each tool. We present
the provisioned uplink capacity to be under 10 Mbps. In this part of the study and the results in this section.
light of this finding, we also verified how the NDT7 browser
calculated the final speed. To our surprise, we found that the
7
4.1 Method and Setup ISP # of Households
We deployed Raspberry Pis (RPis) in 67 households, span- Comcast 42
ning 7 ISPs (see Table 3) from November 2021 to April 2022. AT&T 16
RCN 3
All the households are located in a major city. A focused Everywhere Wireless 2
geography enables us to measure the same subset of Ookla Webpass 2
or NDT7 server infrastructure from multiple vantage points; Wow 1
thus, enabling discovery of server-side issues in Section 4.3. University Network 1
We use Raspberry Pi 4 Model B devices (Quad-core Cortex- Total 67
A72 CPU, 8GB SDRAM, and up to 1 Gbps throughput). Each
Table 3: Number of households studied for a given ISP.
RPi conducts a series of active measurements, including at
least daily Ookla and NDT7 speed tests at random times of
day. We eliminate any WiFi effects and conduct the speed Si
test directly by connecting the RPi to the network router via Ŝi = (1)
S95th
Ethernet. We verify that the speed test results are not limited
by the RPi hardware by performing in-lab tests comparing the where Si is the speed reported by test i and S95th is the 95th
RPis against the NUCs. We see no significant performance percentile result across all speed tests from that particular
difference between RPis and NUCs for download tests. For up- household. In addition to using the nominal speed to nor-
load tests, we do observe that, for link capacities of 900 Mbps malize test results, we use it when assigning households to
and 1 Gbps, NDT7 tests conducted on the RPi are on average different speed tiers.
5% lower than those conducted on the NUC. Ookla reports 5% Speed tier changes. Over the course of this study, some users
lower speeds at 1 Gbps on the RPi than on the NUC. Figure B.7 upgraded their Internet service plan, resulting in either higher
shows this comparison at various link capacities. These dis- download speeds, higher upload speeds, or both. We identify
crepancies should not affect our conclusions at the speeds we speed tier changes by manually inspecting each household’s
test because (1) most ISP offerings for residential networks speed test results over time. To accommodate these changes
are significantly lower than 1 Gbps (e.g., up to 35 Mbps for in our study, we treat measurements taken before and after
Comcast DOCSIS Cable); (2) our analysis compares Ookla and the speed tier upgrade as two distinct households. In total,
NDT7 speeds relative to each other, and both tools degrade we observe two instances of download speed tier change and
similarly due to RPi except at speeds above 900 Mbps. five instances of upload speed tier change. Therefore, we end
Pairing Ookla and NDT7 tests. To isolate the effects of up with a different number of households for download and
Ookla and NDT7’s server infrastructures, we must hold all upload tests: 67 and 72, respectively.
other test conditions constant, meaning we must control for Characterizing test frequency. For download tests, there
the network conditions present along the parts of the end-to- are a median of 354 paired tests across households, with a
end path that Ookla and NDT7 tests have in common. Because minimum of 50 and maximum of 2,429. For upload tests, there
network load can vary over time, comparing Ookla and NDT7 are a median of 345 paired tests, minimum of 50 and maximum
test results taken from different points in time would be to of 2,238 for each household. The CDF of the number of tests
inaccurately compare the network in two different states. As is shown in Figure B.8. For each household, we run Ookla
such, we conduct paired speed tests, whereby we run Ookla and NDT7 tests at least once daily for an average of 126 days,
and NDT7 tests back to back. This approach ensures that the with a minimum of 26 days and a maximum of 182. Since, a
network conditions along the shared portion of the end-to- lot of households we study have ISP-imposed data caps, we
end path are reasonably similar during both tests. set a limit on the monthly data consumption our tests can
Normalizing speed. Since we do not have the ground truth use. Households with higher speed connections will consume
for a household network’s link capacity, we define the nominal more data per test and thus will run fewer tests than house-
speed for a given household to be the 95th percentile speed hold with lower speed connections. Over the course of our
across all speed tests of a given tool from that household. We deployment, high-speed (> 500 Mbps download) households
choose the 95th percentile in accordance with past work that have a median of 263 paired tests while low-speed (< 500 Mbps
studied Internet performance in residential networks [39]. download) households have a median of 791.
Using the nominal speed, we compute the normalized speed 4.2 Comparing Paired Speed Tests
for each test i as follows:
Conducting paired tests allows us to directly compare
Ookla and NDT7 results because both tests were run under
8
in-lab tests, both tools report similar values, especially when
Ookla NDT7
Speed [Mbps]
9
Fraction of tests per household
0.8 0.8
NDT7 Higher Ookla Higher NDT7 Higher Ookla Higher
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
[-1, -0.25] [-0.25,-0.05] [-0.05,0] [0, 0.05] [0.05, 0.25] [0.25, 1] [-1, -0.25] [-0.25,-0.05] [-0.05,0] [0, 0.05] [0.05, 0.25] [0.25, 1]
δrel δrel
Figure 8: The distribution of each paired test class, partitioned by speed tier. Each point in a box plot is the fraction of tests that fall into that class
for a given household. The boxes are first and third quartiles, the line inside the box is median, whiskers are 10th and 90th percentile.
Difference Bin fraction of paired tests within [0.25, 1] (Ookla reports a higher
Low 0 < δrel ≤ 0.05 speed) is only 0.3%, while the median fraction of paired tests
Ookla reports
Medium 0.05 < δrel ≤ 0.25 within [-1, -0.25] (NDT7 reports a higher speed) is 1.2%. It
a higher speed
High 0.25 < δrel ≤ 1.0 is thus more common for a given household to have Ookla
Low −0.05 ≤ δrel < 0
NDT7 reports report a speed that is 25% lower than NDT7 as opposed to vice
Medium −0.25 ≤ δrel < −0.05
a higher speed versa. This finding aligns with trends observed for the sample
High −1.0 ≤ δrel < −0.25
household shown in Figure 7, where Ookla, while typically
Table 4: Summary of paired test classes and how to interpret them. reporting slightly higher speeds, would occasionally report a
significantly lower speed.
We then compute the fraction of paired tests that fall into We further partition households into high and low speed
each category for each household. This approach allows us to tiers. A household is classified in the high downlink speed
characterize the frequency of paired tests with low, medium, tier if the nominal download speed (defined in Equation 1)
and high difference in reported speed. We analyze the distribu- is greater than 500 Mbps, otherwise it is classified in the low
tion for each household instead of looking at the distribution downlink speed tier. As for uplink speed tiers, a household is
across all households for two reasons: 1) each household is classified as high-speed if the nominal upload speed is greater
unique in that the network conditions present vary between than 50Mbps and low speed otherwise. As such, a household
households and 2) the number of paired tests from each house- may be in the high downlink speed tier but the low uplink
hold is not uniform. Figure 8 shows the distribution for each speed tier. We see that households in the high downlink speed
paired test class, where each point is the fraction of tests that tier (>500 Mbps) have a greater fraction of paired tests with
fall into that class for a given household. medium and high differences than households in the low speed
Results. Figure 8a) shows that, in the case of downstream tier (≤500Mbps). For example, there is a median of 13.4%
throughput, for most paired tests, Ookla reports a slightly of paired tests for which 0.05 < δrel ≤ 0.25 among high
higher speed than NDT7. Across all households, the median downlink speed tier households, while the corresponding
fraction of paired tests for which 0 < δrel ≤ 0.05 is 73.8%. fraction for low speed downlink households is only 4.9%. The
It is likely that differences of this magnitude are caused by trends are similar for paired upload tests (see Figure 8b): the
differences in the test protocol, for which Ookla’s sampling 75th percentile fraction of paired tests within [0.25, 1] among
mechanism discards low-throughput samples and NDT7’s high speed (> 50 Mbps) households is 6.4% compared to just
does not. The fraction of paired tests with medium and high 1.1% for low speed (< 50 Mbps) networks.
difference is much lower. For medium differences, the median We believe these variations could result from two factors.
fraction of paired tests for which 0.05 < δrel ≤ 0.25 (Ookla First, as the available throughput in real-world networks is
reports a higher speed) is 8.3%, while the median fraction of variable, more so at high speed, the flooding and sampling
paired tests for which −0.25 ≤ δrel < −0.05 (NDT7 reports mechanisms have an impact on the reported speed. Our in-lab
a higher speed) is 3.4%. As for high differences, the median experiments indicate that Ookla has clear advantages over
10
Normalized Speed
Rank AS Name AS Type % of Tests % Households 1.0
0.8
1 Nitel ISP 21.8% 100.0%
0.6
2 Puregig ISP 13.2% 97.1%
3 Whitesky ISP 11.5% 98.5% 0.4
GTT Zayo Tata Comm Cogent Level3
4 Comcast ISP 8.6% 97.1% Server
5 Windstream ISP 7.1% 97.1%
6 Frontier ISP 6.7% 95.6% (a) NDT7
Normalized Speed
7 Cable One ISP 5.9% 95.6%
1.0
8 Rural Telecom ISP 5.6% 76.5%
9 Hivelocity Cloud Service 4.3% 88.2% 0.8
0.4
g
e.
ne
el
er
st
it
sk
gi
nz
Table 5: Top 10 Ookla Servers ranked by the number of tests that use
el
it
ca
ea
ti
O
oc
re
te
lT
on
E
om
tr
e
el
Pu
hi
l
ds
Fr
a
ab
that server. Percent of households indicates the fraction of households
iv
C
ur
W
in
H
C
W
for which at least one test used the given server. Server
(b) Ookla
% of Total
Rank AS Name
Tests
Figure 9: Distribution of normalized download speeds across servers.
1 GTT 20.40% Note that the y-axis begins at 0.4.
2 Tata Communications 20.04%
3 Cogent 19.99% different Tier-1 networks, with 3 of the 15 servers co-located
4 Level3 19.80% with each Tier-1 network (Table 6).
5 Zayo Bandwidth 19.50% As for Ookla, we observe far greater diversity in where
servers are placed. We observe 48 unique Ookla servers. For
Table 6: NDT Servers ranked by the percentage of total tests the remaining analysis, we characterize the top 10 most com-
monly used Ookla servers. The top 10 servers account for
NDT7 in this respect, given its use of multiple TCP connec- 87.9% of all Ookla tests conducted by our deployment fleet
tions, adaptive test length, and sampling strategy of discarding (Table 5). The hostname, which we further validate with an IP
low throughput samples. Second, for high speed tiers, it is lookup, indicates that 8 of the 10 servers are co-located with
possible that the bottleneck link for one or both the tests is consumer or enterprise ISP network, while the remaining two
not the ISP access link but instead a link further upstream. are co-located with a cloud provider.
We next study how often households use each server. For
Takeaway: The average difference in reported speed
NDT7 tests, we find each server is used roughly equally across
is either not significant or within 5% of the average test
tests. This is expected as NDT7 uses M-Lab’s Naming Server
values in up to 81% and 88% households for download
to find the server [19]. The default policy of the naming server
and upload tests, respectively. For individual test pairs,
returns a set of nearby servers based on client’s IP geolocation.
we observe significant differences – the median fraction
The NDT7 client then randomly selects a server from this set.
of paired tests for which NDT7 reports a speed that is
As for Ookla, some servers are used disproportionately more
0–5% lower than Ookla is 73.8% and a speed that is 5–25%
than others. This is in line with Ookla’s server selection
lower is 13.4%.
policy. The Ookla client pings a subset of nearby test servers
(selected using client IP geolocation) and picks the server with
4.3 Effect of Server Selection minimum ping latency [33].
Server infrastructure. We first characterize the set of Performance variation across servers. We now study
servers observed over the course of our deployment. Across how the reported speed varies with the choice of the server.
all households and tests, we observe 32 unique NDT7 servers. We group all test results by server and then compute the nor-
The top 15 most used NDT7 servers service 99.6% of all NDT7 malized speed for each test, as defined in Equation 1. Figure 9
tests. In light of this, we limit our analysis to these 15 servers. shows the distribution of normalized download speeds across
Looking next at each NDT7 server hostname, we find five different servers for both Ookla and NDT7. For NDT7, we
find that the distribution of normalized speed is similar across
11
Normalized Speed
1.0
1.0 Download
0.8 Upload
0.8
0.6
CDF
0.6 0.4
0.4 0.2
Cogent Tata Comm Zayo GTT Level3
Server 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Δmedian normalized speed [Best - Worst]
(a) NDT7
Normalized Speed
e.
m
er
el
st
ne
y
geneity in accuracy across Ookla servers is caused by either
sk
it
gi
nz
el
it
ca
ea
ti
O
oc
re
N
te
lT
on
E
om
tr
e
el
Pu
hi
l
ds
Fr
iv
C
ur
in
H
C
R
W
Server
path between the client and Ookla server.
Because the number of tests per households varies, cer-
(b) Ookla tain households with many tests may be overrepresented
in the overall trends. To determine whether these trends
Figure 10: Distribution of normalized upload speed across servers. hold for individual households, for each household, we rank
Note that the y-axis begins at 0.4. Ookla servers by the median normalized speed from tests
that used that server. We remove servers for which fewer
Fraction of Households
1.0 than 10 tests used that server. Figure 11 shows the proportion
Download
0.8 Upload of households for which a given server is among the three
0.6 lowest ranked servers for that household. Cable One’s server
0.4 is a bottom three server for 92% and 77% of households, for
0.2 download and upload tests, respectively. Similarly, Comcast’s
0.0 server is a bottom three server for 58% and 73% households,
g
e.
ne
st
er
el
it
sk
gi
nz
it
ca
ea
ti
Te
O
oc
N
te
on
E
om
tr
le
el
Pu
hi
al
ds
Fr
ab
iv
C
ur
W
in
12
Direction Tool # Rejecting H0 % Rejectng H0 5 Related Work
Download NDT7 23 43.4%
(n = 53) Ookla 10 18.9% Speed test design. There are two primary ways to measure
Upload NDT7 7 12.7%
throughput: (1) packet probing and (2) flooding. Most packet
(n = 55) Ookla 5 9.1% probing techniques send a series of packets and infer metrics
like available bandwidth or link capacity based on the inter-
Table 7: Results of t-test where H0 is that the average reported speeds arrival packet delay [11, 14–16, 37]. More recently, Ahmed
for tests conducted in peak and off-peak hours are equal. et al. [1] estimate bandwidth bottlenecks by probing the net-
work using recursive in-band packet trains. However, these
increase in network congestion [39]. If true, the time of day at techniques can be inaccurate especially for high speed net-
which a speed test is run may affect the reported speed. In this works due to their sensitivity to packet loss, queuing policy
section, we analyze our deployment data to see if the effect etc. As a result, most commercial speed tests, including ones
is similar for Ookla and NDT7. In accordance with previous offered by both ISPs [2, 8] and non-ISP entities [21, 29, 35], are
work [39], we define peak hours as 7–11 p.m. flooding-based tools that work by saturating the bottleneck
For this analysis, we use a two sample t-test, or independent link through active measurements.
t-test, because we are comparing two independent popula- Evaluating speed tests. Feamster and Livingood [12] dis-
tions. We define our null hypothesis H0 to be that for a given cuss considerations with using flooding-based tools to mea-
household, the average reported speeds for tests conducted sure speed. They do not, however, conduct empirical exper-
in peak and off-peak hours are equal. To satisfy the normality iments to characterize NDT7 and Ookla performance. Sim-
condition, we only test households for which we have at least ilarly, Bauer et al. [4] explain how differences in speed test
30 tests from both peak and off-peak hours. Enforcing this design and execution contribute to differences in test results.
condition leaves 59 households with sufficient download tests Bauer et al.’s work differs from ours in several ways. First,
and 61 households with sufficient upload tests. Recall that we both Ookla and NDT have seen major design changes in the 12
treat households that undergo speed tier upgrades as two sep- years since that study. Both tools have updated their flooding
arate households, explaining why the number of households and sampling mechanisms, and NDT’s latest version (NDT7)
differ for download and upload tests. As in Section 4.2.1, we uses TCP BBR instead of TCP Reno. Second, they only analyze
use a significance level (α) of 0.01. public NDT data and do not study both Ookla and NDT in
Table 7 summarizes the results. For download tests, we controlled lab settings, nor did they conduct paired measure-
find sufficient evidence to reject H0 for NDT7 on 40.7% of ments in the wide area that allows direct comparison of Ookla
households, but only 18.6% of networks for Ookla. This dif- and NDT, as we do. Complimentary to our comparative anal-
ference suggests that the time of day at which the speed test ysis is work by Clark et al. [7] that provides recommendations
is conducted has a greater effect for NDT7 than for Ookla. on how to use aggregated NDT data, including considering
Furthermore, the set of households for which we can reject the self-selection bias and other end-user bottlenecks like
H0 for NDT7 is a strict superset of the set of households for slow WiFi and outdated modems.
which we can reject H0 for Ookla. For this fact and the fact Residential broadband. Goga et al. [13] evaluate the ac-
that we run paired tests, we can attribute the discrepancy in curacy of various speed test tools in residential networks,
the number of networks to the specific tool, not the access net- yet tools have changed and speeds on residential networks
work (which we control for). The different behavior of Ookla have increased more than 20× since this study ten years
and NDT7 can be due to one or both of the following: (1) the ago. Sundaresan et al. [39] studied network access link per-
client-server paths for NDT7 have more cross-traffic during formance in residential networks more than ten years ago.
peak hours than Ookla; (2) the test protocol, especially use Whereas our work is more focused on characterizing speed
of multiple threads and sampling heuristic by Ookla, makes test tools, this work examined network performance differ-
Ookla more resilient to cross traffic than NDT7. For upload ences across ISPs, looking at latency, packet loss, and jit-
tests, we can only reject H0 on 12.7% and 9.1% of households ter in addition to throughput. Canadi et al. [6] use publicly
for NDT7 and Ookla, respectively; this decrease may be the available Ookla data to analyze broadband performance in 35
result of asymmetry in traffic volume. metropolitan regions. Finally, the Federal Communications
Commission (FCC) conducts the Measuring Broadband Amer-
ica project (MBA) [26], an ongoing study of fixed broadband
performance in the United States. The FCC uses SamKnows
13
whiteboxes [38] to collect a suite of network QoS metrics, 6 Conclusion
including throughput, latency, and packet loss. Because the
MBA project maps broadband Internet performance across This paper provided an in-depth comparison of Ookla and
different ISPs, they use a single speed test—a proprietary NDT7, focusing on both test design and infrastructure. Our
test developed by SamKnows—and do not consider Ookla or measurements, both under controlled network conditions and
NDT7. wide area network using paired tests, present new insights
about the differences in behavior of the two tools. In the future,
we plan to extend our wide-area study to other geographies,
especially rural areas, to understand tool behavior under more
diverse real-world network conditions.
14
References [14] Ningning Hu and Peter Steenkiste. “Evaluation and
characterization of available bandwidth probing tech-
[1] Adnan Ahmed, Ricky Mok, and Zubair Shafiq. “Flow- niques”. In: IEEE Journal on Selected Areas in Commu-
trace: A framework for active bandwidth measure- nications (2003).
ments using in-band packet trains”. In: Passive and
[15] Manish Jain and Constantinos Dovrolis. “Pathload: A
Active Network Measurement. Springer. 2020.
measurement tool for end-to-end available bandwidth”.
[2] AT&T. AT&T Internet Speed Test. url: https : / / www . In: Passive and Active Measurements (PAM) Workshop.
att.com/support/speedtest/. 2002.
[3] New York State Office of the Attorney General. New [16] Srinivasan Keshav. “A control-theoretic approach to
York Internet Health Test. url: https : / / ag . ny . gov / flow control”. In: Communications architecture & proto-
SpeedTest. cols. 1991.
[4] Steven Bauer, David D Clark, and William Lehr. “Under- [17] Measurement Lab. BigQuery QuickStart. url: https://
standing broadband speed measurements”. In: TPRC. www.measurementlab.net/data/docs/bq/quickstart/.
2010.
[18] Measurement Lab. Evolution of NDT. url: https://www.
[5] Josh Boak. Biden Administration to Release $45B for measurementlab.net/blog/evolution-of-ndt/.
Nationwide Internet. May 2022. url: https : / / apnew
[19] Measurement Lab. M-Lab Name Server: Server Selection
s . com / article / technology - broadband - internet -
Policy. url: https://github.com/m-lab/mlab-ns.
643949e239296f4f5154509d4c40de73.
[20] Measurement Lab. NDT (Network Diagnostic Tool). url:
[6] Igor Canadi, Paul Barford, and Joel Sommers. “Revis-
https://www.measurementlab.net/tests/ndt/.
iting broadband performance”. In: Proceedings of the
2012 Internet Measurement Conference. ACM, 2012. [21] Measurement Lab. Speed test by Measurement Lab. url:
https://speed.measurementlab.net/.
[7] David D. Clark and Sara Wedeman. Measurement,
Meaning and Purpose: Exploring the M-Lab NDT Dataset. [22] Measurement Lab. Using M-Lab Data in Broadband
SSRN Scholarly Paper. Rochester, NY, Aug. 2021. Advocacy and Policy. url: https://www.measurementl
ab.net/blog/mlab-data-policy-advocacy/.
[8] Comcast. Xfinity Speed Test. url: https://speedtest.
xfinity.com/. [23] Measurement Lab Mailing List. Discussion of Cogent’s
prioritization of Measurement Lab’s traffic. url: https:
[9] Federal Communications Commission. FTC Takes Ac-
/ / groups . google . com / a / measurementlab . net / g /
tion Against Frontier for Lying about Internet Speeds
discuss/c/hqvIPYkIus0/m/iJp48rPXwAoJ.
and Ripping Off Customers Who Paid High-Speed Prices
for Slow Service. May 2022. url: https : / / www . ftc . [24] Measurement Lab Mailing List. Download speed dis-
gov / news - events / news / press - releases / 2022 / 05 / crepancy when running (NDT) speed test. url: https:
ftc- takes- action- against- frontier- lying- about- / / groups . google . com / a / measurementlab . net / g /
internet - speeds - ripping - customers - who - paid - discuss/c/7gbIb8a44bg/m/LS6dXalbCQAJ.
high-speed. [25] Measurement Lab Mailing List. Measurement Labs
[10] Federal Communications Commission. Measuring (NDT) Speedtest is incorrect? url: https : / / groups .
Broadband America Program. url: https://www.fcc. google . com / a / measurementlab . net / g / discuss / c /
gov/general/measuring-broadband-america. vOTs3rcbp38.
[11] Constantinos Dovrolis, Parameswaran Ramanathan, [26] Measuring Broadband America. Apr. 2022. url: https://
and David Moore. “What do packet dispersion tech- www.fcc.gov/general/measuring-broadband-america.
niques measure?” In: Proceedings IEEE INFOCOM. 2001. [27] National Broadband Availability Map. url: https://
[12] Nick Feamster and Jason Livingood. “Measuring inter- www . ntia . doc . gov / category / national - broadband -
net speed: current challenges and future recommenda- availability-map.
tions”. In: Communications of the ACM (2020). [28] Battle For the Net. Internet Health Test based on Measure-
[13] Oana Goga and Renata Teixeira. “Speed measurements ment Lab NDT. url: https : / / www . battleforthenet .
of residential internet access”. In: Passive and Active com/internethealthtest/.
Network Measurement. Springer. 2012. [29] Netflix. fast.com. url: https://fast.com/.
15
[30] Ookla. (Outdated) Ookla Speed Test Methodology - [37] Vinay Joseph Ribeiro, Rudolf H Riedi, Richard G Bara-
Legacy HTTP client. url: https : / / sandboxsupport . niuk, Jiri Navratil, and Les Cottrell. “pathchirp: Efficient
speedtest.net/hc/en-us/articles/202972350. available bandwidth estimation for network paths”. In:
[31] Ookla. About Ookla SpeedTests. url: https://www.spee Passive and active measurement workshop. 2003.
dtest.net/about. [38] SamKnows. url: https://www.samknows.com/.
[32] Ookla. Case study: Ookla and NDT data comparison for [39] Srikanth Sundaresan, Walter De Donato, Nick Feam-
upstate New York. url: https://www.ookla.com/resou ster, Renata Teixeira, Sam Crawford, and Antonio
rces/whitepapers/make-better-funding-decisions- Pescapè. “Broadband internet performance: a view
with-accurate-broadband-network-data. from the gateway”. In: ACM SIGCOMM computer com-
[33] Ookla. How does Ookla select a server? url: https://he munication review (2011).
lp.speedtest.net/hc/en-us/articles/360038679834. [40] Pennsylvania State University and Measurement Lab.
[34] Ookla. Ookla Speed Test Methodology. url: https://he Broadband Availability and Access in Rural Pennsylva-
lp.speedtest.net/hc/en-us/articles/360038679354. nia. url: https://www.rural.pa.gov/publications/
broadband.cfm.
[35] Ookla. Speedtest by Ookla - The Global Broadband Speed
Test. url: https://www.speedtest.net/. [41] Xinlei Yang, Xianlong Wang, Zhenhua Li, Yunhao Liu,
Feng Qian, Liangyi Gong, Rui Miao, and Tianyin Xu.
[36] Roxy Peck, Chris Olsen, and Jay L. Devore. “Chapter “Fast and Light Bandwidth Testing for Internet Users.”
11: Comparing Two Populations or Treatments”. In: In: NSDI. 2021.
Introduction to Statistics and Data Analysis. 2009.
16
A Statement of Ethics B.1.2 Latency
1.0 6
Accuracy [Measured / Capacity]
5 Ookla
0.8
NDT7
Tool 4
1.00
Ookla
0.6 3
NDT7
Method 2
0.95
0.4 Reported
1
Avg
0.2 5 10 15 25 50 100 300 600
0.90
101 102 103
Round-Trip Time [ms]
0.0
5 10 100 1000 2000 Figure B.3: Number of TCP connections opened during a speed test vs.
Link Capacity [Mbps] one-way latency between the server and the client.
Figure B.1: Tool performance vs. link capacity for upload tests. Note
that the y-axis begins at 0.9. Shaded region represents a 95% confidence
18
interval for n = 10 tests.
16
14
Test Length [s]
12
10
6
Ookla
4 NDT7
0.8
Tool 1.00
Ookla
0.6
NDT7
Method
0.95
0.4 Reported
Avg
0.2
0.90
0 1 2 3 4 5
0.0
0 1 2 3 4 5
Packet Loss [%]
Figure B.5: Tool performance vs. packet loss for upload tests. Shaded
region represents a 95% confidence interval for n = 10 tests.
18
1.0 1.0 1.0
Accuracy [Measured / Capacity]
(a) Download - Client Type vs. Link Capacity (b) Download - Client Type vs. Packet Loss (c) Download - Client Type vs. Latency
Figure B.6: Tool performance on different client types (browser vs. native) for download tests.
1.0 1.0
Accuracy [Measured / Capacity]
Fraction of Networks
0.8 0.8
Tool 1.00
Ookla
0.6 0.6
NDT7 Download
Hardware Upload
0.95
0.4 RPi 0.4
NUC
0.2
0.90 0.2
102 103
0.0
5 50 100 500 1000 0.0
0 500 1000 1500 2000 2500
Link Capacity [Mbps]
Number of Paired Tests
(a) Download
1.0 Figure B.8: CDF of the number of paired tests across all tested networks
Accuracy [Measured / Capacity]
0.8
Tool 1.00
Ookla
0.6
NDT7 0.95
Hardware
0.4 RPi
0.90
NUC
0.2
0.85
5 100 500 1000
0.0
5 100 500 1000
Link Capacity [Mbps]
(b) Upload
Figure B.7: Performance vs. Link Capacity for Raspberry Pis and
NUCs
19