A Comparative Analysis of Ookla Speedtest and Meas

A Comparative Analysis of Ookla Speedtest and
Measurement Labs Network Diagnostic Test (NDT7)

Kyle MacMillan, Tarun Mangla, James Saxon, Nicole P. Marwell, Nick Feamster
University of Chicago
macmillan,tmangla,jsaxon,nmarwell,feamster@uchicago.edu
arXiv:2205.12376v1 [cs.NI] 24 May 2022
Abstract many ways to measure it. In fact, speed tests vary in design
and implementation, and we lack a rigorous understanding
Consumers, regulators, and ISPs all use client-based of how the results from different speed tests compare, as well
“speed tests” to measure network performance, both in single- as why (and under what circumstances) they may produce
user settings and in aggregate. Two prevalent speed tests, different results.
Ookla’s Speedtest and Measurement Lab’s Network Diagnos- Two prevailing speed tests are Ookla’s Speedtest [35]
tic Test (NDT), are often used for similar purposes, despite (“Ookla”), and Measurement Lab’s Network Diagnostic Tool
having significant differences in both the test design and (“NDT”) [21]. Both of these speed tests are used extensively—
implementation and in the infrastructure used to conduct Ookla and NDT report a daily average of over 10 million [31]
measurements. In this paper, we present a comparative evalu- and 6 million tests [17], respectively. Ookla and M-Lab have
ation of Ookla and NDT7 (the latest version of NDT), both in collectively amassed billions of speed tests [7, 31] and have
controlled and wide-area settings. Our goal is to characterize compiled data sets that have become universal resources for
when, how much and under what circumstances these two analyzing broadband Internet performance [3, 9, 28, 40]. Un-
speed tests differ, as well as what factors contribute to the fortunately, these datasets have also been used out of context,
differences. To study the effects of the test design, we conduct without a clear understanding of the caveats and limitations
a series of controlled, in-lab experiments under a variety of of these tools under different circumstances and environ-
network conditions and usage modes (TCP congestion con- ments [32]. The misuse of data that these tools produce stems
trol, native vs. browser client). Our results show that Ookla from confusion about what each tool measures and a lack of
and NDT7 report similar speeds when the latency between the a specific, quantitative characterization of why a tool reports
client and server is low, but that the tools diverge when path the speed that it does.
latency is high. To characterize the behavior of these tools Although these datasets have been used continually over
in wide-area deployment, we collect more than 40,000 pairs the past decade, our understanding of these tools and datasets
of Ookla and NDT7 measurements across six months and 67 has perhaps never been more important. In the United States,
households, with a range of ISPs and speed tiers. Our analysis the recently-passed Infrastructure Investment and Jobs Act
demonstrates various systemic issues, including high variabil- commits $43.5 billion to Internet infrastructure improve-
ity in NDT7 test results and systematically under-performing ments [5], prioritizing areas of the country determined to
servers in the Ookla network. be “unserved" (<25/3) or “underserved" (<100/20) by current
broadband infrastructure. The National Telecommunications
1 Introduction and Information Administration (NTIA) [27] maintains a Na-
tional Broadband Availability Map to identify these high-
Network throughput—colloquially referred to as “speed”— priority areas using, among other sources, Ookla and NDT
is among the most well-established and widely used network speed test data. In addition, state and local officials across
performance metrics. Indeed, “speed” is used as the basis for the country are currently urging consumers to participate in
a wide range of purposes, from network troubleshooting and speed test crowd-sourcing initiatives to help establish which
diagnosis, to policy advocacy [3, 10, 28, 40] (e.g., on issues areas meet the federal funding criteria.
related to digital equity), to regulation and litigation [9] (e.g., To their credit, the organizations that have released these
on issues related to ISP advertised speed). Given the extent to speed test tools have issued various guidance about how pub-
which stakeholders, from consumers to regulators to ISPs, all lic data should (and, perhaps, should not) be used, but their
rely on “speed”, it is in some sense surprising that there are so
1
recommendations are qualitative, obscure, and vague. For The NDT7 client under-reports throughput at higher latencies, in
example, many consumers aiming to test their own Internet comparison to Ookla: The Ookla client reports speeds up to 12%
or contribute to crowd-sourced campaigns use search engines higher than NDT7 at 200 ms round-trip latency, and up to 56%
higher at 500 ms latency (§3.2).
such as Google, which drive users to run an NDT speed test,
even though NDT documentation itself concedes that the tool
The NDT7 native client we initially tested in the lab incorrectly
may not accurately measure ISP speeds [22]. Furthermore, reported upload speeds that significantly exceeded link capacity.
public mailing lists show that users continually struggle to This bug was fixed by the NDT7 team after we filed an issue (§3.4).
explain why different tools report different speeds [24, 25],
as the tools themselves are even advertised in ways that are Across all households in the wide-area deployment, the median
counter to their stated objectives. This further confuses the fraction of paired tests for which Ookla reports a speed that is 0–5%
use of these tools and derived datasets, including the key issue higher than NDT7 is 73.8%. The fraction of paired tests for which
of how various datasets can be applied to specific questions Ookla reports a speed that is 5–25% higher is 13.4% (§4.2).
of public interest. A more rigorous, quantitative, and specific
understanding of when and under what circumstances these For Ookla, the choice of test server can significantly affect the
reported speed. Tests using certain Ookla servers systematically
tools may produce different results can increase the likelihood
report speeds 10% lower than other servers. (§4.3).
that these datasets are ultimately correctly applied—and that
their limitations are known. NDT7 tests are more likely to under-report during peak hours.
This paper conducts the first systematic, comparative 43.4% of households observed a statistically significant decrease in
study of NDT7 (the latest version of NDT) and Ookla, which NDT7-reported download speed tests during peak hours, whereas
has two parts: (1) in-lab measurements that allow direct com- only 18.9% of these households saw the same for Ookla (§4.4).
parison of both tools, under controlled network conditions
where “ground truth” is known; (2) paired wide-area net- Table 1: Main results, and where they can be found in the paper.
work tests, whereby the two tests are run back to back in
a deployment across 67 homes for a six month period. We tion 5 covers the related work providing nuanced differences
use controlled measurement to characterize how NDT7 and in these approaches. In recent years, two pre-dominant tools,
Ookla behave under a wide range of network conditions— Ookla and NDT7, have emerged as “speed tests”.
specifically, varying throughput, latency, and packet loss. We Background. Both Ookla and NDT7 rely on TCP for trans-
also study how different transport congestion control algo- port and attempt to measure the capacity of the path between
rithms and client types (i.e., browser vs. native client) may the client and server by sending as much traffic as possible
affect the measurements that each tool reports. Second, we (“saturating” the link) and computing a throughput value ac-
compare the behavior of these two tools using data from our cording to the number of bytes transferred over some window
wide-area network deployment (encompassing 7 different of time. These techniques measure the bottleneck link along
ISPs), ultimately collecting more than 40,000 tests from each the end-to-end path. When Internet service providers offered
tool. Unique to our study is the use of paired speed tests, in lower speeds to customers, the bottleneck link was very often
which we run Ookla and NDT7 back to back. To our knowl- the ISP access link; over time, however, as access speeds in-
edge, this is the first comparative analysis of Ookla and NDT7 creased, the bottleneck link has shifted to other places along
in deployment over a significant number of networks for an the path, including the WiFi network, interconnection links
extended period of time. between ISPs, the server infrastructure, and even the client
Table 1 summarizes our main findings. We observe signif- device. Ookla and NDT7 differ in three ways that can funda-
icant differences in Ookla and NDT7’s behavior and explain mentally affect measured throughput: (1) the way the client
the causes of these differences. Our results and suggestions attempts to saturate, or flood, the link; (2) the end-to-end path
should help users understand why the reported speed from between the client and server; (3) the sampling and aggrega-
Ookla and NDT7 may differ, as well as guide policymakers tion of the throughput metrics.
towards more accurate and appropriate use of public data sets Flooding mechanism. The mechanisms that Ookla and
based on these tools. NDT7 use to (attempt to) saturate the bottleneck link along
the path between client and server are quite different. Ookla
2 Differences in Speed Test Design adapts both the number of open TCP connections and the test
length in response to changes in the measured throughput
Past research has developed many different approaches to
over the course of the test. NDT7 opens only a single TCP
measure throughput, available bandwidth, capacity etc. Sec-
2
connection, and the test itself always runs for ten seconds. 3 Controlled Measurements
We will discuss this in greater detail in Section 3.2. The latest
versions of both Ookla and NDT7 use TCP websockets. In this section, we explore how Ookla and NDT7’s re-
End-to-end path. Ookla and NDT7 both measure the spective design and implementation affect their accuracy in
throughput of an end-to-end path between the client and controlled network settings, where we know and control the
the server. The tools differ in where these servers are hosted. “ground truth” network conditions. We explore in particu-
Ookla and NDT7 manage and operate their server infrastruc- lar how these two tools report throughput under a variety
ture in very different ways: any network can operate an Ookla of network conditions, under a range of latency and packet
server, yet servers are typically selected based on client prox- loss conditions, as well as how the choice of TCP congestion
imity (as measured by latency), and removed from the set of control algorithm affect reported speed.
options if they are empirically determined to under-report 3.1 Method and Setup
throughput. On the other hand, NDT7 servers are operated
For all of our in-lab measurements, we simplify the end-
by a managed infrastructure, owned and operated by a single
to-end path and connect the client directly to the server via
organization (Measurement Lab). Ookla servers are some-
an ethernet cable. We host both the NDT7 and Ookla server
times “on net” (within the same ISP as the client), although
daemon on the same physical server. This setup allows us to
that is neither a requirement nor a guarantee; on the other
control the network conditions at the client, server, and the
hand, because NDT7 servers are operated in data centers, they
path in between, thus ensuring that any observed differences
are typically “off net”. Servers that are off net result in end-
between Ookla and NDT7 are a result of design and implemen-
to-end paths that may traverse multiple networks, including
tation differences between the tests, as opposed to an artifact
transit networks and interconnection points that may intro-
of changing network conditions along the end-to-end path.
duce bottlenecks. Past history, for example, has shown that
We use the native speed test clients for both the in-lab measure-
transit providers such as Cogent have served as bottlenecks
ments and the wide-area measurements in the next section
for NDT7 tests, and even that these transit providers have
because the native client provides metadata (e.g., socket-level
prioritized test traffic in times of congestion, thereby affecting
information, test UUID) that the browser version does not
the accuracy of the test [23]. Section 4.3 discusses some of
provide. In practice, most users run speed tests through the
these decisions in more detail.
browser version of these respective tools, despite the fact that
Sampling and aggregation. The beginning of a TCP con-
both tools also offer native clients. To understand potential
nection has a period called “slow start”, whereby the client
effects of browser-based measurements, we compare speed
and server transfer data at a rate that is slower than the steady
test measurements from both the native and browser clients
state transfer rate. Packet loss can also cause a TCP sender to
(Section 3.3); we find no significant difference in the results
significantly reduce its sending rate. Such variability, particu-
reported from each modality. This finding is a positive and
larly at the beginning of a transfer, introduces a design choice
somewhat surprising result, as early browser-based speed
concerning how to sample and aggregate a set of samples
tests did have difficulty achieving more than 200 Mbps [12].
indicative of instantaneous sending rates over shorter win-
Finally, because of a bug we discovered in NDT7 (discussed
dows. Speed tests can also sample the throughput at either the
in more detail in Section 3.4), for all NDT7 tests we calculate
sender or receiver. Both NDT7 and Ookla report throughput
the throughput of NDT7 upload tests using the TCPInfo (based
based on the amount of data transferred in the TCP payload.
on the throughput calculated by the server at the kernel level)
NDT7 reports the average throughput over the entire test
information, as opposed to the value reported to the user,
(bytes transferred / test time). The sampling methodology for
which uses AppInfo (based on the sending rate of the client at
Ookla’s latest version is not public. The legacy HTTP-based
the application level).
Ookla client sampled throughput 20 times and discarded the
Hardware. The server and client are run on identical Sys-
top 2 and the bottom 25% samples, thus, effectively disregard-
tem76 Meerkat Model meer6 desktop computers (Intel 11th
ing the TCP slow start period [34]. While, the latest version
Gen i5 @2.4Ghz, 16-GB DDR memory, and up to 2.5 Gbps
does not use the same sampling method (manually verified
throughput) running Ubuntu 20.04.
by inspecting traces), we do observe some form of sampling
whereby it discards low throughput samples. Software. For a measurement server, we used ndt-server ver-
sion 0.20.6 and Ookla daemon build 2021-11-30.2159. Because
there is no publicly available Ookla daemon, we collaborated
with the developers at Ookla to obtain a custom version. For
native client tests, we used ndt-client-go version 0.5.0 and
3
1.0 1.0
Accuracy [Measured / Capacity]

0.8 0.8
Tool 1.00 Tool 1.00
Ookla Ookla
0.6 0.6
NDT7 NDT7
Method Method
0.95 0.95
0.4 Reported 0.4 Reported
Avg Avg
0.2 0.2
0.90 0.90
101 102 103 0 1 2 3 4 5
0.0 0.0
5 10 100 1000 2000 0 1 2 3 4 5
Link Capacity [Mbps] Packet Loss [%]
Figure 1: Download accuracy vs. link capacity. Shaded region repre- Figure 2: Download accuracy vs. packet loss. Shaded region represents
sents a 95% confidence interval for n = 10 tests. The reported method a 95% confidence interval for n = 10 tests. The reported method shows
shows the speeds reported by the tool. The average method is the average the speeds reported by the tool. The average method is the average data
data transfer during the test. transfer rate during the test.
Ookla version 3.7.12.159. For browser-based tests, we use typically achieve a throughput that approaches link capac-
Google Chrome version 100.0.4896.75. We used the tc netem ity, typically maxing out at 80% of the capacity. Our results
package to set link capacity, latency, and packet loss. Both likely differ because client operating systems (a bottleneck
the server and client use the TCP BBR congestion control suggested in previous work from Bauer [4]) have improved
algorithm unless otherwise stated. and the design of speed tools has changed. The results in
To facilitate our analysis, we define the accuracy as the previous studies of NDT [4, 39] concern the original NDT
reported speed, divided by the true link capacity (i.e., the value version, which differs from the current implementation of
that we configure with tc). This metric enables us to com- NDT7 in several important ways. Most notably, the original
pare results across a range of network conditions, including NDT server used TCP Reno, which often prevented the tests
different link capacities. from saturating the link [18]. The current NDT7 server uses
TCP BBR [20]. Absent high latency or packet loss, both tools
3.2 Effects of Network Conditions achieve similar accuracy across a wide range of link capacities.
We first study how accurately NDT7 and Ookla measure Upload accuracy under different uplink capacities, shown in
the link capacity under different network conditions. We are Figure B.1 in the Appendix, show the same trends, with even
interested in quantifying the effects of both the mechanisms less variability.
used to flood the network path and the sampling technique Packet loss. To study how the packet loss between the server
used to calculate the final speed. To do so, we show both and client affects the accuracy of Ookla and NDT7, we fix the
the average throughput (calculated using traffic headers) and uplink and downlink capacity to 100 Mbps and introduce
the final reported throughput. The average throughput com- random loss along the path between the client and server. For
putation enables us to compare results from the different both NDT7 and Ookla, packet loss up to 5% has very little
approaches, even if the tools report different numbers as a impact. Figure 2 shows how download accuracy varies as
result of different sampling and aggregation techniques. Hav- packet loss is induced. The range of reported speeds and
ing isolated these effects, we can then separately quantify the average throughputs are within 0.3% for Ookla and 2% for
effect of sampling technique by comparing the final reported NDT7. We see little accuracy degradation because both the
speed and the average throughput. server uses TCP BBR as their congestion control algorithm,
Link capacity. Figure 1 shows each tool’s accuracy when which does not use packet loss as a congestion signal.
measuring downstream throughput under different link ca- Latency. We now study the effects of latency on speed test ac-
pacities. The results show that there is no significant differ- curacy for a 100 Mbps link capacity. Figure 3 shows download
ence between Ookla and NDT7, in terms of both the reported accuracy is affected as round-trip time (RTT) increases. Look-
and average data transfer. Both tools can achieve 95% link ing first at reported speed, Ookla is unaffected by the increase
saturation up through a 2 Gbps bandwidth connection using in RTT until round-trip latency exceeds 400 ms. Meanwhile,
only a single TCP connection. Past work [4, 39] has suggested the median speed reported by NDT7 decreases to 90% of the
that a speed test using only a single TCP connection cannot link capacity at only 100 ms, further degrades to 83% at 200 ms.
4
1.0
(dashed red line) begins to decrease while Ookla’s reported

speed (solid red line) remains constant. Ookla’s sampling
0.8
technique discards lower throughput samples, resulting in a
0.6
reported throughput that remains high, even as the average
Tool throughput achieved by the test itself decreases. At higher
0.4 Ookla latency, a TCP connection takes longer to increase its sending
NDT7
Method
rate—averaging still includes this low-throughput interval,
0.2
Reported but Ookla’s sampling method discards them and calculates
Avg the final speed using only the higher throughput samples
0.0
0 100 200 300 400 500 600 collected towards the end of the test.
Round-Trip Time [ms] We next investigate why Ookla’s average throughput does
not degrade similar to NDT7 when round-trip latency exceeds
Figure 3: Download accuracy vs. round-trip latency. Shaded region 200 ms. When inspecting the packet traces from the speed
represents a 95% confidence interval for n = 10 tests. The reported
test, we observed that Ookla adapts both the number of TCP
method shows the speeds reported by the tool. The average method is
the average data transfer rate during the test.
connections used and the test duration in response to the
instantaneous measured throughput over the course of the
test, whereas NDT7 does not. Ookla begins to use multiple
Rank Country Total Tests > 100ms > 200ms
TCP connections at 20 ms of latency, increasing the number
1 USA 22.9M 4.4% 1.5% of connections to as many as eight, at latencies higher than
2 India 15.7M 12.7% 4.7% 100 ms. In contrast, NDT7 only uses a single connection
3 Brazil 6.5M 22.5% 11.5% regardless of latency (see Figure B.3 for details). As for test
length, Ookla begins to increase the median test length from
Table 2: Percentage of NDT7 tests conducted in April 2022 with mini- 3.5 seconds at 0 ms to median test length of 11 seconds at
mum round-trip latency exceeding the given threshold. 100 ms—and 15.7 seconds at 600 ms, as shown in Figure B.4.
Longer tests allow Ookla’s throughput to get closer to the link
The difference in reported speeds is maximum at 500 ms, with capacity than NDT7, whose tests take between 9.5 and 10
Ookla and NDT7 reporting a median speed of 87 Mbps and seconds, regardless of latency. Past work has suggested that
31 Mbps, respectively. Ookla’s test is not adaptive but instead uses a fixed length [41];
These latency effects are potentially significant because our results show that the assertion from past work is incorrect,
they imply that if a client test selects a path to a server with and likely based on outdated documentation from Ookla [30].
high latency, then then reported throughput could be signifi-
cantly lower—especially in the case of NDT7, which is more Takeaway: Ookla and NDT7 report similar speeds
sensitive to high-latency paths. An important question, then, when the latency between the client and server is low.
is how often high-latency tests exist. To answer this question, When latency is high (>200ms), NDT7 and Ookla report
we analyze metadata from recent NDT7 tests to see how often speeds that differ by at least 12% and up to 56%. Ookla
NDT7 tests are conducted when the RTT is at least 200 ms. We is more resilient to high latency because it adapts the
consider the reported minimum RTT for the tests conducted number of TCP connections and test length, whereas
in April 2022 from the United States, India, and Brazil, the top NDT7 always uses a single TCP connection and has a
three countries in terms of number of tests. Table 2 shows fixed test length.
the percentage of NDT7 tests whose minimum round-trip
latency exceeds 200 ms. All three of these countries see high
latency tests, and in Brazil, 11.5% of NDT7 tests conducted 3.3 Effects of Client Modalities
involve paths where the round-trip latency exceeds 200 ms.
In addition to variable network conditions, client-side
Our in-lab analysis suggests that these tests would lead to
usage modalities can affect the accuracy of Ookla and NDT7.
reported speeds of 17% lower than the link capacity (or 12%
We study the effect of two such modes: (1) choice of TCP
lower than Ookla) for a 100 Mbps link.
congestion control algorithm (CCA) and (2) choice of client
Figure 3 also illustrates how the choice of a sampling
types (browser vs. native).
and aggregation technique to report a throughput result can
affect the robustness of the reported result under high-latency
conditions. As latency increases, Ookla’s average throughput
5
1.0 TCP connections when using TCP BBR but not for TCP CUBIC.
It is unclear why the choice of CCA causes this behavior.

0.8
Loss. When using TCP CUBIC, both Ookla and NDT7 expe-
rience severely reduced accuracy at packet loss rates at 2%
0.6
and higher. Figure 4b shows how Ookla and NDT7 accurately
Tool
Ookla measure link capacity under different CCAs and packet loss
0.4
NDT7 regimes. As expected, clients running TCP CUBIC suffer be-
CCA cause CUBIC uses packet loss as a congestion signal and slows
0.2
CUBIC
BBR
it’s sending rate, whereas TCP BBR is resilient to losses.
0.0
0 100 200 300 400 500 600
Round-Trip Time [ms] 3.3.2 Client Type
(a) Accuracy vs. Latency In this section, we compare the accuracy of the NDT7 browser
1.0
client and the NDT7 native (command-line) client. There is
no readily available way to conduct the same comparison for

0.8 Ookla because it is not open-source, and we cannot configure
the test server in Ookla’s current browser client.
0.6 We perform these experiments for two reasons: (1) in
Tool the past, browser-based tests have been less accurate at high
0.4 Ookla link capacities [12]; (2) it is likely that most speed tests are
NDT7
CCA
conducted using the browser.
0.2
CUBIC Figure 5 shows the NDT7 upload accuracy for both the
BBR native and browser client as the network conditions vary. It
0.0
0 1 2 3 4 5 should be noted that we calculate speed using the TCPInfo
Packet Loss [%] messages for both the native and browser client. Overall, there
is very little difference in accuracy between the two client
(b) Accuracy vs. Packet Loss
types. As packet loss and latency between the client and
Figure 4: Tool accuracy under different client-side TCP congestion server is induced, the difference in median accuracy between
control algorithms for upload tests. Shaded regions represent a 95% the browser and the native client is within 1%. Although there
confidence interval for n=10 tests. is a small dip in accuracy as the link capacity is increased, the
difference is less than 2%.
3.3.1 Congestion Control Algorithm
Takeaway: When using TCP BBR, both NDT7 and
We compare two commonly used TCP CCAs—indeed CCAs Ookla are more resilient to increases in latency and
that have been used in these speed tests—TCP CUBIC and packet loss than when using TCP CUBIC. There is no
TCP BBR. We choose these algorithms because TCP CUBIC significant difference between the NDT7 browser client
is the default CCA on all current Linux, Windows, and Mac and the NDT7 native client.
machines while TCP BBR is the CCA recommended for the
NDT7 test. In addition, TCP CUBIC and TCP BBR use differ-
ent congestion signals, packet loss and changes in latency, 3.4 NDT7 Upload Test Inconsistencies
respectively. In this part of our study, we study only upsteam Our controlled experiments revealed that the NDT7 native
throughput measurements, as the choice of client-side CCA client would report upload speeds exceeding the link capacity
only affects the sending rate. by up to 34%. Upon making this observation, we studied how
Latency. Figure 4a shows the reported speed at the indicated the NDT7 native client calculates the final reported speed.
round-trip times for different client-side CCAs. The difference The remaining section discusses how we isolated the incon-
in average speed between clients running TCP BBR and TCP sistency, identified the bug in NDT7, and communicated the
CUBIC can be up to 27% of the link capacity for Ookla, and finding (and fix) to the Measurement Lab team.
as much as 7% of the link capacity for NDT7. The greater During an NDT7 test, the client and server exchange two
discrepancy for Ookla is caused by Ookla opening multiple types of messages: AppInfo messages and TCPInfo messages.
The AppInfo messages track the amount of data transferred
6
1.0 1.0 1.0

0.8 0.8 0.8
1.00 1.00
0.6 0.6 0.6

Native Native Native
Browser Browser Browser
0.95 0.95
0.4 0.4 0.4
0.2 0.2 0.2

0.90 0.90
100 101 102 103 0 1 2 3 4 5
0.0 0.0 0.0

1 5 10 100 500 2000 0 1 2 3 4 5 0 100 200 300 400 500 600
Link Capacity [Mbps] Packet Loss [%] Round-Trip Time [ms]
(a) Link Capacity (b) Packet Loss (c) Latency
Figure 5: NDT7 accuracy on different client types (browser vs. native) for upload tests.
TCPInfo
browser calculates upload speed using the TCPInfo messages,
1.30
APPInfo not the AppInfo messages.
We communicated our findings to the NDT7 team at Mea-
1.20
surement Lab, who have already implemented a bug fix for the
1.10
NDT7 native client to use the TCPInfo messages instead of the
AppInfo messages to compute the speed. Because the NDT7
1.00 native client sends data in units of size up to 1 MB, the number
0.95 of transferred bytes at the sender and receiver can differ by up
0.90 to 1 MB. This characteristic explains why tests at lower link
0.5 1 5 10 25 50 100 500
Link Capacity [Mbps] capacities exhibited greater relative differences than tests at
higher link capacities. Although this bug is now in the process
Figure 6: NDT7 upload throughput calculated using TCPInfo messages of being fixed, it has implications for analysis of past data.
and AppInfo messages. Note that the y-axis begins at 0.9. Shaded region Since many NDT7 native client tests were conducted when
represents a 95% confidence interval for n = 10 tests. speed was calculated using the AppInfo messages, we suggest
that researchers who analyze past NDT7 data recalculate the
at the application level, while the TCPInfo messages track the upload speeds using the TCPInfo statistics available in the
amount of data transferred at the kernel-level. Before fixing Measurement Labs BigQuery database [17].
the bug, the native ndt-go-client used the AppInfo messages
to calculate speed. This choice only affects upload results, Takeaway: NDT7 upload tests conducted using the na-
as this method counts all bytes sent to the TCP socket, as tive client report speeds greater than the link capacity
opposed to bytes that are successfully sent across the network. for link capacities under 10 Mbps. This overestimation is
Because download speed is also calculated at the client, only caused by the native client reporting sender-side statis-
successfully received bytes are counted, so the calculation tics. This inaccuracy is especially important to consider
method does not affect the reported download speed. in the context of residential networks, for which upload
Figure 6 shows how NDT7 upload accuracy varies across speeds (especially for historical data) may be less than
different link capacities when calculated using the AppInfo 10 Mbps.
and TCPInfo messages. The native NDT7 client reports a
throughput greater than the link capacity at link capacities
under 10 Mbps. At 0.5 Mbps, the native NDT7 client reports an 4 Wide-Area Measurements
average speed that is 134% of the link capacity. It would thus We complement our in-lab analysis with a longitudinal
be inappropriate to AppInfo to measure upstream throughput. study of Ookla and NDT7 on a set of diverse, real-world resi-
This discrepancy is of special concern for measuring upstream dential networks, which experience dynamic network con-
throughput on residential networks, where it is common for ditions in real deployment settings for each tool. We present
the provisioned uplink capacity to be under 10 Mbps. In this part of the study and the results in this section.
light of this finding, we also verified how the NDT7 browser
calculated the final speed. To our surprise, we found that the
7
4.1 Method and Setup ISP # of Households
We deployed Raspberry Pis (RPis) in 67 households, span- Comcast 42
ning 7 ISPs (see Table 3) from November 2021 to April 2022. AT&T 16
RCN 3
All the households are located in a major city. A focused Everywhere Wireless 2
geography enables us to measure the same subset of Ookla Webpass 2
or NDT7 server infrastructure from multiple vantage points; Wow 1
thus, enabling discovery of server-side issues in Section 4.3. University Network 1
We use Raspberry Pi 4 Model B devices (Quad-core Cortex- Total 67
A72 CPU, 8GB SDRAM, and up to 1 Gbps throughput). Each
Table 3: Number of households studied for a given ISP.
RPi conducts a series of active measurements, including at
least daily Ookla and NDT7 speed tests at random times of
day. We eliminate any WiFi effects and conduct the speed Si
test directly by connecting the RPi to the network router via Ŝi = (1)
S95th
Ethernet. We verify that the speed test results are not limited
by the RPi hardware by performing in-lab tests comparing the where Si is the speed reported by test i and S95th is the 95th
RPis against the NUCs. We see no significant performance percentile result across all speed tests from that particular
difference between RPis and NUCs for download tests. For up- household. In addition to using the nominal speed to nor-
load tests, we do observe that, for link capacities of 900 Mbps malize test results, we use it when assigning households to
and 1 Gbps, NDT7 tests conducted on the RPi are on average different speed tiers.
5% lower than those conducted on the NUC. Ookla reports 5% Speed tier changes. Over the course of this study, some users
lower speeds at 1 Gbps on the RPi than on the NUC. Figure B.7 upgraded their Internet service plan, resulting in either higher
shows this comparison at various link capacities. These dis- download speeds, higher upload speeds, or both. We identify
crepancies should not affect our conclusions at the speeds we speed tier changes by manually inspecting each household’s
test because (1) most ISP offerings for residential networks speed test results over time. To accommodate these changes
are significantly lower than 1 Gbps (e.g., up to 35 Mbps for in our study, we treat measurements taken before and after
Comcast DOCSIS Cable); (2) our analysis compares Ookla and the speed tier upgrade as two distinct households. In total,
NDT7 speeds relative to each other, and both tools degrade we observe two instances of download speed tier change and
similarly due to RPi except at speeds above 900 Mbps. five instances of upload speed tier change. Therefore, we end
Pairing Ookla and NDT7 tests. To isolate the effects of up with a different number of households for download and
Ookla and NDT7’s server infrastructures, we must hold all upload tests: 67 and 72, respectively.
other test conditions constant, meaning we must control for Characterizing test frequency. For download tests, there
the network conditions present along the parts of the end-to- are a median of 354 paired tests across households, with a
end path that Ookla and NDT7 tests have in common. Because minimum of 50 and maximum of 2,429. For upload tests, there
network load can vary over time, comparing Ookla and NDT7 are a median of 345 paired tests, minimum of 50 and maximum
test results taken from different points in time would be to of 2,238 for each household. The CDF of the number of tests
inaccurately compare the network in two different states. As is shown in Figure B.8. For each household, we run Ookla
such, we conduct paired speed tests, whereby we run Ookla and NDT7 tests at least once daily for an average of 126 days,
and NDT7 tests back to back. This approach ensures that the with a minimum of 26 days and a maximum of 182. Since, a
network conditions along the shared portion of the end-to- lot of households we study have ISP-imposed data caps, we
end path are reasonably similar during both tests. set a limit on the monthly data consumption our tests can
Normalizing speed. Since we do not have the ground truth use. Households with higher speed connections will consume
for a household network’s link capacity, we define the nominal more data per test and thus will run fewer tests than house-
speed for a given household to be the 95th percentile speed hold with lower speed connections. Over the course of our
across all speed tests of a given tool from that household. We deployment, high-speed (> 500 Mbps download) households
choose the 95th percentile in accordance with past work that have a median of 263 paired tests while low-speed (< 500 Mbps
studied Internet performance in residential networks [39]. download) households have a median of 791.
Using the nominal speed, we compute the normalized speed 4.2 Comparing Paired Speed Tests
for each test i as follows:
Conducting paired tests allows us to directly compare
Ookla and NDT7 results because both tests were run under
8
in-lab tests, both tools report similar values, especially when
Ookla NDT7
Speed [Mbps]
800 the bottleneck link capacity is low.

600 Focusing now exclusively on households for which we
400 could reject H0 , we calculate the mean difference between
200
Ookla and NDT7 speeds for these households. For download
0
-11 -12 -01 -02 -03 -04 -05
tests, the mean difference is within 5% (10%) of the maximum
21 21 22 22 22 22 22
20 20 20 20 20 20 20 test mean for 33 (43) out of 46 households. Similarly, for
Time upload tests, the mean difference is within 5% (10%) of the
maximum test mean for 30 (32) out of 38 households. Thus,
Figure 7: Download speed test results over time from a single overall, the average differences between Ookla and NDT7 are
household: Ookla results show high variability with occasional either not statistically significant or within 5% (when they
dips in reported speed. are statistically significant) for 81% and 88% households, for
download and upload tests, respectively.
similar access link conditions. In this section, we characterize In comparison to average differences, we find that the
the differences in reported speed for each paired test. We difference within individual paired test can be high. Figure 7
consider the following questions: (1) for each household, is the illustrates the download speeds reported by Ookla and NDT7
average difference in reported speeds over time statistically tests over time for a single household, for which the aver-
significant and (2) for each paired test, how often and to what age difference between NDT7 and Ookla results was not sta-
extent do the reported speeds differ? tistically significant (p-value = 0.021). For 80% of paired
4.2.1 How often do the average test results differ? tests from this household, NDT7 reports a lower speed (up
to 80 Mbps lower) than Ookla. However, Ookla occasionally
Method. Using longitudinal measurements from each house- reports a much lower speed (up to 400 Mbps lower), bringing
hold, we analyze whether the average difference between the the Ookla average speed closer to that of NDT7. Given this
speed reported by Ookla and NDT7 is statistically significant. range in differences within a single paired test, we next study
We leverage the fact that we have paired observations and the magnitude and frequency of these differences within each
test for significance using a paired t-test. Our sample satisfies paired test.
the assumptions required to conduct paired t-test as follows:
(1) the measurements are collected at random time intervals, 4.2.2 How much do individual test results differ?
(2) the Ookla and NDT7 tests are conducted back-to-back,
creating a natural pairing, and (3) the number of paired ob- Method. We now study the difference in reported speeds for
servations per network is great enough (> 30) to satisfy the each paired test. For our analysis, we first compute the signed
normality condition [36]. relative difference (δrel ) for each paired test:
The null hypothesis, H0 , is that for a given household, the
SOokla − SN DT 7
average difference between each pair of Ookla and NDT7 tests δrel = (2)
is 0. For each household, we then compute the p-value, or max(SOokla , SN DT 7 )
the probability of observed values under the null hypothesis. SOokla and SN DT 7 are the speeds reported by Ookla and
Using a significance level α = 0.01, we then reject H0 if the NDT7, respectively. Dividing the difference by the maximum
calculated p-value is less than α. We choose α = 0.01 based of two speed values bounds δrel to [−1, 1].
on the norms in other fields where hypothesis testing is more Interpreting δrel . If δrel > 0, Ookla reported a higher speed
common. Choosing α = 0.01 can be interpreted as there than NDT7 for that paired test, and vice-versa if δrel < 0. The
being a 1% chance of rejecting H0 , when H0 is true [36]. magnitude of δrel indicates the size of the difference between
Results. For download tests, we can reject H0 for 66.7% of the two tests. For example, if δrel = 0.05, then the speed
households (46 out of 69) and can conclude that the average dif- reported by NDT7 is 5% lower than the speed reported by
ference in reported speeds is statistically significant for those Ookla. Similarly, if δrel = −0.25, then the speed reported by
46 households. Turning next to upload tests, we can reject the Ookla is 25% lower than the speed reported by NDT7.
null hypothesis for only 52.8% (38 out of 72) households. This For each paired test, we compute δrel and classify it into
is likely because 80% of households have much lower upload one of six classes based on δrel . Each class is defined by the
speeds (< 50 Mbps). For these lower upload speed households, direction of the difference (whether Ookla or NDT7 reports
the ISP access link is likely the bottleneck. As observed in our a higher speed) and the magnitude of the difference. Table 4
defines the classes and how to interpret them.
9
Fraction of tests per household
Fraction of tests per household

All <500 Mbps >500 Mbps All <50 Mbps >50 Mbps
0.8 0.8
NDT7 Higher Ookla Higher NDT7 Higher Ookla Higher
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
[-1, -0.25] [-0.25,-0.05] [-0.05,0] [0, 0.05] [0.05, 0.25] [0.25, 1] [-1, -0.25] [-0.25,-0.05] [-0.05,0] [0, 0.05] [0.05, 0.25] [0.25, 1]
δrel δrel
(a) Download (b) Upload
Figure 8: The distribution of each paired test class, partitioned by speed tier. Each point in a box plot is the fraction of tests that fall into that class
for a given household. The boxes are first and third quartiles, the line inside the box is median, whiskers are 10th and 90th percentile.
Difference Bin fraction of paired tests within [0.25, 1] (Ookla reports a higher
Low 0 < δrel ≤ 0.05 speed) is only 0.3%, while the median fraction of paired tests
Ookla reports
Medium 0.05 < δrel ≤ 0.25 within [-1, -0.25] (NDT7 reports a higher speed) is 1.2%. It
a higher speed
High 0.25 < δrel ≤ 1.0 is thus more common for a given household to have Ookla
Low −0.05 ≤ δrel < 0
NDT7 reports report a speed that is 25% lower than NDT7 as opposed to vice
Medium −0.25 ≤ δrel < −0.05
a higher speed versa. This finding aligns with trends observed for the sample
High −1.0 ≤ δrel < −0.25
household shown in Figure 7, where Ookla, while typically
Table 4: Summary of paired test classes and how to interpret them. reporting slightly higher speeds, would occasionally report a
significantly lower speed.
We then compute the fraction of paired tests that fall into We further partition households into high and low speed
each category for each household. This approach allows us to tiers. A household is classified in the high downlink speed
characterize the frequency of paired tests with low, medium, tier if the nominal download speed (defined in Equation 1)
and high difference in reported speed. We analyze the distribu- is greater than 500 Mbps, otherwise it is classified in the low
tion for each household instead of looking at the distribution downlink speed tier. As for uplink speed tiers, a household is
across all households for two reasons: 1) each household is classified as high-speed if the nominal upload speed is greater
unique in that the network conditions present vary between than 50Mbps and low speed otherwise. As such, a household
households and 2) the number of paired tests from each house- may be in the high downlink speed tier but the low uplink
hold is not uniform. Figure 8 shows the distribution for each speed tier. We see that households in the high downlink speed
paired test class, where each point is the fraction of tests that tier (>500 Mbps) have a greater fraction of paired tests with
fall into that class for a given household. medium and high differences than households in the low speed
Results. Figure 8a) shows that, in the case of downstream tier (≤500Mbps). For example, there is a median of 13.4%
throughput, for most paired tests, Ookla reports a slightly of paired tests for which 0.05 < δrel ≤ 0.25 among high
higher speed than NDT7. Across all households, the median downlink speed tier households, while the corresponding
fraction of paired tests for which 0 < δrel ≤ 0.05 is 73.8%. fraction for low speed downlink households is only 4.9%. The
It is likely that differences of this magnitude are caused by trends are similar for paired upload tests (see Figure 8b): the
differences in the test protocol, for which Ookla’s sampling 75th percentile fraction of paired tests within [0.25, 1] among
mechanism discards low-throughput samples and NDT7’s high speed (> 50 Mbps) households is 6.4% compared to just
does not. The fraction of paired tests with medium and high 1.1% for low speed (< 50 Mbps) networks.
difference is much lower. For medium differences, the median We believe these variations could result from two factors.
fraction of paired tests for which 0.05 < δrel ≤ 0.25 (Ookla First, as the available throughput in real-world networks is
reports a higher speed) is 8.3%, while the median fraction of variable, more so at high speed, the flooding and sampling
paired tests for which −0.25 ≤ δrel < −0.05 (NDT7 reports mechanisms have an impact on the reported speed. Our in-lab
a higher speed) is 3.4%. As for high differences, the median experiments indicate that Ookla has clear advantages over
10
Normalized Speed
Rank AS Name AS Type % of Tests % Households 1.0
0.8
1 Nitel ISP 21.8% 100.0%
0.6
2 Puregig ISP 13.2% 97.1%
3 Whitesky ISP 11.5% 98.5% 0.4
GTT Zayo Tata Comm Cogent Level3
4 Comcast ISP 8.6% 97.1% Server
5 Windstream ISP 7.1% 97.1%
6 Frontier ISP 6.7% 95.6% (a) NDT7
Normalized Speed
7 Cable One ISP 5.9% 95.6%
1.0
8 Rural Telecom ISP 5.6% 76.5%
9 Hivelocity Cloud Service 4.3% 88.2% 0.8
10 Enzu Cloud Service 3.1% 88.2% 0.6
0.4
g
e.
ne
el
er
st
it
sk
gi
nz
Table 5: Top 10 Ookla Servers ranked by the number of tests that use
el
it
ca
ea
ti
O
oc
re
te
lT
on
E
om
tr
e
el
Pu
hi
l
ds
Fr
a
ab
that server. Percent of households indicates the fraction of households
iv
C
ur
W
in
H
C
W
for which at least one test used the given server. Server
(b) Ookla
% of Total
Rank AS Name
Tests
Figure 9: Distribution of normalized download speeds across servers.
1 GTT 20.40% Note that the y-axis begins at 0.4.
2 Tata Communications 20.04%
3 Cogent 19.99% different Tier-1 networks, with 3 of the 15 servers co-located
4 Level3 19.80% with each Tier-1 network (Table 6).
5 Zayo Bandwidth 19.50% As for Ookla, we observe far greater diversity in where
servers are placed. We observe 48 unique Ookla servers. For
Table 6: NDT Servers ranked by the percentage of total tests the remaining analysis, we characterize the top 10 most com-
monly used Ookla servers. The top 10 servers account for
NDT7 in this respect, given its use of multiple TCP connec- 87.9% of all Ookla tests conducted by our deployment fleet
tions, adaptive test length, and sampling strategy of discarding (Table 5). The hostname, which we further validate with an IP
low throughput samples. Second, for high speed tiers, it is lookup, indicates that 8 of the 10 servers are co-located with
possible that the bottleneck link for one or both the tests is consumer or enterprise ISP network, while the remaining two
not the ISP access link but instead a link further upstream. are co-located with a cloud provider.
We next study how often households use each server. For
Takeaway: The average difference in reported speed
NDT7 tests, we find each server is used roughly equally across
is either not significant or within 5% of the average test
tests. This is expected as NDT7 uses M-Lab’s Naming Server
values in up to 81% and 88% households for download
to find the server [19]. The default policy of the naming server
and upload tests, respectively. For individual test pairs,
returns a set of nearby servers based on client’s IP geolocation.
we observe significant differences – the median fraction
The NDT7 client then randomly selects a server from this set.
of paired tests for which NDT7 reports a speed that is
As for Ookla, some servers are used disproportionately more
0–5% lower than Ookla is 73.8% and a speed that is 5–25%
than others. This is in line with Ookla’s server selection
lower is 13.4%.
policy. The Ookla client pings a subset of nearby test servers
(selected using client IP geolocation) and picks the server with
4.3 Effect of Server Selection minimum ping latency [33].
Server infrastructure. We first characterize the set of Performance variation across servers. We now study
servers observed over the course of our deployment. Across how the reported speed varies with the choice of the server.
all households and tests, we observe 32 unique NDT7 servers. We group all test results by server and then compute the nor-
The top 15 most used NDT7 servers service 99.6% of all NDT7 malized speed for each test, as defined in Equation 1. Figure 9
tests. In light of this, we limit our analysis to these 15 servers. shows the distribution of normalized download speeds across
Looking next at each NDT7 server hostname, we find five different servers for both Ookla and NDT7. For NDT7, we
find that the distribution of normalized speed is similar across
11
Normalized Speed
1.0
1.0 Download
0.8 Upload
0.8
0.6
CDF
0.6 0.4
0.4 0.2
Cogent Tata Comm Zayo GTT Level3
Server 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Δmedian normalized speed [Best - Worst]
(a) NDT7
Normalized Speed
1.0 Figure 12: Distribution of difference in median normalized speed

0.8 between the best and worst Ookla server across households.
0.6
0.4 normalized speed is 0.47 and 0.84, respectively. This hetero-

g
e.
m
er
el
st
ne
y
geneity in accuracy across Ookla servers is caused by either
sk
it
gi
nz
el
it
ca
ea
ti
O
oc
re
N
te
lT
on
E
om
tr
e
el
Pu
hi
l
ds
Fr
server underprovisioning or issues in the end-to-end network

ab
iv
C
ur
in
H
C
R
W
Server
path between the client and Ookla server.
Because the number of tests per households varies, cer-
(b) Ookla tain households with many tests may be overrepresented
in the overall trends. To determine whether these trends
Figure 10: Distribution of normalized upload speed across servers. hold for individual households, for each household, we rank
Note that the y-axis begins at 0.4. Ookla servers by the median normalized speed from tests
that used that server. We remove servers for which fewer
Fraction of Households
1.0 than 10 tests used that server. Figure 11 shows the proportion
Download
0.8 Upload of households for which a given server is among the three
0.6 lowest ranked servers for that household. Cable One’s server
0.4 is a bottom three server for 92% and 77% of households, for
0.2 download and upload tests, respectively. Similarly, Comcast’s
0.0 server is a bottom three server for 58% and 73% households,
g
e.
ne
st
er
el
it
sk
gi
nz
it
ca
ea
ti
Te
O
oc
for download and upload tests, respectively.

re
N
te
on
E
om
tr
le
el
Pu
hi
al
ds
Fr
ab
iv
C
ur
W
in
Finally, we quantify the difference between Ookla servers

H
C
Ookla Server within a household by computing the median normalized

speed of the highest and the lowest ranked server, de-
Figure 11: Percentage of households for which a server is in the bottom noted by ∆med norm speed . Figure 12 shows the CDF of
three servers, ranked based on median normalized reported speed from ∆med norm speed across households. For download tests, the
tests using that server. worst servers report a difference of 0.1 or greater for 30% of
the users. The difference in upload tests is smaller, with 5%
test servers. The bottom 10% of tests across all NDT7 servers users reporting a difference of at least 0.1, likely because most
have speeds at least 13%-22% lower than the household’s access links have slower upload than download speeds.
nominal speed depending on the server. On the other hand,
several Ookla servers show higher variability than others. For Takeaway: The majority of NDT7 servers reside in
instance, the median normalized speed for tests conducted transit ISPs, whereas Ookla’s open server participation
using Cable One’s Ookla server is 0.89, whereas the median policy leads to some of their servers being placed in
normalized speed for Rural Telecom’s Ookla server is 0.99. consumer ISPs and cloud providers. Although Ookla’s
This suggests that when using Cable One’s server, 50% of tests server policy may lead to a greater coverage, there may
report speeds at least 10% lower than the household’s nominal be issues of quality assurance; we observe that several
speed. Ookla servers systematically under-reporting speed.
For upload tests (Figure 10), the distribution of normal-
ized speed is similar across NDT7 servers, while several Ookla
servers show greater variability than others. Specifically, for 4.4 Effect of Time of Day
Ookla, Comcast’s and Cable One’s server report a median Past work has suggested that ISPs experience higher loads
normalized speed of 0.98 for both, while the 10th percentile at certain “peak” usage periods, leading to a corresponding
12
Direction Tool # Rejecting H0 % Rejectng H0 5 Related Work
Download NDT7 23 43.4%
(n = 53) Ookla 10 18.9% Speed test design. There are two primary ways to measure
Upload NDT7 7 12.7%
throughput: (1) packet probing and (2) flooding. Most packet
(n = 55) Ookla 5 9.1% probing techniques send a series of packets and infer metrics
like available bandwidth or link capacity based on the inter-
Table 7: Results of t-test where H0 is that the average reported speeds arrival packet delay [11, 14–16, 37]. More recently, Ahmed
for tests conducted in peak and off-peak hours are equal. et al. [1] estimate bandwidth bottlenecks by probing the net-
work using recursive in-band packet trains. However, these
increase in network congestion [39]. If true, the time of day at techniques can be inaccurate especially for high speed net-
which a speed test is run may affect the reported speed. In this works due to their sensitivity to packet loss, queuing policy
section, we analyze our deployment data to see if the effect etc. As a result, most commercial speed tests, including ones
is similar for Ookla and NDT7. In accordance with previous offered by both ISPs [2, 8] and non-ISP entities [21, 29, 35], are
work [39], we define peak hours as 7–11 p.m. flooding-based tools that work by saturating the bottleneck
For this analysis, we use a two sample t-test, or independent link through active measurements.
t-test, because we are comparing two independent popula- Evaluating speed tests. Feamster and Livingood [12] dis-
tions. We define our null hypothesis H0 to be that for a given cuss considerations with using flooding-based tools to mea-
household, the average reported speeds for tests conducted sure speed. They do not, however, conduct empirical exper-
in peak and off-peak hours are equal. To satisfy the normality iments to characterize NDT7 and Ookla performance. Sim-
condition, we only test households for which we have at least ilarly, Bauer et al. [4] explain how differences in speed test
30 tests from both peak and off-peak hours. Enforcing this design and execution contribute to differences in test results.
condition leaves 59 households with sufficient download tests Bauer et al.’s work differs from ours in several ways. First,
and 61 households with sufficient upload tests. Recall that we both Ookla and NDT have seen major design changes in the 12
treat households that undergo speed tier upgrades as two sep- years since that study. Both tools have updated their flooding
arate households, explaining why the number of households and sampling mechanisms, and NDT’s latest version (NDT7)
differ for download and upload tests. As in Section 4.2.1, we uses TCP BBR instead of TCP Reno. Second, they only analyze
use a significance level (α) of 0.01. public NDT data and do not study both Ookla and NDT in
Table 7 summarizes the results. For download tests, we controlled lab settings, nor did they conduct paired measure-
find sufficient evidence to reject H0 for NDT7 on 40.7% of ments in the wide area that allows direct comparison of Ookla
households, but only 18.6% of networks for Ookla. This dif- and NDT, as we do. Complimentary to our comparative anal-
ference suggests that the time of day at which the speed test ysis is work by Clark et al. [7] that provides recommendations
is conducted has a greater effect for NDT7 than for Ookla. on how to use aggregated NDT data, including considering
Furthermore, the set of households for which we can reject the self-selection bias and other end-user bottlenecks like
H0 for NDT7 is a strict superset of the set of households for slow WiFi and outdated modems.
which we can reject H0 for Ookla. For this fact and the fact Residential broadband. Goga et al. [13] evaluate the ac-
that we run paired tests, we can attribute the discrepancy in curacy of various speed test tools in residential networks,
the number of networks to the specific tool, not the access net- yet tools have changed and speeds on residential networks
work (which we control for). The different behavior of Ookla have increased more than 20× since this study ten years
and NDT7 can be due to one or both of the following: (1) the ago. Sundaresan et al. [39] studied network access link per-
client-server paths for NDT7 have more cross-traffic during formance in residential networks more than ten years ago.
peak hours than Ookla; (2) the test protocol, especially use Whereas our work is more focused on characterizing speed
of multiple threads and sampling heuristic by Ookla, makes test tools, this work examined network performance differ-
Ookla more resilient to cross traffic than NDT7. For upload ences across ISPs, looking at latency, packet loss, and jit-
tests, we can only reject H0 on 12.7% and 9.1% of households ter in addition to throughput. Canadi et al. [6] use publicly
for NDT7 and Ookla, respectively; this decrease may be the available Ookla data to analyze broadband performance in 35
result of asymmetry in traffic volume. metropolitan regions. Finally, the Federal Communications
Commission (FCC) conducts the Measuring Broadband Amer-
ica project (MBA) [26], an ongoing study of fixed broadband
performance in the United States. The FCC uses SamKnows
13
whiteboxes [38] to collect a suite of network QoS metrics, 6 Conclusion
including throughput, latency, and packet loss. Because the
MBA project maps broadband Internet performance across This paper provided an in-depth comparison of Ookla and
different ISPs, they use a single speed test—a proprietary NDT7, focusing on both test design and infrastructure. Our
test developed by SamKnows—and do not consider Ookla or measurements, both under controlled network conditions and
NDT7. wide area network using paired tests, present new insights
about the differences in behavior of the two tools. In the future,
we plan to extend our wide-area study to other geographies,
especially rural areas, to understand tool behavior under more
diverse real-world network conditions.
14
References [14] Ningning Hu and Peter Steenkiste. “Evaluation and
characterization of available bandwidth probing tech-
[1] Adnan Ahmed, Ricky Mok, and Zubair Shafiq. “Flow- niques”. In: IEEE Journal on Selected Areas in Commu-
trace: A framework for active bandwidth measure- nications (2003).
ments using in-band packet trains”. In: Passive and
[15] Manish Jain and Constantinos Dovrolis. “Pathload: A
Active Network Measurement. Springer. 2020.
measurement tool for end-to-end available bandwidth”.
[2] AT&T. AT&T Internet Speed Test. url: https : / / www . In: Passive and Active Measurements (PAM) Workshop.
att.com/support/speedtest/. 2002.
[3] New York State Office of the Attorney General. New [16] Srinivasan Keshav. “A control-theoretic approach to
York Internet Health Test. url: https : / / ag . ny . gov / flow control”. In: Communications architecture & proto-
SpeedTest. cols. 1991.
[4] Steven Bauer, David D Clark, and William Lehr. “Under- [17] Measurement Lab. BigQuery QuickStart. url: https://
standing broadband speed measurements”. In: TPRC. www.measurementlab.net/data/docs/bq/quickstart/.
2010.
[18] Measurement Lab. Evolution of NDT. url: https://www.
[5] Josh Boak. Biden Administration to Release $45B for measurementlab.net/blog/evolution-of-ndt/.
Nationwide Internet. May 2022. url: https : / / apnew
[19] Measurement Lab. M-Lab Name Server: Server Selection
s . com / article / technology - broadband - internet -
Policy. url: https://github.com/m-lab/mlab-ns.
643949e239296f4f5154509d4c40de73.
[20] Measurement Lab. NDT (Network Diagnostic Tool). url:
[6] Igor Canadi, Paul Barford, and Joel Sommers. “Revis-
https://www.measurementlab.net/tests/ndt/.
iting broadband performance”. In: Proceedings of the
2012 Internet Measurement Conference. ACM, 2012. [21] Measurement Lab. Speed test by Measurement Lab. url:
https://speed.measurementlab.net/.
[7] David D. Clark and Sara Wedeman. Measurement,
Meaning and Purpose: Exploring the M-Lab NDT Dataset. [22] Measurement Lab. Using M-Lab Data in Broadband
SSRN Scholarly Paper. Rochester, NY, Aug. 2021. Advocacy and Policy. url: https://www.measurementl
ab.net/blog/mlab-data-policy-advocacy/.
[8] Comcast. Xfinity Speed Test. url: https://speedtest.
xfinity.com/. [23] Measurement Lab Mailing List. Discussion of Cogent’s
prioritization of Measurement Lab’s traffic. url: https:
[9] Federal Communications Commission. FTC Takes Ac-
/ / groups . google . com / a / measurementlab . net / g /
tion Against Frontier for Lying about Internet Speeds
discuss/c/hqvIPYkIus0/m/iJp48rPXwAoJ.
and Ripping Off Customers Who Paid High-Speed Prices
for Slow Service. May 2022. url: https : / / www . ftc . [24] Measurement Lab Mailing List. Download speed dis-
gov / news - events / news / press - releases / 2022 / 05 / crepancy when running (NDT) speed test. url: https:
ftc- takes- action- against- frontier- lying- about- / / groups . google . com / a / measurementlab . net / g /
internet - speeds - ripping - customers - who - paid - discuss/c/7gbIb8a44bg/m/LS6dXalbCQAJ.
high-speed. [25] Measurement Lab Mailing List. Measurement Labs
[10] Federal Communications Commission. Measuring (NDT) Speedtest is incorrect? url: https : / / groups .
Broadband America Program. url: https://www.fcc. google . com / a / measurementlab . net / g / discuss / c /
gov/general/measuring-broadband-america. vOTs3rcbp38.
[11] Constantinos Dovrolis, Parameswaran Ramanathan, [26] Measuring Broadband America. Apr. 2022. url: https://
and David Moore. “What do packet dispersion tech- www.fcc.gov/general/measuring-broadband-america.
niques measure?” In: Proceedings IEEE INFOCOM. 2001. [27] National Broadband Availability Map. url: https://
[12] Nick Feamster and Jason Livingood. “Measuring inter- www . ntia . doc . gov / category / national - broadband -
net speed: current challenges and future recommenda- availability-map.
tions”. In: Communications of the ACM (2020). [28] Battle For the Net. Internet Health Test based on Measure-
[13] Oana Goga and Renata Teixeira. “Speed measurements ment Lab NDT. url: https : / / www . battleforthenet .
of residential internet access”. In: Passive and Active com/internethealthtest/.
Network Measurement. Springer. 2012. [29] Netflix. fast.com. url: https://fast.com/.
15
[30] Ookla. (Outdated) Ookla Speed Test Methodology - [37] Vinay Joseph Ribeiro, Rudolf H Riedi, Richard G Bara-
Legacy HTTP client. url: https : / / sandboxsupport . niuk, Jiri Navratil, and Les Cottrell. “pathchirp: Efficient
speedtest.net/hc/en-us/articles/202972350. available bandwidth estimation for network paths”. In:
[31] Ookla. About Ookla SpeedTests. url: https://www.spee Passive and active measurement workshop. 2003.
dtest.net/about. [38] SamKnows. url: https://www.samknows.com/.
[32] Ookla. Case study: Ookla and NDT data comparison for [39] Srikanth Sundaresan, Walter De Donato, Nick Feam-
upstate New York. url: https://www.ookla.com/resou ster, Renata Teixeira, Sam Crawford, and Antonio
rces/whitepapers/make-better-funding-decisions- Pescapè. “Broadband internet performance: a view
with-accurate-broadband-network-data. from the gateway”. In: ACM SIGCOMM computer com-
[33] Ookla. How does Ookla select a server? url: https://he munication review (2011).
lp.speedtest.net/hc/en-us/articles/360038679834. [40] Pennsylvania State University and Measurement Lab.
[34] Ookla. Ookla Speed Test Methodology. url: https://he Broadband Availability and Access in Rural Pennsylva-
lp.speedtest.net/hc/en-us/articles/360038679354. nia. url: https://www.rural.pa.gov/publications/
broadband.cfm.
[35] Ookla. Speedtest by Ookla - The Global Broadband Speed
Test. url: https://www.speedtest.net/. [41] Xinlei Yang, Xianlong Wang, Zhenhua Li, Yunhao Liu,
Feng Qian, Liangyi Gong, Rui Miao, and Tianyin Xu.
[36] Roxy Peck, Chris Olsen, and Jay L. Devore. “Chapter “Fast and Light Bandwidth Testing for Internet Users.”
11: Comparing Two Populations or Treatments”. In: In: NSDI. 2021.
Introduction to Statistics and Data Analysis. 2009.
16
A Statement of Ethics B.1.2 Latency
We obtained approvals from our Institution Review Board

(IRB) before deploying RasPi devices in participants’ house- 1.0

holds. We have uploaded the consent form that was shared
with each participants before signing up. At each step in the 0.8
study, we have taken utmost care about user privacy. The
RasPi devices deployed in the household collect only active 0.6
measurements. In fact, we can not monitor any user net- Tool
0.4 Ookla
work traffic due to our network setup. We also remove any
NDT7
privacy-sensitive user identifiers (e.g., physical address, de- Method
0.2
mographics) before analyzing the data. Taking a broader view Reported
of ethics, our work has the potential for positive impact on Avg
society. It informs appropriate use of speedtest data which 0.0
0 100 200 300 400 500 600
in turn has implications on accurately mapping and bridging Round-Trip Time [ms]
the digital divide.
Figure B.2: Tool performance vs. one-way latency for upload tests.
B In-Lab Shaded region represents a 95% confidence interval for n = 10 tests.
B.1 Network Conditions

B.1.1 Bandwidth 8
# of Opened TCP Connections
1.0 6
5 Ookla
0.8
NDT7
Tool 4
1.00
Ookla
0.6 3
NDT7
Method 2
0.95
0.4 Reported
1
Avg
0.2 5 10 15 25 50 100 300 600
0.90
101 102 103
Round-Trip Time [ms]
0.0
5 10 100 1000 2000 Figure B.3: Number of TCP connections opened during a speed test vs.
Link Capacity [Mbps] one-way latency between the server and the client.
Figure B.1: Tool performance vs. link capacity for upload tests. Note
that the y-axis begins at 0.9. Shaded region represents a 95% confidence
18
interval for n = 10 tests.
16
14
Test Length [s]
12
10
6
Ookla
4 NDT7
0 100 200 300 400 500 600

Round-Trip Time [ms]
Figure B.4: Length of a speed test as one-way latency increases.

17
B.1.3 Loss B.1.4 Client Type
B.2 Real-World Deployment

1.0
0.8
Tool 1.00
Ookla
0.6
NDT7
Method
0.95
0.4 Reported
Avg
0.2
0.90
0 1 2 3 4 5
0.0
0 1 2 3 4 5
Packet Loss [%]
Figure B.5: Tool performance vs. packet loss for upload tests. Shaded
region represents a 95% confidence interval for n = 10 tests.
18
1.0 1.0 1.0

0.8 0.8 0.8
1.00 1.00
0.6 0.6 0.6

Native Native Native
Browser Browser Browser
0.95 0.95
0.4 0.4 0.4
0.2 0.2 0.2

0.90 0.90
100 101 102 103 0 1 2 3 4 5
0.0 0.0 0.0

1 5 10 100 500 2000 0 1 2 3 4 5 0 100 200 300 400 500 600
Link Capacity [Mbps] Packet Loss [%] Round-Trip Time [ms]
(a) Download - Client Type vs. Link Capacity (b) Download - Client Type vs. Packet Loss (c) Download - Client Type vs. Latency
Figure B.6: Tool performance on different client types (browser vs. native) for download tests.
1.0 1.0
Fraction of Networks
0.8 0.8
Tool 1.00
Ookla
0.6 0.6
NDT7 Download
Hardware Upload
0.95
0.4 RPi 0.4
NUC
0.2
0.90 0.2
102 103
0.0
5 50 100 500 1000 0.0
0 500 1000 1500 2000 2500
Link Capacity [Mbps]
Number of Paired Tests
(a) Download
1.0 Figure B.8: CDF of the number of paired tests across all tested networks
0.8
Tool 1.00
Ookla
0.6
NDT7 0.95
Hardware
0.4 RPi
0.90
NUC
0.2
0.85
5 100 500 1000
0.0
5 100 500 1000
Link Capacity [Mbps]
(b) Upload
Figure B.7: Performance vs. Link Capacity for Raspberry Pis and
NUCs
19

A Comparative Analysis of Ookla Speedtest and Meas

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparative Analysis of Ookla Speedtest and Meas

Uploaded by

Copyright:

Available Formats

A Comparative Analysis of Ookla Speedtest and

Measurement Labs Network Diagnostic Test (NDT7)

Accuracy [Measured / Capacity]

(dashed red line) begins to decrease while Ookla’s reported

It is unclear why the choice of CCA causes this behavior.

no readily available way to conduct the same comparison for

Accuracy [Measured / Capacity]

Accuracy [Measured / Capacity]

0.6 0.6 0.6

0.2 0.2 0.2

0.0 0.0 0.0

(a) Link Capacity (b) Packet Loss (c) Latency

800 the bottleneck link capacity is low.

Fraction of tests per household

(a) Download (b) Upload

10 Enzu Cloud Service 3.1% 88.2% 0.6

1.0 Figure 12: Distribution of difference in median normalized speed

0.4 normalized speed is 0.47 and 0.84, respectively. This hetero-

server underprovisioning or issues in the end-to-end network

for download and upload tests, respectively.

Finally, we quantify the difference between Ookla servers

Ookla Server within a household by computing the median normalized

We obtained approvals from our Institution Review Board

Accuracy [Measured / Capacity]

B.1 Network Conditions

0 100 200 300 400 500 600

Figure B.4: Length of a speed test as one-way latency increases.

B.2 Real-World Deployment

Accuracy [Measured / Capacity]

Accuracy [Measured / Capacity]

0.6 0.6 0.6

0.2 0.2 0.2

0.0 0.0 0.0

You might also like