Professional Documents
Culture Documents
WASM Study
WASM Study
Figure 2: Overview of the phases of our methodology. Web Tranco list: top 1,000,000 websites
Own
HTTP Archive: 40,000 URLs with WebAssembly in JS code
Crawling
binary format. The latter is usually used to distribute WebAssem- 3 WebAssembly top lists
bly. Instructions are arranged into functions, and functions are Survey
Manual
Other
arranged into modules, which correspond to files. We use the terms
Figure 3: Sources from which we collect binaries.
“module” and “binary” interchangeably. Each module is divided
into multiple sections, including import and export sections that
different contexts and at different stages of deployment. Sections 3.1
declare functions shared with the environment, a code section that
to 3.4 present how we collect WebAssembly binaries from source
defines functions and their bodies, and a data section that initializes
code repositories, package managers that distribute deployed soft-
memory. Functions, types, and variables are referenced by indices.
ware, archived and live websites, and through manual search, re-
Host environments. WebAssembly runs within a host environ- spectively. Figure 3 gives on overview of the different sources we
ment, such as the browser, Node.js, or a standalone WebAssembly collect binaries from. Alongside each binary, we also collect meta-
runtime like wasmer or wasmtime. The host environment instanti- data, e.g., on which website a binary was found. All activities related
ates the WebAssembly binary and can provide imported functions to to collecting binaries were done between April and September 2020.
the module. For example, in a client-side web application, JavaScript Overall, the collection phase results in 51,148 binaries, including
code can instantiate binaries through the WebAssembly.instantiate duplicates and binaries that are not representative for real-world
and WebAssembly.Module APIs. usages of WebAssembly. Section 3.5 presents the filtering phase,
Linear memory. Each module instance has a linear memory sec- where we filter and deduplicate these binaries into a set of 8,461
tion, which can be thought of as an array of consecutive bytes. It unique binaries that serve as the basis for our study. Finally, the
stores all data handled by the module, except for locals and globals third phase analyzes the set of binaries through a combination of
(which can only be of four primitive types), including all dynami- static code analysis, analysis of metadata associated with the bina-
cally allocated data structures. The i32 datatype is used for pointers ries, and manual inspection. We present the analysis phase along
into linear memory. The linear memory has two implications. First, with its results in Section 4.
programs that want to associate non-primitive data with functions To the best of our knowledge, no prior work has gathered Web-
typically do so through a so-called “unmanaged stack”, which sim- Assembly binaries from such a diverse set of sources. As a result,
ply is a region in the linear memory used to represent the function the number of binaries we obtain is 58 times larger than the largest
stack. This stack is called “unmanaged” because it is not protected set studied so far (147 unique binaries) [24]. Our experimental re-
by the virtual machine and under full control of the program. As a sults (Section 4.2) show that the sources we consider WebAssembly
result, memory-unsafe behavior from source languages, e.g., C and binaries from complement each other, i.e., considering all of them
C++, can also affect WebAssembly binaries. Second, non-trivial ap- is crucial to obtain a representative dataset.
plications often have their own memory allocator compiled into the
binary. Both the unmanaged stack and custom memory allocators 3.1 Collecting Binaries from Repositories
may allow attackers to exploit a WebAssembly binary [18]. Our first method for collecting binaries looks into source code
Compilers, tools, and runtimes. Various compilers target Web- repositories. Even though WebAssembly is a binary format, devel-
Assembly, e.g., the Emscripten compiler for C and C++, the Rust opers often store binaries into source code repositories, e.g., to ease
compiler, the AssemblyScript compiler, or the Asterius compiler the installation of a project or to include third-party libraries. To
for Haskell. Several compilers sometimes share a common frame- gather such binaries, we clone all public repositories that are in the
work, e.g., Emscripten and the Rust compiler are based on LLVM, top 1,000 results of four queries to the GitHub search API:
while the AssemblyScript compiler and Asterius are based on Bina- • Repositories where “wasm” or “WebAssembly” is in the
ryen. Because the bare WebAssembly language does not provide repository name or description (i.e., two queries).
any built-in library, compilers often combine the compiled code • Repositories that are tagged with “WebAssembly” as one of
with a runtime environment. For example, the Emscripten runtime the used programming languages.
environment and the WebAssembly system interface (WASI) both • Repositories tagged with the topic “WebAssembly”.
offer a set of low-level system calls, e.g., to support file I/O and Overall, the queries result in 3,148 repositories, which we clone,
networking. and then search for files ending in .wasm.
3.2.1 npm Packages. The Node Package Manager (npm) distributes We search the responses stored in the HTTP Archive tables for
JavaScript code, some of which relies on WebAssembly modules. likely WebAssembly binaries and then directly download the cor-
Packages distributed via npm are typically used in server-side ap- responding files. To this end, we query two tables, from months
plications with Node.js or on the client side. To find npm packages May and June 2020, which contain information about all requests
that contain WebAssembly binaries, we gather two sets of pack- made while crawling the websites, and the corresponding responses.
ages. First, from the full registry file of npm we compute the top These tables, called summary_requests are 434.4 GB and 476.7 GB
1,000 most depended-upon packages. Second, we query npm for all large. We filter all requests in the tables by the MIME type of the re-
packages that match at least one of the keywords “wasm” and “Web- quested resource, keeping only those commonly used to serve Web-
Assembly”, which yields 2,350 packages. We install these packages Assembly, such as application/wasm and application/octet-stream,
and their transitive dependencies, and then search the resulting and where .wasm appears in the URL. These queries result in a set
7,198 packages for .wasm files. of 855 URLs. We download files from each of these URLs using wget
and keep all that start with \0asm, WebAssembly’s magic number.
3.2.2 wapm Packages. The WebAssembly Package Manager (wapm)
specializes on distributing WebAssembly code. Most of the wapm 3.3.2 Web Crawling. The HTTP Archive-guided search covers a
packages are intended to run standalone WebAssembly runtime wide range of top-level domains, but it may miss WebAssembly
engines. Unlike for npm, we can afford to analyze all 103 available binaries on websites not covered by crawling a generic list of web-
packages. We install all packages and again extract all .wasm files. sites and binaries that one cannot identify based on their MIME
type. To collect additional binaries, we also perform our own web
3.2.3 Firefox Browser Add-ons. Browser extensions, traditionally crawling. There are three components to our crawling: the seed list,
implemented in JavaScript, nowadays can also make use of Web- the crawling algorithm, and methods for detecting WebAssembly.
Assembly code. To gather binaries used in browser extensions, we Seed lists. Any kind of web crawling requires a seed list of URLs
download the top 2,500 Firefox add-ons from addons.mozilla.org, as to start from. We consider three seed lists, one generic list of popular
measured by average daily users. We then unpack the extensions’ websites and two lists targeted specifically at WebAssembly:
XPI archives, and search again for .wasm files.
• Top one million websites. As a generic set of websites to ex-
3.3 Collecting Binaries from Websites plore, we start crawling from the one million most popular
Collecting WebAssembly binaries from the web involves several websites on the Tranco list [28], a top list more resilient to
challenges. First, as the web is too big to be searched in its entirety, manipulation.
finding suitable starting points for exploring it is crucial. Second, • “WebAssembly” in JavaScript files. WebAssembly binaries on
even when visiting a WebAssembly-powered website, it is non- websites must be executed by some surrounding JavaScript
trivial to identify and collect WebAssembly binaries from it. Some code, e.g., by calling WebAssembly.instantiate. To identify
sites embed WebAssembly modules into JavaScript source code, websites with such JavaScript code, we query a table pro-
e.g., as base64-encoded strings that are decoded and instantiated vided by the HTTP Archive that stores the full bodies of
at runtime. For such sites, we must detect WebAssembly modules all HTTP responses up to some size. We search this table,
when they are executed. Other websites spawn requests for Web- which has a total size of 9.32 TB, with Google BigQuery for
Assembly modules but never execute them during our collection all JavaScript responses that contain WebAssembly and add the
process, e.g., because execution relies on specific user inputs. A URLs of the corresponding websites to our seed list, which
purely dynamic methodology would miss such binaries. results in about 40,000 URLs.
We address the first challenge, finding good websites as starting • WebAssembly top lists. As the most targeted seed list, we
points, through two techniques. On the one hand, we can build on start crawling from three hand-curated lists of notewor-
results from the HTTP Archive for finding sites known to con- thy WebAssembly-related websites. These websites cover
tain WebAssembly binaries (Section 3.3.1). On the other hand, projects using WebAssembly3 , tools and demos4 , and
for our own crawling, we systematically start from potentially WebAssembly-based games5 .
WebAssembly-related seed URLs (Section 3.3.2). To address the Crawling algorithm. Given a seed list, our crawler visits each
second challenge of detecting WebAssembly binaries during crawl- URL on the list and recursively follows links on the visited websites.
ing, we analyze all websites through a combination of static and The crawler visits each URL, with up to one retry. If the website
dynamic detection techniques. is loading successfully, the crawler waits until either the “DOM
content loaded” event is fired and all network connections have be-
3.3.1 Direct Downloads Guided by HTTP Archive. The HTTP Archive come idle, or until a 30-second timeout occurs. The crawler collects
project2 regularly crawls the web and makes the requests and re- all WebAssembly binaries loaded or executed in this time (details
sponses available. Starting from URLs obtained from the Chrome below). For each visited website, the crawler extracts more URLs to
User Experience Report , the project currently covers over 5 million explore from the href attribute of all <a>-tags on the site.
top-level domains, monthly. We here focus on websites crawled To control the amount of sites to visit, the crawler is configured
using the desktop version of Google Chrome, which we access via with two parameters: the recursion depth d, which bounds how
Google’s BigQuery database system.
3 https://madewithwebassembly.com/
4 https://github.com/mbasso/awesome-wasm
2 https://httparchive.org/ 5 https://www.webassemblygames.com/
An Empirical Study of Real-World WebAssembly Binaries WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
many links away from the seed URLs to explore, and the exploration those variants by filename (e.g., *.opt.wasm) and path (e.g.,
breadth b, which bounds how many links to follow on each explored binaries in afl_out/).
site. If a site has more than b links, the crawler picks b of them at • Test suites: On GitHub and in some npm packages, we find
random. For the first two seed lists, we set d = b = 2, i.e., the crawler binaries that are used as test inputs for WebAssembly-related
visits at most seven sites per URL in the seed list. Because the third tools, e.g., parsers, compilers, and virtual machines. One
seed list is the most focused one, we explore it more thoroughly large portion are binaries from the official specification test
with d = 7 and b = 3, and repeat the exploration with 16 separate suite, which often test only a single instruction or language
crawler instances. We chose those parameters based on preliminary construct. We identify them by typical repositories and paths
experiments, to find most binaries in a given time budget. (e.g., files in spectest/).
• Tutorial projects: Many npm projects and some GitHub repos-
Identifying WebAssembly binaries. For each website visited by
itories are instances of users following WebAssembly tutori-
the crawler, we use a combination of two techniques to identify
als for particular tool chains.6 Those binaries are small and
WebAssembly binaries on the site. Our first detection mechanism
all very similar. We identify them based on common binary
intercepts the website’s traffic using a local proxy that inspects
names (e.g., hello_world_bg.wasm) and project names (e.g.,
the headers and contents of all requests and responses. To iden-
test-wasm@0.0.1).
tify WebAssembly modules, we check if the content-type header
• Small and invalid binaries: Finally, we remove binaries that
matches application/wasm or application/octet-stream, or if the
contain ten or fewer instructions, and binaries that cannot
URL contains .wasm, and then ensure that the response payload
be validated by the reference WebAssembly binary toolkit
starts with the proper magic number. If this is the case, we store
(WABT), even with all language extensions enabled.
the loaded file as a WebAssembly binary. The key advantage of this
detection mechanism is that it detects WebAssembly modules even
if they are not executed during the crawler’s visit of the website.
4 RESULTS
The second detection mechanism tracks calls to APIs used for in- Based on our WasmBench dataset of real-world WebAssembly bi-
stantiating WebAssembly modules, as proposed in prior work [24]. naries, we address the research questions (RQs) described in the
We transparently overwrite built-in JavaScript functions, such as introduction. For each RQ, we detail the analyses performed on
WebAssembly.instantiate, and analyze its invocations. In contrast the dataset, the direct results, and then interpret those to obtain
to the first detection mechanism, this mechanism can detect Web- insights, i.e., take-away points, often with a focus on security.
Assembly binaries that occur inline in JavaScript code, if executed.
4.1 Implementation and Experimental Setup
3.4 Collecting Binaries Manually The crawler is implemented based on Puppeteer and Puppeteer Clus-
ter, two Node.js libraries for controlling instances of the Chromium
In addition to automatically collecting WebAssembly binaries, we
browser, here version 83.0.4103.0. The static analyses described in
also gather a small number of binaries manually. On the one hand,
the following are implemented in several Rust programs to stat-
we collect binaries through manual interaction with the web in daily
ically extract relevant features, such as instructions, names, etc.
browsing between April and September 2020. On the other hand,
from the binaries, complemented by Python scripts that perform
we asked WebAssembly developers on reddit.com/r/WebAssembly
the final analyses. For parsing binaries, we use the wasmparser li-
in June 2020 for binaries they are willing to share. As discussed in
brary, a project by the Bytecode Alliance. All experiments were run
the results, these two manual collection methods complement our
on an Ubuntu 18.04 machine with two Intel Xeon CPUs at 2.2 GHz
automatically collected binaries with otherwise missed examples.
running 48 threads, equipped with 256 GB of memory. The crawling
was performed in chunks of 50,000 websites, where each chunk took
3.5 Deduplication and Filtering about 7 hours to finish, with a total of about 10 days for all crawling.
After collecting binaries and associated metadata from the afore- The static analyses usually finish within several minutes for the en-
mentioned sources, we remove duplicates and filter binaries that tire dataset. Our entire dataset and the implementation are available
are not representative of real-world applications. To deduplicate for others to build on at https://github.com/sola-st/WasmBench.
binaries, we compare files based on their SHA256 hash and remove
identical files. Unless mentioned otherwise, our study focuses in 4.2 Overview of Dataset
the deduplicated dataset. In addition to deduplication, we remove Table 1 gives an overview of our dataset. For each source, the table
binaries that are non-representative of real-world applications, be- shows how many binaries we found, and how many remain after
cause they fall into at least one of the following categories. Binaries deduplication and filtering. The last column shows how many bina-
that occur multiple times, e.g., across different sources, are only ries are found only via a single source, illustrating the importance
removed if all occurrences of it were filtered out. of particular collection methods for obtaining a diverse dataset.
• Generated binary variants: Some GitHub repositories contain Sources. The largest contribution to the dataset are the GitHub
binaries generated by research tools, e.g., to fuzz-test Web- repositories and packages from npm. Given that WebAssembly
Assembly implementations, to superoptimize WebAssembly binaries on websites or in arbitrary packages are still relatively
code [4], or to perform code diversification [2]. Since these scarce, selecting repositories and packages related to WebAssembly
tools turn a single binary into many, only slightly differ-
ent variants, we remove the generated variants. We identify 6 For example, the rustwasm book: https://rustwasm.github.io/docs/book/.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aaron Hilbig, Daniel Lehmann, and Michael Pradel
Table 1: Contribution of different sources to the dataset. Table 2: Binaries filtered out due to different criteria.
Found Binaries Removed Binaries
Source (see Figure 3) Filter
Total
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Unique Filtered Only Total and MichaelUnique
Aaron Hilbig, Daniel Lehmann, Pradel
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aaron Hilbig, Daniel Lehmann, and Michael Pradel
GitHub,
Tablesearch: wasm filtered 44,218
2: Binaries out due to21,117
different 6,830
criteria.6,641 Generated binary variants, of those: 9,279 8,048
Table 2: Binaries filtered out14due to different criteria. 1250 in CROW [2] repository 100%
NPM, top dependend-upon 8 8 3 100% 8,025 7,987
1250
binaries
Removed Binaries
binaries
NPM, search: wasm 3,488 2,452 1,163 1,036 Test suites and files, of those: 80% 26,138 5,283
binaries
Removed Binaries 1000
binaries
Filter 1000 variants of WebAssembly spec suite 80%
WAPM, all
Filter 122 113
Total 108 Unique81 60%
25,133 4,931
Total 750
Firefox add-ons, top by users 29 17 17 Unique15 Invalid WebAssembly binaries 60% 13,593 3,513
ofof
750
ofof
Generated binary variants, of those: 261 9,279 141 8,048 40%
Percentile
Number
Web, HTTP
Generated Archive
binary variants, of those: 141
9,279 54
8,048 500Small binaries: <10 instructions 40% 10,146 1,881
Percentile
Number
in CROW [2] with
repository 8,025 500
Web, crawling,
in CROW seed list:
[2] repository 2,923 432
8,025 412 7,987298
7,987 Tutorial projects, of those:
250
20% 848 633
TestHTTP
suites and files, of those: 26,138 20%
Test suitesArchive
and files, of those: 2,046 268
26,138 254 5,283128
5,283 250 hello-wasm projects 695 506
variants of WebAssembly spec suite 25,133 0%
Tranco top
variants websites
of WebAssembly 769
spec suite 167
25,133 164 4,931 86
4,931
0
0 0Filters 50 100 150 200 0% 25 27 29 211 213 215 217 219 221 223 225 227
Invalid WebAssembly
WebAssembly binaries
top lists 108 13,593
89 76 3,513 43
All 0 50Size (kilobytes)
100 150 200 25 2737,808
29 211 Size 15 17 19 14,952
213 2(bytes)
2 2 221 223 225 227
Invalid WebAssembly binaries 13,593 3,513 Size (kilobytes) Size (bytes)
Small
Manual binaries: <10 instructions 93 10,146
89 88 1,881 57
Small binaries: <10 instructions 10,146 1,881 (a) Histogram of the lower 80%
8 (65.8 (b) Cumulative distribution of
Tutorial projects, of those: 848 633 opencascade.js
(a) Histogram of MB, 22.2M
the lower 80% instructions), a WebAssembly
(b) Cumulative distributionport
of
Tutorial
All projects, of those:
Sources 51,148 23,413 848 8,461 633 of the sizes of the binaries. the binary sizes (in log2 scale).
hello-wasm projects 695 506 of
of the sizes of
an open the binaries.
source the found
C++ CAD library, binaryonsizes
npm(in and
log2 GitHub;
scale).
hello-wasm projects 695 506 Figure 4: Distribution of binary sizes.
All Filters 37,808 14,952 Figure
and finally the 4: Distribution
Clang ofcompiled
compiler, itself binary sizes.
to WebAssembly
All Filters 37,808 14,952
is an effective way of finding binaries. At the same time, the sources (46.7 MB, 12.6M instructions), found on wapm and GitHub.
where we do not query for WebAssembly specifically (i.e., most
where
of the we web do not query
crawling, thefortopWebAssembly
packages fromspecifically
npm, and Firefox (i.e., most
add- Insight 2. Complex, real-world applications with millions of lines
where we do not query for WebAssembly specifically (i.e., most Insight 2. Complex, real-world applications with millions of lines
of thestill
ons) webshow
crawling,
that the top packages
WebAssembly fromare
binaries npm, andinFirefox
found the add-
wild in of code are compiled to WebAssembly.
of the web crawling, the top packages from npm, and Firefox add- of code are compiled to WebAssembly.
ons) still projects
popular show that WebAssembly
used by millions binaries
of users. are
For found
the in the
web, we wild
also in
see
ons) still show that WebAssembly binaries are found in the wild in
popular
that all projects
our fourused by millions
sources of users.
are essential forFor the web,
finding we alsosetsee
a diverse of
popular projects used by millions of users. For the web, we also see 4.3 RQ1: Source Languages and Tools
that all our four
WebAssembly sourcesOnly
binaries. are essential
crawling for finding
the top one amillion
diverse set of
websites
that all our four sources are essential for finding a diverse set of 4.3
WebAssembly
would miss atbinaries.least 225 Only crawling
unique the top
binaries. Theonefact
million
that websites
the seed Given RQ1:
4.3 RQ1: Source Languages
Sourcegoal
WebAssembly’s Languages and
and Tools
Tools
of being a universal bytecode, we study
WebAssembly binaries. Only crawling the top one million websites
would missHTTP
at least 225 unique binaries. The fact that the seed Given
whichWebAssembly’s
languages are compiled goal of beingto it inaapractice,
universaland bytecode, we study
which toolchains
would miss at least 225 unique binaries. The fact that the much
lists from Archive and the WebAssembly top lists are seed Given WebAssembly’s goal of being universal bytecode, we study
lists fromthan
smaller HTTPthe Archive
list of theand
topthe
oneWebAssembly
million top yet
websites, liststhe
arenumber
much which
are languages
used for to are compiled to it in practice, and which toolchains
it.
lists from HTTP Archive and the WebAssembly top lists are much which languages are compiled to it in practice, and which toolchains
smaller
of found than the listare
binaries of the top oroneeven
million websites,
showsyet thea number are Analysis.
used for toIt it.
smaller than the list ofsimilar
the top one higher,
million websites, that
yet the targeted
number are used for to is non-trivial to infer from a binary which source
it.
of found
seed list binaries
for are similar
crawling is key or
to even higher,
finding showsmissed
otherwise that a binaries.
targeted language
of found binaries are similar or even higher, shows that a targeted
seed list for crawling is key to finding otherwise missed binaries. Analysis.and compiler
It is non-trivial has to produced
infer from it. We rely onwhich
a binary several com-
source
seed list for crawling is key to finding otherwise missed binaries. plementary It is non-trivial
Analysis. methods. First, weto check
infer from the a binary section,
producers which sourcewhere
language and compiler has produced it. We rely on several com-
language
some toolchainsand compiler
explicitly has produced it. Welanguage(s)
rely on several com-
Insight 1. All methods we use to collect binaries contribute in plementary methods. First,encode
we check thethesource
producers section, where
a program
Insight 1. All methods we use to collect binaries contribute in plementary
is compiled methods.
from. First, we
Second, our check
analysisthe producers
searches section,
for where
characteris-
a non-negligible way. Combining different sources and collection some toolchains explicitly encode the source language(s) a program
a non-negligible way. Combining different sources and collection some toolchains explicitly encodein
that appear thethesource language(s)
section, a program
the
techniques is crucial for obtaining a representative dataset. isticcompiled
function names
from. Second, our analysis import
searches export
for characteris-
techniques is crucial for obtaining a representative dataset. is compiled
section, or from.
the Second,
optional name oursection.
analysisFor searches
example, for_ZdaPv
characteris-
is the
tic function names that appear in the import section, the export
tic function names
name-mangled deletethatoperator
appear inofthe C++; import
or section, the export
runtime.gostring is a
Filtering. Deduplication
Filtering. Deduplicationand andfiltering
filteringnon-representative
non-representativebinariesbinaries section, or the optional name section. For example, _ZdaPv is the
Filtering. Deduplication and filtering non-representative binaries section,
Go runtime or the optional
library name section.
function. Overall,For weexample, _ZdaPv is the
identify characteristic
(Section
(Section 3.5)
3.5)significantly
significantly reduces
reduces the
thedataset
dataset (Table
(Table 2).
2).In
Inparticular,
particular, name-mangled delete operator of C++; or runtime.gostring is a
(Section 3.5) significantly reduces the dataset (Table 2). In particular, name-mangled
function names delete
for C++, operator
C, Rust, ofGo,C++; or runtime.gostring
AssemblyScript, Kotlin, is anda
one
onerepository
repositorythat
thatcontains
containsgenerated
generatedvariants
variantsof ofinput
inputbinaries
binariesis Go runtime library function. Overall, we identify characteristic
one repository that contains generated variants of input binaries isis Go runtime
FStar. Third, library
the function.
analysis searchesOverall,for we identify characteristic
characteristic strings among
important
important to
tofilter,
filter,as
asititotherwise
otherwise accounts
accounts for
forover
over 8,000
8,000 unique
unique function names for C++, C, Rust, Go, AssemblyScript, Kotlin, and
important to filter, as it otherwise accounts for over 8,000 unique function
all text names forof
sequences C++, C, Rust,
>3 ASCII Go, AssemblyScript,
characters in the data Kotlin, and
section. For
binaries.
binaries.From
Fromthe thesecond
secondcategory,
category,test testsuites,
suites,we wealso
alsoseeseethat
that FStar. Third, the analysis searches for characteristic strings among
binaries. From the second category, test suites, we also see that FStar.
example, Third,
beingthecoreanalysis
types, searches for characteristic
Result::unwrap and strings among
Option::unwrap fre-
test
test binaries
binaries are
are commonly
commonly reused
reused across
across many
many projects
projects (26,138
(26,138 all text sequences of >3 ASCII characters in the data section. For
test binaries are commonly reused across many projects (26,138 all text sequences
quently appear in of >3 messages
error ASCII characters of the in the
Rust data section.
standard library. For
We
occurrences,
occurrences,but butonly
only5,283
5,283unique
uniquebinaries),
binaries),and
andthat
thatmost
mostof ofthem
them example, being core types, Result::unwrap and Option::unwrap fre-
occurrences, but only 5,283 unique binaries), and that most of them example,
identifyappearbeing core types,
characteristic Result::unwrap andMatlab,
Option::unwrap fre-
are
arefrom
from the
the official
officialspecification
specification test
test suite.
suite.The
The last
last filter
filter removes
removes quently in error strings
messages forof C++,
the Rust,
Rust standard and COBOL.
library. We
are from the official specification test suite. The last filter removes quently
Fourth, appear
if none in
of error
the messages
above work, ofwetheanalyze
Rust standard
sibling library.
files of We
bina-
binaries
binariesfrom
fromaaafew,
few,very
verysimilar
similartutorials,
tutorials,including
includingmore morethanthan500500 identify characteristic strings for C++, Rust, Matlab, and COBOL.
binaries from few, very similar tutorials, including more than 500 identify characteristic
ries collected strings for C++, Rust, Matlab, and COBOL.
binaries
binaries from
from projects
projects called
called hello-wasm. .
hello-wasm Fourth, if nonefrom of the code repositories
above work, we and package
analyze managers.
sibling files ofSibling
bina-
binaries from projects called hello-wasm. Fourth,
file here if means
none ofathe file above
in the work,
same we analyze that
directory sibling files of
shares thebina-
file
ries collected from code repositories and package managers. Sibling
Binarysizes.
Binary Asaafirst
sizes. As firstproxy
proxyforforthe
thediversity
diversityof ofthe
thecollected
collected ries
name collected
except from
for thecode repositories
extension. We and into
take packageaccount managers.
extensions Siblingfor
Binary sizes. As a first proxy for the diversity of the collected file here means a file in the same directory that shares the file
binaries,we
binaries, welook
lookinto
intotheir
theirsizes
sizesand
andinstruction
instructioncount.
count.Figure
Figure44 file
C, here Rust,
C++, means Go, a AssemblyScript/TypeScript,
file in the same directory that the shares
WebAssemblythe file
binaries, we look into their sizes and instruction count. Figure 4 name except for the extension. We take into account extensions for
showsthe
shows thehistogram
histogramand andcumulative
cumulativedistribution
distributionof ofbinary
binarysizes
sizes name
text except(.wat
format for the extension. We take into account extensions
Finally, for
shows the histogram and cumulative distribution of binary sizes C, C++, Rust, Go,/.wast ), and several
AssemblyScript/TypeScript, smaller languages.
the WebAssembly for
ininbytes.
bytes.The Thedistribution
distributionof ofthe
thenumber
numberof ofinstructions
instructionsisissimilar
similar C,someC++, Rust,code
source Go, AssemblyScript/TypeScript,
repositories and packages with themultiple
WebAssemblyuniden-
in bytes. The distribution of the number of instructions is similar text format (.wat/.wast), and several smaller languages. Finally, for
inin shape
shape andand omitted
omitted forfor space
space reasons.
reasons. While
While there
there are
are many
many text
tifiedformat (.wat
binaries, we /.wast ), and inspect
several source
smaller languages. Finally, and
for
in shape and omitted for space reasons. While there are many some source code manually
repositories and packagescode, with build
multiplescripts,
uniden-
smallbinaries,
small binaries,there
thereisisaalong
longand
andheavy
heavytail
tailtowards
towardslarger
largersizes.
sizes. some source
binaries. code
For each repositories
of the automated and packages
methods with
above,multiple uniden-
small binaries, there is a long and heavy tail towards larger sizes. tified binaries, we manually inspect source code, buildwe manually
scripts, and
Twothirds
Two thirdsof ofthe
thebinaries
binariesare arelarger
largerthan
than20 20KBKBand
andhave
havemore
more tified
inspect binaries,
binaries we manually inspect source code,that build scripts, and
Two thirds of the binaries are larger than 20 KB and have more binaries. For eachand the automated
of the predictions to confirm
methods above, our heuristics
we manually
than8,700
than 8,700instructions.
instructions.TheThemedian
medianbinary
binaryisis37.1
37.1KB
KBlarge
largeand
andhas
has binaries.
are precise. ForForeach of thewhere
binaries automatedmultiple methods
methods above, we the
identify manually
source
than 8,700 instructions. The median binary is 37.1 KB large and has inspect binaries and the predictions to confirm that our heuristics
14,885instructions.
14,885 instructions.The Thelargest
largestbinaries
binariesareareaaWebAssembly
WebAssemblyport port inspect
language, binaries and the predictions to confirm that our heuristics
14,885 instructions. The largest binaries are a WebAssembly port are precise.we Forconfirm
binariesthat where themultiple
predictions methodsare consistent.
identify the source
of TiDB77, ,aadistributed
ofTiDB distributedSQLSQLdatabase
databasewritten
writtenin inGo,
Go,with
with75.1
75.1MBMB areResults.
precise. Figure
For binaries where multiple methods identify the We
source
of TiDB7 , a distributed SQL database written in Go, with 75.1 MB language, we confirm 5a shows
that the thepredictions
inferred source languages.
are consistent. see
and16.9M
and 16.9Minstructions,
instructions,respectively,
respectively,found
foundas asaawapm
wapmpackage;
package; language, we confirm that the predictions
that almost two thirds (64.2%) of the binaries are compiled from C, are consistent.
and 16.9M instructions, respectively, found as a wapm package;
opencascade.js88 (65.8 MB, 22.2M instructions), a WebAssembly port Results. Figure 5a shows the inferred source languages. We see
opencascade.js (65.8 MB, 22.2M instructions), a WebAssembly port
7 https://github.com/pingcap/tidb Results. Figure 5a shows the inferred source languages. We see
8 https://github.com/donalffons/opencascade.js
of an open source C++ CAD library, found on npm and GitHub; that almost two thirds (64.2%) of the binaries are compiled from C,
of an open source C++ CAD library, found on npm and GitHub; that almost two thirds (64.2%) of the binaries are compiled from C,
and finally the Clang compiler, itself compiled to WebAssembly C++, or a combination of both. Given that these are memory-unsafe
and finally the Clang compiler, itself compiled to WebAssembly C++, or a combination of both. Given that these are memory-unsafe
(46.7 MB, 12.6M instructions), found on wapm and GitHub. languages, plagued with decades of vulnerabilities [41] and that
(46.7 MB, 12.6M instructions), found on wapm and GitHub. languages, plagued with decades of vulnerabilities [41] and that
7 https://github.com/pingcap/tidb WebAssembly binaries are not automatically safe from exploita-
7 https://github.com/pingcap/tidb
8 https://github.com/donalffons/opencascade.js
WebAssembly binaries are not automatically safe from exploita-
8 https://github.com/donalffons/opencascade.js tion [18], this result is highly worrying.
tion [18], this result is highly worrying.
dataset. F
piled to WebAssembly, including languages with garbage collection tematical
and heavier runtimes (Go, Matlab). This result matches WebAssem- binaries.
bly’s goal of serving as a universal bytecode. It also means binary because
analysis will become more important, since source code is not al- language
ways available, and even if it is, implementing separate analyses also, e.g.,
for many languages is impractical. cause me
An Empirical Study of Real-World WebAssembly Binaries WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
We also analyze the tools used to produce the binaries. For 20.2% 4.4.1 Us
1250 100% C++ (3,986)
of the binaries,47.1% the producers sectionProducers explicitly mentions
Section Projects (245)
them in stack” is
Percentile of binaries
Number of binaries
unmanaged stack
1.8% Others (9)
9.5%
60%
Emscripten (1,189) 2.0% their resp
could be a stack pointer. 14.0% Finally, 9.5% of all binaries
emmalloc (11) have at least Amon
Stack pointer: 4.3% Boehm GC (61)
global 0 (4,216) 23.5%
No mutable
i32 global (1,989)
40% one candidate mutable global pointer, but0.7% it is not accessed
0.7% wee_alloc (62)
often nate, bein
49.8% enough for our(1,430)
analysis
20% dlmalloc 16.9% to consider it the stack
eosio pointer.(368)
simple_malloc Interest-
2.0% 32.7%
EOSIO, a
15.2%
No memory (167) 0%
ingly, AssemblyScript programs 1.6% are in the last category, since its bytecode
1.3% eosio malloc (2,766)
0% 20% 40% 60% 80% 100% runtime seems Go malloc
to not (139)support stack allocation.
Stack pointer: others (1,284) Percentile of binaries AssemblyScript alloc (114) Emscript
To better understand how much a binary uses the unmanaged default a
(a) Proportion of binaries (b) Cumulative distribution of pro- stack,Figure
Figure7:6b Allocators
shows how in many
binaries, multiple
of the functions canin apply.
a binary analysis o
with identified stack pointer, portion of functions in a binary access the stack pointer at least once. Consistent with Figure 6a,
and if not why. that use the unmanaged stack.
in 35% of all binaries no function uses the stack pointer because are consi
in 35% of all binaries no function uses the stack
none is present. In the median binary, already 33% of all functions pointer because assertion
Figure 6: Unmanaged stack usage in binaries. none is present. In the and median binary, already 33% of all functions
use the stack pointer, in some binaries almost every function dataset, w
that buffer overflows on the unmanaged stack can overwrite across use
usesthe
thestack pointer, stack.
unmanaged and inOn some binaries
average acrossalmost every function
all binaries with an naries), t
stack frames and even into supposedly “constant” data, this makes uses the unmanaged
unmanaged stack, 44% stack. On average
of their functions across
makeall usebinaries
of it. with an that wer
the unmanaged stack a more dangerous exploitation target than unmanaged stack, 44% of their functions make use of it. corruptio
even in native programs [18]. For this reason, we evaluate how Insight 6. Many binaries (65%) and functions in those binaries Boehm G
many binaries use it in practice. (44%) use the unmanaged stack, which attackers may abuse for in severa
Analysis. To analyze the usage of the unmanaged stack, our runtime exploitation. This result extends earlier findings [18] to a
Insight 7
static analysis performs two steps. First, it tries to identify the stack much larger and more diverse set of real-world binaries.
allocator
pointer to determine whether a binary uses an unmanaged stack at
the risk t
all. Out of all global variables, the analysis selects the one that 4.4.2
4.4.2 Statically
Statically Linked Allocators. WebAssembly’s
Linked Allocators. WebAssembly’s memory
memory orga-orga- tion to us
• has type i32, i.e., the type of all pointers, nization
nization isis very
very low-level.
low-level. Besides
Besides the
the single
single linear
linear memory
memory section,
section, memory
• is declared mutable, to exclude constants like STACK_MAX, which
which can
can be be expanded
expanded at at runtime
runtime with
with the memory.grow instruc-
the memory.grow instruc- environm
• is the most read and written global, as determined by the tion,
tion, no
no help
help with
with allocating
allocating memory
memory isis provided
provided by
by the
the language.
language.
number of global.get × global.set instructions14 , and Subdividing
Subdividing the the linear
linear memory,
memory, e.g.,
e.g., to
to avoid
avoid fragmentation
fragmentation and and
4.4.3 Im
• has at least three reads and at least three writes, to avoid to reuse space of deallocated objects, needs to be handled by an
15 https://github.com/sola-st/wasm-binary-security exploit a
false positives in small binaries. allocator that is statically linked into the binary. Especially for bi-
We manually validate that these heuristics identify the stack pointer naries on the web and for smart contract platforms, code size is an
reliably on randomly sampled binaries. If the analysis cannot iden- important consideration, so developers can choose a lightweight
tify a stack pointer, it conservatively assumes that the binary does allocator instead of the default allocator provided by the compiler.
not use an unmanaged stack. Second, once the stack pointer is Prior work has shown [18] that those smaller allocators can lack
identified, the analysis counts the number of functions in the bi- important mitigations against heap metadata corruption and yield
nary that access the stack pointer somewhere in their body. Our powerful arbitrary write primitives for an attacker. However, it
implementation builds upon the prototype analysis provided by remains unclear what allocators developers use in practice.
Lehmann et al.15 , but uses a different WebAssembly parser to also Analysis. To identify allocators in a binary, we rely on similar
handle language extensions and makes the analysis robust enough heuristics as for source language detection (Section 4.3). That is, we
to run on thousands of real-world binaries. first identify allocators by characteristic function names in binaries,
Results. Figure 6a shows that almost two thirds (65%) of all bina- if those are available. Then, we inspect frequent strings in the data
ries use the unmanaged stack. While most of them use the global section of binaries, e.g., for error messages of certain allocators.
with index 0 as their stack pointer, our heuristics to identify the Results. Figure 7 shows our results, grouped into three categories.
stack pointer are important, since 15.2% of all binaries use another In blue, we mark default allocators provided by programming lan-
global variable. For 2% of the binaries, the analysis can clearly deter- guages and compilers, in different shades of red we mark other al-
mine that they have no unmanaged stack in linear memory, simply locators that we identified, and in gray when we could not identify
because there is no linear memory at all. For 23.5% of the binaries, an allocator. In terms of default allocators, we see that 16.9% of all
there is a linear memory section, but no mutable i32 global that binaries use dlmalloc, the default allocator provided by Emscripten,
could be a stack pointer. Finally, 9.5% of all binaries have at least Clang, and the Rust compiler when targeting WebAssembly. Go and
one candidate mutable global pointer, but it is not accessed often AssemblyScript allocators are present roughly in the proportion of
enough for our analysis to consider it the stack pointer. Interest- their respective languages.
ingly, AssemblyScript programs are in the last category, since its Among the non-default allocators, two particular ones domi-
runtime seems to not support stack allocation. nate, being in 32.7% and 4.3% of our binaries. They are both from
To better understand how much a binary uses the unmanaged EOSIO, a smart contract platform that uses WebAssembly as its
stack, Figure 6b shows how many of the functions in a binary bytecode. Those contracts can be written in C++ and compiled with
access the stack pointer at least once. Consistent with Figure 6a, Emscripten. However, most of them are not using Emscripten’s
14 The product of the counts prefers globals which are similarly often read and written. default allocator. While we did not perform an in-depth security
This is true for the stack pointer, but not for other frequently accessed pointers. analysis of EOSIO malloc and simple_malloc we can attest that both
15 https://github.com/sola-st/wasm-binary-security are considerably shorter in terms of code and do not feature any
their respective languages.
at least Among the non-default allocators, two particular ones domi-
ed often nate, being in 32.7% and 4.3% of our binaries. They are both from
Interest- EOSIO, a smart contract platform that uses WebAssembly as its
since its bytecode. Those contracts can be written in C++ and compiled with
Emscripten. However, most of them are not using Emscripten’s
managed An Empirical
default Study of While
allocator. Real-World
weWebAssembly Binariesan in-depth security
did not perform WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
a binary analysis of EOSIO malloc and simple_malloc we can attest that both
gure 6a, Table 3: Imports matching potentially security-critical APIs.
are considerably
assertions shorter
that would in terms
guard of code
against and do
metadata not feature
corruption. Inany
our
because assertions Matching
unctions dataset, we also find wee_alloc (62 binaries) and emmalloc (11our
that would guard against metadata corruption. In bi- Category Patterns
dataset, we also
naries), two smallfind wee_alloc
allocators for (62
Rustbinaries) and emmalloc
and Emscripten (11 bi-
respectively, Imports Binaries %
function naries), twoalready
small allocators forvulnerable
Rust and Emscripten respectively,
with an that were found to be against heap metadata Code execution eval, exec, execve 383 160 1.9%
that were already
corruption attacksfound
[18]. to be vulnerable
Other interestingagainst
customheap metadata
allocators are emscripten_run_script
corruption
Boehm GC attacks [18]. Other interesting
(a mark-and-sweep custom and
garbage collector) allocators are
gperftools, Network access xhr, request, http, fetch 944 172 2.0%
binaries Boehm GCbinaries
in several (a mark-and-sweep
collected from garbage
Googlecollector)
domains.and gperftools, File I/O file, fd, path 7,532 1,610 19.0%
buse for in several binaries collected from Google domains. DOM interaction document, html, body, element 1,720 212 2.5%
[18] to a Dynamic linking dlopen, dlsym, dlclose 352 138 1.6%
Insight 7. WebAssembly binaries come with a variety of memory
allocators, including many custom allocators (38.6%), increasing At least one 10,468 1,797 21.2%
the risk to include vulnerable allocators. If code size is the motiva-
ry orga-
tion to use custom allocators, a more secure alternative could be a
y section, bufdelete does not. We manually inspect matches to ensure they
memory allocation or garbage collection API provided by the host
instruc- are plausible and remove benign matches otherwise.
environment [33].
anguage. Results. Table 3 shows the results of our name-based import
tion and 4.4.3 Imports
Imports of of Security-Critical
Security-Critical APIsAPIs from
from Host
Host Environment.
Environment. To To analysis. The first two columns show the category and patterns
4.4.3
exploit aa WebAssembly
exploit WebAssembly binary, binary, an
an attack
attack proceeds
proceeds in in two
two steps.
steps. The
The we match import names against. In the third column, we count
first step is compromising the state or behavior of the WebAssembly imported functions that match at least one pattern. The last two
An Empirical Study of Real-World WebAssembly Binaries
binary itself, e.g., by exploiting an unsafe allocator (Section 4.4.2) columns show the number WWW of binaries’21, April 19–23, 2021, Ljubljana, Slovenia
with at least one matching
or a buffer overflow on the unmanaged stack (Section 4.4.1). The import, and which fraction of the filtered dataset this corresponds to.
Table 3: Imports matching potentially security-critical APIs.
second step is actually performing the malicious action to the un- We see that
import, imports
and which relatedoftothe
fraction file I/O, the
filtered mostthis
dataset common category,
corresponds to.
derlying system. The only way to do so, assuming VM Matching
implementa-
Category Patterns are see
We present
that in almostrelated
imports every fifth
to filebinary.
I/O, theInterestingly,
most common evencategory,
though
tions are bug-free and host security16 is perfect (which
Imports they are not%
Binaries WebAssembly
are was originally
present in almost every fifth notbinary.
meantInterestingly,
to replace JavaScript,
even though but
[1, 36]), is to call functions imported
eval, exec, execve
into the WebAssembly binary rather for compute
WebAssembly intensive not
was originally applications,
meant to still 212 JavaScript,
replace binaries likely
but
Code
from execution
the host environment. For example, an attacker383 could 160pass1.9%
an
emscripten_run_script interactfor
rather with the DOM
compute from WebAssembly,
intensive applications, still which212attackers could
binaries likely
injected access
Network string onxhr,therequest,
unmanaged stack to an 944
http, fetch imported172 function,
2.0% use for cross-site
interact with the scripting.
DOM from InWebAssembly,
the last row, wewhich see that overall could
attackers 21.2%
e.g.,I/O
File JavaScript’s evalfile,. To
fd, estimate
path how often WebAssembly
7,532 1,610binaries
19.0% of all
use forbinaries import
cross-site at least
scripting. onelast
In the potentially
row, we seesecurity-critical API.
that overall 21.2%
use such
DOM security-critical
interaction document,host
html, APIs, we thus analyze
body, element 1,720 their 212imports.
2.5% of all binaries import at least one potentially security-critical API.
Dynamic linking dlopen, dlsym, dlclose 352 138 1.6%
Analysis. We identify imported security-critical APIs based on Insight 8. Many binaries (21.2%) import potentially dangerous
At leastimport
their one name in the WebAssembly binary. 10,468 Going 1,797 21.2%
by name APIs from their host environment, which may allow compromised
(instead of implementation) is necessary because, (1) the imple- binaries, e.g., to inject arbitrary code or to write to the file system.
first step is compromising
mentation of an imported thefunction
state or behavior of the
is supplied by WebAssembly
the host only
binary itself, e.g., by exploiting
at instantiation-time, so it is notan available
unsafe allocator (Section
when given 4.4.2)
only the
or a buffer
binary; andoverflow
(2) there on
are the
many unmanaged stack (Section
host environments, 4.4.1).
not all The
of which 4.5
4.5 RQ3:
RQ3: Cryptomining
Cryptomining
second
are using step is actuallyWASI
JavaScript. performing the malicious
for example, action tothat
defines imports the can
un- A
A study
study ofof real-world
real-world usesuses of
of WebAssembly
WebAssembly performed
performed inin early
early
derlying system. The
be implemented only way
by different to do so, assuming
standalone WebAssemblyVM implementa-
VMs in na- 2019
2019 [24]
[24] reports
reports cryptomining
cryptomining to to be
be one
one of
of the
the most
most common
common use use
tions are bug-free
tive code. We thusand identify import16names
host security is perfect
for (which
which thetheyhost
are not
im- cases
cases ofof WebAssembly
WebAssembly on on the
the web.
web. That
That study
study found
found 55.7%
55.7% of
of the
the
[1, 36]), is to call
plementation functions
is likely imported into function.
a security-critical the WebAssembly binary
E.g., the import analyzed
analyzed websites
websites to to use
use WebAssembly
WebAssembly for for cryptojacking,
cryptojacking, i.e.,
i.e., the
the
from the host environment. For example,binaries
in WebAssembly an attacker could pass
is typically boundan practice
emscripten_run_script practice ofof using
using aa website
website visitor’s
visitor’s hardware
hardware resources
resources for
for mining
mining
injected string on the unmanaged
to Emscripten-generated JavaScriptstack to ancalls
code that imported
eval. Wefunction,
match cryptocurrencies
cryptocurrencies without
without their
their consent.
consent. Identifying
Identifying and
and controlling
controlling
e.g., JavaScript’s
imports against eval . To estimate
18 patterns how
in five often WebAssembly
categories known to bebinaries
poten- cryptominers
cryptominers on on the
the web
web has
has been
been the
the focus
focus ofof several
several recent
recent pieces
pieces
use
tiallysuch security-critical
security-critical host APIs, we thus analyze their imports.
APIs: of
of work
work [15,
[15, 25,
25, 34,
34, 43].
43]. In
In this
this research
research question,
question, wewe study
study whether
whether
Analysis. We identify
• Code execution. imported
Imports security-critical
like eval APIs based on
, exec, or emscripten_run_script . cryptomining
cryptomining is is still
still an
an important
important threat
threat today.
today. We
We address
address this
this
their•import
Network name in the
access. WebAssembly
Imports containing binary. Going ,by
XHR, request httpname
, or question
question in in two
two ways.
ways. First,
First, we
we analyze
analyze those
those binaries
binaries we
we collected
collected
(insteadfetch
of implementation)
. is necessary because, (1) the imple- from
from the
the web
web for
for signs
signs ofof being
being cryptominers.
cryptominers. Second,
Second, we
we directly
directly
mentation of an
• File I/O. imported
Imports function
containing is ,supplied
file fd, or path by. the host only compare
compare the binaries gathered in earlier work with our dataset.
the binaries gathered in earlier work with our dataset.
at instantiation-time,
• DOM interaction.soImports it is not availabledocument
containing when given , htmlonly
, bodythe
, or 4.5.1 Analyzing Binaries Found on the Web. To understand the
binary;element
and (2) there 4.5.1 Analyzing Binaries Found on the Web. To understand the
could are many host
manipulate theenvironments,
DOM, which not can all ofto
lead which
XSS. prevalence of cryptomining today, we analyze all binaries collected
are using JavaScript. WASI for ,example, defines imports prevalence of cryptomining today, we analyze all binaries collected
• Dynamic linking. dlopen dlsym, and dlclose allowthat can
loading from the web, i.e., direct downloads guided by HTTP Archive and
be implemented from the web, i.e., direct downloads guided by HTTP Archive and
additionalby different
code standalone
at runtime, which WebAssembly
can lead to code VMs in na-
injection. the results of our own crawling, using VirusTotal. The VirusTotal
tiveTocode. the results of our own crawling, using VirusTotal. The VirusTotal
avoidWe thus identify
spurious matches, import names
especially forfor which
short the host
patterns likeim-
fd, API allows to upload and scan files with up to 70 independent third-
API allows to upload and scan files with up to 70 independent third-
plementation
we tokenize import is likely a security-critical
names function.
based on camel-case and E.g., the import
non-alphabet party antivirus scanners and malware detectors, and reports back
emscripten_run_script in WebAssembly binaries is typically bound party antivirus scanners and malware detectors, and reports back
characters, and then check for a pattern to occur verbatim in the the number of positive results. Among the 352 analyzed binaries,
to Emscripten-generated JavaScript code our
thatfile
callsI/O
eval . We match the number of positive results. Among the 352 analyzed binaries,
token sequence. E.g., fd_write matches category, but VirusTotal reports four files to contain malicious content. Three of
imports against 18 patterns in five categories known to be poten- VirusTotal reports four files to contain malicious content. Three of
them are likely to be the same program, as they have similar sizes
16 Host security: isolation of WebAssembly execution from the underlying host, e.g.,
tially security-critical APIs: them are likely to be the same program, as they have similar sizes
(68.8 ± 0.7kB) and the same distribution of instructions. These three
ensuring that a WebAssembly program can write only to its designated memory region. (68.8 ± 0.7kB) and the same distribution of instructions. These three
• Code execution. Imports like eval, exec, or emscripten_run_script. files are detected by 26 or more scanning tools employed by Virus-
• Network access. Imports containing XHR, request, http, or fetch. Total. One of the files is collected from http://monero.cit.net,
• File I/O. Imports containing file, fd, or path. which further supports the presumption that the binary is a cryp-
• DOM interaction. Imports containing document, html, body, or tominer, as “Monero” is the name of a common cryptocurreny.
element could manipulate the DOM, which can lead to XSS. Moreover, one of three binaries is identical to a binary we collect
• Dynamic linking. dlopen, dlsym, and dlclose allow loading addi- also from a GitHub repository called “CryptoNoter”17 , which is an
tional code at runtime, which can lead to code injection. open-source Monero cryptominer. The fourth file reported by Virus-
Total is tested positive by only one scanner. Our manual analysis
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aaron Hilbig, Daniel Lehmann, and Michael Pradel
Other
Number of domains
files are detected by 26 or more scanning tools employed by Virus- 300 Test for WebAssembly support
Total. One of the files is collected from http://monero.cit.net, which Hyphenopoly
200
further supports the presumption that the binary is a cryptominer, Long.js
with ours 300 contains 23 binaries, i.e., Test 16%forof their dataset
WebAssembly supportand 0.2% unique WebAssembly binaries found on the web.
4.6.1 Binaries Found on Multiple Websites. Out of all 476 unique
of our dataset.
200 To better understandHyphenopoly these binaries, we manually Application domain
Long.js binaries found on the# web,
Binaries Application
70 are reused across domain
at least#two Binaries
differ-
examine them and visit the corresponding websites. We find four
100 ent top-level
Games domains. Figure 25 8 shows a histogram
Online gambling of how often
2
of the 23 binaries to be suspicious. Two of them are among the files
binaries occur on multiple domains.
Text processing 11 The dataand
Barcodes follows
QR codes a long-tail dis- 2
flagged by0 VirusTotal, as discussed above. For one file from a website
0 10 20 30 40 tribution,
Visualizationi.e.,/aAnimation
few binaries11occur on Roommany websites,
planning while many
/ Furniture 2
that declares itself to binaries,
Unique be a “blockchain
orderd by number explorer”,
of occurrenceswe could not Mediabinaries
processing / Player 9 or only Blogging 2
other recur a few times once. The top-most widely
observeFigure 8: Binaries
any suspicious found
activity on multiple
when visiting the websites.
source website, Demo, e.g., binary
of a occurs 371 times, Cryptocurrency wallet 2
distributed 7 i.e., in 28% of all domains where
but also could not identify its functionality, and thus declare it to be programming language binaries. Regular expressions 1
but also could we detect WebAssembly
suspicious. Fornot
theidentify its functionality,
last suspicious and thus
binary, visiting thedeclare it to be
corresponding Wasm tutorial or test
suspicious. For the last suspicious binary, visiting the corresponding To better understand the 5most recurring Hashing
binaries, we analyze 1
website increases CPU load to 70%. The website offers a service to Chat 3 PDF viewer 1
website increases CPUand loadopenly
to 70%.advertises
The website them through a combination of automated clustering and manual
mine cryptocurrency theoffers a service
fact that one canto
mine cryptocurrency and openly advertises the fact that one can inspection. The automated clustering represents each binary as
start mining immediately in the browser. That is, the binary is an
start mining immediately in a set of byteunderstand
To better n-grams [21], the mostsummarizes
recurring thebinaries,
numberwe of analyze
n-gram
example of cryptomining butthe notbrowser. That is, the binary is an
cryptojacking.
example of cryptomining but not cryptojacking. occurrences
them through inaa combination
binary into a characteristic
of automated vector, clusteringand then clusters
and manual
In summary, we identify only four binaries from our “web” dataset binaries based
inspection. Theonautomated
the pairwise cosine similarity
clustering represents of their
eachvectors.
binary We as
In
as summary, we identify only
possible cryptominers (about four1%binaries from ourthree
of the dataset), “web”ofdataset
which athen
set inspect
of bytethe top-most
n-grams binaries
[21], in Figure
summarizes the8, using
number theof clusters
n-gram to
as possible cryptominers (about 1% of the
appeared to be inactive when visiting the website. While our dataset), three of which
analy- quickly identify variants of athe same binary.vector, Our analysis shows the
occurrences in a binary into characteristic and then clusters
appeared
sis may missto be inactive when
cryptomining visitinga risk
binaries, the website.
one could While
reduce ourthrough
analy- following to beonthe
binaries based themost widely
pairwise occurring
cosine similarity WebAssembly
of their vectors. binaries.We
sis may missanalysis
additional cryptomining
techniques binaries, a risk
beyond one provided
those could reduce via through
VirusTo- then inspectforthe
Testing top-most binaries
WebAssembly support. in Figure
At least 8, using the clusters
509 domains (38.5% to
additional
tal [24, 43], the prevalence of cryptomining seems to haveVirusTo-
analysis techniques beyond those provided via dropped quickly identifythat
of all domains variants of the same binary.
use WebAssembly) Our analysis
are serving shows the
a WebAssembly
tal [24, 43], theover
significantly prevalence
the pastofone cryptomining
to two years. seems to have
This resultdropped
is also following
binary that to tests
be thewhether
most widely occurring
the browser WebAssembly
supports WebAssembly binaries. at
significantly
confirmed byover the past
a manual one tooftwo
analysis years. Thisbinaries
WebAssembly result isfound
also
confirmed all.Testing
We found two variants of such binaries,
for WebAssembly support. At least 509 domains (38.5% both of which are
on the webby(Section
a manual analysis
4.6.2) and isofinWebAssembly
line with other binaries
reports found
[42] rather small: athat six instruction binary with a singleafunction called
on of all domains use WebAssembly) are serving WebAssembly
thatthe web (Sectionbecame
cryptomining 4.6.2) and lessisappealing
in line with afterother
one of reports [42]
the major test and antests
eightwhether
byte binary
that cryptomining binary that the that
browseronly supports
contains the WebAssembly
WebAssembly at
cryptomining scriptbecame
providers, less Coinhive,
appealingshut afterdown.
one ofVarlioglu
the major et magic
cryptomining script providers, Coinhive, shut down. Varlioglu et all. Wenumber
found two followed by the
variants of language
such binaries,version. bothWebsites
of which servingare
al. find a 99% decrease in sites using cryptomining among sites these test binaries often also binary
serve larger
al. find rather small: a six instruction with abinaries,
single functioni.e., they first
called
that hada made
99% decrease
use of it in sites The
before. using low cryptomining among sites
numbers of cryptominers test whether WebAssembly is supported, and if it is, load a more
that had made use of it before. The low numbers of cryptominers test and an eight byte binary that only contains the WebAssembly
found by our analysis confirms this trend and shows its declining complex binary. For example,
found by our analysis confirms this trend and shows its declining magic number followed by theatlanguage
least 397version.
domainsWebsites
that serve a test
serving
influence on the WebAssembly ecosystem. binarytest
alsobinaries
serve the Hyphenopoly librarybinaries,
discussedi.e., next.
influence on the WebAssembly ecosystem. these often also serve larger they first
testHyphenopoly.
whether WebAssemblyAt least 462 is supported,
domains (34.9% andof if
allitdomains
is, load thata more use
Insight 9. We find WebAssembly-based cryptominers to have complex binary. serve
WebAssembly) For example,
binariesatthat leastare
397partdomains
of thethat serve a test
Hyphenopoly.js
significantly dropped in importance compared to the results of binary alsolibrary,
JavaScript serve the Hyphenopoly
which uses WebAssemblylibrary discussed
to implement next. its core
an earlier study [24]. This finding motivates security research to functionality. Hyphenopoly is a polyfill that “hyphenates
shift the focus from malicious WebAssembly to vulnerabilities in Hyphenopoly. At least 462 domains (34.9% of all18 domainstext thatifuse
the
user agent does not support CSS-hyphenation”.
WebAssembly) serve binaries that are part of the Hyphenopoly.js Our clustering
WebAssembly binaries. identifies 24 variants of this
JavaScript library, which usesbinary, which are
WebAssembly all generated
to implement its from
core
the same underlying
functionality. Hyphenopolylibraryistoa support
polyfill thatdifferent natural languages.
“hyphenates text if the
4.6 RQ4:
4.6 RQ4: Use
Use Cases
Cases on
on the
the Web
Web long.js. At least 33118domains (25% of
user64-bit
agent integer arithmetic
does not supportinCSS-hyphenation”. Our clustering
Given the
Given the decreased
decreased prevalence
prevalence of
of cryptominers,
cryptominers, we we study
study what
what all domains
identifies 24 that use WebAssembly)
variants serve are
of this binary, which a binary that is part
all generated fromof
other use cases WebAssembly has. The following focuses on the 19 The
other use cases WebAssembly has. The following focuses on the samea underlying
long.js,
the JavaScript library
library for 64-bit integer
to support computations.
different natural languages.
web because it is the most prominent target platform
web because it is the most prominent target platform of WebAs- of WebAs- library is commonly used in video players.
sembly and
and because
because websites
websites are
are complete
complete applications
applications that
that we
we can
can 64-bit integer arithmetic in long.js. At least 331 domains (25% of
sembly Draco library
all domains that for
use3D serveAta least
data compression.
WebAssembly) 25 that
binary domains
is part(1.8%
of
manually analyze with reasonable effort. We address the
manually analyze with reasonable effort. We address the question question
of all domains that use WebAssembly) serve a binary that
long.js, a JavaScript library for 64-bit integer computations. The belongs
19 to
in two
in two ways.
ways. First,
First, we
we study
study binaries
binaries that
that occur
occur across
across multiple
multiple
library is commonly used in video players.
18 https://github.com/mnater/Hyphenopoly
websites, which helps understand libraries and other widely reused
17 https://github.com/JayWalker512/CryptoNoter 19 https://github.com/dcodeIO/long.js
components (Section 4.6.1). Second, we inspect a random sample Draco library for 3D data compression. At least 25 domains (1.8%
of 100 unique binaries, which helps understand the application of all domains that use WebAssembly) serve a binary that belongs to
domains of WebAssembly (Section 4.6.2). To capture the full picture the Draco library, which support compressing and decompressing
of WebAssembly use cases, the results are on unfiltered binaries. 3D data.20 These binaries commonly occur on websites with 3D
demos or integrated 3D assets.
4.6.1 Binaries Found on Multiple Websites. Out of all 476 unique
binaries found on the web, 70 are reused across at least two differ- Insight 10. The most widely occurring binaries on the web are dy-
ent top-level domains. Figure 8 shows a histogram of how often namic tests for WebAssembly support and JavaScript-WebAssembly
ong sites rather small: a six instruction binary with a single function called
ominers test and an eight byte binary that only contains the WebAssembly
eclining magic number followed by the language version. Websites serving
these test binaries often also serve larger binaries, i.e., they first
test whether WebAssembly is supported, and if it is, load a more
to have complex binary. For example, at least 397 domains that serve a test
esults of binary alsoStudy
An Empirical serve
of the Hyphenopoly
Real-World library
WebAssembly discussed next.
Binaries WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
earch to
ilities in Hyphenopoly.
Table At least 462
4: Application domains
domains of(34.9%
100 of all domains
randomly that use
sampled,
WebAssembly) serve binaries that are part
unique WebAssembly binaries found on the web. of the Hyphenopoly.js Analysis. We perform a static analysis to assess two name-related
JavaScript library, which uses WebAssembly to implement its core characteristics of binaries. First, the analysis checks whether a bi-
Application domain # Binaries Application domain # Binaries
functionality. Hyphenopoly is a polyfill that “hyphenates text if the nary contains a names section. While by default binaries contain
user agent does not support25CSS-hyphenation”.
Games Online gambling18 Our clustering 2 names only for imported and exported program elements, the op-
dy what Text processing 11 Barcodes and QR codes
identifies 24 variants of this binary, which are all generated 2
from tional names section maps all function indices to identifiers. Second,
s on the Visualization
the / Animation
same underlying 11 support
library to Room planning
different / Furniture
natural 2
languages. the analysis checks whether the names of imported and exported
Media processing / Player 9 Blogging 2
WebAs- At least 331 domains names are minified. Both to save space and to obfuscate the code,
64-bite.g.,
Demo, integer
of a arithmetic in 7long.js.Cryptocurrency wallet (25% 2of
t we can allprogramming
domains that use WebAssembly)Regular serve expressions
a binary that is part 1of compilers may shorten names down to single or two-letter names
language
question Wasm atutorial
long.js, JavaScript
or testlibrary for5 64-bit integer computations.19 The
Hashing 1 devoid of information. The analysis considers a WebAssembly bi-
multiple Chat is commonly used in video
library 3 PDF viewer
players. 1 nary as minified if it contains more than ten import or export names
y reused (to exclude small, potentially hand-written modules), but the aver-
m sample Draco library for 3D data compression. At least 25 domains (1.8% age length of those names is ≤ 4 (to account for some imports that
plication of all domains
the Draco that which
library, use WebAssembly) serve a binary
support compressing that belongs to
and decompressing are never minified by Emscripten, thus increasing the average).
l picture the Draco 20library, which support compressing and decompressing
3D data.20 These binaries commonly occur on websites with 3D Results. Our results show an interesting difference between the
inaries. 3D data.
demos These binaries
or integrated commonly occur on websites with 3D
3D assets. full dataset and binaries found on the web. In the full data set, many
demos or integrated 3D assets.
6 unique binaries contain a names section (19.6%) but only 4.1% of the binaries
wo differ- Insight 10. The most widely occurring binaries on the web are dy- are minified. In contrast, among all binaries found on the web, only
ow often namic tests for WebAssembly support and JavaScript-WebAssembly 13.3% contain a names section but 28.8% are minified. These results
g-tail dis- libraries that perform computation-heavy tasks.
An Empirical Study of Real-World WebAssembly Binaries
show that for a significant fraction of websites, not only minified
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
ile many JavaScript code [37], but also minified WebAssembly binaries make
An Empirical Study of Real-World WebAssembly Binaries WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
st widely 4.6.2 Manual Inspection of a Random Sample. To better understand
18 https://github.com/mnater/Hyphenopoly
it harder to understand what code is running on the client side.
ns where 4.6.2long-tail
the Manual of Inspection
binaries foundof a Random
19 https://github.com/dcodeIO/long.js or onlyTo
on a fewSample. betterwebsite,
a single understand we Insight 12. Many WebAssembly binaries on the web (28.8%)
the
4.6.2
20
also long-tail
Manual
inspect aofrandom
binariessample
Inspection
https://github.com/google/draco
found on
of a Randoma few
of 100 or only
Sample.
unique Toa single website,
betterfound
binaries understand we
on the are minified
Insight 12. and
Manydo not contain useful
WebAssembly names.
binaries on To
thehelp
websecurity
(28.8%)
also
the inspect
web.long-tail a random
of binaries
We exclude binariessample
found of 100
on a few
detected unique
onlyorvia
onlybinaries found
thea WebAssembly on
single website, topthe
we analysts understand third-party code, future work
are minified and do not contain useful names. To help on decompiling
security
web.inspect
also
lists We exclude
(Section a3.3.2)binaries
random sample
to avoid detected
of 100only
biasing via the
unique
the results WebAssembly
binaries
toward found top
on the
pre-selected and reverse
analysts engineering
understand WebAssembly
third-party is needed.
code, future work on decompiling
lists (Section
web. We exclude
application 3.3.2)
domains. toBy avoid
binaries biasing only
detected
inspecting the binaries,
the results
via thetoward pre-selected
WebAssembly
the corresponding top and reverse engineering WebAssembly is needed.
application
lists (Section
websites, anddomains.
3.3.2)
how to theBy inspecting
avoid biasing
websites usesthethe
the binaries,
results thewe
toward
binaries, corresponding
pre-selected
identify the 5 RELATED WORK
websites,ofand
application
purpose 84 how
out ofthe
domains. thewebsites
By inspecting uses
100 binaries. thethe binaries,
binaries,
Table thewe identify
the the
corresponding
4 summarizes ap- 5 WebAssembly
RELATEDinWORK
general. WebAssembly has been formally de-
purpose
websites, of
and84 out
how of
the the 100
websites binaries.
uses Table
the 4 summarizes
binaries,
plication domains that the binaries are used in. The most common we thethe
identify ap- fined [10], including a mechanized proof ofhas thebeen soundness of de-
its
WebAssembly in general. WebAssembly formally
plicationof
purpose domains
84games, that
out of thethe100binaries
forare
binaries. used in.
Table ofThe
all most
4 summarizes common
theText
ap- type system [44]. Since the initial version
domains are accounting a quarter binaries. fined [10], including a mechanized proof of of the
the language,
soundnessseveral of its
domains domains
plication are games,that accounting
the binaries forare
a quarter in.of
used and Theallmost
binaries.
are Text
common language extensions have
processing and applications in visualization animation also type system [44]. Since thebeen proposed
initial version [6, 32, 33].
of the language, several
processing
domains areand applications
games, accounting in visualization
for a and
quarter
relatively common, with 11/100 binaries each. The remaining of animation
all areText
binaries. also
list Cryptomining.
language extensions Cryptojacking, i.e., websites
have been proposed [6, 32,that
33].use the unsus-
relatively
processing common,
and with
applications 11/100
in binaries
visualization each.
and
shows the diversity of application domains WebAssembly is used The remaining
animation are list
also pecting client’s computing
Cryptomining. resources
Cryptojacking, i.e., for mining
websites thatcryptocurrencies,
use the unsus-
shows
relatively
in, thecommon,
ranging diversity
from online of application
with 11/100 of
demos domains
binaries WebAssembly
each.
programming Thelanguages,
remainingis used
list
over has beenclient’s
pecting amongcomputing
the first applications
resources for of WebAssembly [15, 25, 34].
mining cryptocurrencies,
in, ranging
shows
support the from online
fordiversity
creating of
and demos of barcodes
application
scanning programming
domains languages,
WebAssembly
to document over
is used
viewers. Several
has beentechniques
among thedetect and defend against
first applications cryptojacking
of WebAssembly [15, [14, 43].
25, 34].
support
in, for from
ranging creating and demos
online scanning of barcodes
programming to document viewers.
languages, over Section 4.5 studies how prevalent this threat is,
Several techniques detect and defend against cryptojacking [14, 43]. showing that its
support for creating and scanning barcodes to document viewers. importance has decreased over time.this Wang et al.is,[43] discussthatlimita-
Insight 11. The application domains of WebAssembly binaries Section 4.5 studies how prevalent threat showing its
on the web
Insight 11. reflect the diversity
The application of theofweb
domains itself, showing
WebAssembly that
binaries tions of VirusTotal
importance in identifying
has decreased over time.cryptomining,
Wang et al. [43] which may impact
discuss limita-
WebAssembly
on the web reflect is usedtheindiversity
a wide rangeof theof web
applications.
itself, showing that the validity
tions of our results.
of VirusTotal However,
in identifying our manual which
cryptomining, inspectionmay of bina-
impact
WebAssembly is used in a wide range of applications. ries confirms the low prevalence of cryptominers
the validity of our results. However, our manual inspection of bina- among today’s
WebAssembly
ries confirms the binaries, which is also
low prevalence supported by Varlioglu
of cryptominers among today’s et al.’s
4.7 RQ5: Minification and Names observations about the which
declineisof cryptojacking
4.7 RQ5: Minification and Names WebAssembly binaries, also supported by[42]. Varlioglu et al.’s
As a binary format with only a low-level textual representation, WebAssembly
observations about attacks. Beyond
the decline cryptomining,[42].
of cryptojacking other kinds of at-
4.7 RQ5:format
As a binary
WebAssembly Minification
binaries only aand
withcannot as Names
below-level
easilytextual representation,
inspected and under-
WebAssembly binaries cannot tacks based on WebAssembly
WebAssembly attacks. Beyond exist. Lehmann et other
cryptomining, al. showkindsthatofvul-
at-
As
stooda binary
as sourceformat
code,with
e.g.,only abe as easily
low-level
in JavaScript. The inspected
textual and under-
abilityrepresentation,
to understand
stood as source code, e.g., in JavaScript. The ability to understand nerabilities that propagate from memory-unsafe
tacks based on WebAssembly exist. Lehmann et al. show that vul- source languages
WebAssembly
a WebAssembly binaries
binarycannot
is relevant be asforeasily inspected
auditing and under-
third-party code
a WebAssembly binary isinrelevant forexample,
auditing third-party code may also be that
nerabilities exploited
propagatein WebAssembly [18]. Otherssource
from memory-unsafe reportlanguages
examples
stood
and as source
reverse code,
engineeringe.g.,malware.
JavaScript.
For The abilitywhento aunderstand
frequently
and reverse engineering malware. For of such
may alsoattacks [3, 22].
be exploited in Custom memory
WebAssembly allocators
[18]. Others report are potentially
examples
adepended-upon
WebAssembly binary
npm is
package relevant
containsforexample,
aauditing when
WebAssembly a binary,
frequently
third-party code
the
depended-upon npm package contains a WebAssembly binary, the notsuch
of hardened
attacks[5, [3,22].
22].Section
Custom4.4 showsallocators
memory that several areof the risks
potentially
and reverse
package engineering
distribution malware.
platform mayFor example,
want to checkwhenthata it
frequently
does not
package distribution platform may want to check that it does21 not reported by prior work affect a wide range
not hardened [5, 22]. Section 4.4 shows that several of the of binaries. Another
risks
depended-upon
perform malicious npm package
actions, contains
such a WebAssembly
as stealing binary, the
cryptocurrency. Be-
perform maliciousnames,
actions, such asfunctions,
stealing cryptocurrency. 21 Be- line of attack
reported are malicious
by prior work affect WebAssembly
a wide range binaries,
of binaries.e.g., to escape
Another
package
cause distribution
meaningful platform
e.g., may
for want to check that itfor
are helpful does not
under-
cause meaningful the browser
line of attacksandbox [1, 36],WebAssembly
are malicious attacks that use side channels
binaries, e.g., to[9], and
escape
code [17],names, e.g., forbinaries,
asfunctions, are helpful for extent
under-
perform malicious actions, such stealing 21 Be-
standing especially in wecryptocurrency.
study to what attacks based on speculative execution
standing
cause code
meaningful [17], especially
names, in
e.g., for binaries, we
functions,names. study to what extent
are helpful for under- the browser sandbox [1, 36], attacks that[20].
use side channels [9], and
WebAssembly binaries provide meaningful
WebAssembly binaries provideinmeaningful WebAssembly
attacks defenses. A execution
based on speculative defense against[20]. application-level at-
standing code [17], especially binaries, wenames.
study to what extent
tacks is to enforce security
WebAssembly defenses. A defense againstpolicies on untrusted WebAssembly
application-level at-
WebAssembly
Analysis. Webinaries
performprovide
a static meaningful
20 https://github.com/google/draco names.
analysis to assess two name-related
21 https://blog.npmjs.org/post/185397814280/plot-to-steal-cryptocurrency-foiled-by-
binaries
tacks through
is to enforcetaint tracking
security [8, 40].
policies onWebAssembly
untrusted WebAssembly also serves
characteristics
Analysis. Weof binaries.
perform First,analysis
a static the analysis checks
to assess whether a bi-
two name-related as a technology
the-npm binaries throughto implement
taint trackingdefenses, e.g., to sandbox
[8, 40]. WebAssembly alsolibraries
serves
nary contains aofnames
characteristics section.
binaries. First,While by default
the analysis checksbinaries contain
whether a bi- executed in a browser [26], to implement formally verifiedlibraries
cryptog-
as a technology to implement defenses, e.g., to sandbox
names only for imported and exported program elements,
nary contains a names section. While by default binaries contain the op- raphy [30],
executed in or to ensure[26],
a browser constant-time
to implement operations
formallyfor cryptographic
verified cryptog-
tional names
names section
only for maps and
imported all function
exportedindices
programto identifiers.
elements, Second,
the op- primitives
raphy [30], [45].
or to Our
ensure results call for additional
constant-time operations mitigations, e.g., to
for cryptographic
the analysis
tional checks maps
names section whether the names
all function of imported
indices and exported
to identifiers. Second, defend against
primitives [45]. vulnerabilities
Our results callpropagated
for additional frommitigations,
source languages,
e.g., to
names
the are minified.
analysis Both to save
checks whether space and
the names to obfuscate
of imported the code,
and exported and provide guidelines for developing such techniques, e.g., by
defend against vulnerabilities propagated from source languages,
names are minified. Both to save space and to obfuscate thenames
compilers may shorten names down to single or two-letter code, showing which toolchains are most commonly
and provide guidelines for developing such techniques, e.g., by used.
devoid of information.
compilers may shorten The names analysis
down considers
to single ora WebAssembly
two-letter names bi-
Studieswhich
showing of WebAssembly.
toolchains are Musch
mostetcommonly
al. [24] systematically
used. collect
nary asof
devoid minified if it contains
information. more than
The analysis ten import
considers or export names
a WebAssembly bi-
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aaron Hilbig, Daniel Lehmann, and Michael Pradel
ACKNOWLEDGMENTS
This work was supported by the European Research Council (ERC,
grant agreement 851895), and by the German Research Foundation
within the ConcSys and Perf4JS projects.
An Empirical Study of Real-World WebAssembly Binaries WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
[29] Michael Pradel and Koushik Sen. 2015. The Good, the Bad, and the Ugly: An [40] Aron Szanto, Timothy Tamm, and Artidoro Pagnoni. 2018. Taint Tracking for
Empirical Study of Implicit Type Conversions in JavaScript. In ECOOP. WebAssembly. https://arxiv.org/abs/1807.08349. arXiv:1807.08349 [cs.CR]
[30] J. Protzenko, B. Beurdouche, D. Merigoux, and K. Bhargavan. 2019. Formally [41] Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. SoK: Eternal
Verified Cryptographic Web Applications in WebAssembly. In SP. War in Memory. In 2013 IEEE Symposium on Security and Privacy (SP 2013).
[31] Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. 2011. Automated [42] Said Varlioglu, Bilal Gonen, Murat Ozer, and Mehmet Bastug. 2020. Is Crypto-
construction of JavaScript benchmarks. In OOPSLA. 677–694. jacking Dead after Coinhive Shutdown?. In ICICT. IEEE, 385–389.
[32] Andreas Rossberg. 2019. Multiple per-module memories for Wasm. [43] Wenhao Wang, Benjamin Ferrell, Xiaoyang Xu, Kevin W Hamlen, and Shuang
[33] Andreas Rossberg. 2019. Proposal for adding basic reference types. Hao. 2018. Seismic: Secure in-lined script monitors for interrupting cryptojacks.
[34] Jan Rüth, Torsten Zimmermann, Konrad Wolsing, and Oliver Hohlfeld. 2018. In European Symposium on Research in Computer Security. Springer, 122–142.
Digging into Browser-based Crypto Mining. In IMC. [44] Conrad Watt. 2018. Mechanising and verifying the WebAssembly specification.
[35] Marija Selakovic and Michael Pradel. 2016. Performance Issues and Optimizations In CPP 2018. 53–65.
in JavaScript: An Empirical Study. In ICSE. [45] Conrad Watt, John Renner, Natalie Popescu, Sunjay Cauligi, and Deian Stefan.
[36] Natalie Silvanovich. 2018. The Problems and Promise of WebAssembly. 2019. CT-Wasm: Type-Driven Secure Cryptography for the Web Ecosystem.
[37] Philippe Skolka, Cristian-Alexandru Staicu, and Michael Pradel. 2019. Anything POPL (2019).
to Hide? Studying Minified and Obfuscated Code in the Web. In WWW. [46] Hui Xu, Zhuangbin Chen, Mingshen Sun, and Yangfan Zhou. 2020. Memory-
[38] Sooel Son and Vitaly Shmatikov. 2013. The Postman Always Rings Twice: At- Safety Challenge Considered Solved? An Empirical Study with All Rust CVEs.
tacking and Defending postMessage in HTML5 Websites.. In NDSS. arXiv preprint arXiv:2003.03296 (2020).
[39] Cristian-Alexandru Staicu and Michael Pradel. 2018. Freezing the Web: A Study
of ReDoS Vulnerabilities in JavaScript-based Web Servers. In USENIX Sec.