Professional Documents
Culture Documents
Abstract—Fuzzing technologies have evolved at a fast pace in enhanced CGF embodiments of this kind are available for
recent years, revealing bugs in programs with ever increasing both grammar-based [2], [3] and chunk-based [4] formats.
depth and speed. Applications working with complex formats The shortcomings of this approach are readily apparent.
arXiv:1911.00621v1 [cs.CR] 2 Nov 2019
are however more difficult to take on, as inputs need to meet Applications typically do not come with format specifica-
certain format-specific characteristics to get through the initial tions suitable to this end. Asking users to write one is
parsing stage and reach deeper behaviors of the program. at odds with a driving factor behind the success of CGF
Unlike prior proposals based on manually written format fuzzers, that is, they work with a minimum amount of
specifications, in this paper we present a technique to automat- prior knowledge [5]. Such a request can be expensive, and
ically generate and mutate inputs for unknown chunk-based hardly applies to security contexts where users deal with
binary formats. We propose a technique to identify dependen- proprietary or undocumented formats [6].
cies between input bytes and comparison instructions, and later The second limitation is that testing only inputs that per-
use them to assign tags that characterize the processing logic fectly adhere to the specification would miss imprecisions
of the program. Tags become the building block for structure- in the implementation, while inputs that are to some degree
aware mutations involving chunks and fields of the input.
outside it may instead exercise them [6]. In our experience
such bugs are present also in applications that operate on
We show that our techniques performs comparably to
multiple input formats and share code for their processing.
structure-aware fuzzing proposals that require human as-
Recently Blazytko et al. [6] have shown in the context of
sistance. Our prototype implementation W EIZZ revealed 11
grammar-based formats how to extend the grey-box fuzzing
unknown bugs in widely used programs.
process to resemble the workings of a grammar-based solu-
tion without requiring the user to supply one. In a similar
1. Introduction vein in our work we explore automatic fuzzing of chunk-
based structures found in many binary formats [4], [7].
Recent years have witnesses a spike of activity in the Our approach. The main feature of our proposal can be
development of efficient techniques for fuzz testing, also summarized as: we attempt to learn how the input structure
known as fuzzing, from both academic and industrial actors. may look like based on how the program handles its bytes.
In particular, the coverage-based grey-box fuzzing (CGF) Nuances of this idea are present in [6] where code
approach has proven to be very effective for finding bugs coverage guides grammar structure inference, and to some
often indicative of weaknesses from a security standpoint. extent also in a few general-purpose fuzzing additions. For
The availability of the AFL fuzzing framework [1] paved instance, taint tracking can reveal which input bytes take part
the way to a large body of works proposing not only in comparisons against magic sequences [8], [9]. Similarly,
more efficient implementations, but also new techniques to input-to-state correspondence [5] can identify also checksum
deal with common fuzzing roadblocks represented by magic fields using values observed as comparison operands.
numbers and checksum tests in programs. Nonetheless, there We build on the intuition that among comparisons op-
are still several popular application scenarios that make hard erating on input-derived data, we can deem some as tightly
cases even for the most advanced CGF fuzzers. coupled to a single portion of the input structure. We also
CGF fuzzers operate by mutating inputs at the granular- experimentally observed that the order in which a program
ity of their bit and byte representations, deeming mutated exercises these checks can reveal structural details such as
inputs interesting when they lead execution to explore new the location for the type specifier of a chunk. We tag input
program portions. While this approach works very well for bytes with the most representative comparison operations
compact and unstructured inputs [2], it can lose efficacy for their processing, and devise techniques to infer plausible
when dealing with highly structured inputs that must con- boundaries for chunks and fields. Thanks to this automatic
form to some grammar or other type of specification. pipeline we can apply structure-aware mutations for chunk-
Intuitively, “undisciplined” mutations make a fuzzer based grey-box fuzzing [4] without the need for a user-
spend important time in generating many inputs that the supplied format specification.
initial parsing stages of such a program typically rejects, We start by determining candidate instructions that we
resulting in little to none code coverage improvement. Re- can consider relevant with respect to an input byte. Instead
searchers thus have added a user-supplied specification to of resorting to taint tracking, we flip bits in every input
the picture to produce and prioritize meaningful inputs: byte, execute the program, and build a dependency vector
for operands at comparison instruction sites. We analyze 2.1. Coverage-based Fuzzing
dependencies to identify input-to-state relationships and
roadblocks, and to assign tags to input bytes. Tag assignment American Fuzzy Lop (AFL) [1] is a very well-known
builds on spatial and temporal properties: we leverage the CGF fuzzer and several research works have built on it to
fact that a program typically uses distinct instructions in improve its effectiveness. To build new inputs, AFL draws
the form of comparisons to parse distinct items; to break from a queue of initial user-supplied seeds and previously
ties we prioritize older instructions since format validation generated inputs, and runs two mutation stages. Both the
normally happens early in the execution. Tags drive the deterministic and the havoc nondeterministic stage look for
inference process of an approximate chunk-based structure coverage improvements and crashes. AFL adds to the queue
of the input, enabling subsequent structure-aware mutations. inputs that improve the coverage of the program, and reports
Unlike scenarios for reconstructing specifications of in- crashing inputs to users as proofs for bugs.
Coverage. AFL adds instrumentation to a binary pro-
put formats [7], [10], [11], experimental results suggest that,
gram to intercept when a branch is hit during the execution.
like observed for grammar-based formats [6], this inference
To efficiently track hit counts, it uses a coverage map of 216
process does not have to be fully accurate: fuzzing can deal
entries and associates a counter to the hash of the branch,
with some amount of noise and imprecision. Another impor-
built using its source and destination basic block addresses.
tant consideration is that a specification does not prescribe
AFL considers an input interesting when the execution path
how its implementation should look like. Developers can
reaches a branch that yields a previously unseen hit count.
share code among functionalities introducing subtle bugs,
To limit the number of monitored inputs and discard likely
but also devise optimized variants of one functionality, for
similar executions, hit counts undergo a normalization to
instance specialized on value ranges of some relevant input
power-of-two buckets before lookup.
field. As we assign tags and attempt inference over one input
Mutations. In the deterministic stage AFL sequentially
instance at a time, we have a chance to identify and exercise
scans the input, applying a set of mutations for each position.
such patterns when aiming for code coverage improvements.
These include bit or byte flipping, arithmetic increments and
The contributions of this work can be summarized as: decrements, substitution with common constants (e.g., 0, -1,
• a dependency identification technique embedded in the MAX_INT) or values from a user-supplied dictionary. AFL
deterministic mutation stage of grey-box fuzzing; tests each mutation in isolation by executing the program on
• a tag assignment mechanism that uses dependencies to the derived input and inspecting the coverage, then it reverts
overcome roadblocks and to back a structure inference the change and moves to the next transformation.
scheme for chunk-based formats; In the havoc stage AFL applies a nondeterministic se-
• a prototype implementation of the approach on top of quence (or stack) of mutations before attempting execution.
QEMU and AFL that we call W EIZZ; Mutations come in a random number (1 to 256) and consists
• an experimental investigation where W EIZZ beats or in flips, increments, decrements, deletions, etc. at a random
matches a chunk-based CGF proposal that accesses a position in the input. AFL attempts alternate mutation se-
specification, and outperforms general-purpose fuzzers quences on the currently selected input: a power scheduler
over the applications we considered; chooses the number of inputs to generate at this stage based
• a discussion of some real-world bugs we discovered. on the program running time and the size of the generated
To facilitate analyses and extensions of our approach, coverage trace over the single input.
we will make W EIZZ available as open source.
2.2. Roadblocks
2. State of the Art In the fuzzing jargon roadblocks are comparison patterns
over the input that are intrinsically hard to overcome through
The last years have seen a large body of fuzzing-related blind mutations: the two most common embodiments are
works [12]. Grey-box fuzzers have gradually replaced ini- magic numbers, often found for instance in header fields,
tial black-box fuzzing proposals where mutation decisions and checksums, typically used to verify data integrity. In
happen without taking into account their impact on the ex- the Appendix we report examples for both from the chunk
ecution path [13]. Coverage-based grey-box fuzzing (CGF) parsing code of a PNG processing software (Section A.1).
uses lightweight instrumentation to measure code coverage, Format-specific dictionaries may help with magic num-
which discriminates whether mutated inputs are sufficiently bers, yet the fuzzer has to figure out where to place such
“interesting” by looking at control-flow changes. values. Researchers thus came up with a few approaches to
CGF is very effective in discovering bugs in real soft- handle magic numbers and checksums in grey-box fuzzers.
ware [4], yet it may struggle in the presence of roadblocks Sub-instruction Profiling. While there is no obvious
(like checksum and magic numbers) [5], or when mutating way to figure out how a large amount of logic can be
structured grammar-based [6] or chunk-based [4] input for- encoded in a single comparison, one could break multi-byte
mats. In the following we will describe the workings of the comparisons into single-byte [14] (or even single-bit [15])
popular CGF engine that W EIZZ builds upon, and recent checks to better track progress when attempting to match
proposals that cope with the challenges listed above. We constant values. LAF-I NTEL [14] and C OMPARE C OVER -
conclude by discussing where W EIZZ fits in this landscape. AGE [16] are compiler extensions to produce binaries for
such a fuzzing. H ONGG F UZZ [17] implements this tech- in mutations of the observed comparison operand values. A
nique for fuzzing programs when the source is available, topological sort of patches addresses nested checksums.
while AFL++ [18] can automatically transform binaries In the context of concolic fuzzers, E CLIPSER [27] re-
during fuzzing. S TEELIX [19] resorts to static analysis to laxes the path constraints over each input byte. It runs a
filter out likely uninteresting comparisons from profiling. program multiple times by mutating individual input bytes
Overall, sub-instruction profiling is valuable to overcome to identify the affected branches. Then it collects constraints
tests against magic numbers, but ineffective with checksums. resulting from them, considering however only linear and
Taint Analysis. Dynamic taint analysis (DTA) [20] can monotone relationships, as other constraint types would
track when and which input parts affect program instruc- likely require a full-fledged SMT solver. E CLIPSER then se-
tions. V UZZER [8] uses DTA on binary programs to bypass lects one of the branches and flips its constraints to generate
branches that check magic numbers, identified as compar- a new input, mimicking dynamic symbolic execution [25].
ison instructions where an operand is input-dependent and
the other is constant: V UZZER places the constant value 2.3. Format-aware Fuzzing
in the input portion that propagates directly to the former
operand. A NGORA [9] brings two technical improvements: Classic CGF techniques lose part of their efficacy when
it can identify magic numbers that are not contiguous in the dealing with structured input formats found in files. As
input thanks to a multi-byte form of DTA, and uses gradient mutations happen on bit-level representations of inputs,
descent to mutate tainted input bytes efficiently. It however they can hardly bring the structural changes required to
requires compiler transformations to enable such mutations. explore new compartments of the data processing logic of an
Symbolic Execution. While effective for magic bytes, application. Bringing format awareness into play can boost
DTA techniques can at best identify checksum tests, but CGF in this setting. In the literature we can distinguish
not provide sufficient information to address them. For techniques targeting either grammar-based formats, where
this reason several proposals (e.g., [21], [22], [23], [24]) inputs comply to a language grammar, or chunk-based ones,
use symbolic execution [25] to identify and try to solve where inputs follow a tree organization with data chunks
complex input constraints involved in generic roadblocks. resembling a C structure to form individual nodes. The latter
TAINT S COPE [23] proposes a methodology to identify pos- category seems prevalent in real-world software [4].
sible checksums with DTA, patch them away for the sake Grammar-based Fuzzing. L ANG F UZZ [28] is a black-
of fuzzing, and later try to repair inputs with symbolic ex- box fuzzer that using a Javascript grammar specification
ecution. The technique requires the user to provide specific generates valid inputs for a language interpreter by combin-
seeds and is subject to false positives [21], [24]. ing sample code fragments (taken, e.g., from bug reports)
Yet input constraints may be just too hard or long to and test cases. NAUTILUS [3] and S UPERION [2] are recent
solve, leading to timeouts without any actual progress. T- grey-box fuzzer proposals that can effectively test language
F UZZ [24] divides checks in two categories: non-critical, interpreters without requiring a large corpus of valid inputs
when they involve magic numbers, and critical, when they or fragments, but only an ANTLR grammar file.
describe checksums, suggesting symbolic execution to deal G RIMOIRE [6] removes the grammar specification re-
with the first kind, and manual analysis for the second. quirement. Based on R ED Q UEEN, it identifies fragments
D RILLER [21] follows a different design using symbolic from an initial set of inputs that trigger new coverage, and
execution as an alternate strategy to explore inputs, switch- strips them from parts that would not cause a coverage
ing to it when fuzzing exhausts a time budget obtaining no loss. G RIMOIRE notes information for such gaps, and later
coverage improvement. Q SYM [22] adopts a similar scheme, attempts to recursively splice in parts seen in other positions,
choosing concolic execution implemented via dynamic bi- thus mimicking grammar combinations.
nary instrumentation [26] to trade exhaustiveness for speed. Chunk-based Fuzzing. In the context of black-box tech-
Approximate Analyses. A recent trend is to explore niques, S PIKE [29] lets users describe the network protocol
solutions that approximately extract the information that in use for an application to improve the chance of finding
DTA or symbolic execution can bring without incurring their bugs. P EACH [30] generalizes this idea, applying format-
overheads. R ED Q UEEN [5] builds on the observation that aware mutations on an initial set of valid inputs using a
often input bytes flow directly, or after simple encodings user-defined input specification dubbed peach pit. As they
(e.g. swaps for endianness), into instruction operands. The are input-aware, some literature calls such blackbox fuzzers
authors show how to use this input-to-state correspondence smart [4]. One could argue that a smart grey-box variant
instead of DTA and symbolic execution to deal with magic would outperform them, as due to the lack of feedback (as in
bytes, multi-byte compares, and checksums. R ED Q UEEN explored code) they do not keep mutating interesting inputs.
in particular approximates DTA by changing input bytes AFLS MART [4] validates this speculation by adding
with random values (colorization) to see which compari- smart (or high-order) mutations to AFL that add, delete,
son operands change, suggesting a dependency. When both and splice chunks in an input. Using a peach pit, AFLS-
operands of a comparison change, but only for one there MART maintains a virtual structure of the current input,
is an input-to-state relationship, R ED Q UEEN deems it as a represented conceptually by a tree whose internal nodes are
likely checksum test. It then patches the operation, and later chunks and leaves are attributes. Chunks are characterized
attempts to repair the input with educated guesses consisting by initial and end position in the input, a format-specific
TABLE 1: Comparison with related approaches.
Fuzzer Binary Magic bytes Checksums Chunk-based Grammar-based Automatic
AFL++ 3 3 7 7 7 3
A NGORA 7 3 7 7 7 3
E CLIPSER 3 3 7 7 7 3
R ED Q UEEN 3 3 3 7 7 3
S TEELIX 3 3 7 7 7 3
NAUTILUS 7 3 7 7 3 7
S UPERION 3 3 7 7 3 7
G RIMOIRE 3 3 3 7 3 3
AFLS MART 3 3 7 3 7 7
W EIZZ 3 3 3 3 7 3
3.1.3. Tag Placement. W EIZZ succinctly describes depen- 3.1.4. Checksum Validation and Input Repair. The last
dency information between input bytes and performed com- conceptual step of the surgical stage involves checksum val-
parisons by annotating such bytes with tags. Later steps idation and input repair with respect to checksums patched
like input repairing for checksums and structural information in the program during previous iterations. In the process
inference build heavily on the availability of such tags. W EIZZ also discards false positives among such checksums.
For the b-th input byte structure T ags[b] stores as fields: Procedure F IX C HECKSUMS (Algorithm 6 in the Ap-
• id: (the address of) the comparison instruction chosen pendix) filters tags marked as involved in checksum fields
as most significant among those b influences; for which a patch is active, and processes them in the order
• ts: the timestamp of the id instruction when first met; described in the previous section. W EIZZ initially executes
• parent: the comparison instruction that led W EIZZ to the patched program, obtaining a comparison table CT and
the most recent tag assignment prior to T ags[b]; the hash h of the coverage map as execution path footprint.
• dependsOn: when b contains a checksum value, the Then it iterates over each filtered tag as follows. It first
comparison instruction for the innermost nested check- extracts the input-derived operand value of the comparison
sum that verifies the integrity of b, if any; instruction (i.e., the checksum computed by the program)
• flags: stores the operand affected by b, if it is I2S, or and where in the input the contents of the other operand
if it is part of a checksum field; (which is I2S, see Section 3.1.2) are located, then it replaces
• numDeps: the number of input bytes that the operand the input bytes with the expected value (lines 4-5). W EIZZ
in flags depends on. makes a new instrumented execution over the repaired input,
The tag assignment process relies on spatial and tem- disabling the patch. If it handled the checksum correctly,
poral locality in how comparison instructions get executed. the resulting hash h0 will be identical to h implying no
W EIZZ tries to infer structural properties of the input based execution divergence happened, otherwise the checksum is
on the intuition that a program typically uses distinct instruc- a false positive and we have to bail.
tions to process structurally distinct items. We can thus tag After processing all the tags, we may still have patches
input bytes as related when they reach one same comparison left in the program for checksums not reflected by tags.
directly or through derived data. When more candidates are W EIZZ may in fact erroneously treat a comparison instruc-
found, we use temporal information to prioritize instructions tion as a checksum due to inaccurate dependency informa-
with a lower timestamp, as input format validation normally tion. W EIZZ removes all such patches and if after that the
takes place in parsing code from the early stages of the footprint for the execution does not change, the input repair
execution. As we elaborate later, we extend this scheme to process succeeded. This happens typically with patches for
account for checksums and to reassign tags when heuristics branch outcomes that turn out to be already “implied” by
spot a likely better candidate than the current choice. the repaired tag-dependent checksums.
Procedure P LACE TAGS (Algorithm 5 in the Appendix) Otherwise W EIZZ tries to determine which patch not
iterates over comparison sites sorted by when first met in reflected by tags may individually cause a divergence: it
tries patches one at a time and compares footprints to spot function F IELD M UTATION (Tags, I):
1 b ← pick u.a.r. from {0 ... len(I)}
offending ones. This happens when W EIZZ initially finds a 2 for i ∈ {b ... len(I)-1} do
checksum for an input and the analysis of a different input 3 if Tags[i].id==0 then continue
reveals an overfitting, so we mark it as a false positive. 4 start ← G O L EFT W HILE S AME TAG(Tags, i)
At the end of the process W EIZZ re-applies patches that 5 with probab. PrI2S or if !I2S(Tags[start]) then
6 end ← F IND F IELD E ND(Tags, start, 0)
were not a false positive: this will benefit both the second 7 I ← M UTATE(I, start, end); break
stage and future surgical iterations over other inputs. W EIZZ 8 return I
then also applies patches for checksums newly discovered by Algorithm 3: Field identification and mutation.
M ARK C HECKSUMS in the current surgical stage, so when
1
t-1
t+
1
W EIZZ will analyze again the input (or similar ones from
d+
t
d-
ar
ar
ar
d
en
en
en
st
st
st
F UZZ O PERAND) it will be able to overcome them. id B A A A A D
Tags
ts 1 5 5 5 5 16
We measured the code coverage achieved by: (a) AFLS- 1000 4000
800
MART with stacked mutations but without deterministic 3000
600
stages as suggested by its documentation, (b) W EIZZ in 400
2000
its standard configuration, and (c) a variant W EIZZ† where 200 1000
0 0
we disable F UZZ O PERANDS, checksum patching and input decompress djpeg
6000 2500
repairing. W EIZZ† allows us to focus on the effects of tag-
based mutations, without benefits from our roadblock by- 5000
2000
use AFL test cases as seeds, except for wavpack and ffmpeg 500
1000
where we use minimal syntactically valid files [33].
0 0
Figure 5 plots the median basic block coverage after 5 libpng ffmpeg
1800 18000
hours. Compared to AFLS MART, W EIZZ brings appreciably 1600 16000
higher coverage on 3 out of 6 subjects (readelf, libpng, 1400 14000
ffmpeg), slightly higher on wavpack, comparable on de- 1200 12000
results, we will first consider where W EIZZ† lies, and then 800
600
8000
6000
discuss the benefits from the additional features of W EIZZ. 400 4000
format specification, while we rely on how a program han- W EIZZ AFLS MART W EIZZ†
dles the input bytes. Hence, W EIZZ† can reveal behaviors
Figure 5: Basic block coverage over time (5 hours).
that characterize the actual implementation and that may
not be necessarily anticipated by the specification. The first
implication is that W EIZZ† mutates only portions that the found the same bugs. For ffmpeg, W EIZZ found a bug from a
program has already processed in the execution, as tags division by zero in a code section handling multiple formats:
derive from executed comparisons. The second is that im- we reported the bug and its developers promptly fixed it.
precision in the inference of structural facts may actually The I2S-related features of W EIZZ turned out very effective
benefit W EIZZ† . The authors of AFLS MART acknowledge to generate inputs that deviate significantly from the initial
the importance of relaxed specification to expose imprecise seed, e.g., mutating an AVI file into an MPEG-4 one. In
implementations [4]: we will return to this in Section 6. Section 5.2 we will back this claim with a few case studies.
When we consider the techniques disabled in W EIZZ† ,
W EIZZ brings higher coverage for two reasons. I2S facts 5.1.2. Grey-box Fuzzers. For this experiment we analyze
help W EIZZ replace magic bytes only in the right spots the bottom 8 programs from Table 2. These are popular
compared to dictionaries, which can also be incomplete. open source applications handling chunk-based formats that
This turns out important also in multi-format applications the community has heavily tested (e.g., by the OSS-Fuzz
like ffmpeg. Then, checksum bypassing is crucial to generate project for continuous fuzzing [34]), and were used in past
valid inputs for programs that check data integrity, such as evaluations of state-of-the-art fuzzers like [4], [5], [8], [19].
libpng that computes CRC-32 values over data. We tested the following fuzzers: (a) AFL 2.53b in
Table 3 compares W EIZZ with the best alternative for QEMU mode, (b) AFL++ 2.54c in QEMU mode enabling
code coverage at the end of the experiment. Interestingly, the CompareCoverage feature, and (c) E CLIPSER with its
W EIZZ† is a better alternative than AFLS MART for 3 sub- latest release when we ran our tests (commit 8a00591) and
jects. We also consider the collected crashes, found only its default configuration. As W EIZZ targets binary programs,
for wavpack and ffmpeg. In the first case, the three fuzzers we rule out fuzzers like A NGORA that require source instru-
TABLE 4: W EIZZ vs. best alternative for targets of Figure 6. terms of coverage over time. Consistently with previous
Columns have the same meaning used in Table 3. evaluations [27], we use as initial seed a string made of
P ROGRAM B EST A LTERNATIVE BB % W. R . T. B EST C RASHES FOUND BY the ASCII character "0" repeated 72 times. However, for
djpeg
libpng
AFL
AFL
105%
180%
none
none
libpng and tcpdump we fall back to a valid initial input3 as
objdump AFL++ 120% W EIZZ no fuzzer (excluding W EIZZ on libpng) achieved significant
mpg321 AFL++ 100% W EIZZ, AFL++
oggdec E CLIPSER 108% none coverage with artificial seeds. We also provide AFL with
readelf AFL++ 175% none format-specific dictionaries to aid it with magic numbers.
tcpdump E CLIPSER 117% none
gif2rgb AFL 100% none Figure 6 plots the median basic block coverage over
djpeg libpng
time. W EIZZ on 5 out 8 targets (libpng, oggdec, tcpdump,
700 2000 objdump and readelf) achieves significantly higher code
1800
600
1600
coverage than other fuzzers. The first three in the list above
500 1400 process formats with checksum fields that only W EIZZ re-
1200
400
1000
pairs, although E CLIPSER appears to catch up over time for
300 800 oggdec. The structural mutation capabilities combined with
200 600
400
this factor may explain the gap between W EIZZ and other
100
200 fuzzers. For objdump and readelf, by the way R ED Q UEEN
0
objdump mpg321
0
outperforms other fuzzers over them in its evaluation [5] we
3500 450
400
may speculate I2S facts4 are boosting the work of W EIZZ.
3000
350 On mpg321 and gif2rgb, AFL-based fuzzers perform
2500
300 very similarly, with a large margin over E CLIPSER, con-
2000 250
firming that the AFL standard mutations can be effective on
1500 200
150
some chunk-based formats. Finally, W EIZZ leads for djpeg
1000
100 but the other fuzzers are not too far from it.
500
50
Table 4 compares W EIZZ with the best alternative when
0 0
oggdec readelf the budget expires. Interestingly, AFL is the best alternative
400 8000
350 7000
for 3 programs. When taking into account crashes, W EIZZ
300 6000 and AFL++ generated the same crashing inputs for mpg321,
250 5000 while only W EIZZ revealed a crash for objdump.
200 4000
150 3000
100 2000
5.2. RQ2: Zero-day Bugs
50 1000
[10] M. Höschele and A. Zeller, “Mining input grammars from [26] D. C. D’Elia, E. Coppa, S. Nicchi, F. Palmaro, and L. Cavallaro,
dynamic taints,” in Proceedings of the 31st IEEE/ACM International “Sok: Using dynamic binary instrumentation for security (and
Conference on Automated Software Engineering, ser. ASE 2016, how you may get caught red handed),” in Proceedings of the
2016, pp. 720–725. [Online]. Available: http://doi.acm.org/10.1145/ 2019 ACM Asia Conference on Computer and Communications
2970276.2970321 Security, ser. Asia CCS ’19, 2019, pp. 15–27. [Online]. Available:
http://doi.acm.org/10.1145/3321705.3329819
[11] M. Höschele and A. Zeller, “Mining input grammars with
[27] J. Choi, J. Jang, C. Han, and S. K. Cha, “Grey-box concolic testing
AUTOGRAM,” in Proceedings of the 39th International Conference
on binary code,” in Proceedings of the 41st International Conference
on Software Engineering Companion, ser. ICSE-C ’17, 2017, pp.
on Software Engineering, ser. ICSE ’19, 2019, pp. 736–747.
31–34. [Online]. Available: https://doi.org/10.1109/ICSE-C.2017.14
[Online]. Available: https://doi.org/10.1109/ICSE.2019.00082
[12] A. Zeller, R. Gopinath, M. Böhme, G. Fraser, and C. Holler, “The [28] C. Holler, K. Herzig, and A. Zeller, “Fuzzing with code fragments,”
Fuzzing Book,” https://www.fuzzingbook.org/, 2019, [Online; ac- in Proceedings of the 21st USENIX Conference on Security
cessed 10-Sep-2019]. Symposium, ser. SEC’12, 2012, pp. 38–38. [Online]. Available:
[13] A. Takanen, J. D. Demott, and C. Miller, Fuzzing for Software http://dl.acm.org/citation.cfm?id=2362793.2362831
Security Testing and Quality Assurance, 2nd ed. Norwood, MA, [29] D. Aitel, “The Advantages of Block-Based Protocol Analy-
USA: Artech House, Inc., 2018. sis for Security Testing,” https://www.immunitysec.com/downloads/
[14] “Circumventing Fuzzing Roadblocks with Compiler advantages_of_block_based_analysis.html, 2002, [Online; accessed
Transformations,” https://lafintel.wordpress.com/2016/08/15/ 10-Sep-2019].
circumventing-fuzzing-roadblocks-with-compiler-transformations/, [30] M. Eddington, “Peach fuzzing platform,” http://community.
2016, [Online; accessed 10-Sep-2019]. peachfuzzer.com/WhatIsPeach.html, [Online; accessed 10-Sep-
2019].
[15] T. Ormandy, “Making Software Dumberer,” http://taviso.decsystem.
org/making_software_dumber.pdf, 2009, [Online; accessed 10-Sep- [31] “libprotobuf-mutator,” https://github.com/google/libprotobuf-mutator,
2019]. [Online; accessed 10-Sep-2019].
[32] D. C. D’Elia, C. Demetrescu, and I. Finocchi, “Mining hot calling function P LACE TAGS (Tags, Deps, s, CT, CI, R, o, I):
contexts in small space,” Software: Practice and Experience, vol. 46, 1 for b ∈ {0 ... len(I)-1}, op ∈ {op1 , op2 } do
W
pp. 1131–1152, 2016. [Online]. Available: https://doi.org/10.1002/ 2 if ¬ j Deps(s, j, op)[b] then continue
spe.2348 3
P W
n ← k (1 if j Deps(s, j, op)[k] else 0)
4
0
n ← Tags[b].numDeps
[33] M. Bynens, “Smallest possible syntactically valid files of different
types,” https://github.com/mathiasbynens/small, 2019, [Online; ac- 5 reassign ← (!I S C KSM(Tags[b]) ∧ n0 > 4 ∧ n < n0 )
cessed 10-Sep-2019]. 6 if b 6∈ Tags ∨ I S VALID C KSM(CI, s) ∨ reassign then
7 Tags[b] ← S ET TAG(Tags[b], CT, s, op, CI, R, n, o)
[34] “Google OSS-Fuzz: continuous fuzzing of open source software,” 8 return Tags
https://github.com/google/oss-fuzz, 2019, [Online; accessed 10-Sep-
2019]. Algorithm 5: P LACE TAGS procedure.
[35] Blackngel, “MALLOC DES-MALEFICARUM,” http://phrack.org/
issues/66/10.html, 2019, [Online; accessed 10-Sep-2019]. function F IX C HECKSUMS (CI, I, Tags, o):
1 Tagsck ← F ILTER A ND S ORT TAGS (Tags, o)
[36] B. Yadegari and S. Debray, “Bit-level taint analysis,” in 2014 IEEE 2 CT, h ← RUN I NSTR ck (I)
14th International Working Conference on Source Code Analysis 3 for k ∈ {0 ... |Tagsck |-1} do
and Manipulation (SCAM), 2014, pp. 255–264. [Online]. Available: 4 v, idx ← G ET C MP O P O F TAG(CT, Tagsck [k], CI, R)
https://doi.ieeecomputersociety.org/10.1109/SCAM.2014.43 5 I ← F IX I NPUT(I, v, idx)
6 U NPATCH P ROGRAM(Tagsck [k], CI)
[37] B. Yadegari, B. Johannesmeyer, B. Whitely, and S. Debray, “A 7 CT, h0 ← RUN I NSTRck (I0 )
generic approach to automatic deobfuscation of executable code,” 8 if h 6= h0 then
in 2015 IEEE Symposium on Security and Privacy (SP), 2015, pp. 9 CI ← M ARK C KSM FP(CI, Tagsck [k].id)
674–691. [Online]. Available: http://dx.doi.org/10.1109/SP.2015.47 10 return CI, I, false
11 U NPATCH U NTAGGED C HECKSUMS (CI, Tagsck )
0
12 h ← RUN L IGHT I NSTR ck (I)
A. Appendix 0
13 if h == h then return CI, I, true
14 foreach a ∈ dom(CI) | @ t: Tagsck [t].id == a do
15 PATCH P ROGRAM(C, a)
A.1 Roadblocks from lodepng 16 h00 ← RUN L IGHT I NSTRck (I)
17 if h00 6= h0 then CI ← M ARK C KSM FP(CI, a)
18 U NPATCH P ROGRAM(C, a)
As an example of roadblocks found in real-world soft- 19 return CI, I, false
ware, the following excerpt from lodepng image encoder Algorithm 6: F IX C HECKSUMS procedure.
and decoder looks for the presence of the magic string
IHDIR denoting the namesake PNG chunk type and checks function F IND F IELD E ND (Tags, k, depth):
the integrity of its data against a CRC-32 checksum: 1 end ← G O R IGHT W HILE S AME TAG (Tags, k)
2 if depth < 8 ∧ Tags[end+1].ts == Tags[k].ts+1 then
if (!lodepng_chunk_type_equals(in + 8, "IHDR")) { 3 end ← F IND F IELD E ND(Tags, end+1, depth+1)
/*error: it doesn’t start with a IHDR chunk!*/ 4 return end
CERROR_RETURN_ERROR(state->error, 29);
}
Algorithm 7: F IND F IELD E ND procedure.
...
unsigned CRC = lodepng_read32bitInt(&in[29]);
function G ET R ANDOM C HUNK (Tags):
unsigned checksum = lodepng_crc32(&in[12], 17);
if (CRC != checksum) { 1 with probability Prchunk12 do
/*invalid CRC*/ 2 k ← R AND(len(Tags))
CERROR_RETURN_ERROR(state->error, 57); 3 start ← G O L EFT W HILE S AME TAG(Tags, k)
} 4 end ← F IND C HUNK E ND(Tags, k, 0)
5 return (start, end)
6 else S
A.2 Additional Algorithms 7 id ← pick u.a.r. from i {Tags[i].id}
8 start ← 0, L ← ∅
9 while start < |Tags| do
Tag Placement. Algorithm 5 provides pseudocode for the 10 start ← S EARCH TAG R IGHT(Tags, start, id)
11 end ← F IND C HUNK E ND(Tags, start)
P LACE TAGS procedure, while we refer the reader to Sec- 12 L ← L ∪ {(start, end)}, start ← end + 1
tion 3.1.3 for a discussion of its main steps. An interesting 13 return l ∈ L chosen u.a.r.
detail presented here is related to the condition n0 > 4 when Algorithm 8: G ET R ANDOM C HUNK procedure.
computing the reassign flag. This results from experimental
observations, with 4 being the most common field size in
the formats we analyzed. Hence, W EIZZ will choose as Field Identification. Algorithm 7 shows how to implement
characterizing instruction for a byte the first comparison the F IND F IELD E ND procedure that we use to detect fields
that depends on no more than 4 bytes, ruling out possibly complying to one of the patterns from Figure 3, which we
“shorter” later comparisons as they may be accidental. discussed in Section 3.2.1. To limit the recursion at line 3
for field extension, we use 8 as maximum depth (line 7).
Checksum Handling. In Algorithm 6 we report detailed
pseudocode for the F IX C HECKSUMS procedure that we Chunk Identification. As discussed in Section 3.2.2, in
illustrated in its main steps throughout Section 3.1.4. some formats there might be dominant chunk types, and
choosing the initial position randomly would reveal a chunk A.4 Additional Experimental Data
of non-dominant type with a smaller probability. W EIZZ
chooses between the original and the alternative heuristic Table 6 and Table 7 provide the 60%-confidence in-
with a probability Prchunk12 as described in Algorithm 8. tervals for the experiments reporting the basic block code
coverage discussed in Section 5.1.
A.3 Implementation Tuning
In the following we present present additional optimiza- TABLE 6: Basic block coverage (60% confidence intervals)
tions that we integrated in our implementation. after 5-hour fuzzing of the top 6 subjects of Table 2.
Loop Bucketization. In addition to newly encountered P ROGRAMS W EIZZ AFLS MART W EIZZ†
wavpack 1824-1887 1738-1813 1614-1749
edges, AFL deems a path interesting when some branch readelf 7298-7370 6087-6188 6586-6731
hit count changes (Section 2.1). To avoid keeping too many decompress 5831-6276 6027-6569 5376-5685
inputs, AFL verifies whether the counter falls in a previ- djpeg 2109-2137 2214-2221 2121-2169
libpng 1620-1688 1000-1035 1188-1231
ously unseen power-of-two interval. However, as pointed ffmpeg 15946-17885 9352-9923 14515-14885
out by the authors of LAF-I NTEL [14], this approach falls
short in deeming progress with patterns like the following:
TABLE 7: Basic block coverage (60% confidence intervals)
for (int i = 0; i < strlen(magic); ++i) after 24-hour fuzzing of the bottom 8 subjects of Table 2.
if (magic[i] != input[i]) return;
P ROGRAMS W EIZZ E CLIPSER AFL++ AFL
To mitigate this problem, in the surgical phase W EIZZ djpeg 612-614 492-532 561-577 581-592
considers interesting also inputs that lead to the largest hit libpng 1747-1804 704-711 877-901 987-989
objdump 3366-4235 2549-2648 2756-3748 2451-2723
count observed for an edge across the entire stage. mpg321 428-451 204-204 426-427 204-204
oggdec 369-372 332-346 236-244 211-211
Optimizing I2S Fuzzing. Another tuning for the surgical readelf 7428-7603 2542-2871 4265-5424 2982-3091
tcpdump 7662-7833 6591-6720 5033-5453 4471-4576
stage involves the F UZZ O PERANDS step. W EIZZ decides gif2rgb 453-464 357-407 451-454 457-465
to skip a comparison site when previous iterations over it
were unable to generate any coverage-improving paths. In
particular, the probability of executing F UZZ O PERANDS on
a site is initially one and decreases with the ratio between
failed and total attempts.