You are on page 1of 12

MCPA: Program Analysis as Machine Learning

Marcel Böhme
Monash University, marcel.boehme@acm.org
User-Generated Executions Per Day Accuracy ϵ
ABSTRACT HFD Ro3
Static program analysis today takes an analytical approach which OSS-Fuzz 200.0 billion test inputs* [46] 3.6 · 10−6 2.3 · 10−11
Netflix 86.4 billion API requests [50] 5.5 · 10−6 5.3 · 10−11
is quite suitable for a well-scoped system. Data- and control-flow
Youtube 6.2 billion videos watched [47] 2.1 · 10−5 7.4 · 10−10
is taken into account. Special cases such as pointers, procedures, Google 5.6 billion searches [47] 2.1 · 10−5 8.2 · 10−10
and undefined behavior must be handled. A program is analyzed Tinder 2.0 billion swipes [51] 3.6 · 10−5 2.3 · 10−9
precisely on the statement level. However, the analytical approach Facebook 1.5 billion user logins [48] 4.2 · 10−5 3.1 · 10−9
Twitter 681.7 million tweets posted [47] 6.2 · 10−5 6.8 · 10−9
arXiv:1911.04687v1 [cs.SE] 12 Nov 2019

is ill-equiped to handle implementations of complex, large-scale, Skype 253.8 million calls [47] 1.0 · 10−4 1.8 · 10−8
heterogeneous software systems we see in the real world. Existing Visa 150.0 million transactions [45] 1.3 · 10−4 3.1 · 10−8
static analysis techniques that scale, trade correctness (i.e., sound- Instagram 71.1 million photos uploaded [47] 1.9 · 10−4 6.5 · 10−8
(*) per widely-used, security-critical library, on average. 10 trillion total.
ness or completeness) for scalability and build on strong assump-
tions (e.g., language-specificity). Scalable static analysis are well- Figure 1: Daily number of executions of large software sys-
known to report errors that do not exist (false positives) or fail to tems in 2018, and the lower (Ro3) and upper bounds (HFD)
report errors that do exist (false negatives). Then, how do we know on the accuracy ϵ of an estimate µ̂ of the probability µ that a
the degree to which the analysis outcome is correct? binary program property φ holds (e.g., bug is found). Specifi-
In this paper, we propose an approach to scale-oblivious grey- cally, µ ∈ [µ̂ − ϵ, µ̂ + ϵ] with probability δ = 0.01 (i.e., 99%-CIs).
box program analysis with bounded error which applies efficient
approximation schemes (FPRAS) from the foundations of machine We believe that static program analysis today resembles the an-
learning: PAC learnability. Given two parameters δ and ϵ, with prob- alytical approach in the natural sciences. However, in the natural
ability at least (1 − δ ), our Monte Carlo Program Analysis (MCPA) sciences—when analytical solutions are no longer tractable—other
approach produces an outcome that has an average error at most ϵ. approaches are used. Monte Carlo methods are often the only means
The parameters δ > 0 and ϵ > 0 can be choosen arbitrarily close to solve very complex systems of equations ab initio [19]. For in-
to zero (0) such that the program analysis outcome is said to be stance, in quantum mechanics the following multi-dimensional
probably-approximately correct (PAC). We demonstrate the perti- integral gives the evolution of an atom that is driven by laser light,
nent concepts of MCPA using three applications: (ϵ, δ )-approximate undergoes “quantum jumps” in a time interval (0, t), and emits
quantitative analysis, (ϵ, δ )-approximate software verification, and exactly n photons at times tn ≥ . . . ≥ t 1 :
(ϵ, δ )-approximate patch verification. ∫ t ∫ tn ∫ t2
dt n dt n−1 . . . dt 1 p[0, t ) (t 1, . . . , t n )
0 0 0
1 INTRODUCTION
where p[0,t ) is the elementary probability density function [14].
In 2018, Harman and O’Hearn launched an exciting new research Solving this equation is tractable only using Monte ∫Carlo (MC)
agenda: the innovation of frictionless1 program analysis techniques t
integration [33]. MC integration solves an integral F = 0 i dti f (x)
that thrive on industrial-scale software systems [21]. Much progress
by “executing” f (x) on random inputs x ∈ R . Let X j ∈ [0, ti ]
n
has been made. Tools like Sapienz, Infer, and Error Prone are rou-
be the j-th of N samples from the uniform distribution, then F is
tinely used at Facebook and Google [1, 9, 37]. Yet, many challenges ti ÍN
estimated as ⟨F N ⟩ = N j=1 f (X j ). MC integration guarantees
remain. For instance, the developers of Infer set the clear expectation √
that the tool may report many false alarms, does not handle certain that this estimate converges to the true value F at a rate of 1/ N .
language features, and can only report certain types of bugs.2 Such This is true, no matter how many variables ti the integral has or
tools often trade soundness or completeness for scalability. Then, how the function f : Rn → R is “implemented”.
just how sound or complete is an analysis which ignores, e.g., expen- We argue that frictionless program analysis at the large, indus-
sive pointer analysis, reflection in Java, undefined behaviors in C, trial scale requires a fundamental change of perspective. Rather
or third-party libraries for which code is unavailable? than starting with a precise program analysis, and attempting to
carefully trade some of this precision for scale, we advocate an
1 Friction is a technique-specific resistance to adoption, such as developers’ reluctance inherently scale-oblivious approach. In this paper, we cast scale-
to adopt tools with high false positive rates.
2 https://fbinfer.com/docs/limitations.html oblivious program analysis as a probably-approximately correct
(PAC)-learning problem [25], provide the probabilistic framework,
Permission to make digital or hard copies of part or all of this work for personal or and develop several fully polynomial-time randomized approxima-
classroom use is granted without fee provided that copies are not made or distributed tion schemes (FPRAS). Valiant [43] introduced the PAC framework
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. to study the computational complexity of machine learning tech-
For all other uses, contact the owner/author(s). niques. The objective of the learner is to receive a set of samples
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia and generate a hypothesis that with high probability has a low gen-
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
eralization error. We say the hypothesis is probably, approximately
https://doi.org/10.1145/nnnnnnn.nnnnnnn correct, where both adverbs are formalized and quantified.
1
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

We call our technique Monte Carlo program analysis (MCPA). • If a patch was submitted, it would require at most 400 million
The learner is the program analysis. The samples (more formally API requests (equiv. 7 minutes) to state with confidence (p-
Monte Carlo trials) are distinct program executions that are poten- value< 10−3 ) that the failure probability has indeed reduced
tially generated by different users of the software system under if none of those 400M executions exposes the bug after the
normal workload. The hypothesis is the analysis outcome. Given patch is applied. MCPA provides an a-priori conditional prob-
two parameters δ and ϵ, an MCPA produces an outcome that with abilistic guarantee that the one-sided null-hypothesis can be
probability at least (1 −δ ) has an average error at most ϵ. It is impor- rejected at significance level α = 0.01 (Sec. 6).
tant to note that the analysis outcome is probably-approximately • For an arbitrary program analysis, let H be a finite hypothe-
correct w.r.t. the distribution of Monte Carlo trials. For instance, if sis class containing all possible program analysis outcomes.
the executions are generated by real users, the analysis outcome is If MCPA chooses the outcome h ∈ H which explains the
probably-approximately correct w.r.t. the software system as it is analysis target (e.g., points-to-analysis) for all 5.6 billion daily
used by real users. Google searches, then h explains the analysis targets with a
In contrast to existing program analysis techniques, MCPA log(|H |/δ )
maximum error of ϵ ≤ 5.6·109 for “unseen” searches [41].
• allows to reason about the entire software systems under
normal workload, i.e., an MCPA while the system is used On a high level, MCPA combines the advantages of static and
normally produces an outcome w.r.t. normal system use, dynamic program analysis while mitigating individual challenges.
Like a static program analysis, MCPA allows to make statements
• requires only greybox access to the software system, i.e., no
over all executions of a program. However, MCPA does not take
source code is analyzed, instead lightweight program instru-
an analytical approach. A static analysis evaluates statements in
mentation allows to monitor software properties of interest,
the program source code which requires strong assumptions about
• provides probabilistic guarantees on the correctness and ac-
the programming language and its features. This poses substantial
curacy of the analysis outcome; i.e., to exactly quantify the
challenges for static program analysis. For instance, alias analy-
scalability-correctness tradeoff,
sis is undecidable [35]. Like a dynamic program analysis, MCPA
• can be implemented in a few lines of Python code, and overcomes these challenges by analyzing the execution of the soft-
• is massively parallelizable, scale-oblivious, and can be in- ware system. However, MCPA does not analyze only one execution.
terupted at any time. Instead, MCPA allows to make statements over all executions gen-
MCPA is a general approach that can produce outcomes with erated from a given (operational) distribution.
bounded error for arbitrary program analyses. This is because the The contributions of this article are as follows:
PAC-framework covers all of machine learning which can learn • We discuss the opportunities of employing Monte Carlo methods
arbitrary concepts, as well. Shalev-Shwartz and Ben-David [41] as an approximate, greybox approach to large-scale quantitative
provide an excellent overview of the theory of machine learning. program analysis with bounded error.
However, in this paper we focus here on (ϵ, δ )-approximations • We develop several fully-polynomial randomized approximation
of the probability that a binary property holds for an arbitrary exe- schemes (FPRAS) that guarantee that the produced estimate µ̂ of
cution of a terminating program. This has important applications, the probability µ that a program property holds is no more than ϵ
e.g., in assessing reliability [5, 6, 16], quantifying information leaks away from µ with probability at least (1 − δ ). We tackle an open
[34], or exposing side-channels [3]. Given parameters δ > 0 and problem in automated program repair: When is it appropriate to
ϵ > 0, our MCPA guarantees that the reported estimate µ̂ is no more claim with some certainty that the bug is indeed repaired?
than ϵ away from the true value µ with probability at least 1 − δ , • We evaluate the efficiency of our MCPA algorithms probabilis-
i.e., P(| µ̂ − µ | ≥ ϵ) ≤ δ . In statistical terms, MCPA guarantees the tically by providing upper and lower bounds for all 0 ≤ µ ≤ 1,
(1 − δ )-confidence interval µ̂ ± ϵ. Note that confidence and accuracy and empirically in more than 1018 simulation experiments.
parameters δ and ϵ can be chosen arbitrarily close to zero (0) to
minimize the probability of false negatives. 2 MONTE CARLO PROGRAM ANALYSIS
To illustrate the power of MCPA, we show how often well-known,
Monte Carlo program analysis (MCPA) allows to derive statements
industrial-scale software systems are executed per day (Fig. 1). For
about properties of a software system that are probably approxi-
instance, Netflix handles 86.4 billion API requests (8.64 · 1010 ) per
mately correct [42]. Given a confidence parameter 0 < δ < 1 and an
day. MCPA guarantees that the following statements are true with
accuracy parameter ϵ > 0, with probability at least (1 − δ ), MCPA
probability greater than 99% w.r.t. the sampled distribution:
produces an analysis outcome that has an average error at most ϵ.
• No matter which binary property we check or how many Problem statement. In the remainder of this paper, we focus on
properties we check simultaneously, we can guarantee a a special case of MCPA, i.e., to assess the probability µ that a binary
priori that the error ϵ of our estimate(s) µ̂ of µ is bounded property φ holds (for an arbitrary number of such properties). The
between 5.3 · 10−11 ≤ ϵ ≤ 5.3 · 10−6 such that µ ∈ µ̂ ± ϵ. probability that φ does not hold (i.e., that ¬φ holds) is simply (1 − µ).
• If 0 out of 86.4 billion API requests expose a security flaw, Suppose, during the analysis phase, it was observed that φ holds
then the true probability that there exists a security flaw that for the proportion µ̂ of sample executions. We call those sample
has not been exposed is less than 5.3 · 10−11 (Sec. 4). executions in the analysis phase as Monte Carlo trials. The analysis
• If 1000 of 86.4 billion API requests expose a vulnerability, phase can be conducted during testing or during normal usage of
then the true probability µ is probabilistically guaranteed to the software system. The analysis outcome of an MCPA is expected
be within µ ∈ 1.16 × 10−8 ± 1.45 × 10−9 (Sec. 5). to hold w.r.t. the concerned executions, e.g., further executions of
2
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

the system during normal usage. Our objective is to guarantee for 2.3 Assumption 3: Property Outcomes Are
the concerned executions that the reported estimate µ̂ is no more Independent and Identically Distributed
than ϵ away from the true value µ with probability at least 1 − δ ,
The sequence of n MC trials is a stochastic process F {Xm }m=1 n
i.e., P(| µ̂ − µ | ≥ ϵ) ≤ δ .
where Xm ∈ F is a binomial random variable that is true if φ = 1
Assumptions. We adopt the classical assumptions from soft-
in the m-th trial. We assume that the property outcomes F are in-
ware testing: We assume that (i) all executions terminate, (ii) prop-
dependent and identically distributed (IID). This is a classic assump-
erty violation is observable (e.g., an uncaught exception), and (iii) the
tion in testing, e.g., all test inputs are sampled independently from
Monte Carlo trials are representative of the concerned executions.
the same (operational) distribution [16]. More generally, through-
We elaborate each assumption and their consequences in the fol-
out the analysis phase, the probability µ that φ holds is invariant;
lowing sections. We make no other assumptions about the program
the binomial distribution over Xm is the same for all Xm ∈ F .
P or the property φ. For instance, P could be deterministic, proba-
Testing IID. In order to test whether F is IID, there are several
bilistic, distributed, or concurrent, written in C, Java, or Haskell, or
statistical tools available. For instance, the turning point test is a
a trained neural network, or a plain old program. We could define φ
statistical test of the independence of a series of random variables.
to hold if an execution yields correct behavior, if a security property
If assumption does not hold. The assumption that executions
is satisfied, if a soft deadline is met, if an energy threshold is not
are identically distributed does not hold for stateful programs where
exceeded, if the number of cache-misses is sufficiently low, if no
the outcome of φ in one execution may depend on previous execu-
buffer overwrite occurs, if an assertion is not violated, et cetera.
tions. In such cases, we suggest to understand each fine-grained
2.1 Assumption 1: All Executions Terminate execution as a transition from one state to another within a single
non-terminating execution. The state transitions can then be mod-
We assume that all executions of the analyzed program P terminate.
elled as Markov chain and checked using tools such as probabilistic
With terminating executions we mean independent units of behav-
model checking [22, 24].
ior that have a beginning and an end. For instance, a user session
runs from login to logout, a transaction runs from the initial hand- 2.4 Assumption 4: Monte Carlo Trials
shake until the transaction is concluded, and a program method Represent Concerned Executions
runs from the initial method call until the call is returned. Other
We assume that the Monte Carlo trials are representative of the con-
examples are shown in Figure 1. This is a realistic assumption also
cerned executions. In other words, the executions of the software
in software testing.
system that were generated during the analysis phase are from
If assumption does not hold. As we require all (sample and
the same distribution as the executions w.r.t. which the analysis
concerned) executions to terminate, MCPA cannot verify tempo-
outcome is expected to hold. This is a realistic assumption shared
ral properties, such as those expressed in linear temporal logic
with software testing, and any empirical analysis in general.
(LTL). If the reader wishes to verify temporal properties over non-
Realistic Behavior. A particular strength of MCPA compared
terminating executions, we refer to probabilistic model checking
to existing techniques is that the software system can be analyzed
[22, 24, 28] or statistical model checking [29, 39]. If the reader
under normal workload to derive an probably-approximately cor-
wishes to verify binary properties over non-terminating executions,
rect analysis outcome that holds w.r.t. the software system as it is
we suggest to use probabilistic symbolic execution [18] on partial
normally executed.
path constraints of bounded length k.
2.2 Assumption 2: Property Outcomes Are 3 MOTIVATION: CHALLENGES OF THE
Observable ANALYTICAL APPROACH
We assume that the outcome of property φ ∈ {0, 1} can be auto- The state-of-the-art enabling technology for quantitative program
matically observed for each execution. Simply speaking, we cannot analysis is probabilistic symbolic execution (PSE) which combines
estimate the proportion µ̂ of concerned executions for which φ symbolic execution and model counting. In a 2017 LPAR keynote
holds, if we cannot observe whether φ holds for any Monte Carlo [44], Visser called PSE the new hammer. Indeed, we find that PSE is
trial during the analysis phase. This is a realistic assumption that an exciting, new tool in the developer’s reportoire for quantitative
also exists in software testing. program analysis problems particularly if an exact probability is re-
Greybox access. Some properties φ can be observed externally quired. However, the analytical approach of PSE introduces several
without additional code instrumentation. For instance, we can mea- challenges for the analysis of large-scale, heterogeneous software
sure whether latency, performance, or energy thresholds are ex- systems which MCPA is able to overcome.
ceeded by measuring the time it takes to respond or to compute
the final result, or how much energy is consumed. Other properties 3.1 Probabilistic Symbolic Execution
can be observed by injecting lightweight instrumention directly Suppose property φ is satisfied for all program paths I . Conceptually,
into the program binary, e.g., using DynamoRIO or Intel Pin. Other for each path i ∈ I , PSE computes the probability pi that an input
properties can be made observable at compile-time causing very exercises i and then reports µ = i ∈I pi . To compute the probability
Í
low runtime overhead. An example is AddressSanitizer [40] which pi that an input exercises path i, PSE first uses symbolic execution to
is routinely compiled into security-critical program binaries to re- translate the source code that is executed along i into a Satisfiability
port (exploitable) memory-related errors, such as buffer overflows Modulo Theory (SMT) formula, called path condition. The path
and use-after-frees. condition π (i) is satisfied by all inputs that exercise i. A model
3
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

counting tool can then determine the number of inputs Jπ (i)K that (4) No confidence intervals are available for approximate PSE.
exercise i. Given the size JDK of the entire input domain D, the While approximate PSE trades accuracy (of the analysis out-
probability pi to exercise path i can be computed as pi = Jπ (i)K/JDK. come) for efficiency (of the analysis), it does not provide
There are two approaches to establish the model count. Exact any means to assess this trade-off.4 Approximate PSE pro-
model counting [3, 7, 16, 18] determines the model count precisely. vides only a (negatively biased) point estimate µ̂ of µ. Due
For instance, LaTTE [31] computes Jπ (i)K as the volume of a con- to the modular nature of the composition of the estimate µ̂,
vex polytope which represents the path constraint. However, an it is difficult to derive a reliable (1 − δ )-confidence interval
exact count cannot be established efficiently and turns out to be [µ̂ − ϵ, µ̂ + ϵ].
intractable in practice [2, 11].3 Hence, recent research has focussed (5) To provide an analysis outcome for programs under normal
on PSE involving approximate model counting. workload, a usage profile must be developed that can be
Approximate model counting trades accuracy for efficiency and integrated with the path condition. PSE requires to formally
determines Jπ (i)K approximately. Inputs are sampled from non- encode user behavior as a usage profile [16]. However, de-
overlapping, axis-aligned bounding boxes that together contain all riving the usage profile from a sample of typical execution
(but not only) solutions of π (i) [5, 6, 17, 32]. To determine whether can introduce uncertainty in the stochastic model that must
the sampled input exercises i, it is plugged into π (i) and checked for be accounted for [30].
satisfiability. The proportion of “hits” multiplied by the size of the
We believe that many of these limitations can be addressed. Yet,
bounding box, summed over all boxes gives the approximate model
they do demonstrate a shortcoming of the analytical approach. A
count. In contrast to arbitrary polytopes, the size of an axis-aligned
static program analysis must parse the source code and interpret
bounding box can be precisely determined efficiently.
each program statement on its own. It cannot rely on the compiler
to inject meaning into each statement. The necessary assumptions
3.2 Challenges limit the applicability of the analysis. The required machinery is
Probabilistic symbolic execution uses an incremental encoding of a substantial performance bottleneck. For PSE, we failed to find
the program’s source code as a set of path conditions. Whether reports of experiments on programs larger than a few hundred lines
static or dynamic symbolic execution, the semantic meaning of of code. Notwithstanding, we are excited about recent advances in
each individual program statement is translated into an equivalent scalable program analysis where correctness is carefully traded for
statement in the supported Satisfiability Modulo Theory (SMT). scalability. However, how do we quantify the error of the analysis
The quantitative program analysis, whether exact or approximate, outcome? What is the probability of a false positive/negative?
is then conducted upon those path conditions. This indirection
introduces several limitations. 3.3 Opportunities
(1) Only programs that can be represented in the available SMT Monte Carlo program analysis resolves many of these limitations.
theory can be analyzed precisely (e.g., bounded linear integer (1) Every executable program can be analyzed. Apart from termi-
arithmetic [18] or strings [7]). For approximate PSE, only nation we make no assumptions about the software system.
bounded numerical input domains are allowed. Otherwise,
(2) Every binary property φ can be checked, including a non-
bounding boxes and their relative size cannot be derived. Pro-
functional property, as long as it can be automatically ob-
grams that are built on 3rd-party libraries, are implemented
served. Moreover, the number of required trials is indepen-
in several languages, contain unbounded loops, involve com-
dent of the number of simultaneously checked properties.
plex data structures, execute several threads simultaneously
pose challenges for probabilistic symbolic execution. (3) The maximum likelihood estimate µ̂ for the binomial pro-
portion µ is unbiased. Unlike PSE, MCPA does not require
(2) Only functional properties φ can be checked; those can be
to identify or enumerate program paths that satisfy φ.
encoded within a path condition, e.g., as assertions over vari-
able values. In contrast, non-functional properties, such as (4) The error is bounded. The analysis results are probably ap-
whether the response-time or energy consumption exceeds proximately correct, i.e., given the parameters δ and ϵ, with
some threshold value, are difficult to check. probability (1 − δ ), we have that µ ∈ [µ̂ − ϵ, µ̂ + ϵ].
(3) Only a lower bound on the probability µ that φ holds can (5) The best model of a system is the running system itself. The
be provided. PSE cannot efficiently enumerate all paths for system can be analyzed directly during normal execution
which property φ holds. Reachability is a hard problem. Oth- with analysis results modulo concerned executions.
erwise, symbolic execution would be routinely used to prove It is interesting to note that approximate probabilistic symbolic
programs correct. Hence, PSE computes µ as µ ′ = i ∈I ′ pi execution employs Monte Carlo techniques upon each path con-
Í
only for a reasonable subset of paths I ⊆ I that satisfy φ
′ dition, effectively simulating the execution of the program. Our
[16], which is why the reported probability µ ′ ≤ µ. key observation is that the compiler already imbues each statement
with meaning and program binaries can be executed concretely
3 Gulwani et al. [38] observe that the “exact volume determination for real polyhedra
many orders of magnitutes faster than they can be simulated.
is quite expensive, and often runs out of time/memory on the constraints that are
4 We note that the sample variance σ̂ is not an indicator of estimator accuracy. Suppose,
obtained from our benchmarks”. Time is exponential in the number of input variables.
While a simple constraint involving four variables is quantified in less then 3 minutes in µ = 0.5 such that the (true) variance σ 2 = 0.25/n is maximal. Due to randomness, we
some experiments, a similarly simple constraint involving eight variables is quantified may take n samples where φ does not hold. Our estimate µ̂ = 0 is clearly inaccurate,
in just under 24 hours [31]. yet the sample variance σ̂ 2 = 0 gives no indication of this inaccuracy.
4
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

Algorithm 1 (ϵ, δ )-Approximate Software Verification 1.0

Input: Program P, Binary properties Φ 0.8

Input: Accuracy ϵ, Confidence δ 0.6

1: Let n = log(δ )/log(1 − ϵ) ε

2: for Each of n Monte Carlo trials do 0.4

3: Execute program P and observe property outcomes ε 0.2

4: for φ i ∈ Φ do 𝛿
5: if φ i does not hold in current execution 0.0
0.0 0.2 0.4
𝛿
0.6 0.8 1.0

6: then return “Property φ i violated for current execution P” Computed by Wolfram|Alpha Computed by Wolfram|Alpha

Wolfram Alpha LLC. 2019.


7: end for https://www.wolframalpha.com/input/?i=plot+log(delta)/log(1- epsilon)+for+delta=0..1, +epsilon=0..1 (access June 23,
2019).
8: end for
return “For all φ i ∈ Φ, we estimate the probability that φ i holds in Figure 2: The number n of Monte Carlo trials required, as
P is µ̂ L = (n+1)/(n+2) and guarantee that the true probability confidence δ and accuracy ϵ varies. The surface plot (left) is
µ i ∈ [1 − ϵ, 1] with probability at least (1 − δ ).” a 3D plot of the function. The contour plot (right) visualizes
how n changes as δ and ϵ varies.

4 (ϵ, δ )-APPROXIMATE SOFTWARE


VERIFICATION from the true value µ, and since µ̂ = 1 also on the probability that
During the analysis of large software systems, the probablity µ µ ≥ 1 − ϵ.
that a binary program property φ holds for an execution is often P(| µ̂ − µ | ≥ ϵ) ≤ (1 − ϵ)n (2)
almost or exactly one.5 For a very large number of successive Monte
Carlo trials, we might not ever observe that φ is violated. Still, Given a confidence parameter δ : 0 < δ ≤ 1, we can compute the
there is always some residual risk that ¬φ holds for an infinitesimal number n of Monte Carlo trials required to guarantee the (1 − δ )-
proportion of concerned executions, i.e., (1−µ) > 0 [4]. Hence, (ϵ, δ )- confidence interval [µ̂ − ϵ, µ̂ + ϵ] as
approximate software verification allows to specify an allowable
log(δ )
residual risk ϵ. It guarantees 0 ≤ (1 − µ) ≤ ϵ with probability (1 − δ ). n≥ (3)
If the binary program property φ holds for all of n Monte-Carlo log(1 − ϵ)
trials, our empirical estimate µ̂ of the probability µ that φ holds Analogously, given n Monte Carlo trials, with probability at least
is µ̂ = nn = 1. In this section, we provide a better point estimate 1 − δ , the absolute difference between the estimate µ̂ and the true
µ̂ L for µ in the absence of empirical evidence for the violation of value µ is at most ϵ where
φ. Now, Höffding’s inequality already provides probabilistic error 1
bounds (Sec. 5): Given the confidence parameter δ , we can compute ϵ ≤ 1−δn (4)
the accuracy ϵ of the estimate µ̂ such that µ is within µ̂ ± ϵ with
Note that for a 95%-CI (i.e., δ = 0.05), we have that ϵ ≤ 3/n giving
probability (1 − δ ). However, in this special case we can leverage a
rise to the name rule-of-three.
generalization of the rule-of-three (Ro3) to reduce the radius ϵ of
Efficiency. Our Algorithm 1 runs in time that is polynomial in
the guaranteed confidence interval by many orders of magnitude.
log(1/(1 − ϵ)))−1 < 1/ϵ and in log(1/δ ). It is thus a fully polyno-
Point estimate. Given that the sun has risen ever since we
mial randomized approximation scheme (FPRAS). The efficiency
were born n days ago, what is the probability that the sun will rise
of Algorithm 1 is visualized in Figure 2. To reduce ϵ by an order of
tomorrow? This riddle is known as the sunrise problem and the
magnitude increases the required Monte Carlo trials also only by an
solution is due to Laplace. Obviously, our plugin estimator µ̂ = 1 is
order of magnitude. This is further demonstrated for the real-world
positively biased, as there is always some residual probability that
examples shown in Figure 1.
the sun will fail to rise tomorrow (i.e., that ¬φ holds). If we have
observed that φ holds in all of n trials, the Laplace estimate µ̂ L of µ
is computed as
5 (ϵ, δ )-APPROXIMATE QUANTITATIVE
ANALYSIS
n+1
µ̂ L = (1) Quantitative program analysis is concerned with the proportion µ
n+2
of executions that satisfy some property φ of interest. Quantitative
Error bounds. The rule-of-three [20] is widely used in the evalu- analysis has important applications, e.g., in assessing software reli-
ation of medical trials. It yields the 95%-confidence interval [0, 3/n] ability [5, 6, 16], computing performance distributions [10], quanti-
for the probability of adverse effects given a trial where none of the fying information leaks [34], quantifying the semantic difference
n participants experienced adverse effects. Given an accuracy pa- across two program versions [15], or exposing side-channels [3].
rameter ϵ > 0, Hanley and Lippmann-Hand [20] provide an upper We propose (ϵ, δ )-approximate quantitative analysis which yields
bound on the probability that the estimate µ̂ deviates more than ϵ an estimate µ̂ that with probability at least (1 − δ ) has an average
5 Note
error at most ϵ. While the generalized rule of three (Ro3) provides
that we can reduce the alternative case—where the probablity µ that φ holds is
almost or exactly zero—to the case that we consider here. In that case, the probablity a lower bound on the number trials needed to compute an (ϵ, δ )-
(1 − µ) that φ does not hold (i.e., ¬φ holds) is almost or exactly one. approximation of µ, Höffding’s inequality provides an upper bound.
5
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

1.0
Algorithm 2 (ϵ, δ )-Approximate Quantitative Program Analysis
0.8 (Fully-Polynomial Randomized Approximation Scheme)
0.6
Input: Program P, Binary property φ
ε Input: Accuracy ϵ, Confidence δ
0.4
1: Let n = log(2/δ )/(2ϵ 2 )
ε 0.2
2: Let m = 1
3: repeat
𝛿
0.0
0.0 0.2 0.4
𝛿
0.6 0.8 1.0 4: Execute P, observe outcome of φ, and increment m
Computed by Wolfram|Alpha Computed by Wolfram|Alpha

Wolfram Alpha LLC. 2019. https://www.wolframalpha.com/input/?i=plot+log(2/delta)/(2epsilon^2)+for+delta=0..1+epsilon=0..1 (access June 23, 2019).


5: Let ϵ ′ be the radius of the current CP interval (cf. Eq. (9))
6: until (m == n) or (ϵ ′ ≤ ϵ)
Figure 3: The number n of Monte Carlo trials required, as 7: Let µ̂ = X /m where X is the total frequency φ has held
confidence δ and accuracy ϵ vary. The surface plot (left) is a return “We estimate the prob. that φ holds in P is µ̂ and guarantee
3D plot of the function. The contour plot (right) visualizes that the true prob. µ ∈ µ̂ ± ϵ with prob. at least (1 − δ ).”
how n changes as δ and ϵ vary. The bright area at the bottom
demonstrates substantial growth in n as ϵ approaches 0.
is found as solution to the constraint
n   X  
5.1 Höffding’s Inequality n k δ n δ
p (1 − p L )n−k = k
(1 − pU )n−k =
Õ Õ
and pU
Höffding’s inequality provides probabilistic bounds on the accuracy k L 2 k 2
k =X k =0
of an estimate µ̂ of the probability µ = P(φ) that a property φ holds
where nk = k !(n−k n! is the binomial coefficient; p L = 0 when X = 0

for an execution of a program. The number of times φ holds is )!
while pU = 1 when X = n. The radius ϵ of the CP interval is
a binomial random variable X = Bin(µ, n). The probability µ that
φ holds is a binomial proportion. An unbiased estimator of the pU − p L
ϵ= (9)
binomial proportion is µ̂ = X /n, i.e., the proportion of trials in 2
which φ was observed to hold. Given an accuracy parameter ϵ > 0, Given the same number of Monte Carlo trials, the radius ϵ of the
Höffding [23] provides an upper bound on the probability that the CP interval is lower-bounded by the generalized rule-of-three and
estimate µ̂ deviates more than ϵ from the true value µ. upper-bounded by Höffding’s inequality. The CP interval is conser-
vative, i.e., the probability that µ ∈ CI may actually be higher than
P(| µ̂ − µ | ≥ ϵ) ≤ 2 exp(−2nϵ 2 ) (5)
(1 − δ ), particularly for probabilities µ that are close to zero or one.
Given a confidence parameter δ : 0 < δ ≤ 1, we can compute the Wald interval. The most widely-used confidence intervals for
number n of sample executions required to guarantee the (1 − δ )- binomial proportions are approximate and cannot be used for (ϵ, δ )-
confidence interval [µ̂ − ϵ, µ̂ + ϵ] as approximate quantiative analysis. The CP interval is computation-
log(2/δ ) ally expensive and conservative. Hence, the most widely-used CIs
n >= (6)
2ϵ 2 are based on the approximation of the binomial with the normal
Analogously, given n Monte Carlo trials, with probability at least distribution. For instance, the well-known Wald interval [8] is
1 − δ , the absolute difference between the estimate µ̂ and the true " r r #
value µ is at most ϵ where r µ(1 − µ) µ(1 − µ)
CI Wald = µ̂ − zδ /2 , µ̂ + zδ /2 (10)
log(2/δ ) n n
ϵ≤ (7)
2n where zδ /2 is the (1 − δ2 ) quantile of a standard normal distribution.
Note that the probability that the absolute difference between es-
However, as we observe in our experiments, the Wald interval
timate and true value exceeds our accuracy threshold ϵ decays
has poor coverage propertoes, i.e., for small µ, the nominal 95%-
exponentially in the number of sample executions.
confidence interval actually contains µ only with a 10% probability.
Wilson score interval. The radius of the Wald interval is un-
5.2 Binomial Proportion Confidence Intervals
reasonably small when the estimand µ gets closer to zero or one.
CP interval. Clopper and Pearson [12] discuss a statistical method 6 The Wilson score CI mitigates this issue and has become the rec-
to compute a (1−δ )-confidence interval CI = [p L , pU ] for a binomial ommended interval for binomial proportions [8]. The Wilson score
proportion µ, such that µ ∈ CI with probability at least (1 − δ ) for interval [52] is
all µ : 0 < µ < 1. The Clopper-Pearson (CP) interval is constructed
by inverting the equal-tailed test based on the binomial distribution. CI Wilson = [µ̂ ′ − ϵ ′, µ̂ ′ + ϵ ′ ] (11)
Given the number of times X that property φ has been observed to where µ̂ ′ is the relocated center estimate
hold in n MC trials (i.e., X : 0 ≤ X ≤ n), the confidence interval z δ2 /2
!,
z δ2 /2
!

CI CP = [p L , pU ] (8) µ̂ ′ = µ̂ + 1+ (12)
2n n
6A ′
probabilistic method starts with the underlying process and makes predictions and zϵδ /2 is the corrected standard deviation
about the observations the process generates. In contrast, a statistical analysis starts
with the observations and attempts to derive the parameters of the underlying random s
2 z δ2 /2
, !
process that explain the observations. In other words, a statistical method requires ′ µ̂(1 − µ̂) z δ /2
observations to compute the estimates while a probabilistic method provides a-priori ϵ = z δ /2 + 1+ (13)
n 4n 2 n
bounds on such estimates (e.g., based on Ro3 or HfD).
6
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

Algorithm 3 (ϵ, δ )-Approximate Quantitative Program Analysis Algorithm 4 Approximate Patch Verification
(Reduced number of Clopper-Pearson CI calculations) Input: Program Pfix , Binary property φ, Confidence δ
Input: Program P, Binary property φ Input: Total n bug and unsuccessful trials X bug in Pbug
Input: Accuracy ϵ, Confidence δ 1: Let n fix = log(δ )/log (1 − p L ) where p L is the lower limit
1: Let n = n next = log(δ )/log(1 − ϵ) of the Clopper-Pearson interval (cf. Eq. (8)).
2: for each of n next Monte Carlo trials do
2: for Each of n fix Monte Carlo trials do
3: Execute P and observe the outcome of φ
4: end for
3: Execute program Pfix and observe outcome of φ
5: Let µ̂ = X /n where X is the total frequency φ has held 4: if φ does not hold in the current execution
6: if 0 < µ̂ < 1 then 5: then return “Property φ violated for current execution Pfix ”
7: repeat 6: end for
l q m
8: Let n next = (z 2p0q 0 + z z 2p02q 02 + 2ϵp0q 0 + ϵ)/ 2ϵ 2 return “The null hypothesis can be rejected at significance-level at
least δ , to accept the alternative that failure rate has decreased.”
where p0 = µ̂ and q 0 = 1 − p0 and z = zδ /2
9: for each of n next Monte Carlo trials do
10: Execute P and observe the outcome of φ
11: end for 6 APPROXIMATE PATCH VERIFICATION
12: Let n = n + n next “It turns out that detecting whether a crash is fixed or not is an
13: Let ϵ ′ be the radius of the current CP interval (cf. Eq. (9))
14: Let µ̂ = X /n where X is the total frequency φ has held interesting challenge, and one that would benefit from further
15: until ϵ ′ ≤ ϵ scientific investigation by the research community. [..] How long
16: end if should we wait, while continually observing no re-occurrence of
return “We estimate the prob. that φ holds in P is µ̂ and guarantee a failure (in testing or production) before we claim that the root
that the true prob. µ ∈ µ̂ ± ϵ with prob. at least (1 − δ ).” cause(s) have been fixed?” [1]
Let µ bug and µ fix be the probability to expose an error before and
Evaluation. We experimentally investigate properties of the after the bug was fixed, respectively. Suppose, we have n bug exe-
three confidence intervals for binomial proportions in Section 7. cutions of the buggy program out of which X bug = Bin(µ bug , n bug )
We find that only the Clopper-Pearson confidence interval guaran- exposed the bug. Hence, an unbiased estimator of µ bug is µ̂ bug =
tees that µ ∈ CI with probability at least (1 −δ ) for all µ : 0 ≤ µ ≤ 1. X bug /n bug . We call an execution as successful if no bug (of interest)
The other two intervals cannot be used for (ϵ, δ )-approximate quan- was observed. Given a confidence parameter δ : 0 < δ < 1, we
tiative analysis. ask how many successful executions n fix of the fixed program (i.e.,
0 = Bin(µ fix , n fix ) = X fix ) are needed to reject the null hypothesis
5.3 Quantitative Monte Carlo Program Analysis
with statistical significance α = δ ?
Algorithm 2 shows the procedure of the MCPA. Höffding’s inequal- To reject the null hypothesis at significance-level δ , we require that
ity provides the upper bound on the number of Monte Carlo trials re- there is no overlap between both (1 − δ )-confidence intervals. For
quired for an (ϵ, δ )-approximation of the probability µ that property the buggy program version, we can compute the Clopper-Pearson
φ holds in program P (Eq. (6)). The conservative Clopper-Pearson interval CI CP = [p L , pU ] (cf. Sec. 5.2). Recall that the probability
confidence interval provides an early exit condition (Line 6). that a property φ holds is simply the complement of probability that
Efficiency. Algorithm 2 runs in time that is polynomial in 1/ϵ ̸ φ holds. For the fixed program version, we leverage the generalized
and in log(1/δ ) and is thus a fully-polynomial randomized approxi- rule-of-three CI Ro3 = [0, 1 − δ 1/nfix ] (cf. Eq. (4)). In order to reject
mation scheme (FPRAS). The worst-case running time is visualized the null, we seek n fix such that
in Figure 3. Reducing δ (i.e., increasing confidence) by an order of 1
magnitude less than doubles execution time while reducing ϵ (i.e., 1 − δ nfix < p L (14)
increasing accuracy) by an order of magnitude increases execution log(δ )
time by two orders of magnitude. However, computing the Clopper- which is true for n fix > (15)
log (1 − p L )
Pearson (CP) CI after each Monte Carlo trial is expensive. In our
experiments, computing the CP interval 105 times takes 58 seconds, where p L is the lower limit of the Clopper-Pearson CI (cf. Eq. (8)).
on average. Computing the CP interval for each of n = 2 · 1011 test Efficiency. The CP limit p L is upper- and lower-bounded by the
inputs that OSS-Fuzz generates on a good day (cf. Figure 1) would generalized rule-of-three and Höffding’s inequality, respectively,
take about 3.8 years.
s
log(2/δ ) 1
 
Optimization. Hence, Algorithm 3 predicts the number of Monte µ̂ bug − ≤ p L ≤ µ̂ − 1 − δ nbug (16)
2n bug
Carlo trials needed for a Clopper-Pearson interval with radius ϵ.
The estimator in Line 9 was developed by Krishnamoorthy and for all 0 ≤ µ bug ≤ 1. Hence, Algorithm 4 is an FPRAS that runs in
Peng [27] and requires an initial guess p0 of µ. The algorithm com- time that is polynomial in log(1/(1 − µ̂ bug ))−1 and in log(1/δ ). The
putes the first guess p0 = µ̂ L after running the minimal number of worst-case efficiency of Algorithm 4 is visualized in Figure 4. For
trials required for the given confidence and accuracy parameters instance, if 800k of 200 billion executions expose a bug in Pbug , then
(Lines 1–6)—as provided by the generalized rule-of-three (Line 1). it requires less than 1.8k executions to reject the Null at significance
Subsequent guesses are computed from the improved maximum level p < 0.001 in favor of the alternative that the probability of
likelihood estimate (Line 14). exposing a bug has indeed reduced for Pfix .
7
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

1.0
Epsilon = 0.1 Epsilon = 0.01 Epsilon = 0.001

0.8 Algorithm 2
106
Algorithm 3
0.6
Upper Bound (HfD)
𝛿
4
10
Lower Bound (Ro3)

Monte Carlo trials n


0.4

2
10
𝛿 0.2

µbug Delta = 0.1 Delta = 0.01 Delta = 0.001


0.2 0.4 0.6 0.8 1.0
µbug
106
Computed by Wolfram|Alpha Computed by Wolfram|Alpha

Wolfram Alpha LLC. 2019. https://www.wolframalpha.com/input/?i=plot+log(delta)%2Flog(1-mu%2Bsqrt(log(2%2Fdelta)%2F(2*4e6)))+where+mu=0.001..1, delta=0.001..1 (access June 23, 2019).

104
Figure 4: The number n fix of Monte Carlo trials required for
the patched program to reject the Null and conclude that the 102

failure rate has indeed decreased as significance-level δ and


10−6 10−4 10−2 100 10−6 10−4 10−2 100 10−6 10−4 10−2 100
failure rate µ̂ bug for the buggy program vary (n bug = 4e6). Probability (1−µ)
The contour plot shows how n fix changes as δ and µ bug vary.
Figure 6: Required MC trials n for an (ϵ, δ )-approximation
7 EXPERIMENTS of µ. The rule-of-three (Ro3) provides the lower while Höff-
ding’s inequality (HfD) provides the upper bound for all 0 ≤
We implemented Algorithms 1–4 into 300 lines of R code. The binom
µ ≤ 1. In the first row, δ = 0.01. In the second row, ϵ = 0.01.
package [13] was used to compute various kinds of confidence
intervals. R is a statistical programming language. RQ2 (Efficiency). How efficient are Algorithms 2 and 3 w.r.t.
the probabilistic upper and lower bounds? Here, efficiency is
7.1 Approximate Quantitative Analysis measured as the number of MC trials required to compute an (ϵ, δ )-
Wald Wilson Clopper−Pearson approximation of the probability µ that φ holds. The upper and
lower bounds for all µ : 0 ≤ µ ≤ 1 are given by Höffding’s in-
Observed conf. level P(µ ∈ CI)

100.0%

95.0%
Nominal conf. level: 95%
equality (HfD) and the generalized rule-of-three (Ro3), respectively.
90.0% Figure 6 shows the efficiency of both algorithms as µ varies, as well
85.0%
as the probabilistic bounds, for various values of ϵ and δ . Each ex-
periment was repeated 1000 times and average values are reported.
80.0%

75.0% Both algorithms approach the probabilistic lower bound (Ro3) as


10−10 10−8 10−6 10−4 10−2 100 10−10 10−8 10−6 10−4 10−2 100
Probability (1−µ)
10−10 10−8 10−6 10−4 10−2 100
µ → 1 (or µ → 0, resp.). For instance, for δ = 0.01, ϵ = 0.001, at
least 4603 and most 2.6 million Monte Carlo trials are required;
Figure 5: The observed probability that µ ∈ CI given the nom- for µ = 10−4 , Algorithm 2 requires 4809 trials, which is just
inal probability (1 − δ ) = 95% as µ varies. We generated 100 4% above the lower bound. Moreover, Algorithm 3 requires only
thousand repetitions of 100 million trials for one thousand slightly more Monte Carlo trials than Algorithm 2 but reduces the
values of µ, i.e., ≈ 1018 Monte Carlo trials in total. number of computed intervals substantially from n to at most 2.

RQ1. Can the Clopper-Pearson interval CI CP be used for BinaryTree BinomialHeap


(ϵ, δ )-approximate quantitative analysis? An (ϵ, δ )-approxima- v ∈ [0, 9] v ∈ [0, 9] v ∈ [0, 499]
tion µ̂ of a binomial proportion µ guarantees that µ ∈ [µ̂ − ϵ, µ̂ + ϵ] Loc µ̂ µ Loc µ̂ µ µ̂ µ
0 0.9374 0.9375 1 0.64464 0.64451 0.74833 0.74799
with probability at least (1 − δ ). Figure 5 shows the results for 1 0.4589 0.4592 4 0.22842 0.22831 0.31051 0.31062
1018 simulation experiments. For various values of µ, we generated 2 0.5740 0.5745 6 0.60359 0.60350 0.68618 0.68599
3 0.4591 0.4592 7 0.26947 0.26932 0.37265 0.37263
n = 108 trials by sampling from µ and computed the 95% confidence 4 0.5742 0.5745 8 0.22842 0.22831 0.31051 0.31062
interval according to the methods of Wald, Wilson, and Clopper- 8 0.0202 0.0203 9 0.38148 0.38161 0.40648 0.40583
Peason. We repeated each experiment 105 times and measured the 9 0.0189 0.0189 10 0.30421 0.30389 0.40398 0.40416
10 0.0347 0.0346 13 0.00767 0.00765 0.00025 0.00025
proportion of intervals that contain µ. This proportion gives the 11 0.0361 0.0361 14 0.11879 0.11863 0.00275 0.00274
observed while 95% is the nominal confidence-level. 12 0.0361 0.0361 15 0.00767 0.00765 0.00025 0.00025
13 0.1077 0.1077 16 0.05551 0.05552 0.00150 0.00149
Yes. The Clopper-Pearson 95%-confidence interval CI CP contains µ 14 0.5742 0.5745 17 0.06640 0.06621 0.00138 0.00137
15 0.5739 0.5745 18 0.04784 0.04787 0.00125 0.00124
with probability at least 95% for all values of µ. Algorithms 2 and 19 0.00454 0.00456 0.00012 0.00012
3 provide valid (ϵ, δ )-approximations. However, both the Wald 20 0.00767 0.00765 0.00025 0.00024
and the Wilson procedures provide 95%-CIs that contain µ with
Figure 7: Branch probabilities computed by Algorithm 3 (µ̂)
probability less than 95%, especially for µ → 1 (or µ → 0, resp.).
and JPF-ProbSym (µ). We generated sufficient MC trials to
For instance, if µ = 1 − 10−8 , the Wald 95%-confidence interval
guarantee (ϵ, δ )-approximations for all 0 ≤ µ ≤ 1 s.t. δ =
CI Wald , which is the most widely-used interval for binomial
ϵ = 10−3 . That is, 3.8 · 106 executions of random values v on
proportions, contains µ with probability P(µ ∈ CI Wald ) < 80%.
BinaryTree (4.6 seconds) and BinomialHeap (0.5 seconds).
8
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

RQ3 (Experience with PSE). Geldenhuys et al. [18] developed exists with an error that exceeds a residual risk of ϵ = 10−5 with
what is now the state-of-the-art of exact quantitative program analy- probability at most 5%. Decreasing ϵ by an order of magnitude to
sis, probabilistic symbolic execution (PSE). The authors implemented ϵ = 10−6 also increases the number of required trials only by an
PSE into the JPF-ProbSymb tool and evaluated it using the two pro- order of magnitude to 3 · 10−6 .
grams BinaryTree and BinomialHeap. For both programs, Figure 7
shows the exact branch probabilities µ as computed by PSE and the Algorithm 1 is highly efficient. It exhibits a performance that is
corresponding (ϵ, δ )-approximations as computed by Algorithm 3. nearly inversely proportional to the given allowable residual risk.
To generate a Monte Carlo trial, we randomly sampled from the
same domain that PSE is given. Moreover, the Clopper-Pearson interval CI CP provides an upper
As we can see, our Monte Carlo program analysis (MCPA) ap- limit that is very close to the probabilistic lower bound as given by
proximates the branch probabilities µ within the specified minimum the rule-of-three—confirming our observation in RQ2.
accuracy. For δ = ϵ = 0.001, our estimates µ̂ are usually within
±0.0005 of the true values µ. We would like to highlight that ap- 7.3 Approximate Patch Verification
proximate PSE [5, 6, 17, 32] does not provide any such guarantees
on the accuracy of the produced estimates. We also observed that µbug=0.9
0.1 µbug=0.99
0.01 µbug=0.999
0.001

PSE is tractable only for very small integer domains (e.g,. v ∈ [0, 9] 104

[18] or v ∈ [0, 100] [10]) while our MCPA works with arbitrarily

Monte Carlo Trials nfix


large domains just as efficiently. In fact, unlike PSE, MCPA works 103

for executions generated while the software is deployed and used.


102
The results of Alg. 3 are comparable to those produced by JPF-
ProbSym with an error ϵ that can be set arbitrarily close to zero. 101

However, without the additional machinery of PSE, our MCPA 100


algorithm produces the analysis result much faster. For Binomial- 10−3 10−2 10−1 100 10−3 10−2 10−1 100 10−3 10−2 10−1 100
Heap, our algorithm took half a second (0.5s) while JPF-ProbSymb Significance level α (=δ)
took 57 seconds. For BinaryTree, our algorithm took about 5 sec-
Approx. Patch Verif. Mann−Whitney U Test Fisher's Exact Test
onds while JPF-ProbSymb took about seven minutes. In general, we
expect that a sufficiently large number of Monte Carlo trials can Figure 9: Efficiency versus significance. The number of re-
be generated quickly for arbitrarily large programs (cf. Figure 1 on quired successful executions n fix of Pfix to reject the Null at
page 1), which makes MCPA effectively scale-oblivious. significance α = δ . The blue, dashed line shows α = 0.05. We
conducted 1015 simulation experiments, in total; i.e., 103 rep-
Versus JPF-ProbSym, our MCPA is orders of magnitude faster.
etitions of n bug = 106 trials on Pbug for each of 102 values of
n fix ≤ 104 on Pfix , and 3 values of µ bug .
7.2 Approximate Software Verification
RQ5 (Efficiency). How efficient is Algorithm 4 in rejecting
Successful Monte Carlo trials n

1 × 10+9
108
the null hypothesis at a desired p-value compared to exist-
7.5 × 10+8 ing hypothesis testing methodologies? In Figure 9, we compare
106
our MCPA to the Fisher’s exact test and the Mann-Whitney U test.
5 × 10+8
104 The Fisher’s exact test is the standard two-sample hypothesis test
102
.Rule−of−three
2.5 × 10+8 for binomial proportions when the one of the estimates (here µ̂ fix )
is zero or one.7 The Mann-Whitney U test is the standard test used
Clopper−Pearson
Hoeffding's Ineq.
0 × 10+0
100
1e−08 1e−06 1e−04 1e−02 1e+00 1e−08 1e−06 1e−04 1e−02 1e+00
in the AB testing platform at Netflix [49, 50] to assess, e.g., perfor-
Accuracy ε Accuracy ε mance degradation between versions. The U test is a non-parametric
test of the null hypothesis that it is equally likely that a randomly
Figure 8: Efficiency versus residual risk. Both plots show the
selected value from one sample will be less than or greater than a
number of required Monte Carlo trials n, where φ is observed
randomly selected value from a second sample.
to hold, to guarantee—with an error that exceeds ϵ with prob-
ability at most δ = 0.05—that φ always holds. Algorithm 4 is more efficient than existing techniques for high
levels of confidence (e.g, p-value = α ≤ 0.05) on this task of ap-
RQ4 (Residual Risk). For a given allowable residual risk ϵ, proximate patch verification. The Mann-Whitney test that Netflix
how efficient is Alg. 1 in providing the probabilistic guaran- uses for A/B testing performs worst on this task; the difference to
tee that no bug exists when none has been observed? Figure 8 our MCPA increases as α → 0. Fisher’s exact test approaches the
demonstrates the efficiency of approximate software verification. performance of our MCPA as α → 0.
Alg. 1 exhibits a performance that is nearly inversely proportional
to the given allowable residual risk. For instance, it requires about 7 We note that the popular Chi-squared test, while easier to compute, is an approxima-
3 · 105 trials where no error is observed to guarantee that no error tion method that cannot be used when the one of the estimates µ̂ bug , µ̂ fix is 0 or 1.
9
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

7.4 Threats to Validity Statistical/Probabilistic Model Checking. Informally speak-


As in every empirical study, there are threats to the validity of ing, (δ, ϵ)-approximate software verification is to probabilistic and
the results. In terms of external validity, the assumptions of Monte statistical model checking as software verification is to classical
Carlo program analysis (MCPA) may not be realistic. A particular model checking. Our MCPA techniques are neither focused on tem-
advantage of simulation experiments is their generality: Our results poral properties nor concerned with a program’s state space. More
apply to arbitrary software systems of arbitrary size, as long as specifically, instead of verifying the expanding prefix of an infinite
the stochastic process of generating the Monte Carlo trials satisfies path through the program’s state space, MCPA requires several
the assumptions of MCPA. Thus, whether or not the assumptions distinct, terminating executions (Sec. 8). In this setting, the work
are realistic is the only threat to external validity. In Section 2, we on probabilistic symbolic execution [6, 16, 18] is most related.
discuss these assumptions, how realistic they are, and mitigation In model checking, there is the problem of deciding whether a
strategies in case they do not hold. Particularly, most assumptions model S of a potentially non-terminating stochastic system satisfies
are shared with those of automatic software testing. a temporal logic property ϕ with a probability greater than or
In terms of internal validity, we cannot guarantee that our imple- equal to some threshold θ : S |= P ≥θ (ϕ). The temporal property ϕ
mentation is correct, but make our script available for the reader to is specified in a probabilistic variant of linear time logic (LTL) or
scrutinize the implementation and reproduce our results.8 To mini- computation tree logic (CTL). For instance, we could check: “When a
mize the probability of bugs, we make use of standard R packages, shutdown occurs, the probability of a system recovery being completed
e.g., binom for the computation of the confidence intervals. between 1 and 2 hours without further failure is greater than 0.75”:
S |= down → P >0.75 [¬fail U [1,2] up]
8 RELATED WORK Probabilistic model checking (PMC) [22, 24, 28] checks such prop-
We are interested in the probability that a binary property 9 φ holds erties using a (harder-than-NP) analytical approach. Hence, PMC
for an execution of a terminating program. This problem is also becomes quickly untractable as the number of states increases. PMC
tackled by probabilistic symbolic execution [18]. After a discussion also requires a formal model of the stochastic system S as discrete-
of the relationship to probabilistic symbolic execution, we discuss or continuous-time Markov chain. Statistical model checking (SMC)
differences to A/B testing and runtime verification techniques. [29, 39] does not take an analytical but a statistical approach. SMC
Probabilistic Symbolic Execution (PSE) computes for each checks such properties by performing hypothesis testing on a num-
path i that satisfies the property φ the probability pi using sym- ber of (fixed-length) simulations of the stochastic system S. Similar
bolic execution and exact [16, 18] or approximate model counting to MCPA, Sen et al. [39] propose to execute the stochastic system
[5, 6, 17, 32], and then computes µ = i pi to derive the proba-
Í directly. However, Sen et al. assume that a “trace” is available, i.e.,
bility that φ holds. In contrast to MCPA, PSE allows to analyse S can identify and report the sequence of states which S visits
non-terminating programs by fixing the maximum length of the and how long each state transition takes. The properties that are
analyzed path. However, in contrast to PSE, MCPA does not require checked concern particular state sequences and time intervals. In
a formal encoding of the operational distribution as usage profiles. contrast, the focus of our current work is on binary (rather than
Instead, the analysis results are directly derived from actual usage temporal) properties for executions of terminating (rather than
of the software system. While PSE is a static quantitative program non-terminating) systems. In our setting, the concept of time is
analysis that never actually executes the program, MCPA is a dy- rather secondary. Moreover, (δ, ϵ)-approximate software and patch
namic quantitative analysis technique that requires no access to verification are just two particular instantiations of MCPA.
the source code or the heavy machinery of constraint encoding
and solving. In contrast to approximate PSE [5, 6, 17, 32], MCPA 9 CONCLUSION
provides probabilistic guarantees on the accuracy of the produced We are excited about the tremendous advances that have been made
point estimate µ̂. In contrast to statistical PSE [17], MCPA does not in scalable program analysis, particularly in the area of separation
require an exact model count for the branch probabilities of each logic. For instance, the ErrorProne static analysis tool is routinely
if-conditional. Other differences to PSE are discussed in Section 3. used at the scale of Google’s two-billion-line codebase [37]. The In-
A/B Testing [26, 49] provides a statistical framework to evalu- fer tool has substantial success at finding bugs at Facebook scale [9].
ate the performance of two software systems (often the new ver- Yet, there still remain several open challenges; e.g., the developers
sion versus the old version) within quantifiable error. In contrast of Infer set the clear expectation that the tool may report many false
to A/B testing, (ϵ, δ )-approximate quantitative analysis as well as alarms, does not handle certain language features, and can only re-
(ϵ, δ )-approximate software verification pertains to a single pro- port certain types of bugs.10 We strongly believe that many of these
gram. Moreover, in simulation experiments, our (ϵ, δ )-approximate challenges are going to be addressed in the future. However, we also
patch verification outperforms the Fisher’s exact test and the Mann- feel that it is worthwhile to think about alternative approaches to
Whitney U test, which is the two-sample hypothesis test of choice scalable program analysis. Perhaps to think about approaches that
for A/B testing at Netflix [49, 50]. We also note that A/B testing is do not attempt to maintain formal guarantees at an impractical cost.
subject to the same assumptions. If we are ready to give up on soundness or completeness in favor
of scalability, we should at least be able to quantify the trade-off.
8 https://www.dropbox.com/sh/fxiz3ycvvbi51xg/AACk0UyO_nG0UlPMEH645jbXa? The probabilistic and statistical methodologies that are presented
dl=0 in this paper represent a significant progress in this direction.
9 Quantitative program properties, such as energy, are considered by checking whether
they exceed a threshold (for an arbitrary number of thresholds). 10 https://fbinfer.com/docs/limitations.html

10
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

ACKNOWLEDGMENTS [24] Joost-Pieter Katoen. 2016. The Probabilistic Model Checking Landscape. In
Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science
We thank Prof. David Rosenblum for his inspiring ASE’16 keynote (LICS ’16). 31–45.
speech on probabilistic thinking [36]. This research was partially [25] Michael J. Kearns and Umesh V. Vazirani. 1994. An Introduction to Computational
Learning Theory. MIT Press, Cambridge, MA, USA.
funded by the Australian Government through an Australian Re- [26] Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments and
search Council Discovery Early Career Researcher Award (DE190100046). A/B Testing. Springer US, 922–929.
[27] K. Krishnamoorthy and Jie Peng. 2007. Some Properties of the Exact and Score
Methods for Binomial Proportion and Sample Size Calculation. Communications
REFERENCES in Statistics - Simulation and Computation 36, 6 (2007), 1171–1186.
[1] Nadia Alshahwan, Xinbo Gao, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, [28] Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verifi-
Taijin Tei, and Ilya Zorin. 2018. Deploying Search Based Software Engineering cation of Probabilistic Real-time Systems. In Proceedings of the 23rd International
with Sapienz at Facebook. In Search-Based Software Engineering. 3–45. Conference on Computer Aided Verification (CAV’11). 585–591.
[2] Sanjeev Arora and Boaz Barak. 2009. Computational Complexity: A Modern [29] Axel Legay, Benoît Delahaye, and Saddek Bensalem. 2010. Statistical Model
Approach (1st ed.). Cambridge University Press, New York, NY, USA. Checking: An Overview. In Proceedings of the First International Conference on
[3] Lucas Bang, Abdulbaki Aydin, Quoc-Sang Phan, Corina S. Păsăreanu, and Tevfik Runtime Verification (RV’10). 122–135.
Bultan. 2016. String Analysis for Side Channels with Segmented Oracles. In Pro- [30] Yamilet R. Serrano Llerena, Marcel Böhme, Marc Brünink, Guoxin Su, and David S.
ceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations Rosenblum. 2018. Verifying the Long-Run Behavior of Probabilistic System
of Software Engineering (FSE 2016). 193–204. Models in the Presence of Uncertainty. In Proceedings of the 12th Joint meeting of
[4] Marcel Böhme. 2019. Assurance in Software Testing: A Roadmap. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium
the 41st International Conference on Software Engineering: New Ideas and Emerging on the Foundations of Software Engineering (ESEC/FSE). 1–11.
Results (ICSE-NIER ’19). IEEE Press, Piscataway, NJ, USA, 5–8. [31] Jesús A. De Loera, Raymond Hemmecke, Jeremiah Tauzer, and Ruriko Yoshida.
[5] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, and Corina S. Păsăreanu. 2015. 2004. Effective lattice point counting in rational convex polytopes. Journal of
Iterative Distribution-aware Sampling for Probabilistic Symbolic Execution. In Symbolic Computation 38, 4 (2004), 1273 – 1302.
Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering [32] Kasper Luckow, Corina S. Păsăreanu, Matthew B. Dwyer, Antonio Filieri, and
(ESEC/FSE 2015). 866–877. Willem Visser. 2014. Exact and Approximate Probabilistic Symbolic Execution
[6] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, Corina S. Păsăreanu, and for Nondeterministic Programs. In Proceedings of the 29th ACM/IEEE International
Willem Visser. 2014. Compositional Solution Space Quantification for Probabilis- Conference on Automated Software Engineering (ASE ’14). 575–586.
tic Software Analysis. In Proceedings of the 35th ACM SIGPLAN Conference on [33] Klaus Mølmer, Yvan Castin, and Jean Dalibard. 1993. Monte Carlo wave-function
Programming Language Design and Implementation (PLDI ’14). 123–132. method in quantum optics. Journal of the Optical Society of America B 10, 3 (Mar
[7] Tegan Brennan, Nestan Tsiskaridze, Nicolás Rosner, Abdulbaki Aydin, and Tevfik 1993), 524–538.
Bultan. 2017. Constraint Normalization and Parameterized Caching for Quantita- [34] Quoc-Sang Phan, Pasquale Malacaria, Corina S. Păsăreanu, and Marcelo
tive Program Analysis. In Proceedings of the 2017 11th Joint Meeting on Foundations D’Amorim. 2014. Quantifying Information Leaks Using Reliability Analysis.
of Software Engineering (ESEC/FSE 2017). 535–546. In Proceedings of the 2014 International SPIN Symposium on Model Checking of
[8] Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval Estima- Software (SPIN 2014). 105–108.
tion for a Binomial Proportion. Statist. Sci. 16, 2 (05 2001), 101–133. [35] G. Ramalingam. 1994. The Undecidability of Aliasing. ACM Trans. Program. Lang.
[9] Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2011. Syst. 16, 5 (Sept. 1994), 1467–1471.
Compositional Shape Analysis by Means of Bi-Abduction. J. ACM 58, 6, Article [36] D. S. Rosenblum. 2016. The power of probabilistic thinking. In 2016 31st IEEE/ACM
26 (Dec. 2011), 66 pages. International Conference on Automated Software Engineering (ASE). 3–3.
[10] Bihuan Chen, Yang Liu, and Wei Le. 2016. Generating Performance Distributions [37] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera
via Probabilistic Symbolic Execution. In Proceedings of the 38th International Jaspan. 2018. Lessons from Building Static Analysis Tools at Google. Commun.
Conference on Software Engineering (ICSE ’16). 49–60. ACM 61, 4 (March 2018), 58–66.
[11] Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2017. Approximate [38] Sriram Sankaranarayanan, Aleksandar Chakarov, and Sumit Gulwani. 2013. Static
counting in SMT and value estimation for probabilistic programs. Acta Informat- Analysis for Probabilistic Programs: Inferring Whole Program Properties from
ica 54, 8 (01 Dec 2017), 729–764. Finitely Many Paths. In Proceedings of the 34th ACM SIGPLAN Conference on
[12] C. J. Clopper and E. S. Pearson. 1934. The Use of Confidence or Fiducial Limits Programming Language Design and Implementation (PLDI ’13). 447–458.
Illstrated in the Case of the Binomial. Biometrika 26, 4 (12 1934), 404–413. [39] Koushik Sen, Mahesh Viswanathan, and Gul Agha. 2004. Statistical Model Check-
[13] Sundar Dorai-Raj. 2015. R – Binom package. https://cran.r-project.org/web/ ing of Black-Box Probabilistic Systems. In Computer Aided Verification. 202–215.
packages/binom/binom.pdf. Accessed: 2019-06-17. [40] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry
[14] R. Dum, P. Zoller, and H. Ritsch. 1992. Monte Carlo simulation of the atomic Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceed-
master equation for spontaneous emission. Physical Review A 45 (Apr 1992), ings of the 2012 USENIX Conference on Annual Technical Conference (USENIX
4879–4887. Issue 7. ATC’12). USENIX Association, Berkeley, CA, USA, 28–28.
[15] A. Filieri, C. S. Pasareanu, and G. Yang. 2015. Quantification of Software Changes [41] Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning:
through Probabilistic Symbolic Execution (N). In Proceedings of the 30th IEEE/ACM From Theory to Algorithms. Cambridge University Press, New York, NY, USA.
International Conference on Automated Software Engineering (ASE’15). 703–708. [42] Leslie Valiant. 2013. Probably Approximately Correct: Nature’s Algorithms for
[16] Antonio Filieri, Corina S. Păsăreanu, and Willem Visser. 2013. Reliability Analysis Learning and Prospering in a Complex World. Basic Books, Inc.
in Symbolic Pathfinder. In Proceedings of the 2013 International Conference on [43] L. G. Valiant. 1984. A Theory of the Learnable. Commun. ACM 27, 11 (Nov. 1984),
Software Engineering (ICSE ’13). 622–631. 1134–1142.
[17] Antonio Filieri, Corina S. Păsăreanu, Willem Visser, and Jaco Geldenhuys. 2014. [44] Willem Visser. 2017. Probabilistic Symbolic Execution: A New Hammer. https:
Statistical Symbolic Execution with Informed Sampling. In Proceedings of the 22Nd //easychair.org/smart-program/LPAR-21/LPAR-21-s3-Visser.pdf Keynote at the
ACM SIGSOFT International Symposium on Foundations of Software Engineering 21st International Conference on Logic for Programming, Artificial Intelligence
(FSE 2014). 437–448. and Reasoning (LPAR-21).
[18] Jaco Geldenhuys, Matthew B. Dwyer, and Willem Visser. 2012. Probabilistic Sym- [45] Website. 2010. Visa: Transactions per Day. https://usa.visa.com/
bolic Execution. In Proceedings of the 2012 International Symposium on Software run-your-business/small-business-tools/retail.html. Accessed: 2019-06-17.
Testing and Analysis (ISSTA 2012). 166–176. [46] Website. 2017. OSS-Fuzz: Five Months Later. https://testing.googleblog.com/
[19] Brian L Hammond, William A Lester, and Peter James Reynolds. 1994. Monte 2017/05/oss-fuzz-five-months-later-and.html. Accessed: 2017-11-13.
Carlo methods in ab initio quantum chemistry. Vol. 1. World Scientific. [47] Website. 2018. Domo: Data Never Sleeps Report 6.0. https://www.domo.com/
[20] JA Hanley and A Lippman-Hand. 1983. If nothing goes wrong, is everything all blog/data-never-sleeps-6/. Accessed: 2018-11-16.
right? Interpreting zero numerators. Journal of the American Medical Association [48] Website. 2018. Facebook: Company Info and Statistics. https://newsroom.fb.com/
249, 13 (1983), 1743–1745. company-info/. Accessed: 2018-11-16.
[21] Mark Harman and Peter O’Hearn. 2018. From Start-ups to Scale-ups: Opportuni- [49] Website. 2018. Netflix Tech Blog: Automated Canary Analysis
ties and Open Problems for Static and Dynamic Program Analysis. Keynote at at Netflix with Kayenta. https://medium.com/netflix-techblog/
the 18th IEEE International Working Conference on Source Code Analysis. automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69. Ac-
[22] T. Hérault, R. Lassaigne, F. Magniette, and S. Peyronnet. 2004. Approximate cessed: 2018-10-13.
Probabilistic Model Checking. In Proceedings of the 5th International Conference [50] Website. 2018. Netflix TechBlog: Edge Load Balancing. https://medium.com/
on Verification, Model Checking and Abstract Interpretation (VMCAI’04). 307–329. netflix-techblog/netflix-edge-load-balancing-695308b5548c. Accessed: 2018-11-
[23] Wassily Hoeffding. 1963. Probability Inequalities for Sums of Bounded Random 16.
Variables. J. Amer. Statist. Assoc. 58, 301 (1963), 13–30.
11
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

[51] Website. 2019. Visa: Swipes per Day. https://www.gotinder.com/press?locale=en.


Accessed: 2019-06-17.
[52] Edwin B. Wilson. 1927. Probable Inference, the Law of Succession, and Statistical
Inference. J. Amer. Statist. Assoc. 22, 158 (1927), 209–212.

12

You might also like