Professional Documents
Culture Documents
A BAYESIAN APPROACH
By
Roman Krzysztofowicz
University of Virginia
http://www.faculty.virginia.edu/rk/
October 2009
called model failures. Many of the associated modeling issues can be addressed systematically
and coherently within the mathematical framework of Bayesian forecast-decision theory. Five of
them are addressed herein: (i) choosing a criterion function for making rational decisions under
quantify uncertainty and predict realizations; (iii) fusing data from asymmetric samples to cope
with unrepresentativeness of small samples and corruptive effects of outliers; (iv) calibrating
probabilistic predictions against a prior distribution; and (v) ordering predictors, or models, in
suggested that communication between hydrologists and deciders (planners, engineers, operators
of hydrosystems) would benefit if hydrologists adopted, at least on some issues, the perspective of
deciders, who must act in a timely and rational manner, and for whom hydrological estimates and
Key words: decision making; decision criteria; expected utility; estimation; uncertainty;
Bayesian approach.
ii
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A New Research Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Decision-Theoretic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. DECISION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Target-Setting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Impulse Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Quadratic Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Two-Piece Linear Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Absolute Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Median Versus Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 The Meta-Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iii
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Optimal Decision Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Bayes Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Comparison Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Theory of Sufficient Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Informativeness Under Gaussian Likelihood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.1 Sufficiency Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.2 Informativeness Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6 Utilitarian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7. CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
APPENDIX A: WEIBULL DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
1. INTRODUCTION
A hydrological model, like any mathematical model of a system (Wymore, 1977), is an in-
construct simplifies the reality and, therefore, fails to make perfect predictions is a given. What
remains in question, and should not be hidden, is the magnitude and the frequency of failures.
By focusing on failures rather than successes of hydrological models, the conveners (Vazken
Andréassian, Charles Perrin, Eric Parent, and András Bárdossy) of a scientific workshop organized
mental research paradigm. Named with a poetic flare, after Victor Hugo, “the court of miracles
— catchments, events, and situations that revealed unexpected weaknesses or outright failures of
hydrological models, and (ii) a set of objectives for scientific progress in understanding, modeling,
This comprehensive, systemic perspective is noteworthy. At any point in time, the science of
hydrology offers state-of-the-art models. While scientific progress towards greater understanding
and improved models continues in an unconstrained fashion, decisions for planning, construction,
and operation of hydrosystems must be made in a timely and rational manner. This require-
ment places two demands on hydrologists: (i) to provide an honest (a well-calibrated) assessment
condition for effective communication between hydrologists and deciders (planners, engineers, op-
erators), and (ii) to appreciate, and if necessary to adopt, the perspective of a decider for whom
1
1.2 Decision-Theoretic Issues
This expository article is written under the motto (again, aptly coined by the conveners):
“There are no hydrological monsters, only decision-making issues”. Indeed, the requirements of
a rational decider for information help to frame the issues for a hydrologist. We address some
fundamental issues and discuss some basic principles, derived from the Bayesian forecast-decision
1. Choosing a criterion function for making rational decisions under conditions of uncertainty
for the purpose of calculating estimates or setting targets. (This is a meta-decision problem:
3. Fusing data from asymmetric samples of stochastically dependent variates for the purpose
of coping with unrepresentativeness of small samples and effects of illusory outliers in the context
While at first glance these issues may appear disjoint, they can be addressed systematically
and coherently within the mathematical framework of decision theory — a top-down approach
whose ultimate objective is to maximize the expected utility of outcomes to stakeholders at every
step of modeling.
2
2. DECISION CRITERIA
There is a class of decision problems under uncertainty wherein the optimal value of a contin-
uous decision variable a would be set to the realization w of the input variate W if only one knew
that realization at the time the decision must be made. This class includes problems of statistical
estimation (e.g., when a is the estimate of an unknown infiltration coefficient W ) and problems of
setting targets in the context of management, planning, and control (e.g., when a is the height of
a flood levee to be built and W is the maximum flood crest in the next 50 years; when a is the
Suppose a and w are continuous variables and u is a utility function that evaluates outcomes,
such that u(a, w) denotes the utility of outcome resulting from decision a and input w. If in-
put w were known with certainty, then the optimal decision would be a = w, which means that
max u(a, w) = u(w, w) for every w. Oftentimes, it is convenient to transform the utility function
a
into an opportunity loss function l, such that l(a, w) denotes the difference between the utility of
the optimal decision and the utility of a given decision, when w is fixed (DeGroot, 1970):
It follows that the optimal decision a∗ under uncertainty about W can be obtained either by maxi-
mizing the expected utility or, equivalently, by minimizing the expected opportunity loss:
3
Suppose furthermore that the uncertainty about the input variate W is quantified in terms of
a distribution function G such that for any realization w, one has G(w) = P (W ≤ w), where P
Two questions arise. (i) Given the preferences of a decider encoded in the utility function u
(or the opportunity loss function l) and the uncertainty about the input variate W quantified by the
distribution function G (or the density function g), What is the optimal decision a∗ ? (ii) In the ab-
sence of well-formalized preferences over possible outcomes (which is often the case in scientific
estimation and prediction problems), What form of the utility function u (or the opportunity loss
We shall address these questions by recalling known results from decision theory for four
forms of u (or l) shown in Fig. 1 (DeGroot, 1970; Pratt et al., 1995; Krzysztofowicz, 1990).
Suppose decision a that perfectly estimates input w results in infinite utility, whereas decision
a that mis-estimates input w results in zero utility, regardless of the direction and magnitude of
U (a) = E[u(a, W )]
Z ∞
= δ(w − a)g(w) dw
−∞
= g(a), (4)
4
and the optimal decision a∗ is found as the maximizer:
that is, the optimal decision a∗ is a point at which the density function g attains the maximum.
Such a point a∗ is known as the mode of variate W . (It may be noted that (5) is akin to the
To illustrate this solution, Fig. 2(a) shows six Weibull density functions (see Appendix A for
the formula) with fixed scale parameter α and shift parameter η, and different shape parameter β
values, which result in different values of the mode a∗ (see Table 1).
Should a hydrologist adopt the mode of W as the preferred estimator? The above formulation
offers a way of rationalizing such a meta-decision: If the hydrologist believes that only a single
point — the perfect estimate of the unknown w — has any positive utility (which is infinite relative
to the zero utility of every non-perfect estimate), then the answer is affirmative; otherwise other
(To decide whether or not the impulse utility function is a suitable criterion for scientific
decisions, it may help to consider the problem of targeting a pistol in a duel. Here, w is the
position of the adversary on the horizontal axis, and a is the aiming point. The outcome of a = w
is the dead adversary, whereas the outcome of a 6= w is a chance of ending up in the morgue
oneself, should the adversary’s targeting turns out to be perfect. If one assigns the infinite utility
to one’s own life relative to the adversary’s life, as a matter of rationality, or survival instinct, then
the impulse utility function is the exact mathematical model of this preference.)
The opportunity loss is proportional to the quadratic difference between decision a and
5
input w (Fig. 1(b)):
where λ is the marginal opportunity loss from mis-estimation of the input, regardless of the direc-
tion of mis-estimation; it may be in monetary units [$/(unit of w)2 ] or may represent subjective
valuation.
Each feasible decision a is now evaluated in terms of the expected opportunity loss
where a∗ = E(W ). That is, the optimal decision equals the mean of variate W ; Table 1 shows
examples. (It may be noted that (7) is akin to the least squares criterion for parameter estimation.)
Should a hydrologist adopt the mean of W as the preferred estimator? Again, the above
formulation offers a way of rationalizing such a meta-decision. The answer is affirmative if the
hydrologist is indifferent with respect to the direction of mis-estimation and agrees that the op-
portunity loss increases quadratically with the magnitude of mis-estimation. The hydrologist may
also consider the implication of this preference: detailed quantification of uncertainty about vari-
ate W in terms of its distribution function G is irrelevant to decision making; the only relevant
The opportunity loss is proportional to the absolute difference between decision a and input
6
where λo is the marginal opportunity loss from over-estimation of the input, and λu is the marginal
opportunity loss from under-estimation of the input; each constant may be in monetary units
Each feasible decision a is now evaluated in terms of the expected opportunity loss
L(a) = E[l(a, W )]
Z a
= (λu + λo )aG(a) − λu a + λu E(W ) − (λu + λo ) wg(w) dw, (10)
−∞
where a∗ is such that G(a∗ ) = 1/[1 + λo /λu ]. Because W is a continuous variate, its distribution
function G has the inverse G−1 , called the quantile function of W . Hence,
That is, the optimal decision a∗ equals the quantile of variate W corresponding to the probability
First, in order to find the optimal decision a∗ , the hydrologist needs to know the distribution
function G of variate W and the ratio λo /λu of the marginal opportunity losses. Table 2 shows
the implications: for instance, when the hydrologist judges the marginal opportunity loss from
under-estimation to be 3 times as large as that from over-estimation, G(a∗ ) = 3/4; that is, the
Second, associated with the optimal decision a∗ are the probability of over-estimation of the
input, P (W < a∗ ) = G(a∗ ), and the probability of under-estimation of the input, P (W > a∗ ) =
1 − G(a∗ ). For instance, when a∗ is the planned daily capacity of a water delivery system, and
7
W is the daily river flow volume, then G(a∗ ) represents the probability of shortage. Note that this
is the “optimal” probability of shortage: given the ratio λo /λu of the marginal opportunity losses
and the distribution function G of the daily river flow volume W , it is optimal to incur a shortage
Should a hydrologist adopt a quantile of W as the preferred estimator? Again, the above
formulation offers a way of rationalizing such a meta-decision. Inasmuch as the ratio λo /λu may
take on different values for different problems and different deciders, any quantile of W under G
may be optimal in some situation. In that sense, the two-piece linear opportunity loss (9) is the
most general among the three criterion functions considered herein — general in that it allows for
A special case of the two-piece linear opportunity loss function (9) arises when the marginal
opportunity losses are identical: λo = λu = λ. The opportunity loss function is now symmetric
In other words, the opportunity loss is proportional to the absolute difference between decision a
(the estimate) and realization w of variate W . From (12) it follows that the optimal decision a∗
When the opportunity losses from over-estimation and under-estimation are symmetric, the
hydrologist might ask which of the two functions, (6) or (13), reflects better his preferences. In the
8
may be the robustness of the optimal decision to outliers. As is well known, the sample estimate
of the mean E(W ) is sensitive to outliers, whereas the sample estimate of the median G−1 (1/2) is
not. This fact might favor the median. A third meta-consideration, especially when the estimate
constitutes a prediction of W , is the communication of uncertainty to users. For this purpose, the
median of W (not the mean) is the preferred estimate because it conveys at least some rudimentary
assessment of uncertainty (the 50% chance of the actual realization being either below or above
setting problems. Four models have been presented, each with a special criterion function and
an analytic solution for the optimal decision. The summary (Table 3) is this: Each of the four
estimators (mode, mean, median, a quantile) is optimal with respect to some criterion function.
First, the meta-decision problem can be framed thusly: the choice of an estimator for a
hydrological predictand is tantamount to accepting (if only implicitly) the corresponding criterion
function as representative of a hydrologist’s preferences over outcomes; and vice versa, the choice
(explicit) of the criterion function prescribes the type of estimator that is optimal.
Second, the apparent propensity of hydrologists (and scientists, in general) to prefer the mean
(as the estimator) and to deplore the bias (any deviation from the mean) finds no justification in
the theory of rational decisions as applied to real-world problems: ample empirical evidence from
cost-benefit analyses and preference-elicitation studies indicates that opportunity loss functions are
9
3. MODELS OF STOCHASTIC DEPENDENCE
At the heart of model building in any science, including hydrology, is the task of identifying
and representing relations between variables. When uncertainties spoil the idealized deterministic
relations, the focus should shift on representing stochastic dependence structures between vari-
ates. The most general mathematical model of stochastic dependence is a joint density function π.
Herein, we focus on the simplest situation with two continuous variates, interpreted as a predictand
The two factorizations of the joint density function π (in terms of the families of the conditional
density functions φ, f and the marginal density functions κ, g) are equivalent, provided they satisfy
ing task as follows: first, model and estimate g and f ; then derive κ and φ. Thereby a complete
and coherent model of stochastic dependence can be built, which utilizes all available data.
The objective of this exposition is to explain the Bayesian approach to model building in the
context of two problems: (i) fusing data from a large sample of variate W and from a small sample
of variate X in order to improve the model for the density function κ of variate X, and (ii) making
a probabilistic prediction in the form of a conditional density function φ(·|x) of variate W , given
realization x of variate X.
10
Many hydrologists seem to be unaware of how strong the case for a Bayesian approach to
modeling really is. For this reason, we shall juxtapose it with a common statistical approach
which, in essence, focuses on building directly a model of the form h(w|x). In the most general
form, h(·|x) is a conditional density function of variate W , given realization x of variate X. It will
be shown that h 6= φ. And it will be argued that in the court of miracles, the Bayesian approach
would carry the day on the account of allowing the hydrologist (i) to use all available data, and (ii)
to calibrate probabilistic predictions against a specified standard. Both of these unique advantages
of the Bayesian approach should translate into a more robust model — the first line of defense
Suppose the available data are organized into two samples. (i) The joint sample of (X, W )
consists of N realizations:
(ii) The prior sample of W (which may also be called a climatic sample of W when it has been
collected over a sufficient number of years to represent the climate) consists of M realizations:
The two samples are asymmetric, N < M, when the prior sample includes M − N additional
Here is a typical example (Krzysztofowicz & Watada, 1986). A long climatic record, say
from M years, exists of runoff volume w measured at the catchment outlet during the snowmelt
season, but the record of snowpack depth x measurements is recent and short, say from N years,
N < M. With the objective of predicting the seasonal snowmelt runoff volume W , given the
snowpack depth X = x on some fixed date during the winter, the hydrologist should wish not
11
to discard the M − N years of runoff volume measurements. Yet this he must do if he takes a
This section reviews the simple linear regression (DeGroot, 1989) that yields a model of the
form h(w|x) and that sets the stage for the Bayesian model.
3.3.1 Model. Under the normal-linear model, the stochastic dependence between the pre-
dictand W and the predictor X is represented thusly: conditional on a realization of the predictor
W = cx + d + Ξ, (20)
where c, d are parameters, and Ξ is a residual variate being stochastically independent of X, and
having a Gaussian density function with E(Ξ) = 0 and Var(Ξ) = τ 2 . It then follows that the
conditional density function h(·|x) of W , given any realization X = x, is Gaussian with the mean
and variance
E(W |X = x) = cx + d, (21a)
Var(W |X = x) = τ 2 . (21b)
3.3.2 Assumptions. Inasmuch as the objective is to obtain a model for the conditional
12
predictor realizations x(n) for all n = 1, ..., N must show no trend;
Gaussian distribution function with mean zero and variance τ 2 must fit well the
empirical distribution function constructed from the sample {ξ(n) : n = 1, ..., N}.
3.3.3 Parameter Estimates. The parameters c, d, τ are estimated from the joint sample
(18). The estimates of c, d are usually obtained via the least squares method; the estimate of τ 2
for our purpose must be the sample variance: (1/N)Σξ 2 (n). However, the same parameter values
can be obtained via the sample estimates of moments (DeGroot, 1989). Let mX , mW denote the
sample means and s2X , s2W denote the sample variances of X and W , respectively; let r denote the
c = rsW s−1
X , d = mW − cmX , (22a)
These relations will prove handy in explaining the distinctions between the common approach and
When the joint sample is small, the estimates of the regression parameters c, d, τ may be
erroneous. Whereas Bayesian statistical theory offers methods for quantifying uncertainty about
the regression parameters, we focus on a Bayesian method for improving the estimates by ex-
tracting information from the prior sample, which contains realizations of W not included in the
joint sample (Krzysztofowicz, 1983; Krzysztofowicz & Watada, 1986; Krzysztofowicz & Reese,
1991).
13
3.4.1 Prior Density Function. Suppose the prior (marginal) density function g of predictand
(−∞ < x < ∞). To obtain a model for f , the role of W and X in (20) is reversed: conditional
on the hypothesis that the realization of predictand is W = w, for any w (−∞ < w < ∞),
X = aw + b + Θ, (24)
where a, b are parameters, and Θ is a residual variate being stochastically independent of W , and
having a Gaussian density function with E(Θ) = 0 and Var(Θ) = σ 2 . It then follows that the
conditional density function f (·|w) of X, given any hypothesis W = w, is Gaussian with the mean
and variance
E(X|W = w) = aw + b, (25a)
Var(X|W = w) = σ2 . (25b)
Equation (25a) specifies the linear regression of X on W . The assumptions behind (24) parallel
3.4.3 Parameter Estimates. The likelihood parameters a, b, σ are estimated from the
joint sample (18) via the maximum likelihood method, which for a and b is equivalent to the
least squares method, and for σ 2 is equivalent to calculating the sample variance: (1/N)Σθ2 (n).
However, these parameter values can also be obtained via the same sample estimates of moments
a = rsX s−1
W , b = mX − amW , (26a)
14
σ 2 = a2 s2W (r−2 − 1). (26b)
3.4.4 Expected Density Function. When g and f are inserted into the total probability law
(16), one obtains the expected density function κ of predictor X. It is a Gaussian density function
under which
E(X) = aM + b, (27a)
Var(X) = a2 S 2 + σ 2 . (27b)
3.4.5 Posterior Density Function. When g, f, and κ are inserted into Bayes theorem (17),
one obtains the posterior density function φ(·|x) of predictand W , conditional on a realization of
E(W |X = x) = Ax + B, (28a)
Var(W |X = x) = T 2 , (28b)
where
aS 2 Mσ2 − abS 2
A= , B= , (29a)
a2 S 2 + σ 2 a2 S 2 + σ 2
σ2S 2
T2 = . (29b)
a2 S 2 + σ 2
In addition, the posterior quantiles of W are specified by the equation
where Q−1 is the inverse of the standard normal distribution function, 0 < p < 1, and wp is the
follows that the posterior median, w0.5 = Ax + B, equals the posterior mean (28a), and both are
15
4. ATTRIBUTES OF BAYESIAN APPROACH
The most frequent question triggered by the Bayesian approach is this: What is the difference
between model (28) and model (21)? Framing it mathematically, the question becomes: Under
A = c, B = d, T 2 = τ 2. (31)
After inserting (26) into (29), one can obtain the right side of (22) if, and only if,
M = mW , S 2 = s2W . (32)
Thus the two approaches yield the same prediction, φ = h, if and only if the prior mean M and
the prior variance S 2 of predictand W are identical with the sample mean mW and the sample
variance s2W from the joint sample (18). If at least one equality in (32) is violated, then at least one
equality in (31) is violated too; hence, φ 6= h . When M and S come from a prior sample (19),
which is larger than the joint sample (18), the equalities in (32) are unlikely to hold. Whereas the
common approach ignores (M, S), the Bayesian approach fuses it with (mW , sW ). Consequently,
the posterior parameters A, B, T do not correspond to any sample — they owe their being to the
Bayesian theory. Practical advantages of this data fusion are illustrated in Section 5.
One common misconception is not to recognize that (32) is the necessary condition for (31).
For instance, Todini (2008) claims erroneously that he “improved” our Bayesian forecasting
system (Krzysztofowicz & Kelly, 2000b) by estimating π̄(w, x) and κ̄(x) directly from a joint
sample and then finding h(w|x) = π̄(w, x)/κ̄(x). This argument misses, of course, the essence of
16
Another common misconception is to demand that probabilistic predictions, in the form
φ(·|x), output from the Bayesian processor be validated empirically on some joint sample. This is
a folly — a mental carryover from ad-hoc approaches to model building. For as noted in Section
4.1, the posterior parameters A, B, T do not correspond to any sample — they are theoretic con-
structs obtained by fusing estimates from two asymmetric samples that together contain all data
available at the time of model building. Only after one will accumulate a new joint sample at least
as large as the prior sample (size M or larger), will one be able to perform a reasonable empirical
17
5. DATA FUSION AND PREDICTION CALIBRATION
The hydrologist can exploit the advantages of the Bayesian fusion of information to solve two
forecast. These two problems and the workings of the Bayesian processor are illustrated via
tutorial examples. There is a prior sample of size M = 15, and four joint samples created as
follows. (i) Sample A of size N = 10 contains realizations of W which are representative of the
realizations in the prior sample, and hence fairly representative of the prior distribution function of
W . Each of the remaining joint samples has size N = 5 and is unrepresentative. (ii) Sample L
comes from the left tail. (iii) Sample R comes from the right tail. (iv) Sample O includes the
same realizations as sample L, except one realization of X whose value was changed from 23 to
Figure 3 shows the empirical distribution function of W and the parametric distribution func-
tion G which is Gaussian with mean M and variance S 2 , for short N (M, S 2 ); both are estimated
Figures 4–7 show results of the common approach which estimates the regression of W on
X (graph a) and constructs the conditional distribution function H(·|x) of W (graph d); they can
be compared with results of the Bayesian approach which estimates the regression of X on W
(graph b), and derives the expected distribution function K of X (graph c), as well as the posterior
Given only a marginal sample of X, one can estimate mX , s2X to obtain a distribution function
K̄ of X, which in our example is N (mX , s2X ). Obviously, when the sample is unrepresentative,
18
which is likely when it is small, K̄ is erroneous. Can it be improved in some way? Enter the
another variate, say W , whose distribution function G has been estimated from a large and repre-
sentative sample (the prior sample), and there exists a joint sample of X and W , however short,
then the Bayesian approach allows one to improve the initial estimate K̄ of the distribution function
of X. The improved distribution function K is N (E(X), Var(X)), where E(X) and Var(X) are
specified by (27) in terms of the likelihood parameters (a, b, σ 2 ) and the prior parameters (M, S 2 ).
Graph (c) in Figs. 4–7 compares K with K̄. When the marginal sample of X is fairly
representative (Fig. 4), the difference between K and K̄ is minor; thus for the sake of discussion,
let us treat this K as the “true” distribution function K ∗ of X. When the marginal sample of X
is unrepresentative (Figs. 5, 6), function K̄ estimated directly from the sample follows closely the
empirical distribution function, which is located near the tail from which the sample was drawn.
Relative to K̄, function K obtained via the Bayesian processor is shifted correctly towards K ∗ ;
this type of adjustment is typical. In the example with an outlier (Fig. 7), function K̄ interpolates
between the four points of the empirical distribution function in the left tail and the one point in
the right tail. On the other hand, function K recovers, almost, the true function K ∗ .
How to explain the improvement that K offers over K̄ when the joint sample is unrepresen-
tative and has size N = 5 only? An improvement is possible because the Bayesian processor can
recognize an unrepresentative joint sample: it applies the total probability law to check the coher-
ence of estimates mW , sW from the joint sample and the estimates M, S from the prior sample.
19
∙ µ 2 ¶ ¸
2 S
Var(X) = r − 1 + 1 s2X . (33b)
s2W
When the joint sample is representative, mW = M and s2W = S 2 ; consequently, E(X) = mX and
Var(X) = s2X . When the joint sample is unrepresentative, the Bayesian processor can recognize
whether it comes from the left tail of the prior distribution function (mW < M) or the right
tail (mW > M), and whether it is underdispersed (s2W < S 2 ) or overdispersed (s2W > S 2 ).
Then it revises the initial estimates mX , s2X to make them cohere to the prior parameters M,
S 2 according to the strength of stochastic dependence between X and W (quantified through the
correlation coefficient r). Broadly speaking, the Bayesian processor takes the initial estimate K̄ of
the distribution function of variate X and recalibrates it against the specified distribution function
The predictions of W based on X = x obtained via the common approach and via the
Bayesian approach are compared in Figs. 4–7 in two graphs: graph (a) compares the regression
cx + d output from (21) with the posterior mean Ax + B output from (28); graph (d) compares the
conditional distribution function H(·|x) of W , which is N (cx + d, τ 2 ), with the posterior distribu-
tion function Φ(·|x) of W , which is N (Ax + B, T 2 ). Again, the difference is minor when the joint
sample is fairly representative, albeit small (Fig. 4). It may be drastic when the joint sample is un-
representative (Figs. 5, 6). In both cases, regression cx + d has incorrect slope whereas regression
Ax + B recovers, almost, the true slope, and only the intercept remains in error. In effect, the two
regressions form scissors: the difference between them is the smallest in the tail from which the
joint sample was drawn and the largest in the opposite tail. The last case (Fig. 7) shows how an
(illusory) outlier drawn from the opposite tail than the rest of the sample can help (by chance) the
20
common approach by effecting the correct slope. As a result, the difference between H(·|x) and
In general, any difference between the predictions can be explained thusly: The distribu-
gleaned from the joint sample alone. On the other hand, the distribution function Φ(·|x) fuses
information regarding stochastic dependence between X and W extracted from the joint sample
with information regarding natural uncertainty about W (viz, natural variability of W ) extracted
in the prior sample. The fusion of information takes the form of a revision: the prior distribution
function G is revised to the extent justified by the degree of dependence between X and W , as
theory (DeGroot & Fienberg, 1982, 1983; Vardeman & Meeden, 1983). It arises from the view-
point of a user, a decider, or an external observer, who asks the question: Can the probability of
an event, say {W ≤ w}, specified by the distribution function Φ(·|x) be taken at its face value?
Framed mathematically, with P denoting the probability from a user’s point of view, the question
is whether or not the following equality holds for every w and every x:
The treatment of this question is beyond the scope of this article, and also beyond the current
level of verification methods in hydrology and meteorology. (For a more extensive discussion,
see Krzysztofowicz & Sigrest, 1999, Appendix A). However, a simpler framing is easily treatable
and is this: In a large number of predictions of event {W ≤ w}, each prediction using a different
realization X = x and, therefore, assigning a different probability Φ(w|x) to the event, does
21
the mean of the probabilities converge, in the limit, to the relative frequency of event {W =
w}? When the prior distribution function G of W is known (and is estimated from all available
form of a conditional density function drawn from the family {φ(·|x) : all x} is said to
be well calibrated if
E[φ(·|X)] = g. (35)
In other words, after a large number of predictions is made, the mean of the conditional density
self-calibrated.
Proof.
Z ∞
E[φ(w|X)] = φ(w|x)κ(x) dx
−∞
Z ∞
f (x|w)g(w)
= κ(x) dx
−∞ κ(x)
Z ∞
= g(w) f (x|w) dx
−∞
= g(w). (36)
Φ(·|x) produced by the Bayesian approach (23)–(30) are guaranteed to be well calibrated against
the prior distribution function G (the calibration standard). On the other hand, probabilistic pre-
dictions in the form of the conditional distribution functions H(·|x) produced by the common
approach (20)–(22) are not well calibrated against G unless (32) holds.
22
The above conclusion contains answers to the question raised in Section 4.2. First, there is
processor self-calibrates. Second, the need for validation of calibration felt by those who follow
a common approach to prediction problem is well-meant, but its result is predictable: They will
find (empirically) their predictions to be well calibrated if, and only if, (32) holds. A Bayesian
hydrologist can easily check this fact beforehand, during the model building process.
Of course, the above conclusion presumes that the models for f and h are valid in that they
satisfy the assumptions listed in Section 3.3.2 and that the model for g is valid as well. That is
why validation of each component model, g and f , on all available samples is an essential step —
(For an extension of these principles to Bayesian forecasting systems processing outputs from
23
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS
Our final objective is to couple the prediction problem (Sections 3–5) with the decision prob-
lem (Section 2). Two basic questions arise: (i) How to make an optimal decision? (ii) How to
evaluate a predictor? Bayesian decision theory offers a mathematical framework for addressing
both questions.
For the decision problem studied in Section 2, the uncertainty about the input variate W is
now quantified in terms of the posterior density function φ(·|x) output from the Bayesian processor
Z ∞
= u(a, w)φ(w|x) dw, (37)
−∞
and the optimal decision α∗ (x), which depends on x, is that which yields the maximum posterior
expected utility:
When this procedure is executed for every x, one obtains the optimal decision function α∗ .
The analytic solutions derived in Section 2 under four special criterion functions generalize
directly: each of the four estimators (Table 3) is now preceded by the adjective conditional (con-
ditional mode, conditional mean, conditional median, a conditional quantile). When φ(·|x) comes
α∗ (x) = Ax + B (39)
24
is the conditional mode, the conditional mean, and the conditional median, whereas
is the p-probability conditional quantile; setting p = 1/[1 + λo /λu ] yields the estimator α∗ (x)
The overall evaluation, independent of the realization x of the predictor, is obtained by taking
the expectation of (38). Called the integrated maximum posterior expected utility (Bayes utility,
UX = E[U ∗ (X)]
Z ∞
= U ∗ (x)κ(x) dx.
−∞
Z ∞ Z ∞
= [max u(a, w)φ(w|x) dw]κ(x) dx
−∞ a −∞
Z ∞ Z ∞
= [max u(a, w)f(x|w)g(w) dw] dx, (41)
−∞ a −∞
where the fourth line is obtained from the third line by replacing φ(w|x) with (17).
The Bayes utility UX constitutes a measure that evaluates the performance of a prediction-
decision system. Expression (41) reveals that this evaluation depends upon the likelihood function
f which characterizes the predictor, the prior density function g which characterizes the natural
uncertainty about the predictand, and the utility function u which encodes the decider’s preferences
over the outcomes. Thus the evaluation of a particular predictor depends upon elements g and u
of the decision problem; we shall recognize this explicitly by writing UX(g, u).
25
6.3 Comparison Problem
Consider now two predictors (hydrological models, observing systems), say X1 and X2 , of
the same predictand W . Suppose their Bayes utilities are UX1 (g, u) and UX2 (g, u), respectively.
Having to choose between X1 and X2 , a rational decider would obviously prefer the predictor with
The definition describes a situation wherein the order of the utilities remains identical for
every rational decider. Consequently, given a choice between X1 and X2 , each decider would
choose X1 . In that sense, X1 is preferred to X2 from the viewpoint of a utilitarian society. The
difficulty of ordering the predictors based on Definition 2 stems from the necessity of calculating
the utilities UX1 (g, u) and UX2 (g, u) for every possible prior density function g and every possible
Could an order between the utilities of predictors be established based solely on the families
of likelihood functions f1 and f2 ? The conditions under which the answer is affirmative and the
methods of establishing an order are the subjects of inquiry within the Bayesian theory of sufficient
comparisons of forecasters. The theory traces its roots to the works of Blackwell (1951, 1953)
on comparisons of experiments. Since then, the theory has been extended, operationalized, and
applied in various contexts, most extensively in signal detection theory developed by electrical
engineers. Applications to forecast systems are relatively recent. The next section presents one
operational result.
26
6.5 Informativeness Under Gaussian Likelihood Model
the Gaussian likelihood model (24)–(25), the Bayesian theory of sufficient comparisons yields a
|a|
SC = . (43)
σ
Called the sufficiency characteristic of predictor X, it has four properties. (i) It is interpretable as
the “signal-to-noise ratio”, as in the linear regression (25a) of X on W , the absolute value of the
slope coefficient, |a|, is the measure of signal, and the standard deviation of the residual variate,
It is dimensional, with the unit being [unit of W ]−1 . (iv) It orders the alternative predictors of W
as follows.
SCi = |ai |/σ i such that 0 < SCi < ∞. Predictor X1 is more informative than predictor X2 if
6.5.2 Informativeness Score. When in addition to the likelihood model being Gaussian
(24)–(25), the prior density function is Gaussian (23), the variance S 2 of predictand W completely
characterizes the prior uncertainty, 1/S can be interpreted as the “prior signal-to-noise ratio”, and
consequently the quotient SC/(1/S) = |a|S/σ measures the signal-to-noise ratio of a given pre-
27
dictor relative to the prior signal-to-noise ratio. This quotient constitutes the kernel of the measure
"µ ¶−2 #−1/2
SC
IS = +1
1/S
µ ¶−1/2
σ2
= +1 . (45)
a2 S 2
Called the (Bayesian) informativeness score, it has four properties. (i) It is interpretable as the
between X and W :
IS = |Cor(X, W )| , (46a)
The adjective Bayesian stems from the fact that S comes from the prior density function whereas
a and σ come from the likelihood function. Consequently IS does not equal the correlation
coefficient estimated from the joint sample of (X, W ) unless S = sW , as explained in Section
(iv) It orders the alternative predictors of W consistently with SC, being its strictly increasing
transformation. (Note: in my previous writings, (45) was termed the Bayesian correlation score.)
Theorem 3 (Sufficient comparison via IS). Suppose the assumptions of Theorem 2 hold,
predictand W has a Gaussian prior density function with variance S 2 , and the informativeness
score ISi of predictor Xi is obtained from SCi and S according to (45). Predictor X1 is more
28
Table 4 reports the values of the signal-to-noise ratio, |a|/σ, and the informativeness score,
IS, in the four cases shown in Figs. 4–7. It also reports the values of R2 — the proportion of
the variation of the predictand that is explained by the predictor in the regression of the common
approach. While both measures, IS and R2 , imply the same order of the four predictors, their
In summary, under the assumption that the likelihood model is Gaussian and the prior density
function is Gaussian, the theory of sufficient comparisons of predictors leads to simple statistical
measures of informativeness, |a|/σ and IS. The IS is suggested for reporting because it is dimen-
sionless and therefore comparable, in some sense, across different predictands. The key property
of this measure is its ordinal correspondence with the utility (equivalently, the economic value) of
the predictors to every rational decider. In that sense, the IS is a utilitarian measure of model
performance.
29
7. CLOSURE
The novel research paradigm, dubbed “the court of miracles of hydrological modeling”, fo-
cuses on achieving scientific progress in circumstances which traditionally would be called model
failures. We suggest that in such circumstances the requirements of a rational decider for infor-
mation help to frame the issues for a hydrologist. Basic modeling principles for addressing some
of these issues have been discussed herein from the viewpoint of the Bayesian forecast-decision
theory.
These modeling principles are general though they have been illustrated with specific statisti-
cal models. The meta-decision problem (Section 2) considers four special criterion functions but
is general insofar as the distribution function that quantifies uncertainty can be of any form. The
model of stochastic dependence (Section 3), the model for data fusion and prediction calibration
(Section 5), and the measures of informativeness (Section 6) come from the Bayesian Gaussian
processor. It is the simplest processor and, therefore, serves well for expository purposes (to ex-
plain advantages of the Bayesian approach when the joint samples are unrepresentative or affected
by illusory outliers). However, all the models and results discussed herein can be generalized by
employing the Bayesian Meta-Gaussian processor (Krzysztofowicz & Kelly, 2000a, 2000b) which
allows each marginal distribution function to take any form and the stochastic dependence struc-
ture to be nonlinear and heteroscedastic. Inasmuch as hydrological variates exhibit, with a few
Meta-Gaussian processor to the issues treated in this article would be a natural extension.
Acknowledgment
This material is based upon work supported by the National Science Foundation under Grant
30
APPENDIX A: WEIBULL DISTRIBUTION
In the examples, the input variate W is assumed to have the sample space (η, ∞), and the
Weibull distribution with scale parameter α (α > 0), shape parameter β (β > 0), and shift para-
meter η (−∞ < η < ∞). The formulae for the density function g, the distribution function G,
and the quantile function G−1 , for any w ∈ (η, ∞) and any p ∈ (0, 1), are as follows:
µ ¶β−1 " µ ¶β #
β w−η w−η
g(w) = exp − ,
α α α
" µ ¶β #
w−η
G(w) = 1 − exp − ,
α
1
G−1 (p) = α [− ln(1 − p)] β + η.
31
REFERENCES
Andréassian, V., Perrin, C., Parent, E. & Bárdossy, A. (2010) The Court of Miracles of Hydrology:
this issue.
265–272.
DeGroot, M.H. (1970) Optimal Statistical Decisions. McGraw-Hill, New York, USA.
DeGroot, M.H. (1989) Probability and Statistics. Addison-Wesley, Reading, Massachusetts, USA.
DeGroot, M.H. & Fienberg, S.E. (1982) Assessing Probability Assessors: Calibration and
Refinement. In: Statistical Decision Theory and Related Topics, III (ed. by J.O. Berger
& S.S. Gupta), vol. 1, 291–314. Academic Press, New York, USA.
DeGroot, M.H. & Fienberg, S.E. (1983) The Comparison and Evaluation of Forecasters.
Krzysztofowicz, R. (1983) Why Should a Forecaster and a Decision Maker Use Bayes Theorem.
Krzysztofowicz, R. (1986) Expected Utility, Benefit, and Loss Criteria for Seasonal Water Supply
32
Krzysztofowicz, R. (1987) Markovian Forecast Processes. J. Am. Statist. Assoc. 82(397), 31–37.
Krzysztofowicz, R. (2002) Bayesian System for Probabilistic River Stage Forecasting. J. Hyd.
268(1–4), 16–40.
Krzysztofowicz R. & Kelly, K.S. (2000a) Bayesian Improver of a Distribution. Stoch. Env. Res.
Krzysztofowicz, R. & Kelly, K.S. (2000b) Hydrologic Uncertainty Processor for Probabilistic
Krzysztofowicz, R. & Watada, L.M. (1986) Stochastic Model of Seasonal Runoff Forecasts. Water
33
Pratt, J.W., Raiffa, H. & Schlaifer, R. (1995) Introduction to Statistical Decision Theory. The MIT
Wymore, A.W. (1977) A Mathematical Theory of Systems Engineering — The Elements. R.E.
34
Table 1. Optimal decision a∗ in an estimation, or target-setting, problem
Shape parameter β
35
Table 2. Under the two-piece linear opportunity loss function, the ratio
36
Table 3. Analytic solutions to estimation, or target-setting, problems
37
Table 4. Measures of informativeness of predictor X for predictand W in (i) common
Sample R2 |a|/σ IS
38
(a)
u(a, w)
È
a w
(b)
l(a, w)
a w
(c)
l(a, w)
a w
when λo = λu ).
39
(a)
g(w)
2 α=1
β=6 η=0
β=4
1
β=2
β = 1.5
β=1
β = 0.5
0 w
0 1 2 3
(b)
G(w)
1.0
β = 4 β = 2 β = 1.5
β=1
β=6
β = 0.5
0.8
0.6
0.4
α=1
η=0
0.2
0.0
0 1 2 3 w
Fig. 2. The Weibull density functions (a) and the Weibull distribution functions (b)
with fixed values of the scale parameter α and the shift parameter η, and
with varying values of the shape parameter β. (See Table 1 for the
40
1.0
0.8
0.6
P(W≤w)
0.4
0.2
M = 29.93
S = 9.89
0.0
0 10 20 30 40 50 60
Predictand w
41
(a) (b)
60 cx + d 60
c = 0.73
50 d = 7.54
50
τ = 6.61
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.64 10 a = 0.96
B = 10.34 b = 2.13
0 Τ = 6.18 0 σ = 7.59
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 30.8 c = 0.73
0.8 sx = 13.78 0.8 d = 7.54
τ = 6.61
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 30
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 30.83 0.2
A = 0.64
SD(X) = 12.15 B = 10.34
0.0 0.0 Τ = 6.18
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
Fig. 4. Comparison of results from the common model and the Bayesian model, given
joint sample A (all realizations): (a) the regression of W on X (broken line), and the
posterior mean of W (solid line) derived via Bayes theorem; (b) the regression
distribution function K̄ of X estimated from the marginal sample, and the expected
distribution function K of X derived via the total probability law; (d) the conditional
(graph a), and the posterior distribution function Φ(·|X = 30) of W derived via
Bayes theorem.
42
(a) (b)
60 cx + d 60
c = 0.31
50 d = 13.80
50
τ = 5.50
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.74 10 a = 0.34
B = 13.34 b = 12.13
0 Τ = 8.55 0 σ = 5.78
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 18.8 c = 0.31
0.8 sx = 6.11 0.8 d = 13.80
τ = 5.50
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 40
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 22.32 0.2
A = 0.74
SD(X) = 6.69 B = 13.34
0.0 0.0 Τ = 8.55
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
Fig. 5. Comparison of results from the common model and the Bayesian model, given
joint sample L (drawn from the left tail). The interpretation of graphs is the
same as in Fig. 4. The additional dotted lines show (a) the posterior mean
distribution function Φ(·|X = 40) that would have been obtained if the joint
43
(a) (b)
60 cx + d 60
c = 0.31
50 d = 26.89
50
τ = 6.01
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.63 10 a = 0.41
B = 5.72 b = 26.43
0 Τ = 8.53 0 σ = 6.88
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 42.8 c = 0.31
0.8 sx = 7.36 0.8 d = 26.89
τ = 6.01
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 20
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 38.62 0.2
A = 0.63
SD(X) = 7.97 B = 5.72
0.0 0.0 Τ = 8.53
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
Fig. 6. Comparison of results from the common model and the Bayesian model, given
joint sample R (drawn from the right tail). The interpretation of graphs is the
same as in Fig. 4. The additional dotted lines show (a) the posterior mean
distribution function Φ(·|X = 20) that would have been obtained if the joint
44
(a) (b)
60 cx + d 60
c = 0.64
50 d = 8.97
50
τ = 5.18
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.60 10 a = 1.24
B = 11.38 b = -6.15
0 Τ = 5.02 0 σ = 7.24
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 24.2 c = 0.64
0.8 sx = 15.92 0.8 d = 8.97
τ = 5.18
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 30
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 31.08 0.2
A = 0.60
SD(X) = 14.27 B = 11.38
0.0 0.0 Τ = 5.02
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
Fig. 7. Comparison of results from the common model and the Bayesian model, given
same as in Fig. 4. The additional dotted lines show (a) the posterior mean
distribution function Φ(·|X = 30) that would have been obtained if the joint
45