Bayes Decision Inference

DECISION CRITERIA, DATA FUSION, AND PREDICTION CALIBRATION:
A BAYESIAN APPROACH
By
Roman Krzysztofowicz
University of Virginia
P.O. Box 400747
Charlottesville,Virginia 22904-4747, USA
E-mail address: rk@virginia.edu
Research Paper RK–0901
http://www.faculty.virginia.edu/rk/
October 2009
Revised March 2010
Published in Hydrological Sciences Journal, 55 (6), 1033–1050, 2010
Special issue: The Court of Miracles of Hydrology

Abstract. The novel research paradigm, dubbed “the court of miracles of hydrological mod-
eling”, focuses on achieving scientific progress in circumstances which traditionally would be
called model failures. Many of the associated modeling issues can be addressed systematically
and coherently within the mathematical framework of Bayesian forecast-decision theory. Five of
them are addressed herein: (i) choosing a criterion function for making rational decisions under
uncertainty (a meta-decision problem); (ii) modeling stochastic dependence between variates to
quantify uncertainty and predict realizations; (iii) fusing data from asymmetric samples to cope
with unrepresentativeness of small samples and corruptive effects of outliers; (iv) calibrating
probabilistic predictions against a prior distribution; and (v) ordering predictors, or models, in
terms of their informativeness (equivalently, in terms of their economic value to a decider). It is
suggested that communication between hydrologists and deciders (planners, engineers, operators
of hydrosystems) would benefit if hydrologists adopted, at least on some issues, the perspective of
deciders, who must act in a timely and rational manner, and for whom hydrological estimates and
predictions have economic consequences.
Key words: decision making; decision criteria; expected utility; estimation; uncertainty;
distribution function; data fusion; prediction calibration; informativeness;
Bayesian approach.
ii
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A New Research Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Decision-Theoretic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. DECISION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Target-Setting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Impulse Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Quadratic Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Two-Piece Linear Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Absolute Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Median Versus Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 The Meta-Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3. MODELS OF STOCHASTIC DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Joint Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Common Approach to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.3 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Bayesian Approach to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Prior Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.3 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.4 Expected Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.5 Posterior Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4. ATTRIBUTES OF BAYESIAN APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Theoretic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Common Misconceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5. DATA FUSION AND PREDICTION CALIBRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 The Setting for Tutorial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 An Improved Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 An Improved Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 A Calibrated Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iii
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Optimal Decision Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Bayes Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Comparison Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Theory of Sufficient Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Informativeness Under Gaussian Likelihood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.1 Sufficiency Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.2 Informativeness Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6 Utilitarian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7. CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
APPENDIX A: WEIBULL DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
1. INTRODUCTION
1.1 A New Research Paradigm
A hydrological model, like any mathematical model of a system (Wymore, 1977), is an in-
tellectual construct portraying a hydrologist’s understanding of a fragment of reality. That this
construct simplifies the reality and, therefore, fails to make perfect predictions is a given. What
remains in question, and should not be hidden, is the magnitude and the frequency of failures.
By focusing on failures rather than successes of hydrological models, the conveners (Vazken
Andréassian, Charles Perrin, Eric Parent, and András Bárdossy) of a scientific workshop organized
by Cemagref and ENGREF-AgroParistech in Paris, in June 2008, effectively identified a funda-
mental research paradigm. Named with a poetic flare, after Victor Hugo, “the court of miracles
of hydrological modeling”, the paradigm is defined by (i) a collection of “hydrological monsters”
— catchments, events, and situations that revealed unexpected weaknesses or outright failures of
hydrological models, and (ii) a set of objectives for scientific progress in understanding, modeling,
predicting, and deciding (Andréassian et al., 2010).
This comprehensive, systemic perspective is noteworthy. At any point in time, the science of
hydrology offers state-of-the-art models. While scientific progress towards greater understanding
and improved models continues in an unconstrained fashion, decisions for planning, construction,
and operation of hydrosystems must be made in a timely and rational manner. This require-
ment places two demands on hydrologists: (i) to provide an honest (a well-calibrated) assessment
of the predictive uncertainty, conditional on a state-of-the-art hydrological model — a necessary
condition for effective communication between hydrologists and deciders (planners, engineers, op-
erators), and (ii) to appreciate, and if necessary to adopt, the perspective of a decider for whom
hydrological estimates and predictions have economic consequences.
1
1.2 Decision-Theoretic Issues
This expository article is written under the motto (again, aptly coined by the conveners):
“There are no hydrological monsters, only decision-making issues”. Indeed, the requirements of
a rational decider for information help to frame the issues for a hydrologist. We address some
fundamental issues and discuss some basic principles, derived from the Bayesian forecast-decision
theory, for coping with “monsters”. These issues are as follows.
1. Choosing a criterion function for making rational decisions under conditions of uncertainty
for the purpose of calculating estimates or setting targets. (This is a meta-decision problem:
deciding how to decide.)
2. Modeling stochastic dependence between variates in terms of conditional distribution
functions for the purpose of quantifying uncertainty and predicting realizations.
3. Fusing data from asymmetric samples of stochastically dependent variates for the purpose
of coping with unrepresentativeness of small samples and effects of illusory outliers in the context
of (i) estimation of distribution functions, and (ii) quantification of predictive uncertainty.
4. Calibrating probabilistic predictions against a reference (a prior distribution function).
5. Ordering predictors, or models, in terms of their informativeness for decision making
(equivalently, in terms of their expected utility or economic value to a decider).
While at first glance these issues may appear disjoint, they can be addressed systematically
and coherently within the mathematical framework of decision theory — a top-down approach
whose ultimate objective is to maximize the expected utility of outcomes to stakeholders at every
step of modeling.
2
2. DECISION CRITERIA
2.1 Target-Setting Problem
There is a class of decision problems under uncertainty wherein the optimal value of a contin-
uous decision variable a would be set to the realization w of the input variate W if only one knew
that realization at the time the decision must be made. This class includes problems of statistical
estimation (e.g., when a is the estimate of an unknown infiltration coefficient W ) and problems of
setting targets in the context of management, planning, and control (e.g., when a is the height of
a flood levee to be built and W is the maximum flood crest in the next 50 years; when a is the
reservoir storage to be emptied and W is the volume of the incoming flood).
Suppose a and w are continuous variables and u is a utility function that evaluates outcomes,
such that u(a, w) denotes the utility of outcome resulting from decision a and input w. If in-
put w were known with certainty, then the optimal decision would be a = w, which means that
max u(a, w) = u(w, w) for every w. Oftentimes, it is convenient to transform the utility function
a
into an opportunity loss function l, such that l(a, w) denotes the difference between the utility of
the optimal decision and the utility of a given decision, when w is fixed (DeGroot, 1970):
l(a, w) = max u(a, w) − u(a, w)

a
= u(w, w) − u(a, w). (1)
It follows that the optimal decision a∗ under uncertainty about W can be obtained either by maxi-
mizing the expected utility or, equivalently, by minimizing the expected opportunity loss:
max E[u(a, W )] = max{E[u(W, W ) − l(a, W )]}

a a
= E[u(W, W )] − min E[l(a, W )]. (2)

a
3
Suppose furthermore that the uncertainty about the input variate W is quantified in terms of
a distribution function G such that for any realization w, one has G(w) = P (W ≤ w), where P
stands for probability; let g denote the corresponding density function of W .
Two questions arise. (i) Given the preferences of a decider encoded in the utility function u
(or the opportunity loss function l) and the uncertainty about the input variate W quantified by the
distribution function G (or the density function g), What is the optimal decision a∗ ? (ii) In the ab-
sence of well-formalized preferences over possible outcomes (which is often the case in scientific
estimation and prediction problems), What form of the utility function u (or the opportunity loss
function l) should be adopted to arrive at a rational decision in the court of miracles?
We shall address these questions by recalling known results from decision theory for four
forms of u (or l) shown in Fig. 1 (DeGroot, 1970; Pratt et al., 1995; Krzysztofowicz, 1990).
2.2 Impulse Utility Function
Suppose decision a that perfectly estimates input w results in infinite utility, whereas decision
a that mis-estimates input w results in zero utility, regardless of the direction and magnitude of
mis-estimation (Fig. 1(a)). Mathematically,

½
∞ if a = w,
u(a, w) = δ(w − a) = (3)
0 if a 6= w,
where δ denotes the impulse function (Dirac function).
Each feasible decision is now evaluated in terms of the expected utility
U (a) = E[u(a, W )]
Z ∞
= δ(w − a)g(w) dw
−∞
= g(a), (4)
4
and the optimal decision a∗ is found as the maximizer:
U (a∗ ) = max U (a) = max g(a); (5)

a a
that is, the optimal decision a∗ is a point at which the density function g attains the maximum.
Such a point a∗ is known as the mode of variate W . (It may be noted that (5) is akin to the
maximum likelihood criterion for parameter estimation.)
To illustrate this solution, Fig. 2(a) shows six Weibull density functions (see Appendix A for
the formula) with fixed scale parameter α and shift parameter η, and different shape parameter β
values, which result in different values of the mode a∗ (see Table 1).
Should a hydrologist adopt the mode of W as the preferred estimator? The above formulation
offers a way of rationalizing such a meta-decision: If the hydrologist believes that only a single
point — the perfect estimate of the unknown w — has any positive utility (which is infinite relative
to the zero utility of every non-perfect estimate), then the answer is affirmative; otherwise other
forms of u (or l) should be considered.
(To decide whether or not the impulse utility function is a suitable criterion for scientific
decisions, it may help to consider the problem of targeting a pistol in a duel. Here, w is the
position of the adversary on the horizontal axis, and a is the aiming point. The outcome of a = w
is the dead adversary, whereas the outcome of a 6= w is a chance of ending up in the morgue
oneself, should the adversary’s targeting turns out to be perfect. If one assigns the infinite utility
to one’s own life relative to the adversary’s life, as a matter of rationality, or survival instinct, then
the impulse utility function is the exact mathematical model of this preference.)
2.3 Quadratic Difference Opportunity Loss Function
The opportunity loss is proportional to the quadratic difference between decision a and
5
input w (Fig. 1(b)):
l(a, w) = λ(a − w)2 , (6)
where λ is the marginal opportunity loss from mis-estimation of the input, regardless of the direc-
tion of mis-estimation; it may be in monetary units [$/(unit of w)2 ] or may represent subjective
valuation.
Each feasible decision a is now evaluated in terms of the expected opportunity loss
L(a) = E[λ(a − W )2 ], (7)
and the optimal decision a∗ is found as the minimizer:
L(a∗ ) = min L(a) = λVar(W ), (8)

a
where a∗ = E(W ). That is, the optimal decision equals the mean of variate W ; Table 1 shows
examples. (It may be noted that (7) is akin to the least squares criterion for parameter estimation.)
Should a hydrologist adopt the mean of W as the preferred estimator? Again, the above
formulation offers a way of rationalizing such a meta-decision. The answer is affirmative if the
hydrologist is indifferent with respect to the direction of mis-estimation and agrees that the op-
portunity loss increases quadratically with the magnitude of mis-estimation. The hydrologist may
also consider the implication of this preference: detailed quantification of uncertainty about vari-
ate W in terms of its distribution function G is irrelevant to decision making; the only relevant
information about variate W is its mean.
2.4 Two-Piece Linear Opportunity Loss Function
The opportunity loss is proportional to the absolute difference between decision a and input
w and its magnitude depends on the direction of mis-estimation (Fig. 1(c)):

½
λo (a − w) if w ≤ a,
l(a, w) = (9)
λu (w − a) if a ≤ w,
6
where λo is the marginal opportunity loss from over-estimation of the input, and λu is the marginal
opportunity loss from under-estimation of the input; each constant may be in monetary units
[$/(unit of w)] or may represent subjective valuation.
Each feasible decision a is now evaluated in terms of the expected opportunity loss
L(a) = E[l(a, W )]
Z a
= (λu + λo )aG(a) − λu a + λu E(W ) − (λu + λo ) wg(w) dw, (10)
−∞
and the optimal decision a∗ is found as the minimizer:
L(a∗ ) = min L(a)

a
Z a∗
= λu E(W ) − (λu + λo ) wg(w) dw, (11)
−∞
where a∗ is such that G(a∗ ) = 1/[1 + λo /λu ]. Because W is a continuous variate, its distribution
function G has the inverse G−1 , called the quantile function of W . Hence,
a∗ = G−1 (1/[1 + λo /λu ]). (12)
That is, the optimal decision a∗ equals the quantile of variate W corresponding to the probability
1/[1 + λo /λu ]. This solution has two important properties.
First, in order to find the optimal decision a∗ , the hydrologist needs to know the distribution
function G of variate W and the ratio λo /λu of the marginal opportunity losses. Table 2 shows
the implications: for instance, when the hydrologist judges the marginal opportunity loss from
under-estimation to be 3 times as large as that from over-estimation, G(a∗ ) = 3/4; that is, the
optimal decision equals the third quartile of W under G.
Second, associated with the optimal decision a∗ are the probability of over-estimation of the
input, P (W < a∗ ) = G(a∗ ), and the probability of under-estimation of the input, P (W > a∗ ) =
1 − G(a∗ ). For instance, when a∗ is the planned daily capacity of a water delivery system, and
7
W is the daily river flow volume, then G(a∗ ) represents the probability of shortage. Note that this
is the “optimal” probability of shortage: given the ratio λo /λu of the marginal opportunity losses
and the distribution function G of the daily river flow volume W , it is optimal to incur a shortage
with probability G(a∗ ).
Should a hydrologist adopt a quantile of W as the preferred estimator? Again, the above
formulation offers a way of rationalizing such a meta-decision. Inasmuch as the ratio λo /λu may
take on different values for different problems and different deciders, any quantile of W under G
may be optimal in some situation. In that sense, the two-piece linear opportunity loss (9) is the
most general among the three criterion functions considered herein — general in that it allows for
an asymmetry in the valuation of opportunity losses from over-estimation and under-estimation.
2.5 Absolute Difference Opportunity Loss Function
A special case of the two-piece linear opportunity loss function (9) arises when the marginal
opportunity losses are identical: λo = λu = λ. The opportunity loss function is now symmetric
about decision a and can be written concisely as
l(a, w) = λ|a − w|. (13)
In other words, the opportunity loss is proportional to the absolute difference between decision a
(the estimate) and realization w of variate W . From (12) it follows that the optimal decision a∗
equals the median of variate W , as
a∗ = G−1 (1/2). (14)
2.6 Median Versus Mean
When the opportunity losses from over-estimation and under-estimation are symmetric, the
hydrologist might ask which of the two functions, (6) or (13), reflects better his preferences. In the
absence of a definite answer, a secondary meta-consideration, especially in the court of miracles,
8
may be the robustness of the optimal decision to outliers. As is well known, the sample estimate
of the mean E(W ) is sensitive to outliers, whereas the sample estimate of the median G−1 (1/2) is
not. This fact might favor the median. A third meta-consideration, especially when the estimate
constitutes a prediction of W , is the communication of uncertainty to users. For this purpose, the
median of W (not the mean) is the preferred estimate because it conveys at least some rudimentary
assessment of uncertainty (the 50% chance of the actual realization being either below or above
the estimate), which the users can grasp intuitively.
2.7 The Meta-Decision Problem
We have examined a class of decision problems known as estimation problems or as target-
setting problems. Four models have been presented, each with a special criterion function and
an analytic solution for the optimal decision. The summary (Table 3) is this: Each of the four
estimators (mode, mean, median, a quantile) is optimal with respect to some criterion function.
Two conclusions follow.
First, the meta-decision problem can be framed thusly: the choice of an estimator for a
hydrological predictand is tantamount to accepting (if only implicitly) the corresponding criterion
function as representative of a hydrologist’s preferences over outcomes; and vice versa, the choice
(explicit) of the criterion function prescribes the type of estimator that is optimal.
Second, the apparent propensity of hydrologists (and scientists, in general) to prefer the mean
(as the estimator) and to deplore the bias (any deviation from the mean) finds no justification in
the theory of rational decisions as applied to real-world problems: ample empirical evidence from
cost-benefit analyses and preference-elicitation studies indicates that opportunity loss functions are
asymmetric more often than not (e.g., Krzysztofowicz, 1986).
9
3. MODELS OF STOCHASTIC DEPENDENCE
3.1 Joint Density Function
At the heart of model building in any science, including hydrology, is the task of identifying
and representing relations between variables. When uncertainties spoil the idealized deterministic
relations, the focus should shift on representing stochastic dependence structures between vari-
ates. The most general mathematical model of stochastic dependence is a joint density function π.
Herein, we focus on the simplest situation with two continuous variates, interpreted as a predictand
W and a predictor X, so that
π(w, x) = φ(w|x)κ(x) = f(x|w)g(w). (15)
The two factorizations of the joint density function π (in terms of the families of the conditional
density functions φ, f and the marginal density functions κ, g) are equivalent, provided they satisfy
the coherency requirements: the total probability law

Z ∞
κ(x) = f(x|w)g(w), (16)
−∞
and Bayes theorem

f(x|w)g(w)
φ(w|x) = . (17)
κ(x)
The theoretic formulation (16)–(17), known as a Bayesian processor, decomposes the model build-
ing task as follows: first, model and estimate g and f ; then derive κ and φ. Thereby a complete
and coherent model of stochastic dependence can be built, which utilizes all available data.
The objective of this exposition is to explain the Bayesian approach to model building in the
context of two problems: (i) fusing data from a large sample of variate W and from a small sample
of variate X in order to improve the model for the density function κ of variate X, and (ii) making
a probabilistic prediction in the form of a conditional density function φ(·|x) of variate W , given
realization x of variate X.
10
Many hydrologists seem to be unaware of how strong the case for a Bayesian approach to
modeling really is. For this reason, we shall juxtapose it with a common statistical approach
which, in essence, focuses on building directly a model of the form h(w|x). In the most general
form, h(·|x) is a conditional density function of variate W , given realization x of variate X. It will
be shown that h 6= φ. And it will be argued that in the court of miracles, the Bayesian approach
would carry the day on the account of allowing the hydrologist (i) to use all available data, and (ii)
to calibrate probabilistic predictions against a specified standard. Both of these unique advantages
of the Bayesian approach should translate into a more robust model — the first line of defense
against model failures.
3.2 Data Paradigm
Suppose the available data are organized into two samples. (i) The joint sample of (X, W )
consists of N realizations:
{(x(n), w(n)) : n = 1, ..., N}. (18)
(ii) The prior sample of W (which may also be called a climatic sample of W when it has been
collected over a sufficient number of years to represent the climate) consists of M realizations:
{w(n) : n = 1, ..., M}. (19)
The two samples are asymmetric, N < M, when the prior sample includes M − N additional
realizations {w(n) : n = N + 1, ..., M} of W for which no matching realizations of X exist.
Here is a typical example (Krzysztofowicz & Watada, 1986). A long climatic record, say
from M years, exists of runoff volume w measured at the catchment outlet during the snowmelt
season, but the record of snowpack depth x measurements is recent and short, say from N years,
N < M. With the objective of predicting the seasonal snowmelt runoff volume W , given the
snowpack depth X = x on some fixed date during the winter, the hydrologist should wish not
11
to discard the M − N years of runoff volume measurements. Yet this he must do if he takes a
common statistical approach.
3.3 Common Approach to Prediction
This section reviews the simple linear regression (DeGroot, 1989) that yields a model of the
form h(w|x) and that sets the stage for the Bayesian model.
3.3.1 Model. Under the normal-linear model, the stochastic dependence between the pre-
dictand W and the predictor X is represented thusly: conditional on a realization of the predictor
X = x, for any x (−∞ < x < ∞),
W = cx + d + Ξ, (20)
where c, d are parameters, and Ξ is a residual variate being stochastically independent of X, and
having a Gaussian density function with E(Ξ) = 0 and Var(Ξ) = τ 2 . It then follows that the
conditional density function h(·|x) of W , given any realization X = x, is Gaussian with the mean
and variance
E(W |X = x) = cx + d, (21a)
Var(W |X = x) = τ 2 . (21b)
Equation (21a) specifies the linear regression of W on X.
3.3.2 Assumptions. Inasmuch as the objective is to obtain a model for the conditional
density function, it is necessary to validate all three assumptions behind (20):
1. linearity — the regression of W on X must be linear; practically, the scatterplot of
the joint sample must hug the line cx + d;
2. homoscedasticity — the variance τ 2 of the residual variate Ξ must be independent
of x; practically, the scatterplot of the residuals ξ(n) = w(n) − cx(n) − d versus
12
predictor realizations x(n) for all n = 1, ..., N must show no trend;
3. normality — the distribution function of Ξ must be Gaussian; practically, the
Gaussian distribution function with mean zero and variance τ 2 must fit well the
empirical distribution function constructed from the sample {ξ(n) : n = 1, ..., N}.
3.3.3 Parameter Estimates. The parameters c, d, τ are estimated from the joint sample
(18). The estimates of c, d are usually obtained via the least squares method; the estimate of τ 2
for our purpose must be the sample variance: (1/N)Σξ 2 (n). However, the same parameter values
can be obtained via the sample estimates of moments (DeGroot, 1989). Let mX , mW denote the
sample means and s2X , s2W denote the sample variances of X and W , respectively; let r denote the
sample correlation coefficient between X and W . Then
c = rsW s−1
X , d = mW − cmX , (22a)
τ 2 = c2 s2X (r−2 − 1). (22b)
These relations will prove handy in explaining the distinctions between the common approach and
the Bayesian approach.
3.4 Bayesian Approach to Prediction
When the joint sample is small, the estimates of the regression parameters c, d, τ may be
erroneous. Whereas Bayesian statistical theory offers methods for quantifying uncertainty about
the regression parameters, we focus on a Bayesian method for improving the estimates by ex-
tracting information from the prior sample, which contains realizations of W not included in the
joint sample (Krzysztofowicz, 1983; Krzysztofowicz & Watada, 1986; Krzysztofowicz & Reese,
1991).
13
3.4.1 Prior Density Function. Suppose the prior (marginal) density function g of predictand
W is Gaussian with the mean and variance
E(W ) = M, Var(W ) = S 2 , (23)
both estimated from the prior sample (19).
3.4.2 Likelihood Function. This is f(x|·), the likelihood function of W , given X = x
(−∞ < x < ∞). To obtain a model for f , the role of W and X in (20) is reversed: conditional
on the hypothesis that the realization of predictand is W = w, for any w (−∞ < w < ∞),
X = aw + b + Θ, (24)
where a, b are parameters, and Θ is a residual variate being stochastically independent of W , and
having a Gaussian density function with E(Θ) = 0 and Var(Θ) = σ 2 . It then follows that the
conditional density function f (·|w) of X, given any hypothesis W = w, is Gaussian with the mean
and variance
E(X|W = w) = aw + b, (25a)
Var(X|W = w) = σ2 . (25b)
Equation (25a) specifies the linear regression of X on W . The assumptions behind (24) parallel
the assumptions listed in Section 3.3.2 and should be validated likewise.
3.4.3 Parameter Estimates. The likelihood parameters a, b, σ are estimated from the
joint sample (18) via the maximum likelihood method, which for a and b is equivalent to the
least squares method, and for σ 2 is equivalent to calculating the sample variance: (1/N)Σθ2 (n).
However, these parameter values can also be obtained via the same sample estimates of moments
that appear in (22). To wit,
a = rsX s−1
W , b = mX − amW , (26a)
14
σ 2 = a2 s2W (r−2 − 1). (26b)
3.4.4 Expected Density Function. When g and f are inserted into the total probability law
(16), one obtains the expected density function κ of predictor X. It is a Gaussian density function
under which
E(X) = aM + b, (27a)
Var(X) = a2 S 2 + σ 2 . (27b)
3.4.5 Posterior Density Function. When g, f, and κ are inserted into Bayes theorem (17),
one obtains the posterior density function φ(·|x) of predictand W , conditional on a realization of
the predictor X = x. It is a Gaussian density function under which
E(W |X = x) = Ax + B, (28a)
Var(W |X = x) = T 2 , (28b)
where
aS 2 Mσ2 − abS 2
A= , B= , (29a)
a2 S 2 + σ 2 a2 S 2 + σ 2
σ2S 2
T2 = . (29b)
a2 S 2 + σ 2
In addition, the posterior quantiles of W are specified by the equation
wp = Ax + B + T Q−1 (p), (30)
where Q−1 is the inverse of the standard normal distribution function, 0 < p < 1, and wp is the
p-probability posterior quantile of W , conditional on X = x; that is P (W ≤ wp |X = x) = p. It
follows that the posterior median, w0.5 = Ax + B, equals the posterior mean (28a), and both are
equal to the posterior mode of W , as is well known.
15
4. ATTRIBUTES OF BAYESIAN APPROACH
4.1 Theoretic Relations
The most frequent question triggered by the Bayesian approach is this: What is the difference
between model (28) and model (21)? Framing it mathematically, the question becomes: Under
what conditions do the following equalities hold:
A = c, B = d, T 2 = τ 2. (31)
After inserting (26) into (29), one can obtain the right side of (22) if, and only if,
M = mW , S 2 = s2W . (32)
Thus the two approaches yield the same prediction, φ = h, if and only if the prior mean M and
the prior variance S 2 of predictand W are identical with the sample mean mW and the sample
variance s2W from the joint sample (18). If at least one equality in (32) is violated, then at least one
equality in (31) is violated too; hence, φ 6= h . When M and S come from a prior sample (19),
which is larger than the joint sample (18), the equalities in (32) are unlikely to hold. Whereas the
common approach ignores (M, S), the Bayesian approach fuses it with (mW , sW ). Consequently,
the posterior parameters A, B, T do not correspond to any sample — they owe their being to the
Bayesian theory. Practical advantages of this data fusion are illustrated in Section 5.
4.2 Common Misconceptions
One common misconception is not to recognize that (32) is the necessary condition for (31).
For instance, Todini (2008) claims erroneously that he “improved” our Bayesian forecasting
system (Krzysztofowicz & Kelly, 2000b) by estimating π̄(w, x) and κ̄(x) directly from a joint
sample and then finding h(w|x) = π̄(w, x)/κ̄(x). This argument misses, of course, the essence of
the Bayesian theory, which is to fuse information from different sources.
16
Another common misconception is to demand that probabilistic predictions, in the form
φ(·|x), output from the Bayesian processor be validated empirically on some joint sample. This is
a folly — a mental carryover from ad-hoc approaches to model building. For as noted in Section
4.1, the posterior parameters A, B, T do not correspond to any sample — they are theoretic con-
structs obtained by fusing estimates from two asymmetric samples that together contain all data
available at the time of model building. Only after one will accumulate a new joint sample at least
as large as the prior sample (size M or larger), will one be able to perform a reasonable empirical
validation. But, is it needed at all? We address this question in Section 5.4.
17
5. DATA FUSION AND PREDICTION CALIBRATION
5.1 The Setting for Tutorial Examples
The hydrologist can exploit the advantages of the Bayesian fusion of information to solve two
practical problems: improving an estimate of a distribution function, and calibrating a probabilistic
forecast. These two problems and the workings of the Bayesian processor are illustrated via
tutorial examples. There is a prior sample of size M = 15, and four joint samples created as
follows. (i) Sample A of size N = 10 contains realizations of W which are representative of the
realizations in the prior sample, and hence fairly representative of the prior distribution function of
W . Each of the remaining joint samples has size N = 5 and is unrepresentative. (ii) Sample L
comes from the left tail. (iii) Sample R comes from the right tail. (iv) Sample O includes the
same realizations as sample L, except one realization of X whose value was changed from 23 to
55 (the largest realization of X in sample A) to create an illusory effect of an outlier.
Figure 3 shows the empirical distribution function of W and the parametric distribution func-
tion G which is Gaussian with mean M and variance S 2 , for short N (M, S 2 ); both are estimated
from the prior sample.
Figures 4–7 show results of the common approach which estimates the regression of W on
X (graph a) and constructs the conditional distribution function H(·|x) of W (graph d); they can
be compared with results of the Bayesian approach which estimates the regression of X on W
(graph b), and derives the expected distribution function K of X (graph c), as well as the posterior
distribution function Φ(·|x) of W (graph d).
5.2 An Improved Marginal Distribution
Given only a marginal sample of X, one can estimate mX , s2X to obtain a distribution function
K̄ of X, which in our example is N (mX , s2X ). Obviously, when the sample is unrepresentative,
18
which is likely when it is small, K̄ is erroneous. Can it be improved in some way? Enter the
Bayesian approach (Krzysztofowicz & Kelly, 2000a): if variate X is stochastically dependent on
another variate, say W , whose distribution function G has been estimated from a large and repre-
sentative sample (the prior sample), and there exists a joint sample of X and W , however short,
then the Bayesian approach allows one to improve the initial estimate K̄ of the distribution function
of X. The improved distribution function K is N (E(X), Var(X)), where E(X) and Var(X) are
specified by (27) in terms of the likelihood parameters (a, b, σ 2 ) and the prior parameters (M, S 2 ).
Graph (c) in Figs. 4–7 compares K with K̄. When the marginal sample of X is fairly
representative (Fig. 4), the difference between K and K̄ is minor; thus for the sake of discussion,
let us treat this K as the “true” distribution function K ∗ of X. When the marginal sample of X
is unrepresentative (Figs. 5, 6), function K̄ estimated directly from the sample follows closely the
empirical distribution function, which is located near the tail from which the sample was drawn.
Relative to K̄, function K obtained via the Bayesian processor is shifted correctly towards K ∗ ;
this type of adjustment is typical. In the example with an outlier (Fig. 7), function K̄ interpolates
between the four points of the empirical distribution function in the left tail and the one point in
the right tail. On the other hand, function K recovers, almost, the true function K ∗ .
How to explain the improvement that K offers over K̄ when the joint sample is unrepresen-
tative and has size N = 5 only? An improvement is possible because the Bayesian processor can
recognize an unrepresentative joint sample: it applies the total probability law to check the coher-
ence of estimates mW , sW from the joint sample and the estimates M, S from the prior sample.
The mechanism becomes evident upon inserting (26) into (27):
E(X) = rsX s−1

W (M − mW ) + mX , (33a)
19
∙ µ 2 ¶ ¸
2 S
Var(X) = r − 1 + 1 s2X . (33b)
s2W
When the joint sample is representative, mW = M and s2W = S 2 ; consequently, E(X) = mX and
Var(X) = s2X . When the joint sample is unrepresentative, the Bayesian processor can recognize
whether it comes from the left tail of the prior distribution function (mW < M) or the right
tail (mW > M), and whether it is underdispersed (s2W < S 2 ) or overdispersed (s2W > S 2 ).
Then it revises the initial estimates mX , s2X to make them cohere to the prior parameters M,
S 2 according to the strength of stochastic dependence between X and W (quantified through the
correlation coefficient r). Broadly speaking, the Bayesian processor takes the initial estimate K̄ of
the distribution function of variate X and recalibrates it against the specified distribution function
G of the related variate W .
5.3 An Improved Conditional Distribution
The predictions of W based on X = x obtained via the common approach and via the
Bayesian approach are compared in Figs. 4–7 in two graphs: graph (a) compares the regression
cx + d output from (21) with the posterior mean Ax + B output from (28); graph (d) compares the
conditional distribution function H(·|x) of W , which is N (cx + d, τ 2 ), with the posterior distribu-
tion function Φ(·|x) of W , which is N (Ax + B, T 2 ). Again, the difference is minor when the joint
sample is fairly representative, albeit small (Fig. 4). It may be drastic when the joint sample is un-
representative (Figs. 5, 6). In both cases, regression cx + d has incorrect slope whereas regression
Ax + B recovers, almost, the true slope, and only the intercept remains in error. In effect, the two
regressions form scissors: the difference between them is the smallest in the tail from which the
joint sample was drawn and the largest in the opposite tail. The last case (Fig. 7) shows how an
(illusory) outlier drawn from the opposite tail than the rest of the sample can help (by chance) the
20
common approach by effecting the correct slope. As a result, the difference between H(·|x) and
Φ(·|x) is minor even though the difference between K̄ and K is large.
In general, any difference between the predictions can be explained thusly: The distribu-
tion function H(·|x) conveys a quantification of conditional uncertainty about W , given X = x,
gleaned from the joint sample alone. On the other hand, the distribution function Φ(·|x) fuses
information regarding stochastic dependence between X and W extracted from the joint sample
with information regarding natural uncertainty about W (viz, natural variability of W ) extracted
in the prior sample. The fusion of information takes the form of a revision: the prior distribution
function G is revised to the extent justified by the degree of dependence between X and W , as
encoded in the likelihood parameters (a, b, σ).
5.4 A Calibrated Prediction
The notion of calibration of probabilistic predictions is an established concept in Bayesian
theory (DeGroot & Fienberg, 1982, 1983; Vardeman & Meeden, 1983). It arises from the view-
point of a user, a decider, or an external observer, who asks the question: Can the probability of
an event, say {W ≤ w}, specified by the distribution function Φ(·|x) be taken at its face value?
Framed mathematically, with P denoting the probability from a user’s point of view, the question
is whether or not the following equality holds for every w and every x:
P (W ≤ w|Φ(·|x)) = Φ(w|x). (34)
The treatment of this question is beyond the scope of this article, and also beyond the current
level of verification methods in hydrology and meteorology. (For a more extensive discussion,
see Krzysztofowicz & Sigrest, 1999, Appendix A). However, a simpler framing is easily treatable
and is this: In a large number of predictions of event {W ≤ w}, each prediction using a different
realization X = x and, therefore, assigning a different probability Φ(w|x) to the event, does
21
the mean of the probabilities converge, in the limit, to the relative frequency of event {W =
w}? When the prior distribution function G of W is known (and is estimated from all available
information), event {W ≤ w} occurs with probability P (W ≤ w) = G(w). Hence the following.
Definition 1 (Calibration). A system producing a probabilistic prediction of W in the
form of a conditional density function drawn from the family {φ(·|x) : all x} is said to
be well calibrated if
E[φ(·|X)] = g. (35)
In other words, after a large number of predictions is made, the mean of the conditional density
functions should, in the limit, be equal to the prior density function.
Theorem 1 (Calibration of Bayesian Processor). The Bayesian processor (16)–(17) is
self-calibrated.
Proof.
Z ∞
E[φ(w|X)] = φ(w|x)κ(x) dx
−∞
Z ∞
f (x|w)g(w)
= κ(x) dx
−∞ κ(x)
Z ∞
= g(w) f (x|w) dx
−∞
= g(w). (36)
In conclusion, probabilistic predictions in the form of the conditional distribution functions
Φ(·|x) produced by the Bayesian approach (23)–(30) are guaranteed to be well calibrated against
the prior distribution function G (the calibration standard). On the other hand, probabilistic pre-
dictions in the form of the conditional distribution functions H(·|x) produced by the common
approach (20)–(22) are not well calibrated against G unless (32) holds.
22
The above conclusion contains answers to the question raised in Section 4.2. First, there is
no need to validate the calibration of Bayesian predictions because it is guaranteed — Bayesian
processor self-calibrates. Second, the need for validation of calibration felt by those who follow
a common approach to prediction problem is well-meant, but its result is predictable: They will
find (empirically) their predictions to be well calibrated if, and only if, (32) holds. A Bayesian
hydrologist can easily check this fact beforehand, during the model building process.
Of course, the above conclusion presumes that the models for f and h are valid in that they
satisfy the assumptions listed in Section 3.3.2 and that the model for g is valid as well. That is
why validation of each component model, g and f , on all available samples is an essential step —
a due diligence — of the Bayesian approach.
(For an extension of these principles to Bayesian forecasting systems processing outputs from
deterministic hydrologic models of any complexity, see Krzysztofowicz (1999).)
23
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS
Our final objective is to couple the prediction problem (Sections 3–5) with the decision prob-
lem (Section 2). Two basic questions arise: (i) How to make an optimal decision? (ii) How to
evaluate a predictor? Bayesian decision theory offers a mathematical framework for addressing
both questions.
6.1 Optimal Decision Functions
For the decision problem studied in Section 2, the uncertainty about the input variate W is
now quantified in terms of the posterior density function φ(·|x) output from the Bayesian processor
(16)–(17), given realization x of predictor X. Conditional on X = x, each feasible decision is
evaluated in terms of the posterior expected utility:
U(a, x) = E[u(a, W )|X = x]
Z ∞
= u(a, w)φ(w|x) dw, (37)
−∞
and the optimal decision α∗ (x), which depends on x, is that which yields the maximum posterior
expected utility:
U ∗ (x) = U (α∗ (x), x) = max U (a, x). (38)

a
When this procedure is executed for every x, one obtains the optimal decision function α∗ .
The analytic solutions derived in Section 2 under four special criterion functions generalize
directly: each of the four estimators (Table 3) is now preceded by the adjective conditional (con-
ditional mode, conditional mean, conditional median, a conditional quantile). When φ(·|x) comes
from the Gaussian processor (23)–(30),
α∗ (x) = Ax + B (39)
24
is the conditional mode, the conditional mean, and the conditional median, whereas
α∗ (x) = Ax + B + T Q−1 (p) (40)
is the p-probability conditional quantile; setting p = 1/[1 + λo /λu ] yields the estimator α∗ (x)
which is optimal under the two-piece linear opportunity loss function.
6.2 Bayes Utility
The overall evaluation, independent of the realization x of the predictor, is obtained by taking
the expectation of (38). Called the integrated maximum posterior expected utility (Bayes utility,
for short), it is given by
UX = E[U ∗ (X)]
Z ∞
= U ∗ (x)κ(x) dx.
−∞
Z ∞ Z ∞
= [max u(a, w)φ(w|x) dw]κ(x) dx
−∞ a −∞
Z ∞ Z ∞
= [max u(a, w)f(x|w)g(w) dw] dx, (41)
−∞ a −∞
where the fourth line is obtained from the third line by replacing φ(w|x) with (17).
The Bayes utility UX constitutes a measure that evaluates the performance of a prediction-
decision system. Expression (41) reveals that this evaluation depends upon the likelihood function
f which characterizes the predictor, the prior density function g which characterizes the natural
uncertainty about the predictand, and the utility function u which encodes the decider’s preferences
over the outcomes. Thus the evaluation of a particular predictor depends upon elements g and u
of the decision problem; we shall recognize this explicitly by writing UX(g, u).
25
6.3 Comparison Problem
Consider now two predictors (hydrological models, observing systems), say X1 and X2 , of
the same predictand W . Suppose their Bayes utilities are UX1 (g, u) and UX2 (g, u), respectively.
Having to choose between X1 and X2 , a rational decider would obviously prefer the predictor with
a higher utility. Hence the following.
Definition 2 (Informativeness). Predictor X1 is said to be more informative than predictor
X2 if for every g and u,
UX1 (g, u) ≥ UX2 (g, u). (42)
The definition describes a situation wherein the order of the utilities remains identical for
every rational decider. Consequently, given a choice between X1 and X2 , each decider would
choose X1 . In that sense, X1 is preferred to X2 from the viewpoint of a utilitarian society. The
difficulty of ordering the predictors based on Definition 2 stems from the necessity of calculating
the utilities UX1 (g, u) and UX2 (g, u) for every possible prior density function g and every possible
utility function u — an insurmountable obstacle.
6.4 Theory of Sufficient Comparisons
Could an order between the utilities of predictors be established based solely on the families
of likelihood functions f1 and f2 ? The conditions under which the answer is affirmative and the
methods of establishing an order are the subjects of inquiry within the Bayesian theory of sufficient
comparisons of forecasters. The theory traces its roots to the works of Blackwell (1951, 1953)
on comparisons of experiments. Since then, the theory has been extended, operationalized, and
applied in various contexts, most extensively in signal detection theory developed by electrical
engineers. Applications to forecast systems are relatively recent. The next section presents one
operational result.
26
6.5 Informativeness Under Gaussian Likelihood Model
When the stochastic dependence between predictor X and predictand W is characterized by
the Gaussian likelihood model (24)–(25), the Bayesian theory of sufficient comparisons yields a
remarkable simple implementation (Krzysztofowicz, 1987, 1992).
6.5.1 Sufficiency Characteristic. Given the parameters a, b, σ of the Gaussian likelihood
model (24)–(25), define a measure
|a|
SC = . (43)
σ
Called the sufficiency characteristic of predictor X, it has four properties. (i) It is interpretable as
the “signal-to-noise ratio”, as in the linear regression (25a) of X on W , the absolute value of the
slope coefficient, |a|, is the measure of signal, and the standard deviation of the residual variate,
σ, is the measure of noise. (ii) It is bounded as 0 ≤ SC ≤ ∞, with SC = 0 implying that X is
uninformative for predicting W , and SC = ∞ implying that X is a perfect predictor of W . (iii)
It is dimensional, with the unit being [unit of W ]−1 . (iv) It orders the alternative predictors of W
as follows.
Theorem 2 (Sufficient comparison via SC). Suppose each predictor Xi (i = 1, 2) of
predictand W is characterized by a Gaussian likelihood model with the sufficiency characteristic
SCi = |ai |/σ i such that 0 < SCi < ∞. Predictor X1 is more informative than predictor X2 if
SC1 > SC2 . (44)
6.5.2 Informativeness Score. When in addition to the likelihood model being Gaussian
(24)–(25), the prior density function is Gaussian (23), the variance S 2 of predictand W completely
characterizes the prior uncertainty, 1/S can be interpreted as the “prior signal-to-noise ratio”, and
consequently the quotient SC/(1/S) = |a|S/σ measures the signal-to-noise ratio of a given pre-
27
dictor relative to the prior signal-to-noise ratio. This quotient constitutes the kernel of the measure
"µ ¶−2 #−1/2
SC
IS = +1
1/S
µ ¶−1/2
σ2
= +1 . (45)
a2 S 2
Called the (Bayesian) informativeness score, it has four properties. (i) It is interpretable as the
absolute value of a Bayesian estimator of the Pearson’s product-moment correlation coefficient
between X and W :
IS = |Cor(X, W )| , (46a)
Cor(X, W ) = (sign of a)IS. (46b)
The adjective Bayesian stems from the fact that S comes from the prior density function whereas
a and σ come from the likelihood function. Consequently IS does not equal the correlation
coefficient estimated from the joint sample of (X, W ) unless S = sW , as explained in Section
4.1.1. (ii) It is bounded as 0 ≤ IS ≤ 1, with IS = 0 implying that X is uninformative for
predicting W , and IS = 1 implying that X is a perfect predictor of W . (iii) It is dimensionless.
(iv) It orders the alternative predictors of W consistently with SC, being its strictly increasing
transformation. (Note: in my previous writings, (45) was termed the Bayesian correlation score.)
Theorem 3 (Sufficient comparison via IS). Suppose the assumptions of Theorem 2 hold,
predictand W has a Gaussian prior density function with variance S 2 , and the informativeness
score ISi of predictor Xi is obtained from SCi and S according to (45). Predictor X1 is more
informative than predictor X2 if
IS1 > IS2 . (47)
28
Table 4 reports the values of the signal-to-noise ratio, |a|/σ, and the informativeness score,
IS, in the four cases shown in Figs. 4–7. It also reports the values of R2 — the proportion of
the variation of the predictand that is explained by the predictor in the regression of the common
approach. While both measures, IS and R2 , imply the same order of the four predictors, their
numerical values are distinct.
6.6 Utilitarian Perspective
In summary, under the assumption that the likelihood model is Gaussian and the prior density
function is Gaussian, the theory of sufficient comparisons of predictors leads to simple statistical
measures of informativeness, |a|/σ and IS. The IS is suggested for reporting because it is dimen-
sionless and therefore comparable, in some sense, across different predictands. The key property
of this measure is its ordinal correspondence with the utility (equivalently, the economic value) of
the predictors to every rational decider. In that sense, the IS is a utilitarian measure of model
performance.
29
7. CLOSURE
The novel research paradigm, dubbed “the court of miracles of hydrological modeling”, fo-
cuses on achieving scientific progress in circumstances which traditionally would be called model
failures. We suggest that in such circumstances the requirements of a rational decider for infor-
mation help to frame the issues for a hydrologist. Basic modeling principles for addressing some
of these issues have been discussed herein from the viewpoint of the Bayesian forecast-decision
theory.
These modeling principles are general though they have been illustrated with specific statisti-
cal models. The meta-decision problem (Section 2) considers four special criterion functions but
is general insofar as the distribution function that quantifies uncertainty can be of any form. The
model of stochastic dependence (Section 3), the model for data fusion and prediction calibration
(Section 5), and the measures of informativeness (Section 6) come from the Bayesian Gaussian
processor. It is the simplest processor and, therefore, serves well for expository purposes (to ex-
plain advantages of the Bayesian approach when the joint samples are unrepresentative or affected
by illusory outliers). However, all the models and results discussed herein can be generalized by
employing the Bayesian Meta-Gaussian processor (Krzysztofowicz & Kelly, 2000a, 2000b) which
allows each marginal distribution function to take any form and the stochastic dependence struc-
ture to be nonlinear and heteroscedastic. Inasmuch as hydrological variates exhibit, with a few
exceptions, non-Gaussian, nonlinear, and heteroscedastic properties, application of the Bayesian
Meta-Gaussian processor to the issues treated in this article would be a natural extension.
Acknowledgment
This material is based upon work supported by the National Science Foundation under Grant
No. ATM-0641572, “New Statistical Techniques for Probabilistic Weather Forecasting”.
30
APPENDIX A: WEIBULL DISTRIBUTION
In the examples, the input variate W is assumed to have the sample space (η, ∞), and the
Weibull distribution with scale parameter α (α > 0), shape parameter β (β > 0), and shift para-
meter η (−∞ < η < ∞). The formulae for the density function g, the distribution function G,
and the quantile function G−1 , for any w ∈ (η, ∞) and any p ∈ (0, 1), are as follows:
µ ¶β−1 " µ ¶β #
β w−η w−η
g(w) = exp − ,
α α α
" µ ¶β #
w−η
G(w) = 1 − exp − ,
α
1
G−1 (p) = α [− ln(1 − p)] β + η.
31
REFERENCES
Andréassian, V., Perrin, C., Parent, E. & Bárdossy, A. (2010) The Court of Miracles of Hydrology:
Can Failure Stories Contribute to Hydrological Science? Hydrological Sciences Journal,
this issue.
Blackwell, D. (1951) Comparison of Experiments. In: Proc. Second Berkeley Symposium on
Mathematical Statistics and Probability (ed. by J. Neyman), 93–102. University of California
Press, Berkeley, California, USA.
Blackwell, D. (1953) Equivalent Comparisons of Experiments. Ann. Math. Statist. 24,
265–272.
DeGroot, M.H. (1970) Optimal Statistical Decisions. McGraw-Hill, New York, USA.
DeGroot, M.H. (1989) Probability and Statistics. Addison-Wesley, Reading, Massachusetts, USA.
DeGroot, M.H. & Fienberg, S.E. (1982) Assessing Probability Assessors: Calibration and
Refinement. In: Statistical Decision Theory and Related Topics, III (ed. by J.O. Berger
& S.S. Gupta), vol. 1, 291–314. Academic Press, New York, USA.
DeGroot, M.H. & Fienberg, S.E. (1983) The Comparison and Evaluation of Forecasters.
The Statistician 32, 12–22.
Krzysztofowicz, R. (1983) Why Should a Forecaster and a Decision Maker Use Bayes Theorem.
Water Resour. Res. 19(2), 327–336.
Krzysztofowicz, R. (1986) Expected Utility, Benefit, and Loss Criteria for Seasonal Water Supply
Planning. Water Resour. Res. 22(3), 303–312.
32
Krzysztofowicz, R. (1987) Markovian Forecast Processes. J. Am. Statist. Assoc. 82(397), 31–37.
Krzysztofowicz, R. (1990) Target-Setting Problem With Exponential Utility. IEEE Transactions
on Systems, Man, and Cybernetics 20(3), 687–688.
Krzysztofowicz, R. (1992) Bayesian Correlation Score: A Utilitarian Measure of Forecast Skill.
Monthly Weather Rev. 120(1), 208–219.
Krzysztofowicz, R. (1999) Bayesian Theory of Probabilistic Forecasting via Deterministic
Hydrologic Model. Water Resour. Res. 35(9), 2739–2750.
Krzysztofowicz, R. (2002) Bayesian System for Probabilistic River Stage Forecasting. J. Hyd.
268(1–4), 16–40.
Krzysztofowicz R. & Kelly, K.S. (2000a) Bayesian Improver of a Distribution. Stoch. Env. Res.
and Risk Assess. 14(6), 449–470.
Krzysztofowicz, R. & Kelly, K.S. (2000b) Hydrologic Uncertainty Processor for Probabilistic
River Stage Forecasting. Water Resour. Res. 36(11), 3265–3277.
Krzysztofowicz, R. & Reese, S. (1991) Bayesian Analyses of Seasonal Runoff Forecasts.
Stoch. Hydrol. and Hydraulics 5(4), 295–322.
Krzysztofowicz, R. & Sigrest, A.A. (1999) Calibration of Probabilistic Quantitative Precipitation
Forecasts. Weather and Forecasting 14(3), 427–442.
Krzysztofowicz, R. & Watada, L.M. (1986) Stochastic Model of Seasonal Runoff Forecasts. Water
Resour. Res. 22(3), 296–302.
33
Pratt, J.W., Raiffa, H. & Schlaifer, R. (1995) Introduction to Statistical Decision Theory. The MIT
Press, Cambridge, Massachusetts, USA.
Todini, E. (2008) A Model Conditional Processor To Assess Predictive Uncertainty in Flood
Forecasting. Intl. J. River Basin Management 6(2), 1–15.
Vardeman, S. & Meeden, G. (1983) Calibration, Sufficiency, and Domination Considerations
for Bayesian Probability Assessors. J. Am. Statist. Assoc. 78(384), 808–816.
Wymore, A.W. (1977) A Mathematical Theory of Systems Engineering — The Elements. R.E.
Krieger Publishing Co., Huntington, New York, USA.
34
Table 1. Optimal decision a∗ in an estimation, or target-setting, problem
depends on the criterion function (see Table 3) and the distribution
function of the input variate W — here, the Weibull distribution
having scale parameter α = 1, shift parameter η = 0, and shape
parameter β as specified (see Fig. 2).
Shape parameter β
Optimal decision a∗ 0.5 1 1.5 2 4 6
Mode of W 0 0 0.48 0.71 0.93 0.97
Mean of W 2.00 1.00 0.90 0.89 0.91 0.93
Median of W 0.48 0.69 0.78 0.83 0.91 0.94
0.4-quantile of W 0.26 0.51 0.64 0.71 0.85 0.89
0.6321-quantile of W 1.00 1.00 1.00 1.00 1.00 1.00
0.8-quantile of W 2.59 1.61 1.37 1.27 1.13 1.08
The p-quantile of W corresponds to the ratio λo /λu of the marginal opportunity
losses as follows: p = 0.4 ⇔ λo /λu = 3/2,
p = 0.6321 ⇔ λo /λu = 291/500,
p = 0.8 ⇔ λo /λu = 1/4.
35
Table 2. Under the two-piece linear opportunity loss function, the ratio
λo /λu of the marginal opportunity losses determines the optimal
decision a∗ and the associated optimal probability of
over-estimation, P (W < a∗ ) = G(a∗ ).
λo /λu 5/1 3/1 1/1 1/3 1/5
G(a∗ ) 1/6 1/4 1/2 3/4 5/6
36
Table 3. Analytic solutions to estimation, or target-setting, problems
with special criterion functions; W is the variate whose
realization is being estimated.
Criterion function Optimal decision a∗
Impulse utility Mode of W
Quadratic difference opportunity loss Mean of W
Absolute difference opportunity loss Median of W
Two-piece linear opportunity loss Quantile of W
37
Table 4. Measures of informativeness of predictor X for predictand W in (i) common
approach to prediction (regression of W on X): proportion of the explained
variation R2 ; and in (ii) Bayesian approach to prediction (regression of X
on W ): signal-to-noise ratio |a|/σ and informativeness score IS.
Sample R2 |a|/σ IS
A 0.70 0.1263 0.61
L 0.10 0.0589 0.25
R 0.13 0.0592 0.26
O 0.79 0.1718 0.74
38
(a)
u(a, w)
È
a w
(b)
l(a, w)
a w
(c)
l(a, w)
a w
Fig. 1. Special criterion functions for an estimation, or a target-setting, problem:
(a) impulse utility, (b) quadratic difference opportunity loss with λ = 1,
and (c) two-piece linear opportunity loss with λo = 1/2 and λu = 2
(a special case of which is absolute difference opportunity loss,
when λo = λu ).
39
(a)
g(w)
2 α=1
β=6 η=0
β=4
1
β=2
β = 1.5
β=1
β = 0.5
0 w
0 1 2 3
(b)
G(w)
1.0
β = 4 β = 2 β = 1.5
β=1
β=6
β = 0.5
0.8
0.6
0.4
α=1
η=0
0.2
0.0
0 1 2 3 w
Fig. 2. The Weibull density functions (a) and the Weibull distribution functions (b)
with fixed values of the scale parameter α and the shift parameter η, and
with varying values of the shape parameter β. (See Table 1 for the
resultant values of the optimal decision a∗ under different criterion
functions; see Appendix A for the formulae.)
40
1.0
0.8
0.6
P(W≤w)
0.4
0.2
M = 29.93
S = 9.89
0.0
0 10 20 30 40 50 60
Predictand w
Fig. 3. Prior distribution function G of predictand W : empirical (circles) and
parametric N (M, S 2 ) (solid line).
41
(a) (b)
60 cx + d 60
c = 0.73
50 d = 7.54
50
τ = 6.61
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.64 10 a = 0.96
B = 10.34 b = 2.13
0 Τ = 6.18 0 σ = 7.59
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 30.8 c = 0.73
0.8 sx = 13.78 0.8 d = 7.54
τ = 6.61
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 30
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 30.83 0.2
A = 0.64
SD(X) = 12.15 B = 10.34
0.0 0.0 Τ = 6.18
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Fig. 4. Comparison of results from the common model and the Bayesian model, given
joint sample A (all realizations): (a) the regression of W on X (broken line), and the
posterior mean of W (solid line) derived via Bayes theorem; (b) the regression
of X on W ; (c) the empirical distribution function of X (circles), the marginal
distribution function K̄ of X estimated from the marginal sample, and the expected
distribution function K of X derived via the total probability law; (d) the conditional
distribution function H(·|X = 30) of W constructed from regression of W on X
(graph a), and the posterior distribution function Φ(·|X = 30) of W derived via
Bayes theorem.
42
(a) (b)
60 cx + d 60
c = 0.31
50 d = 13.80
50
τ = 5.50
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.74 10 a = 0.34
B = 13.34 b = 12.13
0 Τ = 8.55 0 σ = 5.78
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 18.8 c = 0.31
0.8 sx = 6.11 0.8 d = 13.80
τ = 5.50
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 40
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 22.32 0.2
A = 0.74
SD(X) = 6.69 B = 13.34
0.0 0.0 Τ = 8.55
0 10 20 30 40 50 60 0 10 20 30 40 50 60
joint sample L (drawn from the left tail). The interpretation of graphs is the
same as in Fig. 4. The additional dotted lines show (a) the posterior mean
of W , (c) the expected distribution function K of X, and (d) the posterior
distribution function Φ(·|X = 40) that would have been obtained if the joint
sample A (all realizations) were used.
43
(a) (b)
60 cx + d 60
c = 0.31
50 d = 26.89
50
τ = 6.01
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.63 10 a = 0.41
B = 5.72 b = 26.43
0 Τ = 8.53 0 σ = 6.88
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 42.8 c = 0.31
0.8 sx = 7.36 0.8 d = 26.89
τ = 6.01
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 20
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 38.62 0.2
A = 0.63
SD(X) = 7.97 B = 5.72
0.0 0.0 Τ = 8.53
0 10 20 30 40 50 60 0 10 20 30 40 50 60
joint sample R (drawn from the right tail). The interpretation of graphs is the
44
(a) (b)
60 cx + d 60
c = 0.64
50 d = 8.97
50
τ = 5.18
40 40
Predictand w
Predictor x
30 30
20 20
Ax + B aw + b
10 A = 0.60 10 a = 1.24
B = 11.38 b = -6.15
0 Τ = 5.02 0 σ = 7.24
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 24.2 c = 0.64
0.8 sx = 15.92 0.8 d = 8.97
τ = 5.18
P(W≤w|X=x)
0.6 0.6
P(X≤x)
x = 30
0.4 0.4
Κ(x) Φ(w|X=x)
0.2 E(X) = 31.08 0.2
A = 0.60
SD(X) = 14.27 B = 11.38
0.0 0.0 Τ = 5.02
0 10 20 30 40 50 60 0 10 20 30 40 50 60
joint sample O (containing an outlier). The interpretation of graphs is the
45

Bayes Decision Inference

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayes Decision Inference

Uploaded by

Copyright:

Available Formats

DECISION CRITERIA, DATA FUSION, AND PREDICTION CALIBRATION:

P.O. Box 400747

Charlottesville,Virginia 22904-4747, USA

E-mail address: rk@virginia.edu

Research Paper RK–0901

Revised March 2010

Published in Hydrological Sciences Journal, 55 (6), 1033–1050, 2010

Special issue: The Court of Miracles of Hydrology

eling”, focuses on achieving scientific progress in circumstances which traditionally would be

uncertainty (a meta-decision problem); (ii) modeling stochastic dependence between variates to

terms of their informativeness (equivalently, in terms of their economic value to a decider). It is

predictions have economic consequences.

distribution function; data fusion; prediction calibration; informativeness;

3. MODELS OF STOCHASTIC DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4. ATTRIBUTES OF BAYESIAN APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5. DATA FUSION AND PREDICTION CALIBRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1 A New Research Paradigm

tellectual construct portraying a hydrologist’s understanding of a fragment of reality. That this

by Cemagref and ENGREF-AgroParistech in Paris, in June 2008, effectively identified a funda-

of hydrological modeling”, the paradigm is defined by (i) a collection of “hydrological monsters”

predicting, and deciding (Andréassian et al., 2010).

of the predictive uncertainty, conditional on a state-of-the-art hydrological model — a necessary

hydrological estimates and predictions have economic consequences.

theory, for coping with “monsters”. These issues are as follows.

deciding how to decide.)

2. Modeling stochastic dependence between variates in terms of conditional distribution

functions for the purpose of quantifying uncertainty and predicting realizations.

of (i) estimation of distribution functions, and (ii) quantification of predictive uncertainty.

4. Calibrating probabilistic predictions against a reference (a prior distribution function).

5. Ordering predictors, or models, in terms of their informativeness for decision making

(equivalently, in terms of their expected utility or economic value to a decider).

2.1 Target-Setting Problem

reservoir storage to be emptied and W is the volume of the incoming flood).

l(a, w) = max u(a, w) − u(a, w)

= u(w, w) − u(a, w). (1)

max E[u(a, W )] = max{E[u(W, W ) − l(a, W )]}

= E[u(W, W )] − min E[l(a, W )]. (2)

stands for probability; let g denote the corresponding density function of W .

function l) should be adopted to arrive at a rational decision in the court of miracles?

2.2 Impulse Utility Function

mis-estimation (Fig. 1(a)). Mathematically,

where δ denotes the impulse function (Dirac function).

Each feasible decision is now evaluated in terms of the expected utility

U (a∗ ) = max U (a) = max g(a); (5)

maximum likelihood criterion for parameter estimation.)

forms of u (or l) should be considered.

2.3 Quadratic Difference Opportunity Loss Function

l(a, w) = λ(a − w)2 , (6)

L(a) = E[λ(a − W )2 ], (7)

and the optimal decision a∗ is found as the minimizer:

L(a∗ ) = min L(a) = λVar(W ), (8)

information about variate W is its mean.

2.4 Two-Piece Linear Opportunity Loss Function

w and its magnitude depends on the direction of mis-estimation (Fig. 1(c)):

[$/(unit of w)] or may represent subjective valuation.

and the optimal decision a∗ is found as the minimizer:

L(a∗ ) = min L(a)

a∗ = G−1 (1/[1 + λo /λu ]). (12)

1/[1 + λo /λu ]. This solution has two important properties.

optimal decision equals the third quartile of W under G.

with probability G(a∗ ).

an asymmetry in the valuation of opportunity losses from over-estimation and under-estimation.

2.5 Absolute Difference Opportunity Loss Function