You are on page 1of 49

DECISION CRITERIA, DATA FUSION, AND PREDICTION CALIBRATION:

A BAYESIAN APPROACH

By

Roman Krzysztofowicz

University of Virginia

P.O. Box 400747

Charlottesville,Virginia 22904-4747, USA

E-mail address: rk@virginia.edu

Research Paper RK–0901

http://www.faculty.virginia.edu/rk/

October 2009

Revised March 2010

Published in Hydrological Sciences Journal, 55 (6), 1033–1050, 2010

Special issue: The Court of Miracles of Hydrology


Abstract. The novel research paradigm, dubbed “the court of miracles of hydrological mod-

eling”, focuses on achieving scientific progress in circumstances which traditionally would be

called model failures. Many of the associated modeling issues can be addressed systematically

and coherently within the mathematical framework of Bayesian forecast-decision theory. Five of

them are addressed herein: (i) choosing a criterion function for making rational decisions under

uncertainty (a meta-decision problem); (ii) modeling stochastic dependence between variates to

quantify uncertainty and predict realizations; (iii) fusing data from asymmetric samples to cope

with unrepresentativeness of small samples and corruptive effects of outliers; (iv) calibrating

probabilistic predictions against a prior distribution; and (v) ordering predictors, or models, in

terms of their informativeness (equivalently, in terms of their economic value to a decider). It is

suggested that communication between hydrologists and deciders (planners, engineers, operators

of hydrosystems) would benefit if hydrologists adopted, at least on some issues, the perspective of

deciders, who must act in a timely and rational manner, and for whom hydrological estimates and

predictions have economic consequences.

Key words: decision making; decision criteria; expected utility; estimation; uncertainty;

distribution function; data fusion; prediction calibration; informativeness;

Bayesian approach.

ii
TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A New Research Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Decision-Theoretic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. DECISION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Target-Setting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Impulse Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Quadratic Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Two-Piece Linear Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Absolute Difference Opportunity Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Median Versus Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 The Meta-Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3. MODELS OF STOCHASTIC DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


3.1 Joint Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Common Approach to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.3 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Bayesian Approach to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Prior Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.3 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.4 Expected Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.5 Posterior Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4. ATTRIBUTES OF BAYESIAN APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


4.1 Theoretic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Common Misconceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5. DATA FUSION AND PREDICTION CALIBRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


5.1 The Setting for Tutorial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 An Improved Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 An Improved Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 A Calibrated Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iii
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Optimal Decision Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Bayes Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Comparison Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Theory of Sufficient Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Informativeness Under Gaussian Likelihood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.1 Sufficiency Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.2 Informativeness Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6 Utilitarian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7. CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
APPENDIX A: WEIBULL DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iv
1. INTRODUCTION

1.1 A New Research Paradigm

A hydrological model, like any mathematical model of a system (Wymore, 1977), is an in-

tellectual construct portraying a hydrologist’s understanding of a fragment of reality. That this

construct simplifies the reality and, therefore, fails to make perfect predictions is a given. What

remains in question, and should not be hidden, is the magnitude and the frequency of failures.

By focusing on failures rather than successes of hydrological models, the conveners (Vazken

Andréassian, Charles Perrin, Eric Parent, and András Bárdossy) of a scientific workshop organized

by Cemagref and ENGREF-AgroParistech in Paris, in June 2008, effectively identified a funda-

mental research paradigm. Named with a poetic flare, after Victor Hugo, “the court of miracles

of hydrological modeling”, the paradigm is defined by (i) a collection of “hydrological monsters”

— catchments, events, and situations that revealed unexpected weaknesses or outright failures of

hydrological models, and (ii) a set of objectives for scientific progress in understanding, modeling,

predicting, and deciding (Andréassian et al., 2010).

This comprehensive, systemic perspective is noteworthy. At any point in time, the science of

hydrology offers state-of-the-art models. While scientific progress towards greater understanding

and improved models continues in an unconstrained fashion, decisions for planning, construction,

and operation of hydrosystems must be made in a timely and rational manner. This require-

ment places two demands on hydrologists: (i) to provide an honest (a well-calibrated) assessment

of the predictive uncertainty, conditional on a state-of-the-art hydrological model — a necessary

condition for effective communication between hydrologists and deciders (planners, engineers, op-

erators), and (ii) to appreciate, and if necessary to adopt, the perspective of a decider for whom

hydrological estimates and predictions have economic consequences.

1
1.2 Decision-Theoretic Issues

This expository article is written under the motto (again, aptly coined by the conveners):

“There are no hydrological monsters, only decision-making issues”. Indeed, the requirements of

a rational decider for information help to frame the issues for a hydrologist. We address some

fundamental issues and discuss some basic principles, derived from the Bayesian forecast-decision

theory, for coping with “monsters”. These issues are as follows.

1. Choosing a criterion function for making rational decisions under conditions of uncertainty

for the purpose of calculating estimates or setting targets. (This is a meta-decision problem:

deciding how to decide.)

2. Modeling stochastic dependence between variates in terms of conditional distribution

functions for the purpose of quantifying uncertainty and predicting realizations.

3. Fusing data from asymmetric samples of stochastically dependent variates for the purpose

of coping with unrepresentativeness of small samples and effects of illusory outliers in the context

of (i) estimation of distribution functions, and (ii) quantification of predictive uncertainty.

4. Calibrating probabilistic predictions against a reference (a prior distribution function).

5. Ordering predictors, or models, in terms of their informativeness for decision making

(equivalently, in terms of their expected utility or economic value to a decider).

While at first glance these issues may appear disjoint, they can be addressed systematically

and coherently within the mathematical framework of decision theory — a top-down approach

whose ultimate objective is to maximize the expected utility of outcomes to stakeholders at every

step of modeling.

2
2. DECISION CRITERIA

2.1 Target-Setting Problem

There is a class of decision problems under uncertainty wherein the optimal value of a contin-

uous decision variable a would be set to the realization w of the input variate W if only one knew

that realization at the time the decision must be made. This class includes problems of statistical

estimation (e.g., when a is the estimate of an unknown infiltration coefficient W ) and problems of

setting targets in the context of management, planning, and control (e.g., when a is the height of

a flood levee to be built and W is the maximum flood crest in the next 50 years; when a is the

reservoir storage to be emptied and W is the volume of the incoming flood).

Suppose a and w are continuous variables and u is a utility function that evaluates outcomes,

such that u(a, w) denotes the utility of outcome resulting from decision a and input w. If in-

put w were known with certainty, then the optimal decision would be a = w, which means that

max u(a, w) = u(w, w) for every w. Oftentimes, it is convenient to transform the utility function
a

into an opportunity loss function l, such that l(a, w) denotes the difference between the utility of

the optimal decision and the utility of a given decision, when w is fixed (DeGroot, 1970):

l(a, w) = max u(a, w) − u(a, w)


a

= u(w, w) − u(a, w). (1)

It follows that the optimal decision a∗ under uncertainty about W can be obtained either by maxi-

mizing the expected utility or, equivalently, by minimizing the expected opportunity loss:

max E[u(a, W )] = max{E[u(W, W ) − l(a, W )]}


a a

= E[u(W, W )] − min E[l(a, W )]. (2)


a

3
Suppose furthermore that the uncertainty about the input variate W is quantified in terms of

a distribution function G such that for any realization w, one has G(w) = P (W ≤ w), where P

stands for probability; let g denote the corresponding density function of W .

Two questions arise. (i) Given the preferences of a decider encoded in the utility function u

(or the opportunity loss function l) and the uncertainty about the input variate W quantified by the

distribution function G (or the density function g), What is the optimal decision a∗ ? (ii) In the ab-

sence of well-formalized preferences over possible outcomes (which is often the case in scientific

estimation and prediction problems), What form of the utility function u (or the opportunity loss

function l) should be adopted to arrive at a rational decision in the court of miracles?

We shall address these questions by recalling known results from decision theory for four

forms of u (or l) shown in Fig. 1 (DeGroot, 1970; Pratt et al., 1995; Krzysztofowicz, 1990).

2.2 Impulse Utility Function

Suppose decision a that perfectly estimates input w results in infinite utility, whereas decision

a that mis-estimates input w results in zero utility, regardless of the direction and magnitude of

mis-estimation (Fig. 1(a)). Mathematically,


½
∞ if a = w,
u(a, w) = δ(w − a) = (3)
0 if a 6= w,

where δ denotes the impulse function (Dirac function).

Each feasible decision is now evaluated in terms of the expected utility

U (a) = E[u(a, W )]

Z ∞
= δ(w − a)g(w) dw
−∞

= g(a), (4)

4
and the optimal decision a∗ is found as the maximizer:

U (a∗ ) = max U (a) = max g(a); (5)


a a

that is, the optimal decision a∗ is a point at which the density function g attains the maximum.

Such a point a∗ is known as the mode of variate W . (It may be noted that (5) is akin to the

maximum likelihood criterion for parameter estimation.)

To illustrate this solution, Fig. 2(a) shows six Weibull density functions (see Appendix A for

the formula) with fixed scale parameter α and shift parameter η, and different shape parameter β

values, which result in different values of the mode a∗ (see Table 1).

Should a hydrologist adopt the mode of W as the preferred estimator? The above formulation

offers a way of rationalizing such a meta-decision: If the hydrologist believes that only a single

point — the perfect estimate of the unknown w — has any positive utility (which is infinite relative

to the zero utility of every non-perfect estimate), then the answer is affirmative; otherwise other

forms of u (or l) should be considered.

(To decide whether or not the impulse utility function is a suitable criterion for scientific

decisions, it may help to consider the problem of targeting a pistol in a duel. Here, w is the

position of the adversary on the horizontal axis, and a is the aiming point. The outcome of a = w

is the dead adversary, whereas the outcome of a 6= w is a chance of ending up in the morgue

oneself, should the adversary’s targeting turns out to be perfect. If one assigns the infinite utility

to one’s own life relative to the adversary’s life, as a matter of rationality, or survival instinct, then

the impulse utility function is the exact mathematical model of this preference.)

2.3 Quadratic Difference Opportunity Loss Function

The opportunity loss is proportional to the quadratic difference between decision a and

5
input w (Fig. 1(b)):

l(a, w) = λ(a − w)2 , (6)

where λ is the marginal opportunity loss from mis-estimation of the input, regardless of the direc-

tion of mis-estimation; it may be in monetary units [$/(unit of w)2 ] or may represent subjective

valuation.

Each feasible decision a is now evaluated in terms of the expected opportunity loss

L(a) = E[λ(a − W )2 ], (7)

and the optimal decision a∗ is found as the minimizer:

L(a∗ ) = min L(a) = λVar(W ), (8)


a

where a∗ = E(W ). That is, the optimal decision equals the mean of variate W ; Table 1 shows

examples. (It may be noted that (7) is akin to the least squares criterion for parameter estimation.)

Should a hydrologist adopt the mean of W as the preferred estimator? Again, the above

formulation offers a way of rationalizing such a meta-decision. The answer is affirmative if the

hydrologist is indifferent with respect to the direction of mis-estimation and agrees that the op-

portunity loss increases quadratically with the magnitude of mis-estimation. The hydrologist may

also consider the implication of this preference: detailed quantification of uncertainty about vari-

ate W in terms of its distribution function G is irrelevant to decision making; the only relevant

information about variate W is its mean.

2.4 Two-Piece Linear Opportunity Loss Function

The opportunity loss is proportional to the absolute difference between decision a and input

w and its magnitude depends on the direction of mis-estimation (Fig. 1(c)):


½
λo (a − w) if w ≤ a,
l(a, w) = (9)
λu (w − a) if a ≤ w,

6
where λo is the marginal opportunity loss from over-estimation of the input, and λu is the marginal

opportunity loss from under-estimation of the input; each constant may be in monetary units

[$/(unit of w)] or may represent subjective valuation.

Each feasible decision a is now evaluated in terms of the expected opportunity loss

L(a) = E[l(a, W )]
Z a
= (λu + λo )aG(a) − λu a + λu E(W ) − (λu + λo ) wg(w) dw, (10)
−∞

and the optimal decision a∗ is found as the minimizer:

L(a∗ ) = min L(a)


a
Z a∗
= λu E(W ) − (λu + λo ) wg(w) dw, (11)
−∞

where a∗ is such that G(a∗ ) = 1/[1 + λo /λu ]. Because W is a continuous variate, its distribution

function G has the inverse G−1 , called the quantile function of W . Hence,

a∗ = G−1 (1/[1 + λo /λu ]). (12)

That is, the optimal decision a∗ equals the quantile of variate W corresponding to the probability

1/[1 + λo /λu ]. This solution has two important properties.

First, in order to find the optimal decision a∗ , the hydrologist needs to know the distribution

function G of variate W and the ratio λo /λu of the marginal opportunity losses. Table 2 shows

the implications: for instance, when the hydrologist judges the marginal opportunity loss from

under-estimation to be 3 times as large as that from over-estimation, G(a∗ ) = 3/4; that is, the

optimal decision equals the third quartile of W under G.

Second, associated with the optimal decision a∗ are the probability of over-estimation of the

input, P (W < a∗ ) = G(a∗ ), and the probability of under-estimation of the input, P (W > a∗ ) =

1 − G(a∗ ). For instance, when a∗ is the planned daily capacity of a water delivery system, and

7
W is the daily river flow volume, then G(a∗ ) represents the probability of shortage. Note that this

is the “optimal” probability of shortage: given the ratio λo /λu of the marginal opportunity losses

and the distribution function G of the daily river flow volume W , it is optimal to incur a shortage

with probability G(a∗ ).

Should a hydrologist adopt a quantile of W as the preferred estimator? Again, the above

formulation offers a way of rationalizing such a meta-decision. Inasmuch as the ratio λo /λu may

take on different values for different problems and different deciders, any quantile of W under G

may be optimal in some situation. In that sense, the two-piece linear opportunity loss (9) is the

most general among the three criterion functions considered herein — general in that it allows for

an asymmetry in the valuation of opportunity losses from over-estimation and under-estimation.

2.5 Absolute Difference Opportunity Loss Function

A special case of the two-piece linear opportunity loss function (9) arises when the marginal

opportunity losses are identical: λo = λu = λ. The opportunity loss function is now symmetric

about decision a and can be written concisely as

l(a, w) = λ|a − w|. (13)

In other words, the opportunity loss is proportional to the absolute difference between decision a

(the estimate) and realization w of variate W . From (12) it follows that the optimal decision a∗

equals the median of variate W , as

a∗ = G−1 (1/2). (14)

2.6 Median Versus Mean

When the opportunity losses from over-estimation and under-estimation are symmetric, the

hydrologist might ask which of the two functions, (6) or (13), reflects better his preferences. In the

absence of a definite answer, a secondary meta-consideration, especially in the court of miracles,

8
may be the robustness of the optimal decision to outliers. As is well known, the sample estimate

of the mean E(W ) is sensitive to outliers, whereas the sample estimate of the median G−1 (1/2) is

not. This fact might favor the median. A third meta-consideration, especially when the estimate

constitutes a prediction of W , is the communication of uncertainty to users. For this purpose, the

median of W (not the mean) is the preferred estimate because it conveys at least some rudimentary

assessment of uncertainty (the 50% chance of the actual realization being either below or above

the estimate), which the users can grasp intuitively.

2.7 The Meta-Decision Problem

We have examined a class of decision problems known as estimation problems or as target-

setting problems. Four models have been presented, each with a special criterion function and

an analytic solution for the optimal decision. The summary (Table 3) is this: Each of the four

estimators (mode, mean, median, a quantile) is optimal with respect to some criterion function.

Two conclusions follow.

First, the meta-decision problem can be framed thusly: the choice of an estimator for a

hydrological predictand is tantamount to accepting (if only implicitly) the corresponding criterion

function as representative of a hydrologist’s preferences over outcomes; and vice versa, the choice

(explicit) of the criterion function prescribes the type of estimator that is optimal.

Second, the apparent propensity of hydrologists (and scientists, in general) to prefer the mean

(as the estimator) and to deplore the bias (any deviation from the mean) finds no justification in

the theory of rational decisions as applied to real-world problems: ample empirical evidence from

cost-benefit analyses and preference-elicitation studies indicates that opportunity loss functions are

asymmetric more often than not (e.g., Krzysztofowicz, 1986).

9
3. MODELS OF STOCHASTIC DEPENDENCE

3.1 Joint Density Function

At the heart of model building in any science, including hydrology, is the task of identifying

and representing relations between variables. When uncertainties spoil the idealized deterministic

relations, the focus should shift on representing stochastic dependence structures between vari-

ates. The most general mathematical model of stochastic dependence is a joint density function π.

Herein, we focus on the simplest situation with two continuous variates, interpreted as a predictand

W and a predictor X, so that

π(w, x) = φ(w|x)κ(x) = f(x|w)g(w). (15)

The two factorizations of the joint density function π (in terms of the families of the conditional

density functions φ, f and the marginal density functions κ, g) are equivalent, provided they satisfy

the coherency requirements: the total probability law


Z ∞
κ(x) = f(x|w)g(w), (16)
−∞

and Bayes theorem


f(x|w)g(w)
φ(w|x) = . (17)
κ(x)
The theoretic formulation (16)–(17), known as a Bayesian processor, decomposes the model build-

ing task as follows: first, model and estimate g and f ; then derive κ and φ. Thereby a complete

and coherent model of stochastic dependence can be built, which utilizes all available data.

The objective of this exposition is to explain the Bayesian approach to model building in the

context of two problems: (i) fusing data from a large sample of variate W and from a small sample

of variate X in order to improve the model for the density function κ of variate X, and (ii) making

a probabilistic prediction in the form of a conditional density function φ(·|x) of variate W , given

realization x of variate X.

10
Many hydrologists seem to be unaware of how strong the case for a Bayesian approach to

modeling really is. For this reason, we shall juxtapose it with a common statistical approach

which, in essence, focuses on building directly a model of the form h(w|x). In the most general

form, h(·|x) is a conditional density function of variate W , given realization x of variate X. It will

be shown that h 6= φ. And it will be argued that in the court of miracles, the Bayesian approach

would carry the day on the account of allowing the hydrologist (i) to use all available data, and (ii)

to calibrate probabilistic predictions against a specified standard. Both of these unique advantages

of the Bayesian approach should translate into a more robust model — the first line of defense

against model failures.

3.2 Data Paradigm

Suppose the available data are organized into two samples. (i) The joint sample of (X, W )

consists of N realizations:

{(x(n), w(n)) : n = 1, ..., N}. (18)

(ii) The prior sample of W (which may also be called a climatic sample of W when it has been

collected over a sufficient number of years to represent the climate) consists of M realizations:

{w(n) : n = 1, ..., M}. (19)

The two samples are asymmetric, N < M, when the prior sample includes M − N additional

realizations {w(n) : n = N + 1, ..., M} of W for which no matching realizations of X exist.

Here is a typical example (Krzysztofowicz & Watada, 1986). A long climatic record, say

from M years, exists of runoff volume w measured at the catchment outlet during the snowmelt

season, but the record of snowpack depth x measurements is recent and short, say from N years,

N < M. With the objective of predicting the seasonal snowmelt runoff volume W , given the

snowpack depth X = x on some fixed date during the winter, the hydrologist should wish not

11
to discard the M − N years of runoff volume measurements. Yet this he must do if he takes a

common statistical approach.

3.3 Common Approach to Prediction

This section reviews the simple linear regression (DeGroot, 1989) that yields a model of the

form h(w|x) and that sets the stage for the Bayesian model.

3.3.1 Model. Under the normal-linear model, the stochastic dependence between the pre-

dictand W and the predictor X is represented thusly: conditional on a realization of the predictor

X = x, for any x (−∞ < x < ∞),

W = cx + d + Ξ, (20)

where c, d are parameters, and Ξ is a residual variate being stochastically independent of X, and

having a Gaussian density function with E(Ξ) = 0 and Var(Ξ) = τ 2 . It then follows that the

conditional density function h(·|x) of W , given any realization X = x, is Gaussian with the mean

and variance

E(W |X = x) = cx + d, (21a)

Var(W |X = x) = τ 2 . (21b)

Equation (21a) specifies the linear regression of W on X.

3.3.2 Assumptions. Inasmuch as the objective is to obtain a model for the conditional

density function, it is necessary to validate all three assumptions behind (20):

1. linearity — the regression of W on X must be linear; practically, the scatterplot of

the joint sample must hug the line cx + d;

2. homoscedasticity — the variance τ 2 of the residual variate Ξ must be independent

of x; practically, the scatterplot of the residuals ξ(n) = w(n) − cx(n) − d versus

12
predictor realizations x(n) for all n = 1, ..., N must show no trend;

3. normality — the distribution function of Ξ must be Gaussian; practically, the

Gaussian distribution function with mean zero and variance τ 2 must fit well the

empirical distribution function constructed from the sample {ξ(n) : n = 1, ..., N}.

3.3.3 Parameter Estimates. The parameters c, d, τ are estimated from the joint sample

(18). The estimates of c, d are usually obtained via the least squares method; the estimate of τ 2

for our purpose must be the sample variance: (1/N)Σξ 2 (n). However, the same parameter values

can be obtained via the sample estimates of moments (DeGroot, 1989). Let mX , mW denote the

sample means and s2X , s2W denote the sample variances of X and W , respectively; let r denote the

sample correlation coefficient between X and W . Then

c = rsW s−1
X , d = mW − cmX , (22a)

τ 2 = c2 s2X (r−2 − 1). (22b)

These relations will prove handy in explaining the distinctions between the common approach and

the Bayesian approach.

3.4 Bayesian Approach to Prediction

When the joint sample is small, the estimates of the regression parameters c, d, τ may be

erroneous. Whereas Bayesian statistical theory offers methods for quantifying uncertainty about

the regression parameters, we focus on a Bayesian method for improving the estimates by ex-

tracting information from the prior sample, which contains realizations of W not included in the

joint sample (Krzysztofowicz, 1983; Krzysztofowicz & Watada, 1986; Krzysztofowicz & Reese,

1991).

13
3.4.1 Prior Density Function. Suppose the prior (marginal) density function g of predictand

W is Gaussian with the mean and variance

E(W ) = M, Var(W ) = S 2 , (23)

both estimated from the prior sample (19).

3.4.2 Likelihood Function. This is f(x|·), the likelihood function of W , given X = x

(−∞ < x < ∞). To obtain a model for f , the role of W and X in (20) is reversed: conditional

on the hypothesis that the realization of predictand is W = w, for any w (−∞ < w < ∞),

X = aw + b + Θ, (24)

where a, b are parameters, and Θ is a residual variate being stochastically independent of W , and

having a Gaussian density function with E(Θ) = 0 and Var(Θ) = σ 2 . It then follows that the

conditional density function f (·|w) of X, given any hypothesis W = w, is Gaussian with the mean

and variance

E(X|W = w) = aw + b, (25a)

Var(X|W = w) = σ2 . (25b)

Equation (25a) specifies the linear regression of X on W . The assumptions behind (24) parallel

the assumptions listed in Section 3.3.2 and should be validated likewise.

3.4.3 Parameter Estimates. The likelihood parameters a, b, σ are estimated from the

joint sample (18) via the maximum likelihood method, which for a and b is equivalent to the

least squares method, and for σ 2 is equivalent to calculating the sample variance: (1/N)Σθ2 (n).

However, these parameter values can also be obtained via the same sample estimates of moments

that appear in (22). To wit,

a = rsX s−1
W , b = mX − amW , (26a)

14
σ 2 = a2 s2W (r−2 − 1). (26b)

3.4.4 Expected Density Function. When g and f are inserted into the total probability law

(16), one obtains the expected density function κ of predictor X. It is a Gaussian density function

under which

E(X) = aM + b, (27a)

Var(X) = a2 S 2 + σ 2 . (27b)

3.4.5 Posterior Density Function. When g, f, and κ are inserted into Bayes theorem (17),

one obtains the posterior density function φ(·|x) of predictand W , conditional on a realization of

the predictor X = x. It is a Gaussian density function under which

E(W |X = x) = Ax + B, (28a)

Var(W |X = x) = T 2 , (28b)

where
aS 2 Mσ2 − abS 2
A= , B= , (29a)
a2 S 2 + σ 2 a2 S 2 + σ 2

σ2S 2
T2 = . (29b)
a2 S 2 + σ 2
In addition, the posterior quantiles of W are specified by the equation

wp = Ax + B + T Q−1 (p), (30)

where Q−1 is the inverse of the standard normal distribution function, 0 < p < 1, and wp is the

p-probability posterior quantile of W , conditional on X = x; that is P (W ≤ wp |X = x) = p. It

follows that the posterior median, w0.5 = Ax + B, equals the posterior mean (28a), and both are

equal to the posterior mode of W , as is well known.

15
4. ATTRIBUTES OF BAYESIAN APPROACH

4.1 Theoretic Relations

The most frequent question triggered by the Bayesian approach is this: What is the difference

between model (28) and model (21)? Framing it mathematically, the question becomes: Under

what conditions do the following equalities hold:

A = c, B = d, T 2 = τ 2. (31)

After inserting (26) into (29), one can obtain the right side of (22) if, and only if,

M = mW , S 2 = s2W . (32)

Thus the two approaches yield the same prediction, φ = h, if and only if the prior mean M and

the prior variance S 2 of predictand W are identical with the sample mean mW and the sample

variance s2W from the joint sample (18). If at least one equality in (32) is violated, then at least one

equality in (31) is violated too; hence, φ 6= h . When M and S come from a prior sample (19),

which is larger than the joint sample (18), the equalities in (32) are unlikely to hold. Whereas the

common approach ignores (M, S), the Bayesian approach fuses it with (mW , sW ). Consequently,

the posterior parameters A, B, T do not correspond to any sample — they owe their being to the

Bayesian theory. Practical advantages of this data fusion are illustrated in Section 5.

4.2 Common Misconceptions

One common misconception is not to recognize that (32) is the necessary condition for (31).

For instance, Todini (2008) claims erroneously that he “improved” our Bayesian forecasting

system (Krzysztofowicz & Kelly, 2000b) by estimating π̄(w, x) and κ̄(x) directly from a joint

sample and then finding h(w|x) = π̄(w, x)/κ̄(x). This argument misses, of course, the essence of

the Bayesian theory, which is to fuse information from different sources.

16
Another common misconception is to demand that probabilistic predictions, in the form

φ(·|x), output from the Bayesian processor be validated empirically on some joint sample. This is

a folly — a mental carryover from ad-hoc approaches to model building. For as noted in Section

4.1, the posterior parameters A, B, T do not correspond to any sample — they are theoretic con-

structs obtained by fusing estimates from two asymmetric samples that together contain all data

available at the time of model building. Only after one will accumulate a new joint sample at least

as large as the prior sample (size M or larger), will one be able to perform a reasonable empirical

validation. But, is it needed at all? We address this question in Section 5.4.

17
5. DATA FUSION AND PREDICTION CALIBRATION

5.1 The Setting for Tutorial Examples

The hydrologist can exploit the advantages of the Bayesian fusion of information to solve two

practical problems: improving an estimate of a distribution function, and calibrating a probabilistic

forecast. These two problems and the workings of the Bayesian processor are illustrated via

tutorial examples. There is a prior sample of size M = 15, and four joint samples created as

follows. (i) Sample A of size N = 10 contains realizations of W which are representative of the

realizations in the prior sample, and hence fairly representative of the prior distribution function of

W . Each of the remaining joint samples has size N = 5 and is unrepresentative. (ii) Sample L

comes from the left tail. (iii) Sample R comes from the right tail. (iv) Sample O includes the

same realizations as sample L, except one realization of X whose value was changed from 23 to

55 (the largest realization of X in sample A) to create an illusory effect of an outlier.

Figure 3 shows the empirical distribution function of W and the parametric distribution func-

tion G which is Gaussian with mean M and variance S 2 , for short N (M, S 2 ); both are estimated

from the prior sample.

Figures 4–7 show results of the common approach which estimates the regression of W on

X (graph a) and constructs the conditional distribution function H(·|x) of W (graph d); they can

be compared with results of the Bayesian approach which estimates the regression of X on W

(graph b), and derives the expected distribution function K of X (graph c), as well as the posterior

distribution function Φ(·|x) of W (graph d).

5.2 An Improved Marginal Distribution

Given only a marginal sample of X, one can estimate mX , s2X to obtain a distribution function

K̄ of X, which in our example is N (mX , s2X ). Obviously, when the sample is unrepresentative,

18
which is likely when it is small, K̄ is erroneous. Can it be improved in some way? Enter the

Bayesian approach (Krzysztofowicz & Kelly, 2000a): if variate X is stochastically dependent on

another variate, say W , whose distribution function G has been estimated from a large and repre-

sentative sample (the prior sample), and there exists a joint sample of X and W , however short,

then the Bayesian approach allows one to improve the initial estimate K̄ of the distribution function

of X. The improved distribution function K is N (E(X), Var(X)), where E(X) and Var(X) are

specified by (27) in terms of the likelihood parameters (a, b, σ 2 ) and the prior parameters (M, S 2 ).

Graph (c) in Figs. 4–7 compares K with K̄. When the marginal sample of X is fairly

representative (Fig. 4), the difference between K and K̄ is minor; thus for the sake of discussion,

let us treat this K as the “true” distribution function K ∗ of X. When the marginal sample of X

is unrepresentative (Figs. 5, 6), function K̄ estimated directly from the sample follows closely the

empirical distribution function, which is located near the tail from which the sample was drawn.

Relative to K̄, function K obtained via the Bayesian processor is shifted correctly towards K ∗ ;

this type of adjustment is typical. In the example with an outlier (Fig. 7), function K̄ interpolates

between the four points of the empirical distribution function in the left tail and the one point in

the right tail. On the other hand, function K recovers, almost, the true function K ∗ .

How to explain the improvement that K offers over K̄ when the joint sample is unrepresen-

tative and has size N = 5 only? An improvement is possible because the Bayesian processor can

recognize an unrepresentative joint sample: it applies the total probability law to check the coher-

ence of estimates mW , sW from the joint sample and the estimates M, S from the prior sample.

The mechanism becomes evident upon inserting (26) into (27):

E(X) = rsX s−1


W (M − mW ) + mX , (33a)

19
∙ µ 2 ¶ ¸
2 S
Var(X) = r − 1 + 1 s2X . (33b)
s2W

When the joint sample is representative, mW = M and s2W = S 2 ; consequently, E(X) = mX and

Var(X) = s2X . When the joint sample is unrepresentative, the Bayesian processor can recognize

whether it comes from the left tail of the prior distribution function (mW < M) or the right

tail (mW > M), and whether it is underdispersed (s2W < S 2 ) or overdispersed (s2W > S 2 ).

Then it revises the initial estimates mX , s2X to make them cohere to the prior parameters M,

S 2 according to the strength of stochastic dependence between X and W (quantified through the

correlation coefficient r). Broadly speaking, the Bayesian processor takes the initial estimate K̄ of

the distribution function of variate X and recalibrates it against the specified distribution function

G of the related variate W .

5.3 An Improved Conditional Distribution

The predictions of W based on X = x obtained via the common approach and via the

Bayesian approach are compared in Figs. 4–7 in two graphs: graph (a) compares the regression

cx + d output from (21) with the posterior mean Ax + B output from (28); graph (d) compares the

conditional distribution function H(·|x) of W , which is N (cx + d, τ 2 ), with the posterior distribu-

tion function Φ(·|x) of W , which is N (Ax + B, T 2 ). Again, the difference is minor when the joint

sample is fairly representative, albeit small (Fig. 4). It may be drastic when the joint sample is un-

representative (Figs. 5, 6). In both cases, regression cx + d has incorrect slope whereas regression

Ax + B recovers, almost, the true slope, and only the intercept remains in error. In effect, the two

regressions form scissors: the difference between them is the smallest in the tail from which the

joint sample was drawn and the largest in the opposite tail. The last case (Fig. 7) shows how an

(illusory) outlier drawn from the opposite tail than the rest of the sample can help (by chance) the

20
common approach by effecting the correct slope. As a result, the difference between H(·|x) and

Φ(·|x) is minor even though the difference between K̄ and K is large.

In general, any difference between the predictions can be explained thusly: The distribu-

tion function H(·|x) conveys a quantification of conditional uncertainty about W , given X = x,

gleaned from the joint sample alone. On the other hand, the distribution function Φ(·|x) fuses

information regarding stochastic dependence between X and W extracted from the joint sample

with information regarding natural uncertainty about W (viz, natural variability of W ) extracted

in the prior sample. The fusion of information takes the form of a revision: the prior distribution

function G is revised to the extent justified by the degree of dependence between X and W , as

encoded in the likelihood parameters (a, b, σ).

5.4 A Calibrated Prediction

The notion of calibration of probabilistic predictions is an established concept in Bayesian

theory (DeGroot & Fienberg, 1982, 1983; Vardeman & Meeden, 1983). It arises from the view-

point of a user, a decider, or an external observer, who asks the question: Can the probability of

an event, say {W ≤ w}, specified by the distribution function Φ(·|x) be taken at its face value?

Framed mathematically, with P denoting the probability from a user’s point of view, the question

is whether or not the following equality holds for every w and every x:

P (W ≤ w|Φ(·|x)) = Φ(w|x). (34)

The treatment of this question is beyond the scope of this article, and also beyond the current

level of verification methods in hydrology and meteorology. (For a more extensive discussion,

see Krzysztofowicz & Sigrest, 1999, Appendix A). However, a simpler framing is easily treatable

and is this: In a large number of predictions of event {W ≤ w}, each prediction using a different

realization X = x and, therefore, assigning a different probability Φ(w|x) to the event, does

21
the mean of the probabilities converge, in the limit, to the relative frequency of event {W =

w}? When the prior distribution function G of W is known (and is estimated from all available

information), event {W ≤ w} occurs with probability P (W ≤ w) = G(w). Hence the following.

Definition 1 (Calibration). A system producing a probabilistic prediction of W in the

form of a conditional density function drawn from the family {φ(·|x) : all x} is said to

be well calibrated if

E[φ(·|X)] = g. (35)

In other words, after a large number of predictions is made, the mean of the conditional density

functions should, in the limit, be equal to the prior density function.

Theorem 1 (Calibration of Bayesian Processor). The Bayesian processor (16)–(17) is

self-calibrated.

Proof.
Z ∞
E[φ(w|X)] = φ(w|x)κ(x) dx
−∞

Z ∞
f (x|w)g(w)
= κ(x) dx
−∞ κ(x)
Z ∞
= g(w) f (x|w) dx
−∞

= g(w). (36)

In conclusion, probabilistic predictions in the form of the conditional distribution functions

Φ(·|x) produced by the Bayesian approach (23)–(30) are guaranteed to be well calibrated against

the prior distribution function G (the calibration standard). On the other hand, probabilistic pre-

dictions in the form of the conditional distribution functions H(·|x) produced by the common

approach (20)–(22) are not well calibrated against G unless (32) holds.

22
The above conclusion contains answers to the question raised in Section 4.2. First, there is

no need to validate the calibration of Bayesian predictions because it is guaranteed — Bayesian

processor self-calibrates. Second, the need for validation of calibration felt by those who follow

a common approach to prediction problem is well-meant, but its result is predictable: They will

find (empirically) their predictions to be well calibrated if, and only if, (32) holds. A Bayesian

hydrologist can easily check this fact beforehand, during the model building process.

Of course, the above conclusion presumes that the models for f and h are valid in that they

satisfy the assumptions listed in Section 3.3.2 and that the model for g is valid as well. That is

why validation of each component model, g and f , on all available samples is an essential step —

a due diligence — of the Bayesian approach.

(For an extension of these principles to Bayesian forecasting systems processing outputs from

deterministic hydrologic models of any complexity, see Krzysztofowicz (1999).)

23
6. INFORMATIVENESS OF PREDICTION FOR DECISIONS

Our final objective is to couple the prediction problem (Sections 3–5) with the decision prob-

lem (Section 2). Two basic questions arise: (i) How to make an optimal decision? (ii) How to

evaluate a predictor? Bayesian decision theory offers a mathematical framework for addressing

both questions.

6.1 Optimal Decision Functions

For the decision problem studied in Section 2, the uncertainty about the input variate W is

now quantified in terms of the posterior density function φ(·|x) output from the Bayesian processor

(16)–(17), given realization x of predictor X. Conditional on X = x, each feasible decision is

evaluated in terms of the posterior expected utility:

U(a, x) = E[u(a, W )|X = x]

Z ∞
= u(a, w)φ(w|x) dw, (37)
−∞

and the optimal decision α∗ (x), which depends on x, is that which yields the maximum posterior

expected utility:

U ∗ (x) = U (α∗ (x), x) = max U (a, x). (38)


a

When this procedure is executed for every x, one obtains the optimal decision function α∗ .

The analytic solutions derived in Section 2 under four special criterion functions generalize

directly: each of the four estimators (Table 3) is now preceded by the adjective conditional (con-

ditional mode, conditional mean, conditional median, a conditional quantile). When φ(·|x) comes

from the Gaussian processor (23)–(30),

α∗ (x) = Ax + B (39)

24
is the conditional mode, the conditional mean, and the conditional median, whereas

α∗ (x) = Ax + B + T Q−1 (p) (40)

is the p-probability conditional quantile; setting p = 1/[1 + λo /λu ] yields the estimator α∗ (x)

which is optimal under the two-piece linear opportunity loss function.

6.2 Bayes Utility

The overall evaluation, independent of the realization x of the predictor, is obtained by taking

the expectation of (38). Called the integrated maximum posterior expected utility (Bayes utility,

for short), it is given by

UX = E[U ∗ (X)]

Z ∞
= U ∗ (x)κ(x) dx.
−∞
Z ∞ Z ∞
= [max u(a, w)φ(w|x) dw]κ(x) dx
−∞ a −∞

Z ∞ Z ∞
= [max u(a, w)f(x|w)g(w) dw] dx, (41)
−∞ a −∞

where the fourth line is obtained from the third line by replacing φ(w|x) with (17).

The Bayes utility UX constitutes a measure that evaluates the performance of a prediction-

decision system. Expression (41) reveals that this evaluation depends upon the likelihood function

f which characterizes the predictor, the prior density function g which characterizes the natural

uncertainty about the predictand, and the utility function u which encodes the decider’s preferences

over the outcomes. Thus the evaluation of a particular predictor depends upon elements g and u

of the decision problem; we shall recognize this explicitly by writing UX(g, u).

25
6.3 Comparison Problem

Consider now two predictors (hydrological models, observing systems), say X1 and X2 , of

the same predictand W . Suppose their Bayes utilities are UX1 (g, u) and UX2 (g, u), respectively.

Having to choose between X1 and X2 , a rational decider would obviously prefer the predictor with

a higher utility. Hence the following.

Definition 2 (Informativeness). Predictor X1 is said to be more informative than predictor

X2 if for every g and u,

UX1 (g, u) ≥ UX2 (g, u). (42)

The definition describes a situation wherein the order of the utilities remains identical for

every rational decider. Consequently, given a choice between X1 and X2 , each decider would

choose X1 . In that sense, X1 is preferred to X2 from the viewpoint of a utilitarian society. The

difficulty of ordering the predictors based on Definition 2 stems from the necessity of calculating

the utilities UX1 (g, u) and UX2 (g, u) for every possible prior density function g and every possible

utility function u — an insurmountable obstacle.

6.4 Theory of Sufficient Comparisons

Could an order between the utilities of predictors be established based solely on the families

of likelihood functions f1 and f2 ? The conditions under which the answer is affirmative and the

methods of establishing an order are the subjects of inquiry within the Bayesian theory of sufficient

comparisons of forecasters. The theory traces its roots to the works of Blackwell (1951, 1953)

on comparisons of experiments. Since then, the theory has been extended, operationalized, and

applied in various contexts, most extensively in signal detection theory developed by electrical

engineers. Applications to forecast systems are relatively recent. The next section presents one

operational result.

26
6.5 Informativeness Under Gaussian Likelihood Model

When the stochastic dependence between predictor X and predictand W is characterized by

the Gaussian likelihood model (24)–(25), the Bayesian theory of sufficient comparisons yields a

remarkable simple implementation (Krzysztofowicz, 1987, 1992).

6.5.1 Sufficiency Characteristic. Given the parameters a, b, σ of the Gaussian likelihood

model (24)–(25), define a measure

|a|
SC = . (43)
σ
Called the sufficiency characteristic of predictor X, it has four properties. (i) It is interpretable as

the “signal-to-noise ratio”, as in the linear regression (25a) of X on W , the absolute value of the

slope coefficient, |a|, is the measure of signal, and the standard deviation of the residual variate,

σ, is the measure of noise. (ii) It is bounded as 0 ≤ SC ≤ ∞, with SC = 0 implying that X is

uninformative for predicting W , and SC = ∞ implying that X is a perfect predictor of W . (iii)

It is dimensional, with the unit being [unit of W ]−1 . (iv) It orders the alternative predictors of W

as follows.

Theorem 2 (Sufficient comparison via SC). Suppose each predictor Xi (i = 1, 2) of

predictand W is characterized by a Gaussian likelihood model with the sufficiency characteristic

SCi = |ai |/σ i such that 0 < SCi < ∞. Predictor X1 is more informative than predictor X2 if

SC1 > SC2 . (44)

6.5.2 Informativeness Score. When in addition to the likelihood model being Gaussian

(24)–(25), the prior density function is Gaussian (23), the variance S 2 of predictand W completely

characterizes the prior uncertainty, 1/S can be interpreted as the “prior signal-to-noise ratio”, and

consequently the quotient SC/(1/S) = |a|S/σ measures the signal-to-noise ratio of a given pre-

27
dictor relative to the prior signal-to-noise ratio. This quotient constitutes the kernel of the measure
"µ ¶−2 #−1/2
SC
IS = +1
1/S

µ ¶−1/2
σ2
= +1 . (45)
a2 S 2
Called the (Bayesian) informativeness score, it has four properties. (i) It is interpretable as the

absolute value of a Bayesian estimator of the Pearson’s product-moment correlation coefficient

between X and W :

IS = |Cor(X, W )| , (46a)

Cor(X, W ) = (sign of a)IS. (46b)

The adjective Bayesian stems from the fact that S comes from the prior density function whereas

a and σ come from the likelihood function. Consequently IS does not equal the correlation

coefficient estimated from the joint sample of (X, W ) unless S = sW , as explained in Section

4.1.1. (ii) It is bounded as 0 ≤ IS ≤ 1, with IS = 0 implying that X is uninformative for

predicting W , and IS = 1 implying that X is a perfect predictor of W . (iii) It is dimensionless.

(iv) It orders the alternative predictors of W consistently with SC, being its strictly increasing

transformation. (Note: in my previous writings, (45) was termed the Bayesian correlation score.)

Theorem 3 (Sufficient comparison via IS). Suppose the assumptions of Theorem 2 hold,

predictand W has a Gaussian prior density function with variance S 2 , and the informativeness

score ISi of predictor Xi is obtained from SCi and S according to (45). Predictor X1 is more

informative than predictor X2 if

IS1 > IS2 . (47)

28
Table 4 reports the values of the signal-to-noise ratio, |a|/σ, and the informativeness score,

IS, in the four cases shown in Figs. 4–7. It also reports the values of R2 — the proportion of

the variation of the predictand that is explained by the predictor in the regression of the common

approach. While both measures, IS and R2 , imply the same order of the four predictors, their

numerical values are distinct.

6.6 Utilitarian Perspective

In summary, under the assumption that the likelihood model is Gaussian and the prior density

function is Gaussian, the theory of sufficient comparisons of predictors leads to simple statistical

measures of informativeness, |a|/σ and IS. The IS is suggested for reporting because it is dimen-

sionless and therefore comparable, in some sense, across different predictands. The key property

of this measure is its ordinal correspondence with the utility (equivalently, the economic value) of

the predictors to every rational decider. In that sense, the IS is a utilitarian measure of model

performance.

29
7. CLOSURE

The novel research paradigm, dubbed “the court of miracles of hydrological modeling”, fo-

cuses on achieving scientific progress in circumstances which traditionally would be called model

failures. We suggest that in such circumstances the requirements of a rational decider for infor-

mation help to frame the issues for a hydrologist. Basic modeling principles for addressing some

of these issues have been discussed herein from the viewpoint of the Bayesian forecast-decision

theory.

These modeling principles are general though they have been illustrated with specific statisti-

cal models. The meta-decision problem (Section 2) considers four special criterion functions but

is general insofar as the distribution function that quantifies uncertainty can be of any form. The

model of stochastic dependence (Section 3), the model for data fusion and prediction calibration

(Section 5), and the measures of informativeness (Section 6) come from the Bayesian Gaussian

processor. It is the simplest processor and, therefore, serves well for expository purposes (to ex-

plain advantages of the Bayesian approach when the joint samples are unrepresentative or affected

by illusory outliers). However, all the models and results discussed herein can be generalized by

employing the Bayesian Meta-Gaussian processor (Krzysztofowicz & Kelly, 2000a, 2000b) which

allows each marginal distribution function to take any form and the stochastic dependence struc-

ture to be nonlinear and heteroscedastic. Inasmuch as hydrological variates exhibit, with a few

exceptions, non-Gaussian, nonlinear, and heteroscedastic properties, application of the Bayesian

Meta-Gaussian processor to the issues treated in this article would be a natural extension.

Acknowledgment

This material is based upon work supported by the National Science Foundation under Grant

No. ATM-0641572, “New Statistical Techniques for Probabilistic Weather Forecasting”.

30
APPENDIX A: WEIBULL DISTRIBUTION

In the examples, the input variate W is assumed to have the sample space (η, ∞), and the

Weibull distribution with scale parameter α (α > 0), shape parameter β (β > 0), and shift para-

meter η (−∞ < η < ∞). The formulae for the density function g, the distribution function G,

and the quantile function G−1 , for any w ∈ (η, ∞) and any p ∈ (0, 1), are as follows:

µ ¶β−1 " µ ¶β #
β w−η w−η
g(w) = exp − ,
α α α
" µ ¶β #
w−η
G(w) = 1 − exp − ,
α

1
G−1 (p) = α [− ln(1 − p)] β + η.

31
REFERENCES

Andréassian, V., Perrin, C., Parent, E. & Bárdossy, A. (2010) The Court of Miracles of Hydrology:

Can Failure Stories Contribute to Hydrological Science? Hydrological Sciences Journal,

this issue.

Blackwell, D. (1951) Comparison of Experiments. In: Proc. Second Berkeley Symposium on

Mathematical Statistics and Probability (ed. by J. Neyman), 93–102. University of California

Press, Berkeley, California, USA.

Blackwell, D. (1953) Equivalent Comparisons of Experiments. Ann. Math. Statist. 24,

265–272.

DeGroot, M.H. (1970) Optimal Statistical Decisions. McGraw-Hill, New York, USA.

DeGroot, M.H. (1989) Probability and Statistics. Addison-Wesley, Reading, Massachusetts, USA.

DeGroot, M.H. & Fienberg, S.E. (1982) Assessing Probability Assessors: Calibration and

Refinement. In: Statistical Decision Theory and Related Topics, III (ed. by J.O. Berger

& S.S. Gupta), vol. 1, 291–314. Academic Press, New York, USA.

DeGroot, M.H. & Fienberg, S.E. (1983) The Comparison and Evaluation of Forecasters.

The Statistician 32, 12–22.

Krzysztofowicz, R. (1983) Why Should a Forecaster and a Decision Maker Use Bayes Theorem.

Water Resour. Res. 19(2), 327–336.

Krzysztofowicz, R. (1986) Expected Utility, Benefit, and Loss Criteria for Seasonal Water Supply

Planning. Water Resour. Res. 22(3), 303–312.

32
Krzysztofowicz, R. (1987) Markovian Forecast Processes. J. Am. Statist. Assoc. 82(397), 31–37.

Krzysztofowicz, R. (1990) Target-Setting Problem With Exponential Utility. IEEE Transactions

on Systems, Man, and Cybernetics 20(3), 687–688.

Krzysztofowicz, R. (1992) Bayesian Correlation Score: A Utilitarian Measure of Forecast Skill.

Monthly Weather Rev. 120(1), 208–219.

Krzysztofowicz, R. (1999) Bayesian Theory of Probabilistic Forecasting via Deterministic

Hydrologic Model. Water Resour. Res. 35(9), 2739–2750.

Krzysztofowicz, R. (2002) Bayesian System for Probabilistic River Stage Forecasting. J. Hyd.

268(1–4), 16–40.

Krzysztofowicz R. & Kelly, K.S. (2000a) Bayesian Improver of a Distribution. Stoch. Env. Res.

and Risk Assess. 14(6), 449–470.

Krzysztofowicz, R. & Kelly, K.S. (2000b) Hydrologic Uncertainty Processor for Probabilistic

River Stage Forecasting. Water Resour. Res. 36(11), 3265–3277.

Krzysztofowicz, R. & Reese, S. (1991) Bayesian Analyses of Seasonal Runoff Forecasts.

Stoch. Hydrol. and Hydraulics 5(4), 295–322.

Krzysztofowicz, R. & Sigrest, A.A. (1999) Calibration of Probabilistic Quantitative Precipitation

Forecasts. Weather and Forecasting 14(3), 427–442.

Krzysztofowicz, R. & Watada, L.M. (1986) Stochastic Model of Seasonal Runoff Forecasts. Water

Resour. Res. 22(3), 296–302.

33
Pratt, J.W., Raiffa, H. & Schlaifer, R. (1995) Introduction to Statistical Decision Theory. The MIT

Press, Cambridge, Massachusetts, USA.

Todini, E. (2008) A Model Conditional Processor To Assess Predictive Uncertainty in Flood

Forecasting. Intl. J. River Basin Management 6(2), 1–15.

Vardeman, S. & Meeden, G. (1983) Calibration, Sufficiency, and Domination Considerations

for Bayesian Probability Assessors. J. Am. Statist. Assoc. 78(384), 808–816.

Wymore, A.W. (1977) A Mathematical Theory of Systems Engineering — The Elements. R.E.

Krieger Publishing Co., Huntington, New York, USA.

34
Table 1. Optimal decision a∗ in an estimation, or target-setting, problem

depends on the criterion function (see Table 3) and the distribution

function of the input variate W — here, the Weibull distribution

having scale parameter α = 1, shift parameter η = 0, and shape

parameter β as specified (see Fig. 2).

Shape parameter β

Optimal decision a∗ 0.5 1 1.5 2 4 6

Mode of W 0 0 0.48 0.71 0.93 0.97

Mean of W 2.00 1.00 0.90 0.89 0.91 0.93

Median of W 0.48 0.69 0.78 0.83 0.91 0.94

0.4-quantile of W 0.26 0.51 0.64 0.71 0.85 0.89

0.6321-quantile of W 1.00 1.00 1.00 1.00 1.00 1.00

0.8-quantile of W 2.59 1.61 1.37 1.27 1.13 1.08

The p-quantile of W corresponds to the ratio λo /λu of the marginal opportunity

losses as follows: p = 0.4 ⇔ λo /λu = 3/2,

p = 0.6321 ⇔ λo /λu = 291/500,

p = 0.8 ⇔ λo /λu = 1/4.

35
Table 2. Under the two-piece linear opportunity loss function, the ratio

λo /λu of the marginal opportunity losses determines the optimal

decision a∗ and the associated optimal probability of

over-estimation, P (W < a∗ ) = G(a∗ ).

λo /λu 5/1 3/1 1/1 1/3 1/5

G(a∗ ) 1/6 1/4 1/2 3/4 5/6

36
Table 3. Analytic solutions to estimation, or target-setting, problems

with special criterion functions; W is the variate whose

realization is being estimated.

Criterion function Optimal decision a∗

Impulse utility Mode of W

Quadratic difference opportunity loss Mean of W

Absolute difference opportunity loss Median of W

Two-piece linear opportunity loss Quantile of W

37
Table 4. Measures of informativeness of predictor X for predictand W in (i) common

approach to prediction (regression of W on X): proportion of the explained

variation R2 ; and in (ii) Bayesian approach to prediction (regression of X

on W ): signal-to-noise ratio |a|/σ and informativeness score IS.

Sample R2 |a|/σ IS

A 0.70 0.1263 0.61

L 0.10 0.0589 0.25

R 0.13 0.0592 0.26

O 0.79 0.1718 0.74

38
(a)
u(a, w)
È

a w

(b)
l(a, w)

a w

(c)
l(a, w)

a w

Fig. 1. Special criterion functions for an estimation, or a target-setting, problem:

(a) impulse utility, (b) quadratic difference opportunity loss with λ = 1,

and (c) two-piece linear opportunity loss with λo = 1/2 and λu = 2

(a special case of which is absolute difference opportunity loss,

when λo = λu ).

39
(a)
g(w)

2 α=1
β=6 η=0

β=4

1
β=2

β = 1.5

β=1

β = 0.5
0 w
0 1 2 3

(b)
G(w)

1.0
β = 4 β = 2 β = 1.5
β=1
β=6
β = 0.5
0.8

0.6

0.4
α=1
η=0
0.2

0.0
0 1 2 3 w

Fig. 2. The Weibull density functions (a) and the Weibull distribution functions (b)

with fixed values of the scale parameter α and the shift parameter η, and

with varying values of the shape parameter β. (See Table 1 for the

resultant values of the optimal decision a∗ under different criterion

functions; see Appendix A for the formulae.)

40
1.0

0.8

0.6

P(W≤w)
0.4

0.2
M = 29.93
S = 9.89
0.0

0 10 20 30 40 50 60
Predictand w

Fig. 3. Prior distribution function G of predictand W : empirical (circles) and

parametric N (M, S 2 ) (solid line).

41
(a) (b)

60 cx + d 60
c = 0.73
50 d = 7.54
50
τ = 6.61
40 40
Predictand w

Predictor x
30 30

20 20
Ax + B aw + b
10 A = 0.64 10 a = 0.96
B = 10.34 b = 2.13
0 Τ = 6.18 0 σ = 7.59

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

(c) (d)

_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 30.8 c = 0.73
0.8 sx = 13.78 0.8 d = 7.54
τ = 6.61
P(W≤w|X=x)

0.6 0.6
P(X≤x)

x = 30

0.4 0.4

Κ(x) Φ(w|X=x)
0.2 E(X) = 30.83 0.2
A = 0.64
SD(X) = 12.15 B = 10.34
0.0 0.0 Τ = 6.18

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

Fig. 4. Comparison of results from the common model and the Bayesian model, given

joint sample A (all realizations): (a) the regression of W on X (broken line), and the

posterior mean of W (solid line) derived via Bayes theorem; (b) the regression

of X on W ; (c) the empirical distribution function of X (circles), the marginal

distribution function K̄ of X estimated from the marginal sample, and the expected

distribution function K of X derived via the total probability law; (d) the conditional

distribution function H(·|X = 30) of W constructed from regression of W on X

(graph a), and the posterior distribution function Φ(·|X = 30) of W derived via

Bayes theorem.

42
(a) (b)

60 cx + d 60
c = 0.31
50 d = 13.80
50
τ = 5.50
40 40
Predictand w

Predictor x
30 30

20 20
Ax + B aw + b
10 A = 0.74 10 a = 0.34
B = 13.34 b = 12.13
0 Τ = 8.55 0 σ = 5.78

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

(c) (d)

_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 18.8 c = 0.31
0.8 sx = 6.11 0.8 d = 13.80
τ = 5.50
P(W≤w|X=x)

0.6 0.6
P(X≤x)

x = 40

0.4 0.4

Κ(x) Φ(w|X=x)
0.2 E(X) = 22.32 0.2
A = 0.74
SD(X) = 6.69 B = 13.34
0.0 0.0 Τ = 8.55

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

Fig. 5. Comparison of results from the common model and the Bayesian model, given

joint sample L (drawn from the left tail). The interpretation of graphs is the

same as in Fig. 4. The additional dotted lines show (a) the posterior mean

of W , (c) the expected distribution function K of X, and (d) the posterior

distribution function Φ(·|X = 40) that would have been obtained if the joint

sample A (all realizations) were used.

43
(a) (b)

60 cx + d 60
c = 0.31
50 d = 26.89
50
τ = 6.01
40 40
Predictand w

Predictor x
30 30

20 20
Ax + B aw + b
10 A = 0.63 10 a = 0.41
B = 5.72 b = 26.43
0 Τ = 8.53 0 σ = 6.88

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

(c) (d)

_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 42.8 c = 0.31
0.8 sx = 7.36 0.8 d = 26.89
τ = 6.01
P(W≤w|X=x)

0.6 0.6
P(X≤x)

x = 20

0.4 0.4

Κ(x) Φ(w|X=x)
0.2 E(X) = 38.62 0.2
A = 0.63
SD(X) = 7.97 B = 5.72
0.0 0.0 Τ = 8.53

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

Fig. 6. Comparison of results from the common model and the Bayesian model, given

joint sample R (drawn from the right tail). The interpretation of graphs is the

same as in Fig. 4. The additional dotted lines show (a) the posterior mean

of W , (c) the expected distribution function K of X, and (d) the posterior

distribution function Φ(·|X = 20) that would have been obtained if the joint

sample A (all realizations) were used.

44
(a) (b)

60 cx + d 60
c = 0.64
50 d = 8.97
50
τ = 5.18
40 40
Predictand w

Predictor x
30 30

20 20
Ax + B aw + b
10 A = 0.60 10 a = 1.24
B = 11.38 b = -6.15
0 Τ = 5.02 0 σ = 7.24

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

(c) (d)

_
1.0 Κ(x) 1.0 H(w|X=x)
mx = 24.2 c = 0.64
0.8 sx = 15.92 0.8 d = 8.97
τ = 5.18
P(W≤w|X=x)

0.6 0.6
P(X≤x)

x = 30

0.4 0.4

Κ(x) Φ(w|X=x)
0.2 E(X) = 31.08 0.2
A = 0.60
SD(X) = 14.27 B = 11.38
0.0 0.0 Τ = 5.02

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Predictor x Predictand w

Fig. 7. Comparison of results from the common model and the Bayesian model, given

joint sample O (containing an outlier). The interpretation of graphs is the

same as in Fig. 4. The additional dotted lines show (a) the posterior mean

of W , (c) the expected distribution function K of X, and (d) the posterior

distribution function Φ(·|X = 30) that would have been obtained if the joint

sample A (all realizations) were used.

45

You might also like