This action might not be possible to undo. Are you sure you want to continue?
:
Concepts, Methods and Applications
(Special topics in Probabilistic Graphical Models)
FIRST COMPLETE DRAFT
November 30, 2003
Supervisor: Professor J. Rosenthal
STA4000Y
Anjali Mazumder
950116380
Part I: Decision Theory – Concepts and Methods
1
Part I: DECISION THEORY 
Concepts and Methods
Decision theory as the name would imply is concerned with the process of making
decisions. The extension to statistical decision theory includes decision making in the
presence of statistical knowledge which provides some information where there is
uncertainty. The elements of decision theory are quite logical and even perhaps intuitive.
The classical approach to decision theory facilitates the use of sample information in
making inferences about the unknown quantities. Other relevant information includes
that of the possible consequences which is quantified by loss and the prior information
which arises from statistical investigation. The use of Bayesian analysis in statistical
decision theory is natural. Their unification provides a foundational framework for
building and solving decision problems. The basic ideas of decision theory and of
decision theoretic methods lend themselves to a variety of applications and computational
and analytic advances.
This initial part of the report introduces the basic elements in (statistical) decision theory
and reviews some of the basic concepts of both frequentist statistics and Bayesian
analysis. This provides a foundational framework for developing the structure of
decision problems. The second section presents the main concepts and key methods
involved in decision theory. The last section of Part I extends this to statistical decision
theory – that is, decision problems with some statistical knowledge about the unknown
quantities. This provides a comprehensive overview of the decision theoretic framework.
Part I: Decision Theory – Concepts and Methods
2
Section 1: An Overview of the Decision Framework:
Concepts & Preliminaries
Decision theory is concerned with the problem of making decisions. The term statistical
decision theory pertains to decision making in the presence of statistical knowledge, by
shedding light on some of the uncertainties involved in the problem. For most of this
report, unless otherwise stated, it may be assumed that these uncertainties can be
considered to be unknown numerical quantities, denoted by θ. Decision making under
uncertainty draws on probability theory and graphical models. This report and more
particularly this Part focuses on the methodology and mathematical and statistical
concepts pertinent to statistical decision theory. This initial section presents the
decisional framework and introduces the notation used to model decision problems.
Section 1.1: Rationale
A decision problem in itself is not complicated to comprehend or describe and can be
simply summarized with a few basic elements. However, before proceeding any further,
it is important to note that this report focuses on the rational decision or choice models
based upon individual rationality. Models of strategic rationality (smallgroup behavior)
or competitive rationality (market behavior) branch into areas of game theory and asset
pricing theory, respectively. Thus for the purposes of this report, these latter models have
been neglected as the interest of study is statistical decision theory based on individual
rationality.
“In a conventional rational choice model, individuals strive to satisfy their preferences for
the consequences of their actions given their beliefs about events, which are represented
by utility functions and probability distributions, and interactions among individuals are
governed by equilibrium conditions” (Nau, 2002[1]). Decision models lend themselves
to a decision making process which involves the consideration of the set of possible
actions from which one must choose, the circumstances that prevail and the consequences
that result from taking any given action. The optimal decision is to make a choice in such
a way as to make the consequences as favorable as possible.
As mentioned above, the uncertainty in decision making which is defined as an unknown
quantity, θ, describing the combination of “prevailing circumstances and governing
laws”, is referred to as the state of nature (Lindgren, 1971). If this state is unknown, it is
simple to select the action according to the favorable degree of the consequences
resulting from the various actions and the known state. However, in many real problems
and those most pertinent to decision theory, the state of nature is not completely known.
Since these situations create ambiguity and uncertainty, the consequences and subsequent
results become complicated.
Decision problems under uncertainty involve “many diverse ingredients”  loss or gain
of money, security, satisfaction, etc., (Lindgren, 1971). Some of these “ingredients” can
be assessed while some may be unknown. Nevertheless, in order to construct a
mathematical framework in which to model decision problems, while providing a rational
Part I: Decision Theory – Concepts and Methods
3
basis for making decisions, a numerical scale is assumed to measure consequences.
Because monetary gain is often neither an adequate nor accurate measure of
consequences, the notion of utility is introduced to quantify preferences among various
prospects which a decision maker may be faced with.
Usually something is known about the state of nature, allowing a consideration of a set of
states as being admissible (or at least theoretically so), and thereby ruling out many that
are not. It is sometimes possible to take measurements or conduct experiments in order
to gain more information about the state. A decision process is referred to as “statistical”
when experiments of chance related to the state of nature are performed. The results of
such experiments are called data or observations. These provide a basis for the selection
of an action defined as a statistical decision rule.
To summarize, the “ingredients” of a decision problem include (a) a set of available
actions, (b) a set of admissible states of nature, and (c) a loss associated with each
combination of a state if nature and action. When only these make up the elements of a
decision problem, the decision problem is referred to as the “nodata” or “without
experimentation” decision problem. However, if (d) observations from an experiment
defined by the state of nature are included with (a) to (c), then the decision problem is
known as a statistical decision problem. This initial overview of the decision framework
allows for a clear presentation of the mathematical and statistical concepts, notation and
structure involved in decision modeling.
Section 1.2 The Basic Elements
The previous section summarized the basic elements of decision problems. For brevity
purposes, this section will not repeat the description of the two types of decision models
and simply state the mathematical structure associated with each element. It is assumed
that a decision maker can specify the following basic elements of a decision problem.
1. Action Space: A = {a}.
The single action is denoted by an a, while the set of all possible actions is denoted as A.
It should be noted that the term actions is used in decision literature instead of decisions.
However, they can be used somewhat interchangeably. Thus, a decision maker is to
select a single action A a ∈ from a space of all possible actions.
2. State Space: Θ = {θ}. (or Parameter Space)
The decision process is affected by the unknown quantity Θ ∈ θ which signifies the state
of nature. The set of all possible states of nature is denoted by Θ. Thus, a decision maker
perceives that a particular action a results in a corresponding state θ.
3. Consequence: C = {c}.
Part I: Decision Theory – Concepts and Methods
4
The consequence of choosing a possible action and its state of nature may be multi
dimensional and can be mathematically stated as C a c ∈ ) , ( θ .
4. Loss Function: Θ × ∈ A a l ) , ( θ .
The objectives of a decision maker are described as a realvalued loss function ) , ( θ a l ,
which measures the loss (or negative utility) of the consequence ) , ( θ a c .
5. Family of Experiments: E = {e}.
Typically experiments are performed to obtain further information about each Θ ∈ θ . A
single experiment is denoted by an e, while the set of all possible experiments is denoted
as E. Thus, a decision maker may select a single experiment e from a family of potential
experiments which can assist in determining the importance of possible actions or
decisions.
6. Sample Space: X = {x}.
An outcome of a potential experiment E e ∈ is denoted as X x ∈ . The importance of this
outcome was explained in (3) and hence is not repeated here. However, it should be noted
that when a statistical investigation (such as an experiment) is performed to obtain
information about θ, the subsequent observed outcome x is a random variable. The set of
all possible outcomes is the sample space while a particular realization of X is denoted as
x. Notably, X is a subset of
n
ℜ .
7. Decision Rule: A x ∈ ) ( δ .
If a decision maker is to observe an outcome X = x and then choose a suitable action
A x ∈ ) ( δ , then the result is to use the data to minimize the loss ) ), ( ( θ δ x l . Sections 2 and
3 focus on discussing the appropriate measures of minimization in decision processes.
8. Utility Evaluation: ) , , , ( ⋅ ⋅ ⋅ ⋅ u on Θ × × × A X E .
The quantification of a decision maker’s preferences is described by a utility function
) , , , ( θ a x e u which is assigned to a particular conduct of e, a resulting observed x,
choosing a particular action a, with a corresponding θ. The evaluation of the utility
function u takes into account costs of an experiment as well as consequences of the
specific action which may be monetary and/or of other forms.
Section 1.3 Probability Measures
Statistical decision theory is based on probability theory and utility theory. Focusing on
the former, this subsection presents the elementary probability theory used in decision
processes. The probability distribution of a random variable, such as X, which is
Part I: Decision Theory – Concepts and Methods
5
dependent on θ, as stated above, is denoted as ) (E P
θ
or ) ( E X P ∈
θ
where E is an event. It
should also be noted that the random variable X can be assumed to be either continuous
or discrete. Although, both cases are described here, the majority of this report focuses
on the discrete case. Thus, if X is continuous, then
∫ ∫
= =
E
X
E
x dF dx x f E P )  ( )  ( ) ( θ θ
θ
.
Similarly, is X is discrete, then
∑
∈
=
E x
x f E P )  ( ) ( θ
θ
.
Although, E has been used to describe a family of experiments, an event and will be used
in the next subsection to denote expectations, the meaning of E will be clear from the
context.
For E e ∈ ∀ (where e represents an experiment or even an event), a joint probability
measure )  , (
,
e P
x
⋅ ⋅
θ
or simply denoted
e x
P
 , θ
is assigned and more commonly referred to as
the possibility space. This is used to determine other probability measures (Raiffa &
Schlaifer, 2000):
(i) The marginal measure ) (⋅ ′
θ
P on the state space Θ. Of course, here the
assumption is that ) (⋅ ′
θ
P does no depend on e.
(ii) The conditional measure ) ,  ( θ e P
x
⋅ on the sample space X for a given e and θ.
(iii) The marginal measure )  ( e P
x
⋅ on the sample space X for a given e.
(iv) The conditional measure )  ( x P ⋅ ′ ′
θ
on the state space Θ for a given e and x. The
condition e is not stated as x is a result of the experiment and hence the
relevant information is contained in x.
Before concluding this subsection, it is important to make two remarks. The first is to
summarize the three basic methods for assigning the above set of probability measures.
That is (a) if a joint probability measure is assigned to X × Θ , then the marginal and
conditional measures on Θ and X can be computed, (b) if a marginal measure (probability
distribution) is assigned to Θ and the conditional for X, then the joint can be found and
similarly (c) if a marginal measure (probability distribution) is assigned to X and the
conditional for Θ, then the joint can be determined. These elementary “methods” or
concepts of probability have a more practical importance.
The second remark is simply to clarify that the prime on the probability measure indicates
a prior probability where as the double prime indicates a posterior probability. For the
most part, these notations are redundant but at certain times will help to keep things clear.
When it is obvious these superscripts will not required. A further discussion of priors is
provided in Section 1.5.
Part I: Decision Theory – Concepts and Methods
6
Section 1.4 Random Variables and Expectations
In many instances, real numbers or ntuples (of numbers) describe the states of nature {θ)
and sample outcomes {x}. In the sections of this report a tilde sign may be used to
distinguish a random variable or function from a particular value of the function. For
example, the random variables x
~
andθ
~
may be used to define θ θ θ = ) , (
~
x and x x x = ) , (
~
θ ,
respectively.
Expectations of random variables are almost always considered necessary when dealing
with decision processes such as loss functions. The expectation for a given value of θ, is
defined to be
¦
¹
¦
´
¦
=
∑
∫
∈X x
X
(discrete) ),  ( ) (
s) (continuou ),  ( ) (
) ( [
θ
θ
θ
x f x h
x f x h
X h E
As before, the superscripts and subscripts on the expectation operator will perform in
much the same way as for the probability measure. When necessary, such scripts will be
minimized when the context is clear. Thus, with respect to points (i)(iv) in the previous
section, the notation for the following expectations is provided below:
(i)
θ
E′ or )
~
(θ E′ is taken with respect to
θ
P′ .
(ii)
x
E
 θ
′ ′ is taken with respect to
x
P
 θ
′ ′ .
(iii)
θ , e z
E is taken with respect to
θ , e z
P .
(iv)
e x
E

is taken with respect to
e x
P

.
Section 1.5 Statistical Inference (Classical versus Bayesian)
Statistical inference is considered here within the decision framework. Both classical and
Bayesian perspectives are briefly presented to show the varying approaches. Classical
statistics uses the sample information to make inferences about the unknown quantity, θ.
These inferences (within decision theory) are combined with other relevant information
in order to choose the optimal decision. These other relevant information/sources include
the knowledge of the possible consequences of the decisions and prior information which
was previously mentioned. The former nonsample information, consequences, can be
quantified by determining the possible loss incurred for each possible decision. The
latter, prior information, is the information about θ arising from other relevant sources
such as past experiences.
Bayesian analysis is the approach which “seeks to utilize prior information” (Berger,
1985). This third type of information is best described in terms of a probability
distribution. The symbol ) (θ π or simply ) (θ p will be used to represent a prior density of
Part I: Decision Theory – Concepts and Methods
7
θ. Similar, to the other definitions, under both the continuous and discrete cases, the prior
probability distributions can be written as
¦
¹
¦
´
¦
= = ∈
∑
∫
∫
∈E
E
E
d
dF E P
θ
π
θ π
θ θ π
θ θ
(discrete) ), (
s) (continuou ), ( ) (
) ( ) (
The uses of prior probabilities are discussed in the proceeding sections. Both the non
Bayes and Bayesian decision theory are discussed in this report.
Section 1.6: Convex Sets
Various concepts presented throughout this report make use of the concepts of convexity
and concavity; hence, the required definitions and properties are summarized below.
Definition:
A linear combination of the form
2 1
) 1 ( x x α α − + with 1 0 ≤ ≤ α is called a convex
combination of
1
x and
2
x . (Lindgren, 1971) If
1
x and
2
x in Ω, then it can be said that Ω
is convex if the line segment between any two points in Ω is a subset of Ω. (Berger,
1985).
This definition simply suggests that the set of all combinations of two given points is
precisely the set of points which make up the line segments jointing these two points. It
may be conceived that the values α and ) 1 ( α − which fall between 0 and 1 may be
interpreted as probabilities. A convex combination of two points then becomes a
combination of probabilities.
Definition:
If ,...} , {
2 1
x x is a sequence of points in
m
R , and 1 0 ≤ ≤ α are numbers such that
∑
∞
=
=
1
1
i
i
α , then
∑
∞
=1 i
i i
x α (and finite) is called a convex combination. The convex hull
of a set Ω is the set of all points which are convex combinations of points in Ω.
(Berger, 1985).
Figure 11:
Ω1
Ω2
Part I: Decision Theory – Concepts and Methods
8
The set
1
Ω is convex while the
2
Ω is not, for the figure above. So a convex hull is a set
with no holes in its interior and no indentations on its exterior—e.g. an egg is convex,
while a doughnut or a banana or a golf ball is not. Formally, X is a convex set if every
line segment connecting two distinct points in X is wholly contained in X, and it is strictly
convex if the interior points of such a line segment are in the strict interior of X. Thus, a
strictly convex set has no flat sides or edges: its exterior consists only of curved surfaces
that are bowed outwards. A sphere is strictly convex, while a cube is convex but not
strictly convex. (Nau, 2000).
Convex sets play a central role in the geometric representation of preferences (utility
theory) and choices (statistical decision theory), as will be shown.
Part I: Decision Theory – Concepts and Methods
9
Section 2: “Nonstatistical” Decision Processes
Statistical decision problems include data or observations on the unknown quantity, θ,
(also referred to as the state of nature) which may be used to choose a more optimal
decision. One approach to handling such problems, is to consider the nonstatistical
approach. Simply, this excludes the data available on the state of nature. This is reduces
the problem to a simpler one. The basic elements or “ingredients” of this nodata
problem were presented in the previous section.
In simple decision problems, the most important elements are the state space {θ}, the
action space {a} and the loss function ) , ( a l θ . These aspects form the theoretical
analysis of decision making. The state space Θ and the action space A are both finite –
i.e. “there are only finitely many actions [states]”. The loss incurred is assumed to be
measured in negative utility units where utility is defined by a function defining the
prospects of a decision maker. In this instance, however unrealistic, we assume that the
loss function is known. These concepts will be defined in the proceeding section within
the nodata decision problem constructs.
Section 2.1 The Set of Randomized Actions
A simple way to explain the general theory of decision making is to consider a typical
coin toss. This introduces an extraneous random device which is useful in providing
decision rules that under some criteria are better than those that use only the given,
nonrandom actions (Lindgren, 1971). In general, a random device (such as a coin
(ordinary or biased)) is an experiment of chance having as many possible outcomes as
there are actions from which to choose; each outcome is associated with an action, and
when a particular outcome is observed, the corresponding action is taken.
The use of a random device to select an action from the set of possible actions is called a
randomized or mixed action. Choosing a randomized action from among all possible
actions amounts to selecting a random device from among all random devices that could
be used to determine the action actually taken; and further selecting a set of probabilities
for the various actions. This leads to the following definition of a random action.
Definition:
A randomized action, for a problem with action space consisting of actions a
1
, a
2
,…, a
k
,
is a probability vector (p
1
, p
2
,…, p
k
), that is a sequence of nonnegative numbers p
i
whose
sum is 1.
To summarize, consider constructing an experiment of chance producing outcomes z
1
,
z
2
,…, z
k
with probabilities p
1
, p
2
,…, p
k
assigned, respectively. If the outcome of the
experiment performed is z
1
then action a
1
is taken. To distinguish between pure (or
original) actions a
1
, a
2
,…, a
k
and randomized actions, probability vectors are used to
define the latter type. A singular probability vector can be used to determine a pure
action. For instance, the action a
2
can be equivalently written as the probability vector
Part I: Decision Theory – Concepts and Methods
10
(0,1,0,…0) that assigns all of the probability mass to action a
2
. Thus, pure actions
constitute a subset of the randomized actions.
Section 2.2 The Loss Function
The use of a randomized action in a decision problem with a given loss function
inevitably sets the loss to be a random variable, for each state of nature. The most natural
expected loss to consider when making a decision involves the uncertainty in θ. Thus,
taking the expected value of the random variable (θ or ) , ( a l θ ) measures the consequence
of employing a given randomized action (when nature is in a given state). In particular, if
the loss of a function ) , ( a l θ and the randomized action (p
1
, p
2
,…, p
k
) is used to choose
among the actions a
1
, a
2
,…, a
k
, the expected loss is the following weighted sum:
k k
p a l a l p a l ) , ( ... ) , ( ) , (
2 1 1
θ θ θ + + + .
Notably, this can be written as taking the integral over all ) , (
i
a l θ and p
i
, for the
continuous case. The above definition of expected loss is explained in detail in Chapter 2
of Berger’s Statistical Decision Theory and Bayesian Analysis (1985).
In a decision problem with m states θ
1
, θ
2
,…, θ
m
, there are m L’s corresponding to each
randomized action. For the case of k actions, these L’s are the following expected losses:
k k m m m
k k
k k
p a l p a l a l E L
p a l p a l a l E L
p a l p a l a l E L
) , ( ... ) , ( )] , ( [
) , ( ... ) , ( )] , ( [
) , ( ... ) , ( )] , ( [
1 1 1
2 1 1 2 2 2
1 1 1 1 1 1
θ θ θ
θ θ θ
θ θ θ
+ + = =
+ + = =
+ + = =
M
defined by the randomized action (p
1
, p
2
,…, p
k
). These relations can be written in the
matrix form



.

\

+ +



.

\

=



.

\

) , (
) , (
...
) , (
) , (
1
1
1 1
1
1
k m
k
k
m m
a l
a l
p
a l
a l
p
L
L
θ
θ
θ
θ
M M M ,
which suggests the interpretation of the vector of losses (L
1
, L
2
,…, L
m
) is a point in m
dimensional space computed as a convex combination of the points
( ) , ( ),..., , (
1 i m i
a l a l θ θ ), for i=1,…,k. The notion of convex sets was introduced in sub
section 1.6. The latter points are the loss vectors defined by the pure actions while those
defined by randomized actions are convex combinations of those defined by the pure
actions (Berger, 1985) and (Lindgren, 1971).
The closing example to this subsection shows that the set of loss points (L
1
, L
2
,…, L
m
)
defined by all possible randomized actions (p
1
, p
2
,…, p
k
) is a convex set in m
Part I: Decision Theory – Concepts and Methods
11
dimensional space as described above. It forms a convex polyhedron where the extreme
points are pure actions, although some pure actions may fall inside the set. The following
example demonstrates a two dimensional case – i.e. there are two states of nature.
Example 21:
(Lindgren, 1971)
The following table describes a problem with five possible actions, two states of nature
and the respective loss function.
a
1
a
2
a
3
a
4
a
5
θ
1
θ
2
2 4 3 5 3
3 0 3 2 3
The pure actions define the loss vectors (the columns in the table) and the randomized
actions define the convex set generated by these five actions which is founding the figure
below. Note that one action, a
3
falls within the convex polyhedron.
Figure 21:
Recall from Section 1.6 that a convex combination of a set of points corresponds to the
“center of gravity of a system of point masses at those points” where these masses are
proportional to the weights in the convex combination. Thus, the randomized action (p
1
,
p
2
,…, p
k
) yields a point (L
1
, L
2
,…, L
m
), which is the center of gravity of a system of
masses with p
1
units at a
1
, p
2
units at a
2
,…, and p
k
units at a
k
.
Interpreting randomized actions as centers of gravity helps geometrically interpret the
state. For instance, if only p
1
and p
2
are positive, with no mass at a
3
,…, a
k
, then the center
of gravity (and therefore the point (L
1
, L
2
)) must lie on the line segment joining a
1
and a
2
.
If these actions are represented by extreme points, then such a mixture of just these two
actions lies on the edge of the convex set of points representing all randomizations
(Lindgren, 1971). Thus, for this example, the action a
3
results in the same losses as a
L
1
L
2
a1
a5
a4
a2
a3
Part I: Decision Theory – Concepts and Methods
12
certain mixture of a
1
, a
2
and a
5
, or as a mixture involving only a
1
, a
4
and a
5
, as well as
many other mixtures of a
1
, a
2
, a
4
and a
5
. However, it could not be obtained by mixing
actions a
2
, a
4
and a
5
.
Section 2.3 Regret
If the state of nature is known then the action which results in minimal lost would be
taken. Thus, if it were known that θ
i
were the state of nature, the action a for which
) , ( a l
i
θ is smallest should be taken, and the minimum loss
) , ( min a l m
i
A
i
θ = ,
is a loss that could not be avoided with even the best decision. Suppose, one takes the
action a
j
, which does not produce this minimum, and then discovers that nature is indeed
in state θ
i
, the decision maker would regret not having chosen the action that produces the
minimum; the amount of loss that could have saved by knowing that state of nature is
called the regret. Regret is defined for each state θ
i
and action a
i
as follows:
) , ( min ) , ( ) , ( a l a l a r
i
A
j i j i
θ θ θ − = .
So for each state of nature, subtract the minimum loss m
i
from the losses involving that
state to obtain the regret.
Regret is often referred to as opportunity loss and represents a “thinking” in terms of gain
rather than loss (Lindgren, 1971). The gain resulting from taking action a when nature is
in state θ is the negative of the loss:
) , ( ) , ( a l a g θ θ = − .
Hence, the minimum loss is the negative of the maximum gain:
) , ( max ) , ( min a g a l
A A
θ θ − =
and the regret can be reexpressed in terms of gain as follows:
) , ( ) , ( max ) , ( min ) , ( ) , (
j i i
A
i
A
j i j i
a g a g a l a l a r θ θ θ θ θ − = − = − .
This represents the maximum that could have been gained if the state of nature had been
known, minus the amount that actually was gained by taking action a
j
.
Example 21: (cont’d.)
(Lindgren, 1971)
Part I: Decision Theory – Concepts and Methods
13
Recall the example presented in the previous subsection. The table has been replicated
below with the addition of another column representing the minimum loss over all five
actions. The regret table is provided below and computed by subtracting the minimum
loss from each of the losses in their respective rows.
a
1
a
2
a
3
a
4
a
5
) , ( min a l
A
θ
θ
1
θ
2
2 4 3 5 3
3 0 3 2 3
2
0
a
1
a
2
a
3
a
4
a
5
θ
1
θ
2
0 2 1 3 1
3 0 3 2 3
There is at least one zero in each state and the remaining entries are positive in the regret
table. Geometrically, the objective or effect is to translate the set of points so that at least
one action is on each axis. This is demonstrated in the figure below.
Figure 22:
Notice that the whole convex set of randomized actions shifts along with the five pure
actions. This is generally the case because “the amount subtracted from each loss is
independent of the action a, upon which a probability distribution is imposed in a mixed
action” (Lindgren, 1971):
i i i
A
i i i
m L a l a l E a r E R − ≡ − = = ) , ( min )] , ( [ )] , ( [ θ θ θ .
Question: Would it make any difference, in studying a decision problem, if one used
regret instead of loss? In some instances, the regret may be more painful then the losses,
depending on the role of the decision maker and his or her stakes. The classical treatment
of statistical problems in statistical decision theory, usually results in assuming a loss
L
1
L
2
a1
a5
a4
a2
a3
Part I: Decision Theory – Concepts and Methods
14
function that is already a regret function; however this is not always the case and leads to
a discussion in cases where it does make a difference.
Section 2.4 The Minimax Principle
Decision problems present a difficulty in determining the best decision because an action
that is best under one state of nature is not necessarily the best under the other states of
nature. Although, various schemes have been proposed  decision principles that lead to
the selection of one or more actions as “best” according to the principle used – none is
universally accepted. (See Berger, 1985 §1.5.)
By linearly ordering the available actions, assigning “values” to each action according to
its desirability is a frequentist principle. The minimax principle places a value on each
action according to the worst that can happen with that action. For each action a, the
maximum loss over the various possible states of nature:
) , ( max ) ( a l a M θ
Θ
= ,
is determined and provides an ordering among the possible actions (Lindgren, 1971 and
French & Insua, 2000). Taking the action a for which the maximum loss M(a) is a
minimum lends itself to the name minimax.
Berger states the same principle within the context of a decision rule. If ∆ ∈ δ is a
randomized rule then the quantity ) , ( sup δ θ
θ
R
Θ ∈
represents the worst that could happen in
the decision δ is used. Furthermore, the decision rule
1
δ is preferred to a rule
2
δ if
) , ( sup ) , ( sup
2 1
δ θ δ θ
θ θ
R R
Θ ∈ Θ ∈
< .
Similarly, a minimax decision rule is a minimax decision rule if it minimizes
) , ( sup δ θ
θ
R
Θ ∈
among all randomized rules in ∆.
Example 21: (cont’d.)
(Lindgren, 1971)
Again the loss table is repeated below with the addition a row stating the maximum loss
for the various actions. The smallest maximum loss is determined by action a
1
to be 1.
a
1
a
2
a
3
a
4
a
5
θ
1
θ
2
2 4 3 5 3
3 0 3 2 3
) , ( max a l θ
Θ
3 4 3 5 5
Part I: Decision Theory – Concepts and Methods
15
The table of regrets is produced below and shows different results then that of the
maximum loss table. The table shows that the minimum maximum regret is determined
by action a
2
.
a
1
a
2
a
3
a
4
a
5
θ
1
θ
2
2 4 3 5 3
3 0 3 2 3
) , ( max a l θ
Θ
3 2 3 3 5
Recall the question posed at the end of the last subsection:
Question: Would it make any difference, in studying a decision problem, if
one used regret instead of loss?
A graphical approach of determining the minimax pointwhich is feasible when there are
two states of nature is a useful in approaching the previously stated question. For a given
action a with losses ) , (
2 1
L L under ) , (
2 1
θ θ , respectively, the maximum of these losses is
the first coordinate if the point lies below the bisector of the first quadrant while if the
point lies above that bisector, or 45˚ line, the maximum loss is the second coordinate L
2
.
If two points lie above the bisector, the lower one has the smaller maximum; and if two
points both lie below it, the leftmost point has the smaller maximum. This approach can
be generalized to many states, but becomes more messy in deciphering all the
combinations.
The graphical approach to the minimax action is simply that the minimax process is
related to the location of the origin of the coordinate system, and that moving the action
points relative to the coordinate system can alter the process of finding the minimax
point. Of course, in some instances, the minimax loss action and the minimax regret
action will not differ.
To determine the minimax action among the set of all randomized actions is generally
more complicated, because instead of choosing an action from a finite set of actions one
must choose a probability vector from a set of possible probability vectors that is infinite
in number (even if the set of actions is finite). There are two cases that can be considered
for brevity sake at this point  when there are just two states of nature, and when there are
just two actions. When there are just two states of nature, a graphical solution to the
problem of determining a minimax mixed action can be carried out by representing the
randomized actions in terms of ) , (
2 1
L L and ) , (
2 1
R R . This can be shown by returning
to the example.
Example 21: (cont’d.)
(Lindgren, 1971)
Continuing with the same example and looking at the figure below, it is clear that this
point (where the bisector meets the convex polyhedron) lies on the segment joining the
Part I: Decision Theory – Concepts and Methods
16
points representing a
1
and a
2
, and so represents a mixture involving only those two
actions. The point (x,y) in question is a convex combination of a
1
and a
2
:


.

\

− +


.

\

=


.

\

0
4
) 1 (
3
2
p p
y
x
.
Moreover, it is the point on that line segment with equal coordinates:
) 1 ( 0 3 ) 1 ( 4 2 p p y x p p − + = = = − + ,
and equating these two functions of p yields
5
4
= p .
Figure 23:
The minimax action therefore puts
5
4
of the total probability (1) at a
1
,
5
1
at a
2
, and none
at any other action, producing the probability vector ) 0 , 0 , 0 ,
5
1
,
5
4
( . The minimax loss,
that is, the smallest maximum expected loss, is the value obtained by setting
5
4
= p ,
namely,
5
12
3 = p . The point on the graph representing the minimax actions is thus
(
5
12
,
5
12
). Notice that this minimax expected loss is actually less that the minimum
maximum loss achieved when only pure actions are admitted, 3, as determined in the
“earlier example” presented in the previous section.
The case of two action is discussed only briefly as it can be determined in a similar
fashion as above. The randomized actions are vectors of the form (p, 1  p), defined by a
single variable p. The expected losses under the various states of nature can be computed
L
1
L
2
a1
a5
a4
a2
a3
Part I: Decision Theory – Concepts and Methods
17
as functions of p, and from this one can determine (at least graphically) the function
)] , ( [ max a l E θ
Θ
. The minimum point of the latter function then defines the minimax
action. See both Berger (1985) and Lindgren (1971) for examples and further discussion.
Section 2.5 Bayes Solutions
The frequentist approach has considered the states of nature to be fixed. In contrast,
considering concept of randomness for the state of nature is the Bayes decision principle.
This stance considers nature as random or not, nevertheless, it is rational to incorporate
into the decision making process one’s prior “hunches, convictions, or information about
the state of natureand how “likely” (whatever that means) the various states of nature are
to be governing the situation” (Lindgren, 1971). This is accomplished by weighting the
states and ordering the actions to permit the selection of a “best” action. Thus, if a large
loss can occur for a given action when nature is in a state that the decision maker feels is
highly unlikely, the extreme loss is minimized slightly by the state of nature that would
have produced it.
The role of prior probabilities was briefly introduced in Section 1 and will be expanded
upon here. In a general decision problem, a probability weight ) (θ g
1
is assigned to each
state of nature θ, where these assigned weights are nonnegative and add up to 1. Such a
set of probabilities or weights is called a prior distribution for θ. Given the distribution
) (θ g , the loss incurred for a given action a is a random variable, with expected value
∑
=
i
i i
a l g a B ) , ( ) ( ) ( θ θ .
This is referred to as the Bayes loss corresponding to action a (Lindgren, 1971). The
Bayes action is then defined to be the action a that minimizes the Bayes loss B(a). That
is, the computation of the expected loss according to a given prior distribution provides a
means of arranging or ordering the available actions on a scale (namely, B(a)) such that
the action farthest to the left on that scale is the most desirable, and is to be taken.
When randomized actions are considered, a Bayes loss can be defined as the expectation
with respect to a given prior distribution of the expected loss (with respect to the
randomized action): For a randomized action p = ) ,..., , (
2 1 k
p p p , which assigns
probability p
i
to action a
i
, the expected loss for a given state θ is
j
j
j
p a l a l E
∑
= ) , ( )] , ( [ θ θ ,
and the Bayes loss is obtained by averaging these (for the various θ’s) with respect to
) (θ g :
1
This notation is used so as to diffuse any confusion from the probability vectors assigned to each action.
Part I: Decision Theory – Concepts and Methods
18
∑ ∑ ∑
)
`
¹
¹
´
¦
= =
i i i
j j i i i i
p a l g a l E g B ) , ( ) ( )] , ( [ ) ( ) ( θ θ θ θ p .
Since this Bayes loss is a function of ) ,..., , (
2 1 k
p p p , or k – 1 variables (p
k
is determined
as soon as
1 2 1
,..., ,
− k
p p p are specified), the problem of determining the minimum Bayes
loss is that of minimizing a function of (k – 1) variables.
Example 22:
(Lindgren, 1971)
Consider the following loss table where there are just two states of nature and two
actions. The prior probabilities are given and ) (
1
θ g = w, ) (
2
θ g = 1 w and the expected
losses are simply calculated as
. 4 5 ) 1 ( 5 1 ) (
6 6 ) 1 ( 6 0 ) (
2
1
w w w a B
w w w a B
− = − + • =
− = − + • =
θ
1
θ
2
B(a)
a
1
a
2
0 6
1 5
6 – 6w
5 – 4w
) (θ g w 1  w
The figure below illustrates that for any w to the left of 0.5, the value of w for which the
Bayes losses are equal, the smaller Bayes loss is incurred by taking action a
2
. For any w
to the right of 0.5, the Bayes action is a
1
as it yields the smaller Bayes loss. When w =
0.5 then it is irrelevant whether action a
1
or a
2
is taken. Examination of that figure shows
that for a prior distribution defined by a w in the range
1
0 w w < ≤ the Bayes action is a
1
;
for
2 1
w w w < < it is a
2
; for 1
2
≤ < w w it is a
3
.
Figure 24:
w
w
.5
6
a1
a2
0 1
Part I: Decision Theory – Concepts and Methods
19
The interpretation of small or large w is of course dependent on the problem; however, it
can be generalized that small w refers to a least favorable outcome where as w refers to a
more favorable outcome. In general, the Bayes losses for the various actions will be
linear functions of w, in a problem with two states of nature and ) (
1
θ g = w, ) (
2
θ g = 1
w. These functions of w will be represented by straight lines, and for a given value of w
the action corresponding to the line whose ordinate at that w is smallest is the Bayes
action.
Now when considering randomized actions (p
1
, p
2
,…, p
k
) for actions a
1
,… a
k
,
respectively, produces the expected losses
∑ ∑ ∑ ∑
)
`
¹
¹
´
¦
=
)
`
¹
¹
´
¦
=
j i
j i i j
i j
j j i i
a l g p p a l g B ) , ( ) ( ) , ( ) ( ) ( θ θ θ θ p .
Thus, the value of B(p) is a convex combination of the values of the Bayes losses for the
various pure actions, and is at least as great as the smallest of those values (Lindgren,
1971). This means that there is no gain possible in the use of randomized actions; a pure
action can always be found which yields the minimum Bayes loss.
The result that the Bayes actions are the same using the regret as using lossis evident for
the problem involving only two states of nature; however, it is also true in general.
Substitution of
) , ( min ) , ( ) , ( a l a l a r
i j i j i
θ θ θ − =
into the expression for expected regret:
) ( ) , ( )] , ( [
i
i
j i j
g a r a r E θ θ θ
∑
=
yields
. ) , ( min ) ( )] , ( [
) , ( min ) ( ) ( ) , ( )] , ( [
∑
∑ ∑
− =
− =
i
i
a
i j
i
i
a
i i
i
j i j
a l g a l E
a l g g a l a r E
θ θ θ
θ θ θ θ θ
Thus, )] , ( [
j
a r E θ differs from )] , ( [
j
a l E θ by a term that does not involve a
j
. The action
that minimizes one must therefore minimize the other. This is illustrated the following
example.
Example 23:
(Lindgren, 1971)
Consider the following decision problem with its corresponding loss table.
Part I: Decision Theory – Concepts and Methods
20
a
1
a
2
a
3
θ
1
θ
2
2 5 3
3 1 5
The regrets are
1 ) , ( ) , (
2 ) , ( ) , (
2 2
1 1
− =
− =
a l a r
a l a r
θ θ
θ θ
The expected regrets, given prior weights ) (
1
θ g and ) (
2
θ g , are
)] ( ) ( 2 [ ) ( 5 ) ( 3 )] , ( [
)] ( ) ( 2 [ ) ( ) ( 5 )] , ( [
)] ( ) ( 2 [ ) ( 3 ) ( 2 )] , ( [
2 1 2 1 3
2 1 2 1 2
2 1 2 1 1
θ θ θ θ θ
θ θ θ θ θ
θ θ θ θ θ
g g g g a r E
g g g g a r E
g g g g a r E
+ − + =
+ − + =
+ − + =
so that )]. ( ) ( 2 [ )] , ( [ )] , ( [
2 1
θ θ θ θ g g a l E a r E
i
+ − =
Thus the action that minimizes )] , ( [
j
a r E θ also minimizes )] , ( [
j
a l E θ by a difference of,
irrespective of the action.
Section 2.6 Dominance and Admissibility
The concepts of dominance and admissibility are counterparts to decision rules. Some
examples have been encountered in which certain of the available actions would never be
used because there are others for which losses are always less. It is important to state
some definitions before discussing their importance any further.
Definition:
An action a* (pure or randomized) is said to dominate an action a if the loss incurred by
using action a is always at least as great as that incurred by using a:
*), , ( ) , ( a l a l θ θ ≥ for all θ.
Definition:
An action a* (pure or randomized) is said to dominate strictly an action a if it dominates
action a and if, in addition, there is some state of nature for which the loss inequality is
strict:
*), , ( ) , ( a l a l θ θ > for some θ.
Definition:
Part I: Decision Theory – Concepts and Methods
21
An action (pure or randomized) is said to be admissible if no other action dominates it
strictly.
The proceeding paragraph was extracted from Lindgren (1971).
Returning to the case of two states of nature, an action is represented as a point in the (L
1
,
L
2
) plane, where L
i
is the loss (or expected loss, in the case of a randomized action)
incurred when that action is taken and θ
i
is the state of nature. The figure below shows
points corresponding to actions a*, a
1
and a
2
, for which losses are such that a* dominates
both a
1
and a
2
. In the case of a
1
the losses are greater than for a* for both states of
nature; in the case of a
2
only the loss for θ
1
is strictly greater than with a*. Incidentally
the action a* would strictly dominate every action represented by points in the shaded
quadrant (including the boundaries), with the exception of the point a* itself. (In this
kind of representation, an action that dominates action a but does not strictly dominate it
would be represented by the same point as action a, since the losses would be the same.)
Figure 25:
If an action a* dominates an action a there is no need to leave a in the competition for a
best action. If the dominance is not strict, then the losses are the same for a as for a*, and
a can be dispensed with; if it is strict, then one may actually do worse by using a than by
using a*.
Before ending this section, we can reiterate this concept in terms of risk functions and
decision rules. In Chapter 5 of French and Insua (2000), “it is stated that risk functions
induce a natural ordering among decision rule: a decision rule which performs uniformly
better in terms of risk than another for each value of θ seems better overall”. Let ) (
1
⋅ δ and
) (
2
⋅ δ be two decision rules. Then ) (
1
⋅ δ dominates ) (
2
⋅ δ if , ), , ( ) , (
2 1
θ θ δ θ δ ∀ ≤ R R with
strict inequality. Similarly, it can be stated that the ) (
1
⋅ δ and ) (
2
⋅ δ are equivalent if
. ), , ( ) , (
2 1
θ θ δ θ δ ∀ = R R
a1
a* a2
L1
L2
Part I: Decision Theory – Concepts and Methods
22
Definition:
A decision rule (or action) is admissible if there exists no Rbetter decision rule where R
is referring to the risk function. A decision rule is inadmissible is inadmissible if there
exists an Rbetter decision rule.
The above definition was provided by Berger (1985). An action that is not admissible is
said to be inadmissible, and can be dispensed with because there is an action that does at
least as well under the circumstances. On the other hand, an action that is close to the
lower left boundary of a set of mixed actions (in the L
1
, L
2
representation) may not be so
bad as the name inadmissible would imply (Lindgren, 1971). That is, there are degrees of
inadmissibility which the terminology ignores. In statistical problems, for example, there
may be solutions that are slightly inadmissible but are preferred for some reason to those
that dominatebecause of computability, for instance.
Section 2.7 Bayesian and Classical Approaches –
Bayes versus Minimax Principle
This section explores the connections between Bayesian and classical approaches through
their respective forms of decision principles, namely the Bayes and Minimax Principle.
The latter is essentially based on the risk function ) , ( θ δ R
2
which induces a partial
ordering among the decision rules
3
, leading to the concept of admissibility. Since this is
only a partial ordering, the Bayesian approach introduces a prior distribution which may
be used to weight the risk function and orders according to Bayes risk (French and Insua,
2000). Much of this discussion centers around the concept of admissibility. From the
preceding three subsections, some possible generalities can be made – (1) Bayes
solutions are usually admissible; (2) A minimax action is a Bayes action; and (3)
Admissible actions are Bayes, for some prior distribution (Lindgren, 1971). These are
nonformal statements; however they are discussed in detail with proofs in 4.8 (Bayes
admissibility) and 5.5 (Minimax admissibility and Comparison with Bayes) of Berger
(1985). (See also Chapter 6 of French and Insua (2000).
Again a geometric representation of just two states of nature is instructive as the set of all
possible (randomized) actions is represented in the L
1
L
2
plane as a convex set. That a
minimax solution is Bayes for some prior distribution is evident, for the case of two states
of nature, from the geometrical representation in the L
1
L
2
plane (Lindgren, 1971). The
figures below illustrate four cases that may arise.
2
It is important to recall here that risk function is the expected loss (as defined by the Frequentist or
classical approach). The risk function is further explored in Section 3.
3
It is important to note that decisions and actions are sometimes interchangeable, especially in the nodata
decision problems. Thus, although the notion of decision rules has been brought in this section it is
discussed in more detail in the next section.
Part I: Decision Theory – Concepts and Methods
23
Figure 2.6:
(a) The minimax action occurs on
the boundary of the set of
actions, and is a mixture of two
pure actions; this action would be
Bayes for the prior distribution
(w, 1 w) such that – w/(1 – w) is
equal to the slope of the line
through the two pure actions
involved.
(b) The minimax action, which is a
pure action, would be Bayes for
any prior distribution such that –
w/(1 – w) is a slope between the
slopes of the line segments which
meet in that pure action.
(c) The prior distribution which
produces the minimax action as a
Bayes action corresponds to the
tangent line at L
1
= L
2
; this kind
of set of randomized actions
would only occur if there are
infinitely many pure actions at
the outset.
(d) The minimax action would be
any that yields an (L
1
, L
2
) on the
bottom edge of the action set;
these are Bayes for the prior
distribution, which assigns
probability 1 to θ
2
.
Part I: Decision Theory – Concepts and Methods
24
Example 23:
(Lindgren, 1971)
Revisiting the problem introduced in Section 2.5 produces the following Bayes losses
with given prior probabilities ) 1 , ( w w − . The loss table has been repeated for ease.
a
1
a
2
a
3
θ
1
θ
2
2 5 3
3 1 5
. 2 5 ) 1 ( 5 3 ) (
4 1 ) 1 ( 5 ) (
3 ) 1 ( 3 2 ) (
3
2
1
w w w a B
w w w a B
w w w a B
− = − + =
+ = − + =
− = − + =
The graphs of these are shown in figure 2.7 below. The lowest point of intersection and
the highest minimum is at
5
2
= w . That is, of all the prior distributions that Nature might
choose, the highest minimum Bayes loss (loss incurred by using the Bayes action) is
achieved for the prior distribution )
5
3
,
5
2
(
4
.
Figure 2.7:
The minimax randomized action is easily found (by the graphical procedure described in
Section 2.4) to be ) 0 ,
5
1
,
5
4
( with losses L
1
= L
2
= 13/5. The point on the line through (2,
3) and (5, 1) with equal coordinates:
4
This is said to be a least favorable prior distribution (Lindgren, 1971).
w
6
0 1
2/5
Part I: Decision Theory – Concepts and Methods
25
5
13
or ,
3
2
2
3
1
1
1
= − =
−
−
L
L
L
.
A prior distribution that would yield this randomized action as Bayes action is defined by
a w such that
3
2
1
− =
−
−
w
w
,
where 2/3 is the slope of the line through (5, 1) and (2, 3). Notice that this w, which is
5 / 2 = w , is precisely the w of the least favorable distribution
5
determined above. A
graphical representation of this has not been shown here but can be achieved in much the
same manner described in Section 2.4 and 2.5.
The Bayes Perspective…
Mathematically the Bayes approach is just postulating a weighting function that provides
an ordering among the actions; practically, the decision maker is incorporating into that
weighting function personal preferences about what that unknown state of nature is likely
to be. The concept of subjective probability is discussed in Part C on utility theory. In
the role of rational decision making, the inclusion of such personal preferences deems
reasonable. However, nonBayesians do criticize this approach due to its view of
subjectivity. In defence of this approach, it is not terribly sensitive; therefore, in small
inaccuracies in specification is not treacherous. Furthermore, the Bayes solutions are
usually admissible.
The Minimax Approach…
The minimax approach, on the other hand, is not nearly so easy to defend. Berger (1985)
goes on to say that “when considered from a Bayesain viewpoint, it is clear that the
minimax approach can be unreasonable. It is pessimistic view  making the assumption
that the worst will happen. Although it is frequently admissible (and Bayes for some
prior distribution), the distribution is least favorable. A more significant objection is that
the minimax principle, by considering ) , ( sup δ θ
θ
R , may violate the rationality principles
(Berger, 1985).
Sometimes a minimax solution can be computed by determining among the Bayes
solutions one for which the losses under the various states of nature are equal. That is, if
) ,..., ( *
* *
1 k
p p = p is a randomized action that is Bayes with respect to some prior ) (θ g and
is such that the (expected) loss function
5
The notion of a least favorable distribution giving rise to the minimax action as a Bayes action is general,
but not trivial to establish. This idea is discussed more indepthly in sources on game theory which is not
included in this report.
Part I: Decision Theory – Concepts and Methods
26
∑
= =
i
i i
p a l a l E L
*
) , ( )] , ( [ *) , ( θ θ θ p
is constant in θ, then p* is minimax. The development follows (Lindgren, 1971): The
assumption that p* is Bayes for ) (θ g means that for any randomized action p,
) ( ) ( min *) ( p p p B B B
P
≤ = .
But since *) , ( p θ L is constant, its mean value with respect to (wrt) the weighting ) (θ g is
just that constant value:
*) , ( *)] , ( [ *) ( p p p θ θ L L E B = = .
On the other hand, since B(p) is the expected value of ) , ( p θ L , which cannot exceed the
maximum value of ) , ( p θ L → ) , ( max ) ( p p θ
θ
L B ≤ and it follows for any p that
) , ( max *) , ( p p θ θ
θ
L L ≤ .
Similarly, the constant *) , ( p θ L cannot exceed the smallest of these maxima
) , ( max min *) , ( p p θ θ
θ
L L
P
≤ ,
where the right side of this inequality is the minimax loss.
Since *) , ( p θ L is defined to be constant in θ, it is equal to its maximum value, which in
turn is greater than or equal to the smallest such maximum:
) , ( max min *) , ( max *) , ( p p p θ θ θ
θ θ
L L L
P
≥ = .
And then because *) , ( p θ L is neither less than nor greater than the minimax loss, it must
be equal to itand this means that p* is a minimax solution, as was asserted. This can
also be visually easing with just two states of nature. Lindgren (1985) has described this
phenomenon more in depth both algebraically and graphically. A similar proof can be
found in Berger (1985) and is discussed in brief in French and Insua (2000).
Before closing this section, I would like to make one note. Two of three key decision
rules have been discussed here. The third, the Invariance Principle, has not been included
in a theoretical discussion but for completeness is mentioned. The Invariance Principle,
as the name would have it imply, basically states that if two problems have identical
formal structures (sample size, parameter space, densities and loss function), then the
same decision rule should be used in each problem. An entire chapter in Berger’s 1985
book Statistical Decision Theory and Bayesian Analysis is devoted to this principle.
Part I: Decision Theory – Concepts and Methods
27
Section 3: “Statistical” Decision Processes
Statistical decision problems include data  meaning “the results of making one or more
observations on some random quantity that is thought to be intimately related to the state
of nature in some decision problem” (Lindgren, 1971). The availability of such data
provides some illumination in the selection of an action such that the state of nature is not
completely unknown. It will be shown that the use of data will define procedures which
result in an expected loss that is lower than what would be incurred if the data were not
available. However, even the availability of data will not avoid completely the kind of
situation encountered in the nodata case, in which there is no clearcut criterion for
rating the various candidate procedures as a basis for choosing one of them as best.
Section 3.1 Data and the State of Nature
To obtain data for use in making decisions an appropriate experiment of chance should be
performed – one in which the state of nature determines the generation of the data, and so
the probability distribution for the data depends on that state of nature.
The data of a given problem may consist of a single number (value of a random variable),
or a sequence of numbers  usually resulting from performing the same experiment
repeatedly, or sometimes a result or results that are not numerical. In general, a random
variable X will be employed to refer to the data, and in the problems considered here X
will denote either a single random variable, or a sequence of random variables:
(X
1
,…,X
n
). In any case X will have certain possible “values” and a probability for each,
according to the state of nature. Thus, for each value of x of the random quantity X there
is a probability
) ( ) ; ( x X P x f = =
θ
θ
assigned to that value x by the state of nature θ. (The notation ) (E P
θ
will mean the
probability of the event E when the state of nature is θ.)
6
Section 3.2 Decision Functions
Section 2 introduced the concept of a decision rule within the nodata context. It’s
relevance develops in this section as a procedure for using data as an aid to decision
making involving a rule, or set of instructions, that assigns one of the available actions to
each possible value of the data X. Thus, when the pertinent experiment is performed and
a value of X obtained, say X = x, an action has been assigned by the rule to that value, and
6
It is important to realize that in order for the data to be of value in making a decision, the dependence of
the probability distribution for X on the state of nature must be known. That is, ) ; ( θ x f is assumed to be
given or known.
Part I: Decision Theory – Concepts and Methods
28
that action is taken. A decision rule is a function a = δ(x), and is called a decision
function, or statistical decision function. Berger (1985) gives the following definition.
Definition:
A (nonrandomized) decision rule ) (x δ is a function from X into A. (It is always assumed
that the functions are measureable.) If X = x is the observed value of the sample
information, then ) (x δ is the action that will be taken. (Recall for a nodata problem, a
decision rule is simply an action.) Two decision rules, ) (
1
x δ and ) (
2
x δ , are considered
equivalent if 1 )) ( ) ( (
2 1
= = X X P δ δ
θ
for all θ.
Question: How many distinct rules are there? If there are just k available actions (a
1
,
a
2
,…, a
k
), and if the data X can have one of just m possible values (x
1
, x
2
,…, x
m
), then
there are precisely k
m
distinct decision functions that can be specified. “Of the k
m
possible
decision functions, some are sensible, some are foolish; some ignore the data, and some
will use it wrongly” (Lindgren, 1971).
Example 31:
(Lindgren, 1971)
Consider the following table with 8 decision rules to between two actions and where the
observed value X={0,1,2}. The first decision rule ignores the data as it takes action a
2
regardless; the last decision rule performs in much the same way. The second decision
rule suggests taking action a
2
is X = 0 or 1.
x δ
1
δ
1
δ
1
δ
1
δ
1
δ
1
δ
1
δ
1
0
1
2
a
2
a
2
a
2
a
1
a
2
a
1
a
1
a
1
a
2
a
2
a
1
a
2
a
1
a
2
a
1
a
1
a
2
a
1
a
2
a
2
a
1
a
1
a
2
a
1
Decision rules can be randomized in much the way actions are randomized – an
extraneous random device to choose among the available rules. A randomized decision
function is then a probability distribution over the set of pure decision functions,
assigning probability p
i
to decision function δ
i
(x). A randomized decision rule can be
defined alternatively by attaching an outcome x
i
of the data X (from an experiment),
which selects one of the actions (a
1
, a
2
,…, a
k
) according to some probability distribution
(g
1
, g
2
,…, g
k
) (Lindgren, 1971). This is rule is equivalent to a randomization of pure
decision functions, when the numbers of actions and possible values of X are finite. The
following definition provided by Berger (1985) closes this part.
Definition:
A randomized decision rule ) , ( ⋅ x δ is, for each x, a probability distribution on A, with the
interpretation that if x is observed, ) , ( a x δ is the probability that ac action in A will be
chosen. (Again a randomized decision rule in nodata problems is simply referred to as a
Part I: Decision Theory – Concepts and Methods
29
randomized action.) Nonrandomized rules can be considered a special case of
randomized rules, in that they correspond to the randomized rules which, for each x, so
that a specific action chosen has probability one. Let 〉 〈δ denote the equivalent
randomized rule (at this time) for the nonrandomized rule ) (x δ given by
¹
´
¦
∉
∈
= = 〉 〈
. ) ( if 0
, ) ( if 1
)) ( ( ) , (
A x
A x
x I a x
A
δ
δ
δ δ
Section 3.3 The Risk Function
Section 2 first introduced the concept of risk function as that understood by the classical
approach. This section defines and explores this concept more explicitly. Following
from the previous subsection, when a given decision function δ(x) is used, the loss
incurred depends not only on the state of nature that governs it, but also on the value of X
that is observed. Since X is random, the loss incurred can be restated as,
)) ( , ( x l δ θ ,
and is a random variable. The frequentist decisiontheoretic evaluates, for each θ, the
expected value of loss if such a decision rule δ(x) was used repeatedly with varying X in
the decision problem.
Definition:
The risk function of a decision rule δ(x) is defined by
))] ( , ( [ ) , ( x l E R δ θ δ θ =
which depends on the state of nature θ and on the decision rule δ. (Notably, for a nodata
problem, ) , ( ) , ( a L R θ δ θ = ).
Assuming a distribution for X defined by ) ; ( ) ( θ
θ
x f x X P = = , the risk function would be
calculated as
∑
=
i
i i
x f x l R ) ; ( )) ( , ( ) , ( θ δ θ δ θ .
Or simply stated, the risk function is the weighted average of the various values of the
random loss. Notice that the dependence of risk on the state θ arises because of two
facts: the loss for a given action depends on θ, and the probability weights used in
computing the expected loss depends on θ.
Thus, when the decision is based on observed results of an experiment, a decision rule
(pure or randomized) can be selected from those available, knowing the risk function but
ignorant of the true state of nature. The problem of selecting a decision rule knowing the
risk function is mathematically exactly the same as that of selecting an action in the
Part I: Decision Theory – Concepts and Methods
30
absence of data, knowing the loss function. The methods (minimax, Bayes) used in
attacking the nodata problem can be used for the statistical problem (with data).
In a similar fashion to a loss table, a risk table can be created in much the same way with
the number of rows representing the possible states of nature and the number of columns
producing the number of distinct decision rules. The decision rules are also representable
graphically as before, where (R
1
, R
2
) is plotted, with ) , ( δ θ
i i
R R = , for each rule δ. The
plot in the figure below shows in addition to the pure decision rules the set of randomized
of the pure rules, as the convex set generated by them (French and Insua, 2000).
Figure 31:
Several observations can be made (Lindgren, 1971) when using or neglecting the data:
• The pure decision functions δ
1
and δ
8
, which ignore the data, give points (R
1
, R
2
)
which are exactly the same as the corresponding (L
1
, L
2
) for the nodata problem.
• The straight line joining δ
1
and δ
8
would consist of randomizations that are exactly
equivalent (as regards risk) to randomizations of the pure actions a
1
and a
2
,
respectively, in the absence of data.
• The data provides risk points which clearly dominate these nodata points; in
particular, δ
4
, δ
6
and δ
7
are all better, and indeed δ
4
and δ
7
are admissible.
• There are several rules (δ
2
, δ
3
and δ
5
) which are worse than rules that make no use of
the data, and consequently make poor use of the data.
The availability of data permits a reduction of losses if the data is used properly and if the
set of (R
1
, R
2
) points representing the various procedures is pulled in toward the zero
regret point. In this instance, the data cannot be independent of the state of nature. If the
distribution of X is independent of θ:
) ( ) ; ( ) ( x k x f x X P = = = θ
θ
,
6
5
1
R2
R1
d5
d2
d3
d6
d7
d4
d1
d8
Part I: Decision Theory – Concepts and Methods
31
then the risks for a given rule δ(x) are
. ,..., 1 for ) ( )) ( , ( ))] ( , ( [ s i x k x l X l E R
j
j j i i i
= = =
∑
δ θ δ θ
In vector form this becomes
∑



.

\

=



.

\

j
j s
j
j
s
x l
x l
x k
R
R
)) ( , (
)) ( , (
) (
1 1
δ θ
δ θ
M M ,
which is a convex combination of the points (L
1
,…, L
s
), where )) ( , (
1 j i
x l L δ θ = ,
corresponding to the m actions ) ( ),..., (
1 m
x x δ δ . These losses can be achieved by the
randomization ) ( ),..., (
1 m
x k x k of pure actions ) ( ),..., (
1 m
x x δ δ , and so the data does not
extend the convex set of loss points to risk points that are any better (Lindgren, 1971).
Section 3.4 Selecting a Decision Function
Section 2.6 introduced the concepts of dominance and admissibility which are extended
to the decision function, by defining them in terms of the risk function. (Although this
concept was addressed in the previous section, this section discusses more in depthly
within the constructs of modeling statistical decision problems. So, a decision function
* δ dominates a decision function δ if and only if
θ δ θ δ θ all for , ) , ( *) , ( R R ≤ .
The dominance is strict if the inequality is strict for at least one state of nature. A
decision function is admissible if it is not dominated strictly by any other decision
function. This was defined by Berger (1985) and stated in Section 2.6.
A decision function that involves an inadmissible action is inadmissible. If a is not
admissible, it is dominated by some action a* (Lindgren, 1971):
0 0 0
some for , ) , ( *) , ( θ θ θ a l a l < .
If δ(x) assigns the action a to some possible value x
j
of positive probability under θ
0
, a
new rule ) ( * x δ is defined to be identical with ) (x δ except that * ) ( * a x
j
= δ , and then
) , ( ) ; ( )) ( , (
) ; ( )) ( * , ( *) , (
0 0 0
0 0 0
δ θ θ δ θ
θ δ θ δ θ
R x f x l
x f x l R
i i
j
i i
j
= <
=
∑
∑
Part I: Decision Theory – Concepts and Methods
32
(The inequality follows because the losses in the sums are identical except for the term
where i = j, in which case δ* = a* instead of a, yielding a smaller loss.) For states other
than θ
0
the inequalities < are replaced by ≤, and the strict dominance of δ by δ* follows.
The Minimax Approach…
The minimax principle provides a numerical measure of decision rules, namely, the
maximum risk over the various states of nature:
) , ( max ) ( δ θ δ
θ
R M = .
The minimax decision function is the δ that minimizes this maximum risk.
The Bayes Approach…
Assigning prior probability weights ) (θ g to the various states of nature determines the
average risk over the states:
∑
= =
j
j j
g R R E B ) ( ) , ( )] , ( [ ) ( θ δ θ δ θ δ .
This is the Bayes risk. The preferred decision rule is the decision function δ which
minimizes this Bayes risk.
Recall the question posed at the last section:
Question: Would it make any difference, in studying a decision problem, if
one used regret instead of loss?
As in the nodata case, it can make a difference  at least, if one considesr a minimax
approach. There are two ways to introduce the idea of regret  by applying it to the initial
loss function, and by applying it to the risk, or expected loss. These result to the same
thing and are shown below. The regret function was defined as
) , ( min ) , ( ) , ( a l a l a r
a
i i
θ θ θ − = ,
so for each x,
) , ( min )) ( , ( )) ( , ( a l x l x r
a
θ δ θ δ θ − = ;
and the expected regret is then
) , ( min ))] ( , ( [ ))] ( , ( [ a l X l E X r E
a
θ δ θ δ θ − = .
Part I: Decision Theory – Concepts and Methods
33
Now for any decision rule δ,
x a l x l
a
all for ), , ( min )) ( , ( θ δ θ ≥ ,
so that
) , ( min ))] ( , ( [ a l X l E
a
θ δ θ ≥
and
) , ( min ) , ( min ))] ( , ( [ min a l R X l E
a
θ δ θ δ θ
δ δ
≥ = .
But if a* is the action that gives the minimum ) , ( a l θ , and one considers the decision rule
* ) ( * a x ≡ δ , then
) , ( min *) , ( *)] , ( [ *) , ( ) , ( min a l a l a l E R R
a
θ θ θ δ θ δ θ
δ
= = = ≤ (Lindgren, 1971).
Since the minimum regret over all rules δ is neither > nor < the minimum loss over all
actions a, these must be equal. Therefore, the expected regret is the same as the
“regretized” risk and is given as ) , ( min ) , ( ))] ( , ( [ δ θ δ θ δ θ
δ
R R X r E − = .
Example 31: (cont’d)
(Lindgren, 1971)
Recall this example with the updated table of expected regrets. Shown also are the
maximum risks and the Bayes risk for the corresponding prior probabilities ) (θ g .
) (θ g δ
1
δ
2
δ
3
δ
4
δ
5
δ
6
δ
7
δ
8
.8
.2
1 .9 .7 .4 .6 .3 .1 0
0 .5 .4 .1 .9 .6 .5 1
M(δ)
B(δ)
1 .9 .7 .4 .9 .6 .5 1
.8 .82 .64 .34 .66 .36 .18 .2
Suppose we now from graphically analyzing this problem that the minimax mixed
decision rule is a mixture decision rules 4 and 7 and can be computed as


.

\

− +


.

\

5 .
1 .
) 1 (
1 .
4 .
p p
which gives
. 5 . 5 . 1 . 1 . 1 . 4 . p p p p − + = − +
Part I: Decision Theory – Concepts and Methods
34
Thus, 7 / 4 = p and the minimum maximum risk is computed by substituting the value for
p back in the equation above and obtaining the value 19/70. The Bayes approach works
in much the same way, except that the prior distribution would be utilized.
Section 3.5: The Posterior Distribution
For completeness, a brief description of the posterior distribution is provided here within
the decision model framework. Some of the concepts or relationships between
conditional, marginal and joint were addressed in Section 1 on probability measures.
Let X represent data with possible values
k
x x ,...,
1
and let Θ denote the state of nature,
with possible values
m
θ θ ,...,
1
. Assume that the distributions of X under the various states
of nature are given:
)  ( )  (
j i j i
x X P x f θ θ = Θ = = ,
as well as the prior distribution for Θ:
) ( ) (
j j
P g θ θ = Θ = .
From these the joint probabilities ) , (
j i
x p θ :
) ( )  ( ) , (
j j i j i
g x f x p θ θ θ = ,
and the marginal probabilities for X:
) ( )  ( ) (
j
j
j i i
g x f x p θ θ
∑
=
can be constructed. The conditional probabilities for Θ given X=x
i
are
∑
= =
j
j j i
j j i
i
j i
i j
g x f
g x f
x p
x p
x h
) ( )  (
) ( )  (
) (
) , (
)  (
θ θ
θ θ θ
θ .
This relation is essentially Bayes theorem. The function )  (
i j
x h θ is the posterior
probability function for Θ, corresponding to the given prior, ) (θ g .
There are 2 notes to make before closing this section (Lindgren, 1971).
1. When the distribution of the data X is unrelated to the state of nature (that is, is the
same for all states), then the posterior probabilities are equal to the prior probabilities.
Thus, if ) ( )  (
i i
x k x f ≡ θ ,
) (
) (
) ( )  (
)  ( θ
θ θ
θ g
x k
g x f
x h
i
i
i
= = .
Part I: Decision Theory – Concepts and Methods
35
So the observation X does not alter the odds on the various states of nature from what
they were before the observation.
2. If a prior belief is sufficiently strong, observations will not alter it. Thus if
0 ) ( and 1 *) ( = = θ θ g g for any other θ, then
∑
= =
i
i i
x f g x f x k *);  ( ) ( )  ( ) ( θ θ θ
and then provided , 0 *)  ( ≠ θ x f
¹
´
¦ =
= =
. other any for , 0
*, 0 if , 1
*)  (
) ( )  (
)  (
θ
θ
θ
θ θ
θ
x f
g x f
x h
Section 3.6 Successive Observations
This is adapted from Lindgren (1971) and will be discussed further in the Part on
Sequential Statistical Decision Theory.
The previous subsection has indicated that an observation can alter prior odds to posterior
odds. The concern or question arises that if the first posterior distribution were
considered as if it were a prior, should it result in yet a new posterior distribution? The
interest holds as if a posterior distribution is used as a prior distribution with new data,
the resulting posterior distribution is the same as if one had waited until all the data were
at hand to use with the original prior distribution for a final posterior distribution.
Suppose that two observations are made, (X, Y), with probability function
)  , ( )  , ( θ θ = Θ = = = y Y x X P y x f .
Expressing conditional probabilities using the defining formulas in terms of joint
probabilities gives
) and  ( )  ( )  , ( θ θ θ = Θ = = = Θ = = = Θ = = x X y Y P x X P y Y x X P ,
which can be rewritten as
) ,  ( )  ( )  , ( θ θ θ x y f x f y x f = .
The posterior probabilities for θ given X=x, and given a prior ) (θ g are
) (
) ( )  (
) ( )  (
) ( )  (
)  (
1
x k
g x f
g x f
g x f
x h
i i
θ θ
θ θ
θ θ
θ = =
∑
.
Part I: Decision Theory – Concepts and Methods
36
Using this as a prior probability for θ = Θ together with the new observation Y=y yields
the posterior probability
.
) ( )  , (
) ( )  , (
) ( / ) ( )  ( ) ,  (
) ( / ) ( )  ( ) ,  (
)  ( ) ,  (
)  ( ) ,  (
) ,  (
1
1
2
∑
∑
∑
=
=
=
j j
j j j
j j
g y x f
g y x f
x k g x f x y f
x k g x f x y f
x h x y f
x h x y f
y x h
θ θ
θ θ
θ θ θ
θ θ θ
θ θ
θ θ
θ
This is precisely the posterior probability that would be computed based on data (X, Y)
and the prior ) (θ g .
Section 3.7 Bayes Actions from Posterior Probabilities
The following discussion is found in Chapter 4 of Lindgren (1971), Chapter 4 of Berger
(1985), Chapter 6 of French and Insua (2000) and Chapter 3 of Raiffa and Schlaifer
(2000).
The Bayes decision rule for a given problem, corresponding to a certain prior distribution
can be found by applying the Bayes principle to the risk table  weighting the various
states of nature and choosing the decision function ) (⋅ δ so as to minimize the Bayes risk,
) (δ B . This function of δ is the expected value of the risk
∑
=
i
i i
g R B ) ( ) , ( ) ( θ δ θ δ .
Consider a different formalization.
Step 1: Recall the expression for ) , ( δ θ R as it is computed from the loss function is:
∑
=
j
j j
x f x l R )  ( )) ( , ( ) , ( θ δ θ δ θ .
Step 2: Substitute this into ) (δ B yields
∑∑
=
i j
i i j j i
g x f x l B ) ( )}  ( )) ( , ( { ) ( θ θ δ θ δ .
Step 3: According to Bayes’ theorem, the product ) ( )  (
i i j
g x f θ θ can be replaced by the
product of the conditional probability in which the condition is the value of X, and the
marginal probability X=x
j
:
Part I: Decision Theory – Concepts and Methods
37
) ( )  ( ) ( )  (
j j i i i j
x p x h g x f θ θ θ = .
This is calculated by summing the joint probabilities for ) , ( X Θ over θ (which would be
relevant in determining the posterior probabilities but is not at the moment):
. ) ( )  (
) , ( ) (
∑
∑
=
= = Θ =
i
i i j
i
j i j
g x f
x X P x p
θ θ
θ
Step 4: The indicated double summation can be calculated equally well by summing first
on i and then on j:
), ( )  ( )) ( , (
) ( )  ( )) ( , ( ) (
j
j i
j i j i
j i
i i j j i
x p x h x l
g x f x l B
∑ ∑
∑ ∑
)
`
¹
¹
´
¦
=
)
`
¹
¹
´
¦
=
θ δ θ
θ θ δ θ δ
where the last expression is obtained by making the substitution from Bayes’ theorem, as
given above.
Note: The quantity in braces depends on the observed value x
i
and on the action
assigned to x
i
by the decision function used:
)  ( )) ( , ( ) , (
j i
i
j i j h
x h x l x L θ δ θ δ
∑
= .
It is the average of the losses )) ( , (
j
x l δ θ with respect to the posterior probability weight
which is referred to as the expected posterior loss.
Step 5: Since ) (δ B is a weighted sum of expected posterior losses with nonnegative
weights:
) ( ) , ( ) (
j
j
j h
x p x L B
∑
= δ δ ,
where L
h
is made as small as possible. So if (for the observed to x
i
) ) (
j
x δ is taken to be
the action that minimizes the expected posterior loss L
h
, the resulting rule is one that
minimizes the Bayes risk and is a Bayes procedure (Lindgren, 1971).
To obtain the Bayes action given a particular X, it is not necessary to go through the
computation of the whole decision rule where the number of decision functions is of a
higher order than the number of actions.
Part I: Decision Theory – Concepts and Methods
38
Mathematically, the present approach is a simple process of minimizing a function of the
actions, namely, the expected posterior loss, by selecting one of those actions; the earlier
approach is a more complicated process of minimizing a function of the decision rules,
over all possible decision rules. In summary, the earlier approach incorporates the data
into the loss function to obtain the risk function where the original prior is then used on
those risks to determine the Bayes rule. The present approach incorporates the data into
the prior distribution to obtain the posterior distribution, which is used as an educated
“prior” distribution on the original losses. The results are equivalent.
Part II: Utility Theory  Concepts and Methods
39
Part II: UTILITY THEORY 
Concepts and Methods
Decision making under uncertainty places the evaluation of the consequences of possible
actions at the forefront of the problems concerning decision problems. There are two
main paradigms for modeling human uncertainty which has been applied to decision
making under uncertainty. One involves probabilistic and statistical reasoning. Most
common is the Bayesian perspective coupled with expected utility maximization. The
other classification is rulebased deductive systems which are based on axiomatic
foundations. Of particular interest is Savages expected utility which involves binary
relations over functions defined on a measurable space. In contrast, von Neumann
Morgenstern’s theory of expected utility maximization under risk derived a basic set of
axioms from which was deduced the existence of mathematical functions which have
useful properties in comparing alternative states and preferences.
The first section of this part presents the basic ideas and concepts of utility theory. It also
introduces the utility within the context of monetary values. Sections 5 and 6 focus on
the development of utility theory based on axiomatic foundations. The middle section
focuses on the concept of ordinal utility and shifts to discuss the concept of subjective
probability. The last section is a comparative assessment of the various theories that
were developed, demonstrating the differences and similarities in their conception.
Part II: Utility Theory  Concepts and Methods
40
Section 4: Utility Theory – From Axioms to Functions
An analytic study of a decision problem seems to require the assumption of a
mathematical model or structure of ordering among the various possible consequences of
taking an action in the face of a particular state of nature. In evaluating these
consequences of possible actions, two major problems are encountered. The first is that
the values of the consequences may not have any scale of measurement; and the second is
that even when there is a clear scale (usually monetary) by which consequences can be
evaluated, the scale may not reflect the “true” value to the decision maker (Berger, 1985).
In speaking about what one is faced with as the result of making a decision when nature
is in a certain state, the term consequence has been used, because the situation has been
brought about in part by the making of the decision. In discussing utility, it is preferable
to use the term prospect, without the implication that the future history has necessarily
been brought upon by a decisionwhich happens to be the case in decision theory but is
not generally so in the theory of utility.
A mathematical analysis is simpler, if it is possible to put prospects, not just in order, but
on a numerical scaleto assign a numerical measure to each prospect. With reasonable
assumptions about one’s preferences among the various prospects, such an assignment, or
function, can be achieved. The numbers are referred to as utilities and subsequently
utility theory deals with the development of such numbers. Thus, utility is a function
defined on the various prospects (or consequences) with which one is faced, measuring
the relative desirability of these prospects on a numerical scale.
This section focuses on the basic axioms and general concept of utility while the
proceeding sections discuss the development of the different “types” of utility within
individual rational decision making.
Section 4.1: Preference Axioms
Every mathematical structure is based on a set of axioms. Acceptance of the axioms
consolidates the acceptance of a whole string of consequences, or theorems, that follow
from the axioms. This section presents the underpinning axioms, dealing with
preferences, which imply the existence of a utility function, providing a numerical scale
in terms of which the consequences of actions are assumed to be measured.
A notion of preference, or relative desirability, among prospects is assumed. The
following notation
1
of the basic binary relation represents a weak preference taken to be
at least as desirable:
2 1
~
P P f means: prospect P
1
is at least as desirable as prospect P
2
.
1
The symbols f and
~
f will occasionally be used in the other direction, with the obvious meaning, namely,
that the prospect on the small end of the symbol is the less desirable.
Part II: Utility Theory  Concepts and Methods
41
This relation is weak which assumes properties of completeness and transitivity (French
and Insua, 2000). This latter trait – transitivity is stated below as axiom 2 and holds for
the following two relations as well. This aforementioned weak order can be used to
define the equal desirability of two prospects:
2 1
~ P P means: prospect P
1
and P
2
are equally desirable,
which is defined to mean that both
2 1
~
P P f and
1 2
~
P P f or simply that they are indifferent.
Strict preference is expressed in the following manner:
2 1
P P f means: prospect P
1
is preferred over prospect P
2
,
which is a condition that (by definition) exists when and only when
2 1
~
P P f , but P
1
and P
2
are not equally desirable. This latter relation ~ is reflexive, that is,
1 1
~ P P , if it is
understood that a prospect is at least as good as itself. Moreover, by definition it is
symmetric; that is
2 1
~ P P is equivalent to
1 2
~ P P (Lindgren, 1971).
These relations (along with axiom 2 – transitivity) result in the following consequences:
(i) If Q P ~ and R Q ~ , then . ~ R P
(ii) If Q P
~
f and R Q ~ , then R P
~
f .
(iii) If Q P f and R Q ~ , then R P f .
(iv) If Q P f and R Q f , then R P f .
Before clearly stating the key axioms, it is necessary to address the concept of a
“mixture” of prospects (Lindgren, 1971). These are summarized into three aspects.
(1) If ∃ with probability p of a prospect P
1
and a probability (1p) of a prospect P
2
,
this is called a random prospect and is considered to be a mixture of the prospects P
1
and
P
2
and is denoted by
p
P P ] , [
2 1
.
(2) More generally, if one faces P
1
with probability p
1
, P
2
with probability p
2
, and P
k
with probability p
k
, this is again a random prospect, called a mixture of P
1
,…, P
k
and
denoted
) ,..., ( 1
1
] ,..., [
k
p p k
P P .
(3) Similarly, the notation for considering random prospects composed of infinite
prospects, P
1
with probability p
1
, P
2
with probability p
2
, and so forth, where p
1
+ p
2
+…=1
is
,...) , ( 2 1
2 1
,...] , [
p p
P P .
These have been summarized into the adopted axioms as follows (Lindgren, 1971) &
(Berger, 1985):
Part II: Utility Theory  Concepts and Methods
42
AXIOM 1: Given any prospects P and Q, then either
Q P f , or Q P p , or Q P ~ .
Axiom 1 states that, given any two prospects, either one is preferred over the other, or the
other over the one, or they are equally desirable.
AXIOM 2: If Q P
~
f and R Q
~
f then R P
~
f .
Axiom 2 is the axiom of transitivity of the preference relation. Its assumption precludes a
type of inconsistency in the mathematical structure that may or may not accurately
describe one’s preferences.
AXIOM 3: If
2 1
P P f , then for any probability p and prospect P, it follows that
. ] , [ ] , [
2 1 p p
P P P P f
AXIOM 3`: If P
1
, P
2
,… and Q
1
, Q
2
,… are sequences of prospects such that
i i
Q P f for i
=1,2,…, then for any set of probabilities ) 1 ,...( ,
2 1
=
∑ i
α α α
. ,...] , [ ,...] , [
,...) , ( 2 1 ,...) , ( 2 1
2 1 2 1
α α α α
Q Q P P f
Axiom 3 says that improving one of the prospects in a finite mixture improves the
mixture, and Axiom 3` extends this notion to countably infinite mixtures.
AXIOM 4: Given three prospects in the order
3 2 1
P P P f f , there are mixtures
p
P P ] , [
3 1
and
r
P P ] , [
3 1
such that
. ] , [ ] , [
3 3 1 2 3 1 1
P P P P P P P
r p
f f f f
Axiom 4 says when
3 2 1
P P P f f that there is no P
1
so wonderful that the slightest chance
of encountering it instead of P
3
is better than any ordinary P
2
; and that there is no P
3
so
terrible that the slightest chance of encountering it instead of P
1
is worse than any
ordinary P
2
.
The following consequences are implied (Lindgren, 1971). For instance
0 1
P P f , it then
follows that
(v) . 1 0 for , ] , [
0 0 1 1
< < p P P P P
p
f f
Since P
1
can be thought of as
p
P P ] , [
1 1
and
p
P P ] , [
0 0
, so that (by Axiom 3)
Part II: Utility Theory  Concepts and Methods
43
. ] , [ ] [ ] , [
0 0 0 0 1 1 1 1
P P P P P P P P
p p p
= = f f f
2
(vi) . 1 0 and if , ] , [ ] , [
0 1 0 1 0 1
≤ < ≤ p q P P P P P P
q p
f f
This states that one mixture of two prospects is preferred over another if there is a
stronger probability of the more desirable prospect in the first mixture. That is as p
increases from 0 to 1, the mixture
p
P P ] , [
0 1
becomes increasingly more desirable. A
further consequence of the axioms is the existence of a utility function which is to follow.
Section 4.2 Coding Intermediate Prospects
The previous subsection showed that given any two distinct prospects
1 0
and P P
with
1 0
P P p , all mixtures of these prospects lie between them in order of desirability:
. ] , [
1 0 1 0
P P P P
p
p p
Moreover, the larger p the more desirable is the mixture. Now, consider that if P is an
arbitrary prospect between
1 0
and P P :
,
1 0
P P P p p
then there is some mixture of
1 0
and P P equivalent to P.
THEOREM: Given any two prospects
1 0
P and P such that
1 0
P P p , and given any prospect P
such that
1 0
P P P p p , then there is a unique number p between 0 and 1 such that
p
P P ] , [
0 1
is equivalent to P. (Lindgren, 1971)
This result means that the prospects intermediate to given prospects can be thought of as
lined up on the scale of numbers from 0 to 1, each being identified with a number on that
scale. Equivalent prospects are located at the same point on the scale, and the ordering of
prospects in desirability corresponds to ordering on the scale – from least desirable to
more desirable prospects. A utility function is now easily constructible which provides a
coding of utilities that gives a numerical representation of preferencesby taking the
utility of a prospect to be just the number on the scale from 0 to 1. Thus, if
p
P P P ] , [ ~
0 1
,
define p P u = ) ( . For all prospects between
1 0
and P P , as well as for
1 0
and P P themselves
1 ) ( and , 0 ) (
1 0
= = P u P u .
2
The equality of two prospects means that they are precisely the same prospect.
Part II: Utility Theory  Concepts and Methods
44
There are certain properties satisfied by the utility function
3
, defined for any P between
given
1 0
and P P , as the value p in the mixture
p
P P ] , [
0 1
which is equivalent to P, as follows:
Utility Property A: If Q P f , then ) ( ) ( Q u P u ≥ .
Property A, for any prospects P and Q between
1 0
and P P , follows from the fact that
mixtures of
1 0
and P P are ordered according to the proportion of P
1
in the mixture; thus, if
p
P P P ] , [
0 1
= and
q
P P Q ] , [
0 1
= , and Q P f , then p > q. But p P u = ) ( and q Q u = ) ( , which
yields Property A.
Utility Property B: ). ( ) 1 ( ) ( ) ] , ([ Q u r P ru Q P u
r
− + =
Property B, for these P and Q, observe that
) 1 ( 0 1 0 1 0 1
] , [ ] ] , [ , ] , [[ ] , [
′ − +
= =
r q pr r q p r
P P P P P P Q P
which implies that the utility in
r
Q P ] , [ is
). ( ) 1 ( ) ( ) 1 ( ) ] ([ Q u r P ru r q pr PQ u
r
− + = − + =
Property B can be expressed in terms of expectation. For a mixed prospect
r
Q P ] , [ , the
utility is a random variable with possible values ) (P u and ) (Q u and corresponding
probabilities r and 1r; the expected utility is therefore ) ( ) 1 ( ) ( Q u r P ru − + , which is the
utility assigned to the random prospect. The utility of a random prospect is the expected
value of the utility (Berger, 1985).
Berger (1985) describes a 5 step procedure in constructing a utility function much of
which is described above. He also includes in his construction of the utility function the
fact that any linear function of such a utility would also satisfy Utility Properties A and
B
4
, and serve just as well as a utility function.
Defining a function ) (P v on prospects between
1 0
and P P as follows:
b P au P v + = ) ( ) ( ,
for a > 0, has the simple effect of shifting the origin and introducing a scale factor. The
utility of
1 0
and P P in this new scheme would be
b a P v b P v + = = ) ( and ) (
1 0
.
3
Berger, 1985.
4
French and Insua, 2000.
Part II: Utility Theory  Concepts and Methods
45
Property A, for ) (P v is satisfied. To obtain property B compute
). ( ) 1 ( ) (
] ) ( )[ 1 ( ] ) ( [
)] ( ) 1 ( ) ( [
) ] , ([ ) ] , ([
Q v r P rv
b Q au r b P au r
b Q u r P ru a
b Q P au Q P v
r r
− + =
+ − + = =
+ − + =
+ =
The matter of utilities for prospects that are not between two given prospects
1 0
and P P has
been neglected in this report. This concludes this subsection.
Section 4.3 Utility for Money
In many practical situations, prospects are expressible in terms of amounts of money. In
such cases, it is tempting to treat the number of dollars as utility, or as proportional to
utility: M M u = ) ( or kM M u = ) ( . Recall the introduction to this section which addressed
that discounting the possibility that such a utility function overlooks other aspects of
prospects addresses the issue that this scale of values is not usually a totally adequate
basis in terms of which to analyze decision problems involving money. For one thing,
the function M M u = ) ( is not a bounded function of M, at least if M is allowed to range
over all real numbers. The famous example of the “St. Petersburg Paradox” illustrates
the difficulty (Lindgren, 1971) and (Berger, 1985):
Example 61:
You are offered, for a fee, the following random prospect. You will be given $2
N
if in
repeated tosses of a coin the first heads does not appear until the Nth toss. What entry fee
would you be willing to pay?
If money is utility, (that is M M u = ) ( ), you should be willing to pay any fee up to the
expected value of the payoff, because your utility would not be decreased by so doing. It
can be shown that the probability of obtaining heads for the first time on the Nth toss is
½
N
, and accepting this one obtains as the expected payoff:
. ... 1 1 1
2
1
2 ) (
1
∑
∞
∞ = + + + = • =
N
N
payoff E
(That is, the sum is not a finite number.) Despite the infinite expected payoff, it is found
that people generally will not pay even a large but finite entry for the proposed game.
On the other hand, if it is assumed that one’s utility function is
¹
´
¦
> =
≤
=
, 20 if 576 , 048 , 1 2
, 20 if 2
) (
20
M
M
M u
M
Part II: Utility Theory  Concepts and Methods
46
the expected utility of the payoff is
. 21 ...
4
1
2
1
20
2
1
) 2 (
2
1
2 ] 0 ( [
20
1 21
20
∑ ∑
= 
.

\

+ + + = + • =
∞
N N
N
Payoff u E
The game then has the same utility as $21.
Another aspect of the paradox is that no one could in fact offer the game honestly  no
one has $2
40
to pay out in the event of 39 tails in a row before the first heads, for
example. On the other hand, if a gambling house with a capital of $1,048,576 agrees to
pay out the entire amount of N =20 or more, the expected payoff is $21, as the
computation above also shows.
Even without the paradox arising from permitting one’s capital M to have any finite
value, as in the example above, most people’s utility for money is not strictly linear;
although, it may be approximately linear over a restricted range of values of M. Figure
41 shows a utility function ) (M u plotted against M that is something like most people’s
utility functions for money.
Figure 41:
The function shown above is an increasing function of M, corresponding to the
assumption that the greater the monetary value of a prospect the greater its desirability. It
is unusual to encounter a situation in which a greater amount of capital is “worth” less
than a smaller amount. The monotonic nature of the curve representing utility for money
would show that that if one is offered a favorable chance prospect involving money
repeatedly, the offer should be accepted. That is, if there is a net monetary gain each
time, the decision maker’s monetary assets increase  and therefore utility increases.
The expected utility is the appropriate guide for action in a single experiment, rather than
expected monetary gain, but in a long sequence of experiments the monetary gain is
relevant. In dealing with a given problem it is often convenient to focus attention on the
u(M)
M
0
Part II: Utility Theory  Concepts and Methods
47
small portion of the curve representing the function ) (M u . Label the money axis in terms
of the amount of money to be lost or gained, rather than in terms of total assets. If M
denotes an increase or decrease from an initial status M
0
, ) (M u is really ) (
0
M M u + . It
may also appear as though a person’s utility function is different in different situations
reflected as different points on the overall utility curve. Figure 41 showed that it is
possible to have (locally) a utility for gain or loss that is concave down in some instances,
and concave up in others, as shown in the figure below.
Figure 42(a):
This would be appropriate if
the initial capital was near 0.
Figure 42(b):
This would be appropriate if
the initial capital was large.
Section 4.4 Bets, Fair and Unfair
“A bet is a random or mixed prospect [of the type encountered at the beginning of the
previous section]: One starts with an initial capital M
0
and according to the outcome an
experiment of chance ends up either with M
L
, an amount less than M
0
, or with M
W
, an
amount greater than M
0
. In the former case he loses, his capital being reduced by the
amount
L
M M y − =
0
, and in the latter case he wins, his capital being increased by the
amount
0
M M x
W
− = . It is said he “puts up with” y and his opponent “puts up with” x,
the winner taking both amounts.” (Lindgren, 1971)
Definition:
The money odds in such a bet are said to be “y to x” that one wins the bet, or y/x. The
probability odds in such a bet are said to be “p to 1p” that one wins the bet, or p/(1p),
where p is the probability that he wins. A bet is said to be a fair bet of the probability
odds and the money odds are equal.
M
0
0
M
Part II: Utility Theory  Concepts and Methods
48
A fair bet can be thought of as the invariant principle addressed in Section 2. It is “fair”
in the sense that neither side is favored. If the bet is fair, then
. 0 ) 1 ( or ,
1
= − −
−
= px p y
p
p
x
y
Substitution of
L
M M y − =
0
and
0
M M x
W
− = yields
, 0 ) ( ) 1 )( (
0 0
= − − − − p M M p M M
W L
or
. ) 1 (
0
M M p pM
L W
= − +
A bet can be redefined as one in which the expected capital after the bet is equal to the
capital of before the bet. There is no loss nor gain in capital on the part of the one bettor;
however, if one player experienced a net expected loss, the bet would not be fair.
When utility is measured in terms of money, that is when kM M u = ) ( , the indifference
relation stands – accepting or not accepting the bet are equally desirable:
). (
) ) 1 ( ( ) ( ) 1 ( ) ( ) (
0 0
M u kM
M p pM k M u p M pu bet u
L W L W
= =
− + = − + =
The equal desirability follows from the fact that taking the bet has the same utility as the
initial capital and will be readdressed in the sections concerning the development of
utility theory.
Consider a general nonlinear utility function ) (M u , and a fair bet:
. ) 1 (
0 L W
M p pM M − + =
The utility of the bet is
). ( ) 1 ( ) (
L W B
M u p M pu U − + =
In vector notation
,
) (
) 1 (
) (
0


.

\

− +


.

\

=


.

\

L
L
W
W
B
M u
M
p
M u
M
p
U
M
which is interpreted geometrically to mean that the point ) , (
0 B
U M is a convex
combination of the two points )) ( , (
W W
M u M and )) ( , (
L L
M u M . Thus, the
point ) , (
0 B
U M lies on the line segment joining those two points, as shown in the figures
below. Two cases are shown when the person is not indifferent to the fair bet.
Part II: Utility Theory  Concepts and Methods
49
Figure 43(a):
This represents a utility function
and fair bet that is attractive.
Figure 43(b):
This represents a utility function
and fair bet that is unattractive.
Consider a bet that is not fair, biased against the person whose interests are of concern
and in terms of whose utility the situation is being studied; in such a case the expected
capital after the bet is less than the initial capital:
, ) 1 (
0
M M M p M p
L W
< ′ = ′ − + ′
where p′ is the probability of winning, and is smaller than the p corresponding to a fair
bet for the same amounts of money. The above inequality can be restated in the form
,
1
0
0
x
y
M M
M M
p
p
W
L
=
−
−
<
′ −
′
which says that the money odds are higher than the actual probability odds.
IfU′ denotes the utility in this unfair bet:
), ( ) 1 ( ) (
L W
M u p M u p U ′ − + ′ = ′
M
Mw
M0 ML
U(ML)
UB
U(MW)
U(M)
MW
M0 ML
UB
M
U(M)
Part II: Utility Theory  Concepts and Methods
50
then the point ) , ( U M ′ ′ lies on the line segment joining the points )) ( , (
W W
M u M and
)) ( , (
L L
M u M :
.
) (
) 1 (
) (


.

\

′ − +


.

\

′ =


.

\

′
′
L
L
W
W
M u
M
p
M u
M
p
U
M
This is shown in figure 44, in which is also plotted M
0
, the initial capital, which is
greater than the expected capital after the bet, M′ . Decreasing M′ amounts to
decreasing p′ , that is, altering the probability odds, and there is a p* corresponding to the
smallest M* for which the bet would be worth taking:
* *) 1 ( * M M p M p
L W
= − +
or
.
*
*
L W
L
M M
M M
p
−
−
=
figure 44:
It was shown above a case in which the utility is such that a fair bet is advantageous. In
much the same manner, it can be shown that a utility is such that a fair bet is not
advantageous, which can be used to determine how unfair it is. This is omitted here.
Nau (2000) discusses the issues surrounding fair bets and utility for money under
equilibrium in depth. He characterizes those who are risk averse and those who are not in
delimitating their actions under varying conditions.
M
U(M)
U’
U(M0)
MW
M0
M’
M*
ML
Part II: Utility Theory  Concepts and Methods
51
Section 5: Development of Utility Theory I – Ordinal Utility
The development of rational choice
5
[decision] theory (in its modern form) dates back 50
years to a trio of publications: von Neumann and Morgenstern’s Theory of Games and
Economic Behavior (1944/1947), Kenneth Arrow’s Social Choice and Individual Values
(1951), and Savage’s Foundations of Statistics (1954)
6
. These papers laid the foundation
in the use of mathematical methods such as some of those discussed in Section 6. These
include axiomatic methods, concepts of measurable utility and personal probability, and
tools of general equilibrium theory and game theory. The focus of the previous section
was to address some of these aspects while these next two sections discuss the
development of utility theory.
For brevity purposes, this report only reviews models of individual rationality. These
include ordinal utility (for choice under conditions of “certainty”), expected utility (for
choice under conditions of “risk,” where probabilities for events are objectively
determined), subjective expected utility and statepreference theory (for choice under
conditions of “uncertainty,” where probabilities for events are subjectively determined—
or perhaps undetermined), and several kinds of nonexpected utility theory. The prior
presentation of the underpinning preference axioms is revisited briefly in this section,
providing some background on the concept of “ordinal” utility, prior to the discussions of
“cardinal” expected utility presented in the next section.
Section 5.1: Ordinal Utility
Returning to the notation for preference relations and the stated axioms, suppose that
there are only two preferences – for instance, apples and bananas
7
. A pair of numbers (a,
b) then represents a possible bundle of commodities (namely a apples and b bananas) that
a person might possess. Now suppose that this person has preferences among such
groups of apples and bananas. If those preferences are complete, reflexive, transitive, and
continuous (see Section 6.1), then there exists a utility function U which represents them,
so that (a, b) = (a′, b′) if and only if U(a, b) = U(a′, b′). Recall the notion of indifference.
Then it can be said that for any group or bundle (a, b) it is possible to find a unique
number x such that the bundle (x, x), which has equal amounts of every preference, is
precisely indifferent to (a, b). Let that number x be defined as U(a, b). But this is not the
only possible way to define a utility function representing the same preferences. For
example, we could just as well define U(a, b) to be equal to 2
x
or f(x) where f is any
monotonic (i.e., strictly increasing) function. This was shown in Section 4.
Definition:
5
Several economics and operations research papers interchange between the terms choice and decision
theory.
6
Nau, 2000.
7
This could be anything. Lindgren uses beef and chicken in his example. The idea is to be able to assign a
numeric value to something which may be numerically measured in terms of preference for a particular
person. Much of this theory presented here is crucial to the understanding of consumer theory as found in
microeconomics.
Part II: Utility Theory  Concepts and Methods
52
An ordinal value function ℜ → A v : is a realvalued function which represents –or agrees
with a weak preference in the following sense:
b a b v a v A b a
~
) ( ) ( , , f ⇔ ≥ ∈ ∀ .
The above definition was provided by French and Insua (2000) and simply states that an
ordinal value function is a numerical assignment to a set of alternatives which represents
the person or decision maker’s preference ranking by a numerical ranking. Thus, the
utility function representing a decision maker’s preferences is said to be merely an
ordinal utility because only the ordering property of the utility numbers is meaningful.
Example 51:
For example, suppose that U(1, 1) = 1, U(2, 1) = 2, U(1, 2) = 3, and U(2, 2) = 4. In other
words, referring back to apples and bananas, one apple and one banana yield one unit of
utility, two apples and one banana yield two units of utility, etc. Then this means only
that the preference ordering is (2, 2) > (1, 2) > (2, 1) > (1, 1), i.e., two apples and two
bananas are better than one apple and two bananas which are better than two apples and
one banana which are better than one apple and one banana. However, it may not be
concluded that two apples and one banana are “twice as good” as one apple and one
banana or that one additional apple yields the same “increase in utility” regardless of
whether you have one banana or two bananas to start with.
Recall that any monotonic transformation of U carries exactly the same preference
information: if f is a monotonic function, and if V is another utility function defined by
V(a, b) = f(U(a, b)), then V encodes exactly the same preferences as U. For example,
letting f(x) = 2
x
, we obtain a second utility function satisfying V(1, 1) = 2, V(2, 1) = 4,
V(1, 2) = 8, and V(2, 2) = 16, which provides a completely equivalent representation of
this hypothetical decision maker’s preferences.
Ordinal value functions represent preference rankings and nothing more. There is no
further information in the numbers used to represent the decision maker’s preferences,
such as strength of preference. More formally, the following theorem is given by French
and Insua (2000).
THEOREM: For any domain of objects A with a weak order
~
f , there exists an ordinal
value function if an only if there exists a countable order dense subset . , viz A B ⊂ B is
such that B b a a A a a ∈ ∃ ∈ ∀ ,
~
, ,
2 1 2 1
f such that
2 1
~ ~
a b a f f .
Thus, any strictly increasing transformation of an ordinal value function also agrees with
the underlying weak preferences. Conversely, any two ordinal value functions which
agree with the same preference relation are related by a strictly increasing transformation.
Thus, the representation is said to be “unique up to a strictly increasing transformation”
(French and Insua, 2000).
Part II: Utility Theory  Concepts and Methods
53
Section 5.2: Marginal Utility
Suppose now that the nonunique ordinal utility function of concern is differentiable—
i.e., it is a smooth function of the numbers of apples and bananas. Then marginal utility
(of an additional apple or banana)
8
can be defined as the partial derivative of the utility
function (with respect to apples or bananas), evaluated at the point of interest. Since it
was previously stated that ordinal utilities are nearlymeaningless numbers, it might be
plausible to believe that marginal utilities would be nearlymeaningless numbers as well;
however, the ratios of marginal ordinal utilities are always uniquely defined. The ratios
of marginal utilities are precisely the marginal rates of substitution between the
prospects that leave total utility unchanged. Marginal rates of substitution are uniquely
determined by preferences and as such they are observable, hence any two utility
functions that represent the same preferences must yield the same ratio of marginal
utilities between any two preference groups (Nau, 2000).
For example, if the ratio of the marginal utilities of apples to bananas is equal to 1/2 when
the decision maker already has one apple and one banana means that the person would
indifferently trade an ε fraction of an additional banana for a 2ε fraction of an additional
apple. Then the same ratio must be obtained for any other utility function that is a
monotonic transformation of the original one.
Section 5.3: Indifference Curves
Section 6 illustrated the geometrical perspective of a decision maker’s utility curve. So,
to recap, geometrically, an ordinal utility function determines the shape of indifference
curves in the space of possible prospects or groups of prospects. An indifference curve is
a set of points representing endowments that all yield the same numerical utility—i.e. that
are equally preferred. A decision maker’s indifference curves in
2
ℜ space is illustrated in
figure 51.
Figure 51:
Points x and y are on the same
indifference curve, meaning that
they yield exactly the same utility
—i.e., the decision maker is precisely
indifferent between them—whereas
point z lies on a “higher” indifference
curve, so it would be preferred to either
x or y. The slope of a tangent line
to an indifference curve (such as the
dotted line through point y) is precisely the marginal rate of substitution between
prospects at that point.
8
The marginal utility of an apple is not necessarily the increase in utility yielded by an additional whole
apple—it is the increase in utility yielded by an additional ε fraction of an apple, divided by ε, where ε is
infinitesimally small.
y
x
z
Part II: Utility Theory  Concepts and Methods
54
Note 1: A person wishes to climb higher (to another indifference curve) if at all
possible. Thus, if the person started at point x on the map, he or she would be indifferent
to walking over to point y, which is at the same elevation, but would prefer to climb
higher up to point z.
Note 2: While the shapes of the indifference curves are uniquely determined by
preferences, the relative heights of the utility surface above different indifference curves
are completely arbitrary.
Note 3: Another utility function that represented the same preferences would have
to yield the same indifference curves, whose tangent lines would have exactly the same
slopes at all points, but the corresponding utility surface might otherwise look very
different.
Section 5.4: Assumptions  Marginal Utility and
MultiAttributed Theory
The assumption of diminishing marginal utility was briefly addressed in Section 6. This
assumption means that if the holdings of all prospects are fixed except one, then the
marginal utility of that prospect should decrease as the “endowment” of it increases (Nau,
2000). The term nonsatiation refers to the fact that the marginal utility of every prospect
always remains positive—i.e., it approaches zero only asymptotically, so that more is
always better no matter how much you already have. (See Section 4.3.)
Decreasing marginal utility is actually not a strong enough condition for the decision
maker’s problem to always have a unique solution unless the utility function is also
additively separable—i.e., of the form U(a,b) = u
1
(a) + u
2
(b). If the number of prospects
is increased to some n  i.e., there exists prospects a, b, c, d, …, etc., then the additive
form of utilities refers to the consequences or alternatives to be multiattributed in
n
ℜ .
9
A stronger assumption of strictly convex preferences is usually made, which means that if
x = z and y = z, then αx+(1α)y > z for any number α strictly between 0 and 1, where the
vectors x, y, and z denote general multidimensional preferences. In other words, if x and y
are both weakly preferred to z, then everything “in between” x and y is strictly preferred
to z
10
. In particular, if x and y are indifferent to each other, then αx+(1α)y is strictly
preferred to either of them—i.e., the decision maker would prefer to have an average of x
and y rather than either extreme.
This property captures the intuition of diminishing marginal utility, namely that ½ x
provides “more than half as much utility” as x, but it does so by referring to observable
preferences rather than unobservable utility. Lindgren (1971) discusses the case of
arbitrary prospects more in depth in Chapter 2.3.
9
This is explored in Section 8.
10
See Section 6.2.
Part II: Utility Theory  Concepts and Methods
55
If preferences are strictly convex, then any utility function that represents those
preferences is necessarily strictly quasiconcave, which means that its level sets are
strictly convex sets, i.e., the set of all y such that U(y) = U(x) is a strictly convex set. The
strict convexity of the level sets ensures that the decision maker’s problem has a unique
solution (Nau, 2000)
11
.
Section 5.5 Subjective Probability
The classical concept of probability was initially introduced (in Part A) as a repeatable
experiment (of chance) that (at each trial) results in one of a given set of possible
outcomes. The probability assigned to each outcome is thought of as an idealization of
its relative frequency of occurrence in a long sequence of trials. In the case of a chance
or random variable, these outcomes are ordinary numbers, characterized by non
numerical descriptions. When the outcomes are thought of as prospects with
corresponding probabilities is termed a random prospect. However, this frequency
concept does not always suffice when dealing with θ.
The theory of subjective probability was created to enable one to talk about probabilities
when the frequency viewpoint does not apply (Berger, 1985). A quantification of
“chances” involving one’s beliefs or convictions concerning the occurrences of the event,
and embodying the odds indicates that the matter of chance in this kind of experiment is
actually personal. The term subjective probability is used to refer to probability that
represents a person’s degree of belief concerning the occurrence of an event. Although it
can be assumed that the subjective probability of a given state exists, it is seldom easy to
determine that probability (Lindgren, 1971).
Nau (2000) states that there are four “bestknown” concepts of probability – (1) the
classical interpretation, (2) the frequentist (or objective) interpretation, (3) the logical (or
necessary) interpretation, and (4) the subjective interpretation. Although objective
probability and subjective probability are different in conception and interpretation, they
are not different as mathematical entities, in as much as the same basic axioms are
assumed to hold for both. Any definitions or theorems concerning objective probabilities
apply equally well to subjective probabilities. In particular, the expected value of a
random variable, defined earlier (in Section 6) for the case of objective probabilities,
would be defined in exactly the same way if the probabilities are subjective, namely, as
the weighted sum of the possible values, where the weights are the corresponding
(subjective probabilities).
This section discussed the role of ordinal utility. This closing subsection has introduced
the topic to be discussed in the following section which more formally develops the
utility theory within its three comparable forms.
11
Nau (2000) discusses this within the constructs of consumer theory as understood in microeconomic
theory. He also provides a detail development of utility theory and equilibrium states as they formed in
utility theory and impacted microeconomic theory.
Part II: Utility Theory  Concepts and Methods
56
Section 6: Development of Utility Theory II–
Comparing Axiomatic Theories
In the last section, ordinal utility functions were discussed according to the marginalist
view of modeling decision problems under conditions of certainty. This section compares
the main elements of the theory of individual rational modeling under risk and
uncertainty. These were developed in the mid1900’s: the theories of subjective
probability (SP), expected utility (EU), and subjective expected utility (SEU). The
primary sources of information for this section are Nau (2000) and French and Insua
(2000) and were stated at the beginning of Section 5. Both provided detailed accounts of
the developments of all three works and provide citations for the original publications
and other perspectives on these developments.
Nau discusses these three theories in terms of four attributes or concepts – “axiomatic
fever”, “primitive preferences”, “concepts of probability”, and “”the independence
condition for preferences → cardinal utility”. Of these four concepts, the first two have
been discussed thoroughly in the Section 4. The third topic of interest was addressed
briefly in the closing subsection of Section 5. However, this last concept has not been
addressed this far. Before beginning a comparative look at the three theories of concern,
a little insight into the importance of this last topic will be given first.
Section 6.1: The Independence Condition for Preferences →
Cardinal Utility
Von Neumann and Morgenstern in their aims to establish a “new gametheoretic
foundation for economics”, decided it was necessary to first axiomatize a concept of
cardinal measurable utility that could serve as a “single monetary commodity” whose
expected value the players would seek to maximize through their strategy choices.
Recall from the previous section that decision making under conditions of certainty were
modeled as ordinal utility functions. It was discussed that under such conditions, it is
meaningless to say something like “the utility of y is exactly halfway between the utility
of x and the utility of z.” von Neumann and Morgenstern discovered that such a
statement can be made if the decision maker has preferences not only among definite
objects such as x, y, and z, but also among probability distributions over those objects.
Thus, it can be said that by definition, a 5050 gamble between x and z has a utility that is
exactly halfway between the utility of x and the utility of z, and if y is indifferent to such
a gamble, then it too has a utility that is exactly halfway between the utility of x and the
utility of z. There are several important things to note here (Nau, 2000) and (French and
Insua, 2000).
Note 1: Additional “psychological” data is required to construct a cardinal utility
function, namely data about preferences among probability distributions over rewards,
not just data about preferences among the rewards themselves.
Part II: Utility Theory  Concepts and Methods
57
Note 2: Those preferences must satisfy an independence condition, namely that x
is preferred to y if and only if an α chance of x and a 1−α chance of z is preferred to an
α chance of y and a 1−α chance of z, regardless of the value of z—i.e., a preference x
over y is independent of common consequences received in other events.
12
Note 3: Once probabilities have been introduced into the definition and
measurement of utilities, a cardinal meaning cannot be attached to the utility values
except in a context that involves risky choices: if the utility of y is found to lie exactly
halfway in between the utilities of x and z, then it cannot be conclude that the increase in
the measure results from the exchange of x for y to be exactly the same as the increase
that results from the exchange of y for z. All that can be said is that y is indifferent to a
5050 gamble between x and z.
While von NeumannMorgenstern utility function was being established other
economists had discovered that a simple behavioral restriction on preferencesunder
certainty implies that a consumer must have an additivelyseparable utility function, i.e.,
a utility function of the form U(x1, x2, …, x3) = u1(x1) + u2(x2) + … + un(xn), which is
unique up to an increasing linear transformation and is therefore cardinally measurable
(Nau, 2000).
Section 6.2: Comparative Axioms of SP, EU, and SEU
Many authors contributed to the development of axiomatic theories of rational decision
making under uncertainty; however this section concentrates on three main contributions:
de Finetti’s theory of subjective probability (SP), von Neumann and Morgenstern’s
theory of expected utility (EU), and Savage’s theory of subjective expected utility (SEU).
Savage’s theory is essentially a merger of the other two and will be presented in
conjunction with the works of Anscombe and Aumann. I have tried to be consistent with
my notation; however various authors present various different aspects and thus, this is an
amalgamation of work published by (Nau, 2000) and (French and Insua, 2000).
Table 61 (appended to the end of section 6) presents the axioms at a glance as a
comparative means and was adapted from Nau (2000). This is followed by a discussion
on some of the key elements of comparison for the three theories of concern.
In all three theories, there is a primitive ordering relation, which I will denote by
~
f . In
SP theory, the objects of comparison are events A, B, etc., which are subsets of some
grand set of possible states of nature. The ordering relation between events is one of
comparative likelihood: A = B means that event A is considered “at least as likely” as
event B. In EU theory, the objects of comparison are probability distributions f, g, etc.,
over some set of possible consequences, which could consist of amounts of money or
12
The independence condition is implicit in von Neumann and Morgenstern’s axiomatization.
Part II: Utility Theory  Concepts and Methods
58
other tangible rewards or prizes, or they could simply be elements of some abstract set.
The ordering relation is one of preference: f = g means that distribution f is “at least as
preferred” as distribution g. SEU theory merges the main elements of these two
frameworks: the objects of comparison are “acts,” which are real or imaginary mappings
from states of nature to consequences. In other words, an act is an assignment of some
definite consequence to each state of nature that might obtain. The ordering relation is
preference, as in EU theory.
Similarity of Axioms Across Theories
1. Assumption that the relation is complete: for any A and B, either A
~
f B or B
~
f A or both. Hence, it is always possible to determine at least a weak
direction of ordering between two alternatives.
2. Assumption that the ordering relation is transitive: i.e., A
~
f B and B
~
f C
implies A
~
f C.
3. Assumption that the relation has a property called independence or
cancellation, which means that in determining the order of two elements (such
as A
~
f B), there is some sense in which common events or consequences can
be ignored on both sides.
A Closer Look at the Independence Condition Across Probability Theories
1. The independence condition for subjective probability states:
C B C A B A ∪ ∪ ⇔
~ ~
f f
for any C that is disjoint from both A and B. Simply stated, if the same nonoverlapping
event is joined to two events whose likelihoods are being compared, the balance of
likelihood is not “tipped”. This condition also implies:
*
~
*
~
C B C A C B C A ∪ ∪ ⇔ ∪ ∪ f f
for any C, C* that are disjoint from both A and B. Thus, a common disjoint event on both
sides of a likelihood comparison can be replaced by another common disjoint event
without tipping the balance.
2. The independence condition for expected utility is:
Part II: Utility Theory  Concepts and Methods
59
. ) 1 (
~
) 1 (
~
h g h f g f α α α α − + − + ⇔ f f
The expression αf + (1α)h is an “objective mixture” of the distributions f and h—i.e., an
α chance of receiving f and a 1α chance of receiving h, where the chances are
objectively determined. Thus, if f and g are objectively mixed with the same other
distribution h, in exactly the same proportions, the balance of preference is not tipped
between them. The a decisiontree illustration of this condition is and its interpretation is
explained in Section 9.
The condition also implies:
* ) 1 (
~
* ) 1 ( ) 1 (
~
) 1 ( h g h f h g h f α α α α α α α α − + − + ⇔ − + − + f f ,
which says that if f is preferred to g when they are both mixed with a common third
distribution h, then the direction of preference isn’t changed if they are mixed with a
different common distribution h* instead. In other words, the “common element” h can
be replaced with any other common element.
3a. The independence condition for subjective expected utility (Savage’s axiom
P2, otherwise known as the “sure thing principle” as noted by Nau (2000)) states:
* ) 1 (
~
* ) 1 ( ) 1 (
~
) 1 ( h A Ag h A Af h A Ag h A Af − + − + ⇔ − + − + f f ,
for any nonnull event A
13
. The expression Af + (1−A)h is a “subjective mixture” of f and
h: it means the consequence specified by act f in every state where event A is true is
achieved, and the consequence specified by act h in all other states is achieved. Simply,
this condition says that comparing two acts which happen to agree with each other in
some states (namely the states in which A is not true), then different agreeing
consequences in those states can be substituted without tipping the balance. This property
is necessary to define the concept of a conditional preference between two acts: the
relation Af + (1−A)h
~
f Ag + (1−A)h means that f is preferred to g conditional on event A.
3b. There is also a second kind of independence condition that also appears in SEU
theory, namely an assumption about the stateindependence of preferences for particular
consequences. This assumption (Savage’s P3) states that if x, y, and z are constant acts
(acts that yield the same consequence in every state of nature) and A is any nonnull
event, then:
z A Ay y A Ax y x ) 1 (
~
) 1 (
~
− + − + ⇔ f f .
13
A nonnull event is an event that has nonnegligible probability in the sense that at least some preferences
among acts are affected by the consequences received in that event.
Part II: Utility Theory  Concepts and Methods
60
Thus, if the consequence x is preferred to the consequence y “for sure,” then it is also
preferred to y conditional on any event, other things being equal. This assumption is
needed to help separate utilities from probabilities—i.e., to ensure that relative utilities
for consequences don’t depend on the states in which they are received.
The important thing to note is that, in all these conditions, the common additive factors
can be ignored or canceled (e.g., C, h, z, etc.) on opposite sides of the ordering relation.
This property is necessary in order for the relation to be represented by a probability
distribution that is additive across mutually exclusive events and/or a utility function that
is a linear in probability. The additivity or linearity of the preference representation, in
turn, means that the direction of preference or comparative likelihood between two
alternatives depends only on the difference between them.
Section 6.3: Subjective Probability versus Expected Utility theory
In de Finetti ‘s 1937 paper, he pointed out that there are actually two equivalent ways to
axiomatize the theory of subjective probability. One way is in terms of a binary relation
of comparative likelihood (as illustrated in the table above). The other way is in terms of
the acceptance of monetary gambles. The two theories are “dual” to each other—i.e., they
are really the same theory but their working parts are merely labeled and interpreted in
different ways (Nau, 2000).
Suppose that there is some set S of mutually exclusive, collectively exhaustive states of
nature. Consider possible distributions of monetary wealth over those states. Thus, if f is
a vector representing a wealth distribution, then f (s) is the amount of money you get if
state S s ∈ occurs.
In the presence of the completeness axiom, the independence axiom also works in
reverse, i.e., . ) 1 (
~
) 1 (
~
h g h f g f α α α α − + − + ⇔ f f for all 0 < α < 1. This is so because
the completeness axiom requires the decision maker to have a definite direction of
preference between f and g, and if it were not the same as the direction of preference
between αf + (1α)h and αg + (1α)h, the independence axiom would produce a
contradiction.
A0: (reflexivity) f
~
f f for all f.
A1: (completeness) for any f and g, either f
~
f g or g
~
f f or both.
A2: (transitivity) if f
~
f g and g
~
f h, then f
~
f h.
A3: (strict monotonicity) if f(s) > g(s) for all s, then f f g (i.e., NOT g
~
f f).
A4: (independence) f
~
f g ⇔ αf + (1−α)h
~
f αg + (1−α)h for all h and 0 < α < 1.
Part II: Utility Theory  Concepts and Methods
61
This means the decision maker effectively has linear utility for money, because scaling
both f and g up or down by the same positive factor and/or increasing the decision
maker’s initial wealth by a constant amount h does not lead to a reversal of preference.
There are many results and important conclusions that follow from this line of thought
which are discussed in some depth in Nau (2000). Two of prominence are: The first of
which is that the direction of preference between any two wealth distributions depends
only on the difference between them. And the second of which states that the inequalities
between different pairs of wealth distributions can be added.
Now consider the following set of axioms. Consider now the differences between wealth
distributions as gambles – of which there are two types: acceptable and unacceptable.
These are the assumptions used by de Finetti in his gamblebased method of
axiomatizing subjective probability, and (by the arguments above) they are logically
equivalent to A0−A4 (Nau, 2000).
SP THEOREM: If preferences among wealth distributions over states satisfy A0A1A2
A3 A4 (or equivalently, if acceptable gambles satisfy B0B1B2B3B4), then there exists
a unique subjective probability distribution π such that f = g if and only if the expectation
of the wealth distribution f is greater than or equal to the expectation of distribution g
according to π (or, equivalently, x is an acceptable $gamble if and only if its expectation
is nonnegative according to π) .
This is de Finetti’s theorem on subjective probability.
Consider reversing the role of money and probabilities. Let C be some set of
consequences or prizes, and let f, g, etc., denote probability distributions over those
prizes. Thus, f(c) now represents the probability of receiving prize c rather than an
amount of money. The axioms are as follows:
B0: (reflexivity) 0 ∈ G
B1: (completeness) for any x, either x ∈ G or –x ∈ G or both
B2: (linearity) if x ∈ G, then αx ∈ G for any α > 0
B3: (additivity) if x ∈ G and y ∈ G, then x + y∈G
B4: (coherence) G contains no strictly negative vectors
Part II: Utility Theory  Concepts and Methods
62
There is a small difference in Axiom 3 which is the strict monotonicity condition is
weakened to a nontriviality condition.
The direction of preference between any two probability distributions depends only on
the difference between those two distributions. Note that a difference between two
probability distributions also a kind of “gamble” in the sense that it changes your risk
profile—it just does so by shifting the probabilities attached to fixed prizes rather than by
shifting the amounts of money attached to events with presumablyfixed subjective
probabilities. Then the followings axioms are developed:
The continuity/closure (A5/B5) requirement is needed to rule out unbounded utilities as
well as lexicographic preferences (Nau, 2000) and French and Insua (2000).
Let u denote a utility function for prizes, because f = g if and only if f′u = g′u, and the
quantities on the left and right of this expression are just the expectations of u under the
distributions f and g, respectively. The function u is subject to arbitrary positive scaling,
and a constant can also be added to it without loss of generality (wlog), because this will
just add the same constant to the expected utility of every probability distribution.
EU THEOREM: The results are summarized as the expected utility theorem. If
preferences among probability distributions over consequences satisfy A0A1A2A3A4
(or equivalently, acceptable pgambles satisfy B0B1B2B3B4), there exists a utility
function u, unique up to positive affine transformations, such that f = g if and only if the
expected utility of f is greater than or equal to the expected utility of g (or, equivalently, x
A0: (reflexivity) f
~
f f for all f
A1: (completeness) for any f and g, either f
~
f g or g
~
f f or both
A2: (transitivity) if f
~
f g and g
~
f h, then f
~
f h
A3: (nontriviality) there exist f* and g* such that f* f g* (i.e., NOT g*
~
f f*)
A4: (independence) f
~
f g⇔ f + (1α)h
~
f g + (1α)h where 0 < α < 1.
A5: (continuity): for fixed f, the set of g such that f
~
f g is closed, and vice versa
B0: (reflexivity) 0 ∈ G.
B1: (completeness) for any x, either x ∈ G or –x ∈ G or both.
B2: (linearity) if x ∈ G, then αx ∈ G for any α > 0.
B3: (additivity) if x ∈ G and y ∈ G, then x + y∈G.
B4: (nontriviality) there is at least one nonzero pgamble not in G.
B5: (closure): G is closed.
Part II: Utility Theory  Concepts and Methods
63
is an acceptable pgamble if and only if it yields a nonnegative change in expected utility
according to u).
This is von NeumannMorgenstern’s theorem on expected utility, and it is
isomorphic to de Finetti’s theorem on subjective probability, merely with the roles of
probabilities and payoffs reversed (French and Insua, 2000).
The axioms of subjective probability or expected utility are satisfied if and only if (a)
your set of acceptable directions of change in (stochastic) wealth is a convex cone, (b) it
excludes at least some directions—especially those that lead to a sure loss—and (c) it is
always the same set of directions, no matter what your current distribution.
Section 6.4: Subjective Expected Utility (AnscombeAumann & Savage)
Savage combined the features of subjective probability and expected utility theory in
order to model situations in which the decision maker may attach their own subjective
probabilities to events and their own subjective utilities to consequences. Anscombe &
Aumann followed in much the same way, but had a simpler approach by simply merging
Savage’s concept of an “act” (a mapping of states to consequences) with von Neumann
and Morgenstern’s concept (an objective probability distribution over consequences),
(French and Insua (2000).
Anscombe and Aumann proposed that the objects of choice should be mappings from
states to objective probability distributions over consequences. Such objects are known
as “horse lotteries” in the literature. This is supposed to conjure up an image of a horse
race in which each horse carries a different lottery ticket offering objective probabilities
of receiving various prizes. By applying the von NeumannMorgenstern axioms of
expected utility to horse lotteries, together with one more assumption, you get
probabilities and utilities together. Henceforth, let f, g, and h denote horse lotteries.
Think of them as vectors with doublysubscripted elements, where f (c,θ) denotes the
objective probability that the horse lottery f assigns to consequence c when state θ occurs
(i.e., when horse s wins the race). Furthermore, let x, y, and z, denote horse lotteries that
are “constant” in the sense that they yield the same objective probability distribution over
consequences in every state of the world (i.e., no matter which horse wins).
Let the usual axioms A0A5 apply to preferences among horse lotteries, together with
one additional axiom, which is essentially the same as Savage’s P3:
A6: (stateindependence): x
~
f y ⇔ Ax + (1−A)z
~
f Ay + (1−A)z
for all constant x, y, z and every event A
Here, Ax + (1−A)z denotes the horse lottery that agrees with x if A occurs and agrees with
z otherwise, where A is an event (a subset of the states of the world).
Part II: Utility Theory  Concepts and Methods
64
Note1: From axioms A0A5 and some work, the existence of an expectedutility
vector v such that f = g if and only if f′v = g′v. The elements of the vector v are
subscripted the same as those of f and g, with v(c, θ) representing the expected utility of
consequence c when it is received in state θ. Intuitively, v(c, θ) combines the probability
of the state with the utility of the consequence.
Note 2: Axiom A6 then implies the further restriction that the expected utilities
assigned to different consequences in state θ must be proportional to the expected
utilities assigned to the same consequences in any other state, because it requires
conditional preferences among objective probability distributions over consequences to
be the same in all states of the world. It follows that v can be decomposed as v(c, s) =
p(s)u(c), where p(s) is a (unique!) nonnegative probability assigned to state s and u(c) is
a stateindependent utility assigned to consequence c.
Note 3: By convention, the utilities attached to particular consequences are
assumed to be identical—not merely proportional—in different states, in which case the
ratio of v(c, θ) to v(c, θ′) can be interpreted as the ratio of the subjective probabilities of
states θ and θ′.
Note 4: Preferences among Savage acts are represented by a unique probability
distribution p over states and a stateindependent utility function u over consequences,
with act f preferred to act g if and only if the expected utility of f is greater than or equal
to the expected utility of g. This is summarized as the subjective expected utility
theorem. However, before stating the theorem, one further remark is made which
addresses a weakness in this point.
Remark: It should be noted that the uniqueness of the derived probability
distribution depends on the unstated assumption that consequences are not only similarly
ordered by preference in every state of the world (which is the content of axiom A6
above), but they also have the very same numerical utilities in every state.
SEU THEOREM: If preferences among horse lotteries satisfy A0A1A2A3A4A5
A6, then there exists a unique probability distribution p and a stateindependent utility
function u that is unique up to positive affine transformations such that f = g if and only if
the expected utility of f is greater than or equal to the expected utility of g.
Section 6.5: Incomplete Preferences
The completeness axiom causes some distress. It forces the notion that a “gamble” must
be accepted. So, although, it may be unacceptable according to some axioms – it is still
acceptable according to the completeness axiom. The three theorems staed above with
respect to each theory is restated generally with incomplete preferences.
SP THEOREM with incomplete preferences: If preferences among wealth
distributions over states satisfy A0A2A3A4A5, then there exists a convex set P of
subjective probability distributions such that f = g if and only if the expectation of the
Part II: Utility Theory  Concepts and Methods
65
wealth distribution f is greater than or equal to the expectation of distribution g for every
distribution π in P (or, equivalently, x is an acceptable gamble if and only if its
expectation is nonnegative for every distribution π in P) .
Similar results are obtained when you drop the completeness axiom from EU or SEU
theory: you end up with convex sets of utility functions or expectedutility functions.
EU THEOREM with incomplete preferences: If preferences among probability
distributions over consequences satisfy A0A2A3A4A5 (or equivalently, acceptable p
gambles satisfy B0B2B3B4B5), there exists a convex set U of utility functions,
unique up to positive affine transformations,
14
such that f
~
f g if and only if the expected
utility of f is greater than or equal to the expected utility of g for every utility function u
in U (or, equivalently, x is an acceptable pgamble if and only if it yields a nonnegative
change in expected utility for every utility function u in U ). See French and Insua
(2000).
SEU THEOREM with incomplete preferences: If preferences among horse lotteries
satisfy A0A2A3A4A5A6, then there exists a convex set V of expectedutility
functions, at least one of which can be decomposed as the product of a probability
distribution p and a stateindependent utility function u, such that f
~
f g if and only if the
expected utility of f is greater than or equal to the expected utility of g for every
expectedutility function v in V.
14
Note: Theorem 3 addresses positive affine transformations in Chapter 2 of French and Insua, (2000).
Part II: Utility Theory – Concepts and Methods
66
Table 6.1: Comparing Axioms: SP, EU & SEU
de Finetti Von NeumannMorgenstern Savage
Theory of… Subjective Probability Expected utility Subjective expected utility
Features States Abstract set closed under probability
mixtures
States, consequences
Objects of comparison Events A,B,C (sets of states) Probability distributions f,g,h “Acts” f,g,h, (mappings from states to
consequences); constant acts x,y,z
Primitive relation
~
f
Comparative Likelihood Preference Preference
Reflexivity axiom
A A
~
f f f
~
f f f
~
f
Completeness axiom
For all A,B: if B A
~
f
or A B
~
f or both
For all f,g: if g f
~
f or f g
~
f or both P1a:For all f,g: if g f
~
f or f g
~
f or both
Transitivity axiom
For all A,B,C: if B A
~
f
or C B
~
f then C A
~
f
For all f,g,h: if g f
~
f
or h g
~
f then h f
~
f
P1b: For all f,g,h:
if g f
~
f or h g
~
f then h f
~
f
Independence axiom
(a.k.a. substitution,
cancellation,
separability, surething
principle)
If ∅ = ∩ = ∩ C B C A
then B A
~
f iff C B C A ∪ ∪
~
f
If g f
~
f then
h g h f ) 1 (
~
) 1 ( α α α α − + − + f
for all α strictly between 0 and 1. the
converse is also true as a theorem,
whence:
h g h f ) 1 (
~
) 1 ( α α α α − + − + f iff
* ) 1 (
~
* ) 1 ( h g h f α α α α − + − + f
P2: For all f,g,h,h*, if event A is nonnull, then
h g h f ) 1 (
~
) 1 ( Α − + Α Α − + Α f iff
* ) 1 (
~
* ) 1 ( h g h f Α − + Α Α − + Α f
Stateindependent
utility axiom (“value
can be purged of
belief”)
P3: For all constant acts x,y,z, if event A is
nonnull, then y x
~
f iff
z y z x ) 1 ( ) 1 ( Α + Α Α + Α f
Part II: Utility Theory – Concepts and Methods
67
z y z x ) 1 (
~
) 1 ( Α − + Α Α − + Α f
Qualitative probability
axiom (“belief can be
discovered from
preference”)
P4: For all events A & B and constant acts
x,y,z*,y* such that y x f and
* * y x f : y B Bx y A Ax ) 1 ( ) 1 ( − + − + f
iff * ) 1 ( * * ) 1 ( * y B Bx y A Ax − + − + f
Nontriviality axiom
∅ Ω f f A for any A that is
“uncertain”
g f g f f such that and ∃ P5: g f g f f such that and ∃
Continuity axiom There exists a fair coin
If , h g f f f then
h g h f ) 1 ( ) 1 ( β β α α − + − + f
for some α and β strictly between 0
and 1.
P6: If g f f , there exists a finite partition of
the set of states into events{A
i
} such that
h g h f
i i i i
Α + Α − Α − + Α ) 1 ( ) 1 ( f for
any event A
i
in the partition and h.
Dominance axiom
P7: If ) ( ) ( θ θ g f f one every state θ in event
A, then h g h f ) 1 (
~
) 1 ( Α − + Α Α − + Α f
Conditional
probability axiom
If C B C A ⊆ ⊆ and then
B A C B C A
~
iff 
~
 f f
Part III: Sequential Statistical Decision Theory
68
Part III: SEQUENTIAL STATISTICAL
DECISION THEORY 
Introduction to Markov Decision Processes
Concepts of decision processes (as those presented in prior sections) have assumed single
decision making problem. However, this is rarely the case. In most instances, decisions
are “nested in a series of related decisions” (French and Insua, 2000). This iterates the
concept that a decision is contingent upon the outcome of previous decisions made. The
formulation of sequential problems is most naturally expressed as a recursive equation
and its solution is credited to Bellman. Subsequent developments in sequential methods
have instigated growth in sequential probability leading to a large body of theory on
Markov decision processes. This sequential pattern of decision processes is a concept
that has wide applications to all areas of the sciences, engineering, operations research,
finance, etc. Consequently, these developments and their applicability to a wide range of
problems have propagated the use of sequential methods in subsequent fields.
This final theoretical part of the report focuses on sequential decision processes. An
examination of sequential sampling leads to the formulations of decision rules and
stopping rules in statistical decision theory. These concepts establish the goal for pre
posterior and sequential analysis. Since many problems can be formulated in Markovian
ways, the latter section focuses on Markov chains and Markov decision processes as an
extension to the developments in sequential probability. Emphasis is placed on abstract
and technical formulations of the concepts and methodology. This provides a framework
for formulating sequential decision problems.
Part III: Sequential Statistical Decision Theory
69
Section 7: Preposterior and Sequential Methods
The framework for decision problems as introduced in Part I assumed a fixed sample of
observations for a single decision. Noted in the framework, decisions lead to
consequences which may subsequently lead to further decision making. A natural
extension of the framework to encompass sequential decision processes establishes new
concepts for statistical decision theory. That is the option to experiment or make new
observations before making a decision – a term which is coined preposterior analysis.
With a goal to minimize cost, sampling is the principle issue in sequential statistical
decision theory and establishes the concepts of optimal stopping rule and decision rules.
This instigated the expansion of hypothesis testing to encompass sequential methods.
Section 7.1: A Framework for Sequential Decision Problems
The framework for decision problems thus far, has centered on decision making either
with or without experimentation. However, the choice or the decision to experiment is
also a very important aspect of statistical decision theory. This decision making process
is frequently termed preposterior analysis. The goal is to minimize the overall cost
which incorporates both the decision loss as well as the cost of conducting the
experiment.
The first introduction to statistical decision problems introduced the concept of having
observations upon which a decision could be based. Those statistical procedures
assumed a fixed sample size, although it was not explicitly stated. Other strategies
suggest sequential sampling and moreover, multistage procedures. Extending the
framework for decision problems, denote the sample space for independent random
variables ,... ,
2 1
X X to be
i
Χ and define ) ,...., , (
2 1 j
j
X X X = Χ where
j
Χ is assumed to have
density )  ( θ
j
j
f x . Now suppose that ,... ,
2 1
X X is a sequential sample from the
density )  ( θ x f ; then
∏
=
=
j
i
i
j
j
x f f
1
)  ( )  ( θ θ x .
The cost of the experiment is attributed to two factors: n – the number of observations
taken and s – the way in which the observations are taken (Berger, 1985). Thus,
) , , , ( s n a l θ denotes the overall loss when θ is the true state of nature and a is the action
taken. Note that ) , , , ( s n a l θ can be considered to be a sum of the general loss function
) , ( a l θ and the sampling cost ) , ( s n c . Berger(1985) states that these functions are additive
when the decision maker has a linear utility function which can be stated as
∑
=
+ =
n
i
i
c a l n a l
1
) , ( ) , , ( θ θ ,
Part III: Sequential Statistical Decision Theory
70
where ) , , ( n a l θ is the shortened form of the overall loss function for a given sampling
approach and
i
c represents the ith observation cost.
Recall that the loss function can be defined as ) , ( ) , ( a u a l θ θ − = . Thus, if the utility
function is nonlinear and there is a monetary gain or loss, ) , ( a g θ , then
)) , ( ) , ( ( ) , , , ( s n c a g u s n a l − − = θ θ .
The difficulty in sequential decision problems resides in all the possible ways in which
the observations are taken. The two most common approaches are (a) the fixed sample
method and (b) the sequential sampling approach. Although the latter approach is of
primary interest, this section begins by describing the methodology of determining the
optimal fixed sample size, leaving the latter approach for the remainder of this section.
Recall that preposterior analysis was coined as such as the choice is made before the
data is obtained. Thus, the “nonstatistical” decision problem – that is the noobservation
problem first. Within the Bayesian context, let ) (⋅ =
θ
π P and ) (π
n
r be the Bayes risk of the
optimal policy with at most n stages with knowledgeπ . Then the situation in which the
decision maker has no option to make an observation and must choose an action can be
expressed as:
)] , ( [ min ) (
0
θ π
θ
a l E r
A a∈
=
where the expectation overθ is taken with respect toπ . Another way of expressing this is
to let
π
δ
n
denote a Bayes decision rule for this problem and incorporating the loss in
observing ) ,..., (
1 n
n
X X = Χ . The equivalent expression is then
)] ), ( , ( [ ) (
0
n l E E r
n
n
n
Χ =
Χ π
θ
π
δ θ π .
To determine the optimal fixed sample size in a decision problem requires resolving the n
which minimizes ) (π
n
r . In simplicity, this can be resolved by differentiating ) (π
n
r with
respect to n and setting the equation equal to zero.
If it assumed that the overall loss function has the aforementioned additive structure, then
π
δ
n
is the Bayes decision rule for the loss function. Furthermore, if the fixed sample size
loss is equal to the sequential loss, a sequential analysis of the problem is considered
cheaper then a fixed sample size analysis (Berger, 1985). The advantages of sequential
analysis are somewhat obvious and at the forefront lies the decision maker choice to
gather the required amount of data needed to make a decision for a specified degree of
precision. The remainder of this section of the report focuses on the foundations,
concepts and methods of sequential analysis.
Part III: Sequential Statistical Decision Theory
71
Section 7.2: Sequential Analysis – The Basics
The distinguishing feature of sequential analysis is the ability to take observations one at
a time with the decision maker or experimenter having the option of stopping the
experiment and making a decision. Keeping within the sequential decision problem
framework presented earlier, let’s assume that the cost is proportional to the number of
observations; and moreover that the loss function is increasing in n. Thus, it can then be
reiterated that if n observations are taken sequentially, at which point the decision maker
chooses to take action A a ∈ , then the loss function when θ is the true state of nature can
be expressed as ) , , ( n a l θ .
A nonrandomized sequential decision procedure is denoted ) , ( δ ϕ = d and consists of
two components: ϕ – a stopping rule; and δ – a (terminal) decision rule. A stopping rule
is characterized by a family of functions of the form
)...} ( ), ( , {
2 2 1 1 0
X X ϕ ϕ ϕ ϕ = .
In the nonrandomized case,
=
ns observatio no the take to says rule the iff , 1
n observatio first the take to says rule the iff , 0
0
ϕ
and
=
hand. at are ,...., given that
ns, observatio more no take to says rule the iff , 1
hand at are ,...., given that
n, observatio another take to says rule the iff , 0
,...) , (
1
1
2 1
n
n
n
X X
X X
X X ϕ
That is to say that
0
ϕ is the probability of making an immediate decision without
sampling while
n
ϕ is the probability (zero or one) of stopping sampling and making a
decision after
n
X is observed. Thus, given the functions
n
ϕ one can determine when to
stop; If the rule is given in other terms, one can determine the functions
n
ϕ from the rule.
Alternatively, the stopping rule can be characterized by the following family of functions
} ),...., ( ), ( , {
2 2 1 1 0
X X τ τ τ τ =
where ,
0 0
ϕ τ = and
Part III: Sequential Statistical Decision Theory
72
=
− − − =
− −
,... , stream given the ,
at first time for the sampling stopping for calls rule the if , 1
it after on go to says or n observatio
th the prior to sampling the stopped either rule the if , 0
) ,..., ( )] ,..., ( 1 )]...[ ( 1 )[ 1 ( ) ,..., (
2 1
1 1 1 1 1 1 0 1
X X X
n
X X X X X X X
n
n n n n n n
ϕ ϕ ϕ ϕ τ
Similar to the
n
ϕ functions, given the
n
τ functions one can determine when to stop, or
given the stopping rule, one can determine the
n
τ functions.
The (terminal) decision rule is simply defined as a collection of decision rules of the type
used in problems with fixed sample size (Lindgren, 1971). Thus, a terminal decision rule
consists of a series of decision functions ),... ( ), ( ,
2 2 1 1 0
X X δ δ δ where ) (
i i
X δ is the action
taken after sampling has stopped after observing
i
X . Thus,
0
δ is the action taken when
no data is available and a decision is taken immediately.
Since the overall problem at hand, is to determine the final sample size at which ϕ says to
stop and make a decision, it is favorable to discuss this issue in terms of the stopping time
N. Thus, given a sequence of observations ,...., ,
2 1
X X the function
n
τ gives the
conditional probability that precisely n observations are called for:
) ,...., ( ,...) ,  (
1 2 1 n n
X X X X n N P τ = = .
Further, the probability of stopping at the nth observation, is simply the expected value
assuming a particular decision and stopping rule:
)} ,...., ( {
1 } , { n n
X X E P τ
δ τ θ
=
=
, (Lindgren, 1971).
More formally, the stopping time is the random function of X given (for a nonrandomized
sequential procedure) by
} 1 ) ( : 0 min{ ) ( = ≥ =
n
n
n N X X τ , (Berger, 1985).
Thus, } { n N = is the set of observations for which the sequential procedure stops at time
n; and hence there is nothing to prevent ∞ = N , as shown in Berger (1985):
∑
∫
∑
∞
=
=
∞
=
+ = =
= + = = ∞ <
1
1
)  ( ) 0 (
) ( ) 0 ( ) (
i
n N
n
n
i
dF N P
n N P N P N P
θ x
Part III: Sequential Statistical Decision Theory
73
But, in applying sequential procedures it is favorable to have a rule for which
) ( ∞ < N P =1.
Section 7.3: Bayesian Sequential Procedures
Recall in section 7.1 that a preposterior analysis includes both the decision loss and the
sampling or experiment cost when a particular action is taken for a particular state of
nature. If a sequence of observations is taken at a cost of ) (N c , for n observations and a
sequential procedure ) , ( δ ϕ = d is used, the loss incurred is
) ( ) , ( N c d l + θ
where N is the random variable whose value is the number of observations actually used
in reaching a decision. And the expected loss, or risk function, is subsequently expressed
)] ( ) , ( [ ) , ( N c d l E d r + = θ θ .
The Bayes risk of a sequential procedure ) , ( δ ϕ = d is defined to be
)] , ( [ ) , ( d r E d r θ π
π
= .
Bayesian analysis with fixed sample size problems is straightforward; Bayesian
sequential analysis is quite cumbersome. So the objective of Bayesian sequential
analysis is to find the sequential procedure which minimizes ) , ( d r π . This is defined as
) , ( inf ) ( d r r π π
d
= .
Note that the idea here is that every stage of the procedure, one should compare the
(posterior) Bayes risk of making an immediate decision with the “expected” (posterior)
Bayes risk (Berger, 1985). Now suppose that there are potentially n stages,
corresponding to observations
n
X X X ,...., ,
2 1
. If the Bayes decision rule
π
δ calls for
taking all n observations in a particular application, the terminal decision is made
according to the Bayes criterion. That is, the posterior distribution is applied to the given
loss function to obtain averages over which the possible actions are ordered and the
optimum chosen (Lindgren, 1971). Assuming that K n a l − ≥ ) , , (θ , it then follows that
Part III: Sequential Statistical Decision Theory
74
). ( ) ), ( , ( ) 0 , , ( ) 0 (
) ( ) ( ) ), ( , ( )] 0 , , ( [ ) 0 (
) ( )  ( ) ), ( , ( )] 0 , , ( [ ) 0 (
)] , ( [ ) , (
1
} {
0 0 0
1
} {
0
1
} {
0
n m
n
n N
n
n
n
n
n N
n m n
n
n
n
n
n N
n
n
n
n
n
dF n r r N P
dF dF n l l E N P
dF dF n l l E N P
r E r
x x
x x
x x
d d
∑
∫
∑
∫ ∫
∑
∫ ∫
∞
=
=
∞
=
= Θ
∞
=
Θ =
+ = =
+ = =
+ = =
=
δ π δ π
θ δ θ δ θ
θ θ δ θ δ θ
θ π
π π
π π
π
(Note that the subscript m refers to the marginal density.) Thus, ) , ( d r π is minimized if
0
δ and
n
δ are chosen to minimize the posterior expected loss. This is equivalent to the
fixed sample size issue (Berger, 1985). Subsequently, if the stopping rule τ suggests at
least n – k observations, then the problem of whether to stop (using the Bayes terminal
decision) or obtain more data is solved by comparing two conditional expected losses
(Lindgren, 1971). One of these conditional expressions is based on those n – k
observations, while the other is to take more observations. This is summarized by the
following theorem found in Lindgren (1971).
If τ is a given stopping rule and π a given prior, then the Bayes risk ) , ( d r π
is minimized by that τ by the decision rule d where d
i
is the Bayes decision
rule based on the first i observations considered as a sample of fixed sample
size i.
Thus, half the problem of determining the Bayes sequential problem is deciphered as
regardless of the stopping rule used, the optimal action, once one has stopped, is simply
the Bayes action for the given observations (Berger, 1985). Furthermore, if the posterior
Bayes risk is at time n is a constant, independent of the actual observation then the
optimal stopping rule corresponds to the choice, before an experiment, of an optimal
fixed sample size n. However, this situation is rare. Yet when they are independent, the
problem of determining the optimal stopping time is still a question and is the focus of
the next section.
Section 7.4: Optimal Stopping Time
The difficulty of determining a Bayes sequential stopping rule resides in the fact that
there is an infinite “stream of identically distributed observations” upon which to base a
sequential test. However, this is made easier if the Bayes stopping rule can be
determined stage by stage. This simplification arises because at every stage the future
looks exactly the same (Berger, 1985). The conditional expected loss (given the
observations) is the function corresponding to the posterior probabilities at each stage.
Hence, those posterior probabilities depend on the observations (at hand) which enter the
decision process through the posterior distribution (Lindgren, 1971).
Part III: Sequential Statistical Decision Theory
75
There are many situations which may arise when trying to determine the optimal stopping
rule. Before considering these situations it is important to develop the framework for
such a test. So, for a given procedure ) , ( δ τ = d and given losses
a
l and
b
l corresponding to
the erroneous acceptance and rejection of the null hypothesis, respectively, the expected
loss is
= ≡ +
= ≡ +
=
, if , )  (
if , )  (
) , ; (
1 1 1
0 0 0
H R H N cE l
H R H N cE l
r
b
a
θ β
θ α
δ τ θ
where α and β are the usual error sizes, c is the cost per observation and N is the
number of observations used in making a decision. Then assuming a prior distribution π
assigned to
0
H , the Bayes risk is
, ) 1 ( ) , (
1 0
R R d r π π π − + =
and the Bayes procedure which minimizes ) , ( d r π . Lindgren (1971) demonstrates that if
no observations are permitted by the stopping rule τ , the minimum Bayes risk is just its
value when the nodata Bayes procedure is used (as mentioned previously):
+
> −
+
≤
= =
. if ), 1 (
, if ,
) , ( min ) (
0
b a
a
a
b a
a
a
l l
l
l
l l
l
l
d r r
π π
π π
π π
This is shown graphically in figure 71 below. Recall that this was shown in section 2 as
it is essentially the no data problem or “nonstatistical” decision problem.
Figure 71:
Now consider a set of rules, ,
1
C which suggest obtaining at least one observation. The
minimum Bayes risk for this set of rules is denoted:
] ) 1 ( [ min ) (
1 0 1
1
R R r
C
π π π − + = .
Figure 72:
The figure above represents a general graphical view of this function of π and has the
following properties (Lindgren, 1971):
(a) , ) (
1
c r ≥ π
(b) , ) 1 ( ) 0 (
1 1
c r r = =
(c) the graph of ) (
1
π r is concave down.
Part III: Sequential Statistical Decision Theory
76
The first property suggests that at least one observation should be taken at cost c. The
second property suggests that if the prior is either 0 or 1, then no amount of sampling will
alter the distribution, so the cost is just c plus the 0 of the nodata curve. The last
property simply implies that the curve ) (
1
π r is made up of a “family of line segments
joining pairs of points ) , 0 (
0
R and ) , 1 (
1
R . Berger (1985) and Lindgren (1971) both
establish more formally the concavity. With these properties, it is now possible to
graphically compare the curves. For instance, if the graph of ) (
1
π r lies above ) (
0
π r then
this suggests not to take any observations. This visual analysis can be carried out for the
given hypotheses.
Lindgren (1971) provides three steps for determining the appropriate Bayes sequential
procedure. This can be summarized as follows:
1. Given losses
a
l and
b
l , and cost c per observation, the values of C and D as shown
in figure 72 can be determined from the graphs of ) (
0
π r and ) (
1
π r .
2. If C ≤ π , reject
0
H , and if D ≥ π , do not reject
0
H ; otherwise sample and
determine A and B (from C and D and π ) where
.
1
1
and ,
1
1
D
D
B
C
C
A
−
−
=
−
−
=
π
π
π
π
This is established since the condition for continuing sampling, can be expressed in
terms of the likelihood ratio for the first n observations such that B A
n
< Λ < .
3. After each observation
n
X determine the corresponding likelihood ratio
n
Λ for
n
X X ,...,
1
; if A
n
≤ Λ , reject
0
H , and if B
n
≥ Λ , do not reject
0
H . If B A
n
< Λ < ,
take any other observation.
The draw back to this procedure is that ) (
1
π r is rarely known; however, given the
properties of ) (
1
π r , it is possible to determine the form of the Bayes sequential procedure.
Section 7.5: Sequential Probability Ratio Test
The sequential probability ratio test is provides one of the first applications of sequential
analysis within statistical analysis (French and Insua, 2000). It tests a simple hypothesis
0
H against a simple alternative
1
H . As constructed in the previous subsection, the
following constants are defined where α and β are the respective error sizes:
β
α
β
α −
=
−
=
1
,
1
B A
Part III: Sequential Statistical Decision Theory
77
These constants form the limits for the likelihood ratio
n
Λ computed after each
observation is taken. Then if A
n
≤ Λ , the sampling stops and the null hypothesis is
rejected. If B
n
≥ Λ , the sampling stops and the hypothesis is not rejected. And
subsequently, if B A
n
< Λ < , then another observation is taken.
Notice that this is essentially the Bayes sequential procedure described earlier. The
essential difference is that there is no formal loss structure in this instance; however given
any sequential probability likelihood ratio test essentially means that there exist losses
and a sampling cost per observation such that the Bayes sequential test for some prior
distribution is the given sequential probability ratio test. Thus, the analysis both
statistically and graphically proceeds in the same manner.
The decision theoretic approach of balancing the competing factors of expected losses
and sampling costs resides in both the Bayesian and nonBayesian approaches; and thus
makes it difficult to determine an optimal policy or rule. However, French and Insua
(2000) state that there is an intuitive structure for the optimal policy. That it is to say,
that if the likelihood ratio of the observations is not sufficiently small or large, then keep
sampling – otherwise, stop and take action. Even the constant values of A and B remain
undefined, knowing the general structure of the optimal policy allows for finding possible
good values. Berger (1985), French and Insua (2000), and Lindgren (1971) provide
further details on the development, theory and other methodological issues (constraints)
with regards to the sequential probability ratio test.
Part III: Sequential Statistical Decision Theory
78
Section 8: Markov Decision Processes
The previous section provided a framework for sequential decision problems. Such
problems can be formulated in Markovian ways which has led to large body of theory on
Markov decision processes. Decision processes which make a sequence of probabilistic
transitions and in which the transition probabilities depend only on the current state of the
system are said to undergo a Markov process. Subsequently, if there are alternative
actions amongst which a choice can be made then this is referred to as a Markov decision
processes. Since such processes are comprised of Markov chains they are also attributed
with the Markovian properties which provide favorable analytic abilities. This section
begins with introducing the underlying concepts of Markov chains and their properties as
well as other attributes of both finite and infinite chains. This section concludes with the
framework and computational aspects of Markov decision processes for determining
optimal policies.
Section 8.1: Sequential Decision Problems Formulated in
Markovian Ways
The previous section introduced a framework for sequential decision making under
uncertainty. Intervals of time separate the stages at which decisions must be made, and
the effect of a decision at any stage is to influence the transition from current to
succeeding state. If the transition from state to state is a probabilistic sequence and the
transition probability from current to succeeding state is only dependent on the current
state, then the decision process is said to have the Markovian property. This section
begins by providing the basic notation and framework for sequential decision making
processes which can be formulated in Markovian ways.
A random process when observed over time will be in different states at different times;
and this change in state is referred to as state transition. Suppose a process has N states
and its behavior over time is specified by the state the “system” is in at each stage in
time. In this report, the stage structure for the sequential problems considered (whether
they are time based or not) will be discrete. More formally, a typical random process X is
a family } : { T t X
t
∈ of random variables indexed by some set T. Further, if we define
,...} 2 , 1 , 0 { = T , the system is a discretetime process (Grimmett and Stirzaker, 1982). In
sequential decision problems, we may define ,....} , , {
2 1 0
X X X to be a sequence of random
variables which take values in some countable set S, the state space. Thus each
n
X is a
discrete random variable taking on one of N possible values, where   S N = . It is evident
that this system is not deterministic, and hence can be neither prophesied nor achieved. It
is possible, however, to deduce some statistical description about the behavior of the
process (Buchanan, 1982).
Definition: The process X is a Markov chain if it satisfies the Markov
property
)  ( ) ,..., ,  (
1 1 1 0 − −
= = =
n n n n
X s X P X X X s X P
Part III: Sequential Statistical Decision Theory
79
for all 1 ≥ n and S s ∈ . (Grimmett and Stirzaker, 1982)
For simplicity (of notation), let ) (n p
ij
denote the transition probability from i to j if the
transition takes place at the nth step. The Markovian assumption that only the current
stage (n) and the state (i) are relevant to the determination of the future behavior of the
process is described as being memoryless. This assumption is attributable to the
attractive use of Markov chains in sequential decision problems.
Much of theory of Markov chains is simplified by the condition that S is finite (Grimmett
and Stirzaker, 1982). A finite Markov chain is said to be timehomogenous if for every
pair of states i and j, ) (n p
ij
=
ij
p for all n. Such a chain is also described as having
stationary transition probabilities. The transition matrix for a Markov chain is the N N ×
matrix ] [
ij
p P = , where N is the number of states. The transition probability for a
Markov chain is a stochastic matrix.
Theorem: P is a stochastic matrix, which is to say that
(a) P has nonnegative entries, or 0 ≥
ij
p
(b) P has row sums equal to one, 1 =
∑
j
ij
p .
This summarizes the basic elements of Markov chains. The remainder of this section
presents the favorable attributes of Markov chains which demonstrate the potential
applicability of Markov chains, especially in infinite stage Markov decision processes.
Section 8.2: Multistep Transition and State Probabilities
The previous subsection focused on finite stage Markov decision problems. This part
now shifts to work with infinite stage Markovian decision problems. The transition
probability matrix P describes the likelihood of transitions from state to state; and since
our interest lies in the evolution of X, it becomes important to determine the multistep
transition probability values.
Define )  ( ) (
0
X s X p n
n ij
= = ϕ for N j i ≤ ≤ , 1 and ... 2 , 1 , 0 = n as the nstep transition
probabilities for the Markov chain (defined by P). If the system starts in state i at time 0
and is in state j at time (n+1), the result is achieved by transition in n steps from state i to
some state k (for which the probability is ) (n
ik
ϕ ) followed by transition from k to j (for
which the probability is
kj
p ). Thus, given that the state at the nth step is k, the
probability of transition to j in (n + 1) steps is
kj ik
p n) ( ϕ . These N stages for the
transitions from i to j given the nth state are mutually exclusive and exhaustive, hence
∑
=
= +
N
k
kj ik ij
p n n
1
) ( ) 1 ( ϕ ϕ , (Buchanan, 1982).
Part III: Sequential Statistical Decision Theory
80
Buchanan (1982) provides the following justification.
. ) (
assumption Markovian by the
}  { }  {
y probabilit l conditiona of defintion the from
} ,  { }  {
}  , { }  { ) 1 (
1
1 0
1
0 1 0
1
0 1 0 1
∑
∑
∑
∑
=
= = =
= = =
= = = = +
=
+
=
+
=
+ +
k
kj ik
N
k
n n n
N
k
n n n
N
k
n n n ij
p n
X s X P X s X P
X X s X P X s X P
X X s X P X s X P n
ϕ
ϕ
Since multistep transition probabilities are probabilities, they must also adhere to the
regular conditions of probability. That is they must satisfy the relationship that
1 0 ≤ ≤
ij
ϕ and
∑
=
=
N
j
ij
n
1
1 ) ( ϕ . The next step is to formulate this into a matrix form
which in turn develops the ChapmanKolmogorov equations (Grimmett and Stirzaker,
1982).
The equation for the multistep transition probabilities can be represented in (stochastic)
matrix form (quite simply). Define )] ( [ ) ( n n
ij
ϕ = Φ as the nstep transition probability
matrix for all n. From the justification above, we can rewrite the final statement as
, ) 0 (
,..., 1 , 0 for , ) ( ) 1 (
I
n P n n
= Φ
= Φ = + Φ
where I is the N N × identity matrix. So,
, ) 2 ( ) 3 (
, ) 1 ( ) 2 (
, ) 0 ( ) 1 (
3
2
P P
P P
P P
= Φ = Φ
= Φ = Φ
= Φ = Φ
and thus more generally
n n
P P P P n n = = − Φ = Φ
−1
) 1 ( ) ( . This can then be deduced to
the special case of the ChapmanKolmogorov equations as stated in Grimmett and
Stirzaker (1982):
. ) ( ) ( ) (
∑
= +
k
kj ik
n p m n m ϕ ϕ
Hence,
n m n m
P P P =
+
and so
n
P is the nth power.
Part III: Sequential Statistical Decision Theory
81
Now define
n
n
P
∞ →
= Φ lim (if it exists) and call Φ the limiting multistep transition
probability matrix or summarized as the limiting distribution for the chain. Note as the
number of transitions increases, the influence of the original state diminishes. This is
expected since the Markovian property assumes no memory to the process. It is
important to keep clear that the probabilities of transitions from state to state are fully
described by P, while Φ summarizes the longterm behavior of the process. Thus, the
above theorem relates the longterm development to shortterm development, and
informs us how
n
X depends on
0
X .
Section 8.3: Some Classes of Markov Chains
This next subsection considers the way in which the states of a Markov chain are related
to each other. This will lead to a classification of the states. First we begin with some
basic terminology which essentially focuses on the term communication (which exists
between states).
Definition: It is said that i communicates with j, written j i → , if the
chain may ever visit j with positive probability, starting from i. That is,
j i → if 0 ) ( > m
ij
ϕ for some 0 ≥ m . It is also said that i and j
intercommunicate if j i → and i j → , in which case we write j i ↔ .
Thus, two states intercommunicate if transition is possible in either direction in some
finite number of steps. The number need not be the same for the directions of
communication (Buchanan, 1982). Grimmett and Stirzaker (1982) proceed by stating the
following theorem.
Theorem: If j i ↔ then
(a) i and j have the same period
(b) i is transient if and only if j is transient
(c) i is null persistent if and only if j is null persistent
The first property, period is defined as the greatest common divisor of the epoch at which
return is possible (Grimmett and Stirzaker, 1982). Further, the state is said to be periodic
if the divisor is > 1 and aperiodic if the divisor is equal to one. The second is probably
the most important property of the relationship of communication. Buchanan (1982)
demonstrates this property by claiming that since j i ↔ and k j ↔ , then there exists
numbers
1
n and
2
n such that ) (
1
n
ij
ϕ and ) (
2
n
jk
ϕ are both positive. From the Chapman
Kolmogorov equation, we get
∑
=
> ≥ = +
N
l
jk ij lk il
n n n n n n
1
2 1 2 1 2 1
. 0 ) ( ) ( ) ( ) ( ) ( ϕ ϕ ϕ ϕ ϕ
This establishes the communication from i k ↔ . The last property which draws upon
the term persistent simply claims that state i is persistent if the probability of eventual
Part III: Sequential Statistical Decision Theory
82
return to i, having started from i, is 1. Subsequently, if this probability is strictly less than
1, i is transient. In any Markov chain, sets of states (or at least one set of states) where all
states which are members of the same set can be found which communicate with one
another (Buchanan, 1982).
There are many classifications of states besides the ones already mentioned. We now
define an ergodic set of states which is said to be a set of in which every state
communicates with every other state of the set. Grimmett and Stirzaker (1982) define a
state to be ergodic if it is persistent, nonnull, and aperiodic. Thus, an ergodic set of
states is a set of states in which every state can be reached from every other state of the
set, and once the set is entered there can be no transition out of the set. Of course, an
ergodic state is an element of an ergodic set. Buchanan (1982) suggests considering
ergodic states to be “collective” trapping states, whereby every Markov chain contains at
least one ergodic state. It should also be noted that there may be states which do not
belong to an ergodic state; these are referred to as transient states.
We now provide another definition or classification of states which leads to a
classification of chains.
Definition: A set of C of states is called
(a) closed if 0 =
ij
ϕ for all C j C i ∉ ∈ ,
(b) irreducible if j i ↔ for all C j i ∈ ,
Once a chain has taken a value in a closed set C of states, it never leaves. States of this
type are absorbing states. The equivalence class j i ↔ simply iterates that an irreducible
set C is aperiodic if all states in the set are aperiodic. (Grimmett and Stirzaker, 1982)
The final remark to be made here is with regard to defining a regular Markov chain. That
is a Markov chain whose states form a single ergodic set is an ergodic chain. If such an
ergodic set has all entries of the multistep transition probability matrix
n
P n = Φ ) ( greater
than zero, then it is said to be regular. Since ) ( ) 1 ( n P n Φ = + Φ , then
1 + n
P will have no
zero entries if
n
P does not have any (Buchanan, 1982). Thus, reaching a sufficient
number of transitions at which all state to state transitions are possible, any additional
transitions will not lead to any such transitions having zero probability again. And, such
an ergodic chain which is regular in subsequently termed a regular Markov chain.
It is a property of regular Markov chains that the limiting multistep transition probability
matrix ] [
ij
ϕ = Φ will have identical rows. Denote the state probability as ) (n
i
π .
Buchanan (1982) shows that the state probability vector
n
P n n ) 0 ( ) ( ) 0 ( ) ( π π π = Φ = at
time n. From this, we get Φ = ) 0 ( π π by taking the limit of both sides as n tends to
infinity. Buchanan (1982) uses this to show that for a regular Markov chain that the
limiting state probability vector π is independent of the initial state probability vector
) 0 ( π and is the same as the identical rows of the limiting multistep transition probability
matrix Φ. This concludes the introduction to Markov chains and their properties.
Part III: Sequential Statistical Decision Theory
83
Section 8.4: Markov Decision Processes
The section now extends the properties and method of Markov chains to develop the
body of knowledge known as Markov decision processes. In a Markov decision process
there is a system with states labeled N i ,..., 1 = . In state i there are a number of actions
available. If action k is chosen a return ) , ( i k r is generated and the system goes to state j
with probability ) , , ( k j i p . A set of actions, one for each state, constitutes a policy.
Under a given policy δ the systems is a Markov process with returns and has steady state
gain ) (δ g . A policy which has the highest gain (or the lowest in a minimization
problem) is an optimal policy. A policy whose gain differs from the optimal by not more
than a known amount is called a tolerance optimal policy. The formulation and solution
of Markov decision problems will be illustrated by extending the decision structure.
Suppose a system has a finite number of states, } ,..., , {
2 1 m i
θ θ θ θ = Θ ∈ . The system or
process may move between states at each infinite series of equal stages (or time steps)
,... 2 , 1 = t . At each time step t, the decision maker is to select an action A a a
k
t
∈ =
where A is finite. Depending on the action
k
a chosen by the decision maker at time t and
its current state
t
θ , the system evolves to its next state
1 + t
θ . Then the probabilities
) ,  (
1
1
k i
t
j
t k
ij
a p p
t
θ θ θ θ
θ
= = =
+
+
describe the evolution of the system. Thus, it is assumed that the probabilities hold the
Markovian property and subsequently, do not depend on either the earlier states of the
process nor any of the earlier actions. Recall that in a decision framework, every action
taken with the current state results in a consequence. Similarly, we define ) , (
i k
t
a c θ
which is the consequence a decision maker receives at each stage. Further, this can be
extended to show the overall “timestream” of consequences (French and Insua, 2000).
Now that Markov decision processes have been placed within a decision theoretic
framework, we now extend this to include the concepts of utility theory. That is, we now
place things within the context of rewards rather than losses. (Note that reward includes
the cost of taking the action.) Assume that the reward received at stage t is
) , ( )) , ( ( i k r a c u
t i k
t t
= θ . It follows that all ) , ( i k r
t
are bounded. Since the aim is to find
the optimal policy with minimal longterm average costs, it is necessary to consider the
increase in the value of state i from stage to stage. Hastings and Mello (1978) provide the
formulation for the bounds and convergence test.
There are two common ways of evaluating the overall timestream of rewards in Markov
decision processes. The first includes a discount factor ρ and a discounted additive
utility function such that
∑
∞
=
−
=
1
1 2 2 2 1 1 1
) , ( ),...) , ( ), , ( (
t
t
t
i k r a c a c u ρ θ θ .
Part III: Sequential Statistical Decision Theory
84
The other common approach is to choose a series of actions to maximize her average
reward:
T
i k r
a c a c u
T
t
t
T
∑
=
∞ →
=
1 2 2 2 1 1 1
) , (
lim ),...) , ( ), , ( ( θ θ .
In the first approach presented, the decision maker seeks a policy for choosing actions
which maximize the total expected discounted reward. The latter approach gives the
longrun average monetary return or the expected monetary value of the policy.
Finite Markovian decision processes decide upon a policy to be applied at a (previously
known) finite number of decision epochs with the objective of optimizing some measure
of performance. However, in certain instances, the finite stage aspect may be too limiting
a prospect which then “unbounds” the criterion presented above. Having an infinite
Markovian decision process has the problem of approximation itself, which due to the
attributes of Markov chains, the infinite approximation is a feasible proposition. For a
finite stage problem, the criterion for discriminating between policies was the
maximization of the expected reward associated with a policy. In the infinite stage
problem, the cumulative expected reward from using a given policy will not be finite and
hence, will increase without bound. Discounting the rewards by some factor as stated
above, obviates this problem and consequently gives an expected discounted reward for
each policy which is finite. There are other propositions for handling infinite Markov
decision process. Some of these alternatives are presented in Hastings and Mello (1978)
and Buchanan (1982). French and Insua (2000) also provide a discussion on the merits of
some of alternative functions of utility to solve Markov decision processes.
Now a decision policy ,...) , (
2 1
δ δ for a Markovian decision process is a sequence of
decision rules which prescribe for each stage t, the action the decision maker should take
if the current state of the process is
t
θ (French and Insua, 2000). If the decision rule
selected does not depend on the stage t, then the policies are said to be stationary. A
stationary policy applied to a Markov decision process determines the transition
probabilities from state to state and thus, induces a Markov chain. If it is assumed that
this Markov chain is single and ergodic then the assumption can be made that for each
policy there exists a state which can be reached from any other state under the policy. It
can then be further delineated to use the attributes of Markov chains to solve such
decision processes. (Computational issues are expounded in section 12.)
There are several algorithms (or classes of algorithms) which can utilized for solving
Markov decision processes. Linear programming is one approach used. However, value
iteration and policy iteration are the most common. The first of these, value iteration,
considers that if the decision maker takes action
k
a and receives ) , ( i k r
t
immediately and
then passes to state
j
θ with probability
k
ij
p , and continues optimally receiving the
expected total discounted reward, the sequence will eventually converge for any starting
point. There are many variants of this algorithmic procedure. French and Insua (2000)
Part III: Sequential Statistical Decision Theory
85
illustrate a couple of these and Mello and Hastings (1978) present another variant. The
latter algorithmic procedure, policy iteration has a similar objective. It begins with an
arbitrary policy, instead of an arbitrary function as defined for the value iteration
algorithm. This procedure begins with an arbitrary (stationary) policy, evaluates the
function which defines the total expected discounted reward and seeks for an improved
policy and until convergence is reached. This has described the possible approaches for
solving infinite Markov decision processes. Finite Markov decision processes are
simpler and can be solved by backward dynamic programming or backward induction
which is presented in section 13.
Part IV: Model Building for Decision Analysis
86
Part IV: MODEL BUILDING
FOR DECISION ANLYSIS:
Topics in Probabilistic Graphical Models
Methodologies for decision processes (as those presented in prior sections) are based
upon probability and utility or different ways of handling uncertainty. Subsequently,
certain criteria should be satisfied as a provision for guiding decisionbased problems.
For instance, French and Insua, list axiomatic basis, lack of counter examples, feasibility,
robustness, “transparency to users”, and “comparability with a wider philosophy” as
criteria which should be satisfactorily determined. The first two provide the theoretical
merit; the second two, feasibility and robustness, relate to the practicality of performing
the analysis while the latter two entail the implementation of the analysis results. These
latter parts of the report focus on the practicality and implications of decision processes
and attempt to bridge the gap between the conceptual and practical components of
decision theory. However, none of this is applicable without a decision model upon
which the criteria is built and subsequently assessed.
This part of the report focuses on building models to illustrate decision problems under
various assumptions. Emphasis is placed on probabilistic graphical models. These
probability models are extensions of the procedures and methodology presented in earlier
sections of this report. Concepts of problem structuring, parameters and attributes lay the
foundation for model building and are introduced in the description of each inception of
the model building process. Each section presented in Part IV describes a modeling
approach for a particular formulation and framework for a given decision problem.
Part IV: Model Building for Decision Analysis
87
Section 9: Graphical Representation of Decision Problems
Section 1 introduced the framework for decision problems. The components of a
decision problem provide the construction for a graphical representation. An introduction
to graphical models for describing decision processes is illustrated through the
conception of decision trees and influence diagrams. A comparative assessment of these
two formulations demonstrates the strengths and weakness of the approaches in both
structural and practical settings. In this section, a formulation of a general decision
problem is revisited in a concise manner in order to introduce the assumptions and
criteria required for model building.
Section 9.1: Modeling Decision Problems
Sections 1 through 3 demonstrated the key concepts and ideas pertaining to decision
modeling. In essence, decision problems consist of actions, states of nature and
consequences. Without forgoing the complexities of modeling such problems, a decision
model can be concisely formulated as
c a → ⊕θ .
That is, the interaction between an action and a state of nature leads to a consequence.
Recall that the choice of action or decision is under the control of the decision maker; but
the state which pertains is beyond his/her control. When both the action space A and the
state space are finite, the model can be presented in a decision table as shown below.
Actions States of Nature
n
θ θ θ L
2 1
m
a
a
a
⋅
⋅
⋅
2
1
mn m m
n
n
c c c
c c c
c c c
L
L
L
L
L
L
2 1
2 22 21
1 12 11
⋅ ⋅ ⋅
⋅ ⋅ ⋅
⋅ ⋅ ⋅
The simplest discussions of decision theory assume that the decision maker chooses an
action (row) in the table, whereupon nature will ‘choose’ a state (column), leading to the
consequence. This tabular representation of decision problems has very close conceptual
and historical connections with game theory (which is not explored in this report).
With respect to the three elements – actions, states and consequences, the terminology
may vary; however, the distinction between these elements is common. This separation
is found throughout statistical decision theory, and leads to a model of decision making in
which there is some representation of uncertainty relating to the unknown state, some
Part IV: Model Building for Decision Analysis
88
representation of preference between the possible consequences, and a mechanism for
bringing these two representations together to guide choice. This statement merely
emphasizes the adoption of the model c a → ⊕θ .
Conceptually, the “consequences” considered thus far, need not have been considered to
be numerical. That it is to say, that they may be thought of as descriptions of the
outcome of actions chosen under various possible “states of nature”. Similarly, neither
the actions nor the states of nature need to be thought of quantitatively. Then the
previously defined sets C and A can be taken in a general sense without any further
mathematical structure other than the existence of suitable (σ) fields required for the
introduction of probability measures, (French and Insua, 2000).
The tabular representation (shown above) is replaced by a functional one when the action
and state spaces are infinite. The notation C A c → Θ × : is commonly used however,
other authors (such as Savage) employ an equivalent but more succinct notation which
identifies the actions from the state space to the consequence space: C a → Θ : . Notably,
neither notation recognizes the possibility that the set of possible states may depend upon
the action chosen (French and Insua, 2000).
Statistical decision theory, on the other hand, assumes that the observation X = x from an
observation space X is observed according to the conditional distribution )  ( θ ⋅
X
P . Thus,
decision problems can be formulated as } :  { A X d d A D
X
→ = = which relates the
decision maker’s actions with each possible observation and presents the set of all
possible choices. Thus, a choice D d ∈ induces a probability distribution for each pair
Θ × ∈D d ) , ( θ as follows:
∫
′ =
= ′
c x d c
C
x dP c P
) ), ( (
)  ( )  (
θ
θ θ .
Consequently, this structure of statistical decision theory reduces to the general form of a
decision problem. Thus, three representations of decision modeling have been presented
each shown to have the same general structure.
Section 9.2: Model Building – Fundamental and Means Objectives
Modeling a decision problem (in practicality) requires three fundamental steps, as stated
by Robert T. Clemen in Making Hard Decisions. These can be summarized concisely as
(1) identifying and structuring the values and objectives, (2) structuring the elements of
the decision situation into a logical framework, and (3) refining the precise definition of
all of the elements of the decision model. The aforementioned steps not only provide the
foundation for the decision making process but also provide the construction process for
a graphical representation of decision problems and methodology.
Much of the work presented in this report has focused on single objective problems;
however, often, multiple objectives or multiattribute problems surface which encompass
conflicting goals. For instance, the investor may wish to maximize the financial return of
an investment portfolio but also minimize the volatility of the portfolio’s value. Thus, it
Part IV: Model Building for Decision Analysis
89
is important to address the issue of value structuring which is necessary in order to
graphically display complex decision processes.
The consequence space C, thus far, has simply been stated as a set of objects which can
be defined on an interval of the real line. When considering multiple objectives, the
consequence space becomes complex and presents a multiattributed structure. To
clarify, by an attribute refers to a factor to be considered and when given a “direction” of
preference, say minimize or maximize, it is termed objective. Thus, the problem arises of
structuring these objectives in a consequential manner.
Once a set of objectives, consistent with the decision context, has been established, the
objectives should be separated into means and fundamental objectives. This step
provides a value structuring process whereby distinguishing between those objectives that
are important because they help achieve other objectives and those that are important
because they reflect the goal – what we really want to accomplish. This process of
arranging attributes holds cognitive advantages through graphical modeling and
representation.
For example, fundamental objectives are organized into hierarchies. The upper levels in a
hierarchy represent more general objectives, and the lower levels explain or describe
important elements of the more general levels. The figure below illustrates a possible
hierarchy that might arise in the context of defining vehicle regulations. A higher level
fundamental objective might be “Maximize safety” where lower level fundamental
objectives might be “Minimize Loss of Life,” “Maximize Serious Injuries,” and
“Minimize Minor Injuries.” Furthermore, these can be separated in order to pertain to
particular issues for a particular group. As seen in this example, these objectives are
parsed out separately for adults and children.
Figure 91:
In contrast to a hierarchy organization, means objectives are organized into networks.
For the same example, means objectives may include “Minimize accidents” and
“Maximize Use of Vehicle Safety Features”. There are several other possible means
objectives which may work together (network) in order to accomplish the maximization
of safety. That is means objectives can be connected to several objectives in order to
accomplish the task at hand.
Maximize Safety
Minimize
Minor Injuries
Minimize
Serious Injuries
Minimize
Loss of Life
Adults Children Adults Children
Part IV: Model Building for Decision Analysis
90
Figure 92:
Structuring the fundamental objectives into a hierarchy or tree is crucial for a multiple
decision process; however, the means objective network provides an important basis for
generating creative new alternatives. In some instances, a well measured means objective
can sometimes substitute for a more difficult to measure fundamental objective (Clemen,
1996). Brownlow and Watson (1987) confer that structuring attributes assist the
“cognitive overload” brought about by the complexities of a decision making process. In
reality, problems are much complex in structure than the one presented here. However,
the graphical interface that induces this methodology still requires the building of the
model or decision problem.
There are several ways in which a decision maker can choose to build such a model.
Although, much of this discussion is intuitive, there are methods or techniques which
facilitate this thinking process. French and Insua (2000) refer to the “topdown”
questions; subsequently, Clemen (1996) presents table 91 which summarizes the
questions or four techniques for organizing the two types of objectives which are useful
in constructing means objectives networks and fundamental objectives hierarchies.
Table 91: Constructing MeanObjectives Networks and FundamentalObjectives Hierarchies
Fundamental
Objectives
Mean
Objectives
To Move:
To Ask:
Downward in the hierarchy
What do you mean by that?
Away from Fundamental Objectives
How could you achieve this?
To Move:
To Ask:
Upward in the hierarchy
Of what more general objective
is this an aspect?
Towards Fundamental Objectives
Why is that important?
Maximize
Safety
Require
Safety features
Educate Public
Enforce
Traffic Laws
Reasonable
Traffic Laws
Minimize Driving
Under Alcohol
Influence
Maximize
Driving Quality
Minimize
Accidents
Maintain
Vehicle
Motivate Purchase
Of Safety Features
Maximize
Vehicle Safety
Part IV: Model Building for Decision Analysis
91
It is obvious that attributes must hold certain properties or meet certain requirements in
order to be useful. Recall, that the intention is to quantify such attributes. Hence, they
must be measurable on some scale for each consequence. French and Insua (2002)
distinguish three types of scale, noting that they may be objective or subjective measures.
The first, a natural scale, provides a direct measure such as cost. The second, a
constructed or subjective scale, defines the objective as well as indicates the impact of a
specific consequence. That is they are created specifically within the context of the
decision problem, e.g. minimizing caregiver burden. The third scale, a proxy, uses an
attribute which is perceived to describe the objective and is measurable. Keeney and
Raiffa (1976) affirm that no two attributes should measure the same aspect and yet should
distinguish between consequences. Other practical concerns such as decision ownership
and feasibility, although important in the practical setting, are not explored in this report.
Section 9.3: Model Building – A Graphical Approach
The process of specifying, structuring, and sorting out the means objectives from the
fundamental objectives is the initial step of building a graphical model of the decision
problem. The focus now turns on the graphical representation of the elements of a
decision problem – decisions or alternatives, uncertain events or outcomes, and
consequences. The most common graphical representation of decision problems is a
decision tree. It has been superceded, in recent years, by influence diagrams.
Decision trees and influence diagrams, both, provide a graphical approach to modeling
decision problems. This (sub)section will focus on the common elements used is such
model building processes. The latter section delved into the specifics for each modeling
approach and concludes with a comparative assessment. Nonetheless, both approaches
require the elements of a decision problem. Although notation varies from source to
source, there are three types of nodes which are used both in influence diagrams and
decision trees and can be stated as follows:
• Square nodes represent decisions (or actions) and mathematically are associated
with a decision set D.
• Circular nodes represent random quantities or events which are associated with a
random outcome X.
• Diamond or rounded square nodes represent consequences (value) of the decision
process. These are represented by the set of all possible consequences C.
These shapes are generally referred to as decision nodes, chance nodes and consequence
nodes. Nodes are integrated into graphs or networks, connected by arrows/branches or
(directed) arcs. These nodes are assimilated such that they coincide with the progression
of time. These are useful aids in model building. The following two sections describe a
general procedure for constructing and modifying influence diagrams and decision trees
to model the structure of a decision problem.
Section 9.4: Constructing Influence Diagrams
Part IV: Model Building for Decision Analysis
92
Influence diagrams provide a visual perspective of the decisions, uncertain events (states
of nature), and consequences and the subsequent interrelationships that exist within a
decision problem. An appropriate model ensures the probabilistic structure of a decision
problem, the timing of available information and interdependence of decisions that can be
taken and which may arise under certain states of nature, in a compact form (Marshall
and Oliver, 1995).
Model building of decision problems as seen can become quite complex; however,
influence diagrams hold many advantages. First, they provide a framework by which a
decision maker can reject or confirm assumptions and accurately model dependencies
through a graphical display. Secondly, complex decision problems have the innate
ability to become “messy”. Influence diagrams provide a format whereby the large
volume of information can be summarized into relevant and sufficient matters. From a
practical viewpoint, they provide alternative situations and are easily interpretable. The
use of algorithms and numerical techniques also enable an efficient and simple analysis.
The design and construction of influence diagrams are based on the aforementioned three
basic elements – decision nodes, chance (states of nature) nodes, and consequence nodes,
connected by arcs. A node at the beginning of an arc is qualified as a predecessor; and
consequently, a node at the end of an arc is referred to as a successor. The rules for using
arcs to represent relationships among the nodes are shown in figures 93 and 94. In
general, an arc can represent either relevance or sequence (Clemen, 1996). The direction
of the arc indicates the meaning and brings into the relevance or sequence context.
Figure 93:
For example, the figure above illustrates an arrow pointing into a chance node, indicating
relevance. This demonstrates that the predecessor is relevant for assessing the chances
associated with the uncertain event. In the diagram above, the first cluster of nodes
shows an arrow (arc) from A to C. This suggests that the chances (states of nature)
associated with C may be different for different outcomes of A. Likewise, an arrow
pointing from a decision node to a chance node means that the chosen decision is relevant
for assessing the chances associated with the succeeding uncertain event. For instance,
Relevance
A
C
E
B
D
F
Part IV: Model Building for Decision Analysis
93
the choice taken in decision B is relevant for assessing the chances associated with C’s
possible outcomes. Relevance arcs can also point into consequence or calculation nodes,
indicating that the consequence or calculation depends on the specific outcome of the
predecessor node. The second cluster of nodes shows that consequence node F depends
both on decision D and event E.
Figure 94:
When the decision maker has a choice to make, the choice would normally be made on
the basis of information available at the time. Arrows that point to decisions represent
information available at the time of the decision and hence represent sequence (figure 9
4). Such an arrow indicates that the decision is made knowing the outcome of the
predecessor node. An arrow from a chance node to a decision means that from a decision
maker’s point of view, all uncertainty associated with a chance event is resolved and the
outcome is known when the decision is made. Thus, information is available to the
decision maker regarding the event’s outcome. This is the case, as illustrated in the
figure above, with chance node H and decision node I. The decision maker waits to learn
the outcome of H before making decision I. An arrow from one decision to another
decision simply means that the first decision is made before the second, such as decision
nodes G and I. Thus, the sequential ordering of decisions is shown in an influence
diagram by the path of arcs through the decision nodes.
A simple procedure for constructing an influence diagram to model the structure of a
decision problem is stated in Decision Making and Forecasting by Marshall and Oliver
(1995). These sequential steps are summarized below.
1. Create a preliminary list of decisions and random events (or quantities of interest)
whose outcomes are believed to be important in the formulation of the problem.
Identify the attributes and objectives that are to be used to measure the
consequence of the decisions and outcomes.
2. Name each random quantity and decision. Represent each random quantity with a
circular node and a decision with a square node. Draw them in order of
occurrence from left to right.
3. Identify any influences or dependencies between random quantities and decision.
Insert directed arcs between nodes that influence one another with the direction
corresponding to the natural influence believed.
H
I
G
Sequence
Part IV: Model Building for Decision Analysis
94
4. Determine any conditional independencies and represent them accurately.
5. Check to see that there are no directed cycles – i.e. a connected set of arcs in a
directed path that leads out of one node and back itself.
6. Check that there any decision node that occurs before a later decision node has a
directed arc from the former to the latter, Similarly, a chance node known to a
given decision mode must be known to a later decision node. This is a
requirement of the principle of coherence (Marshall and Oliver, 1995).
This process is fairly simple and thus alternative diagrams can be easily drawn. Clemen
(1996) offers some other remarks on the construction of influence diagrams. First, the
nature of the arcrelevance or sequence can be ascertained by the context of the arc
within the diagram. Secondly, as stated in 5 of the procedure above, an influence
diagram should not contain any cycles. Furthermore, although the construction of an
influence diagram may be technically correct, there is no clear cut supposition to suggest
that the influence diagram constructed is the only correct one. Thus, it cannot be stated
that there is a unique correct diagram but that there are many ways in which a diagram
can appropriately represent a decision problem. The representation that is the most
appropriate is the one that is requisite for the decision maker. That is, a requisite model
contains everything that the decision maker considers important in making the decision
(Raiffa and Schlaifer, 2000). Incorporating all of the important concerns
1
of the decision
is the only way to get a requisite model and adequate representation of the problem.
Section 9.5: Constructing Decision Trees
Influence diagrams are an excellent display of a decision problem’s basic structure;
however, they hide much of the detail which may be pertinent to a decision maker. To
display more of these details, a decision tree can be constructed. Yet, they hold the
drawback that the size of the problem cannot always be practically represented.
The design of a decision tree holds much of the character of an influence diagram. That
is, squares represent decisions to be made, circles represent chance nodes and
consequence nodes are found at the ends of the branches which connect the nodes. The
interpretation of decision trees follows much the same path as explained in section 9.4 on
influence diagrams.
In designing a decision tree, the options represented by branches from a decision node
must be such that the decision maker can choose only one option. However, in some
instances, combination strategies are possible. Also, each chance node must have
branches that correspond to a set of mutually exclusive and collectively exclusive
outcomes. Mutually exclusive means that only one of the outcomes can happen.
Collectively exclusive means that no other possibilities exist; one of the specified
outcomes has to occur. Putting these two specifications together means that when the
uncertainty is resolved, one and only one of the outcomes occurs.
1
Sensitivity analysis is helpful in determining which elements are important.
Part IV: Model Building for Decision Analysis
95
A decision tree represents all possible paths that the decision maker might follow through
time, including all possible decision alternatives and outcomes of chance events. In a
complicated decision situation with many sequential decisions or sources of uncertainty,
the number of potential paths may be very large. Each path, consequently, represents a
particular scenario that could unfold over time. It is sometimes useful to think of the
nodes as occurring in a time sequence.
Figure 95:
Including multiple objectives in a decision tree is straightforward; at the end of each
branch, simply list all of the relevant consequences. An easy way to accomplish this
systematically is with a consequence matrix as shown in the figure above. This is an
example of a decision tree representation of FAA’s multiple objective bomb detection
decision (Clemen, 1996). Each column of the matrix represents a fundamental objective,
and each row represents an alternative. Evaluating the alternatives requires “filling in the
boxes” in the matrix; each alternative must be measured on every objective. Some basic
decision trees will be described later in this section in conjunction with influence
diagrams more illustratively.
Marshall and Oliver (1995) provide a procedure for constructing decision trees, once the
set of random events and actions have been defined together with their sets of
probabilities and respective utilities (cost/payoff). Much like (sequence) influence
diagrams, decision trees are constructed in chronological order and nodes are connected
by branches. The steps include:
1. Draw a branch for each possible action from the first (leftmost) decision node. If
it is a chance node, do the same drawing a branch for each possible consequence.
2. Label each branch emanating from a chance node i with a unique element from
set Θ ∈
i
θ , with the corresponding probability and consequence.
3. Label each branch emanating from a decision node and its respective loss or
consequence.
4. Place the consequence values from the set on the terminal nodes at the end of
each branch.
Best System
Detection
Effectiveness
Time to
Implement
Passenger
Acceptance
Cost
System1
System 2
System 3
System 4
System 5
Part IV: Model Building for Decision Analysis
96
The designation between decisions and chance nodes is critical. Placing a chance event
before a decision means that the decision is made conditional on the specific chance
outcome having occurred. Conversely, if a chance node is to the right of a decision node,
the decision must be made in anticipation of the chance event. The sequence of decisions
is shown in a decision tree by order in the tree from left to right. If chance events have a
logical time sequence between decisions, they may be appropriately ordered. If no
natural sequence exists, then the order in which they appear in the decision tree is not
critical, although the order used does suggest the conditioning sequence for modeling
uncertainty (Clemen, 1996).
The key to constructing a decision tree from an influence diagram is the specification of
every possible outcome of every node: chance outcomes from each chance node, actions
from each decision node, and all possible consequences defined. Thus, a path in a
decision tree is a series of decisions and random outcomes that lead from a single initial
chance or decision node to a distinct end point or terminal node. The one fundamental
problem with decision trees, as stated in Marshall and Oliver (1995), is that the
dependencies among branches on each path are not made evident as apparent by
influence diagrams.
Section 9.6: Comparing Problem Structuring
Modeling decision problems can be formulated by use of graphical methods, such as
those presented above. Thus, for analysis purposes, it is beneficial to construct either a
decision tree or influence diagram and bring the structure of the decision problem to the
forefront. In this section, a general formulation of decision problem is presented in both
graphical forms and provides the basis for a comparative assessment of these approaches.
Figure 96 below provides a decision tree representation of the general problem
presented in the table in section 9.1. The decision tree displays the different possible
scenarios, but does not provide a clear picture of the interrelation and influences between
the uncertainties and decisions. Thus, it is not immediately clear whether the m chance
nodes at the end of the action branches represent the same uncertainty about θ, i.e. the
probability distribution of θ is not conditioned on the choice of action. The influence
diagram in Figure 97 makes this clear.
Figure 96:
The mn possible scenarios (histories)
of the decision tree are represented
by the mn possible paths in the tree.
a1
a2
a3
θ1
θ2
θn
c11
c12
c1n
Part IV: Model Building for Decision Analysis
97
Figure 97:
The arrows indicate that the consequence
depends on the choice of action and the
unknown state; but the absence of an arrow
from the action node to the state nodes
indicates that the uncertainty about the state
is not conditioned on the choice of action.
Figure 98 and 99 provide a decision tree and influence diagram for a statistical decision
problem. These emphasize a further difference between the two representations. From
the tree, it is clear that the observation X is made before an action is decided upon. So, in
this case, the uncertainty concerning θ differs between chance nodes, being conditional
upon the observed value of X. Within the influence diagram, it might appear that the
arrow from X to a indicates that X becomes known to the decision maker before the
action a is chosen. But if this interpretation is made for all influence arrows, it may be
assumed that the uncertainty about θ is resolved and becomes known to the decision
maker before X is observed, which it does not.
Figure 98:
Figure 99:
Generally in influence diagrams, an arrow entering a chance node indicates that the
probability distribution represented by that node is conditional on the quantity (random
variable or decision) at the other end of the arrow. An arrow entering a decision node
indicates that the quantity at the other end of the arrow is known to the decision maker at
the time of the decision. An arrow entering a value node indicates that the quantity in the
value node is partially determined by the quantity at the other end of the arrow. In
summary, a decision tree representation displays two facets of a problem particularly
clearly: contingency and temporal relationships between decisions and realizations of
random events. An influence diagram
2
representation hides these facets but does bring
influencesor, in probabilistic terms, conditionalityto the forefront. Together they bring
complementary perspectives on the issues facing a decision maker.
2
An influence diagram can be used to represent the structure of a decision maker’s knowledge, i.e. beliefs,
simply by avoiding the use of decision or value nodes. Such a diagram is referred to as a belief net and its
use is most common in the field of artificial intelligence.
θ
a
c
θ
X
a
c
X=x1
X=x2
X=xq
a1
c1n
c12
c1n
θ1
θ2
θn
Part IV: Model Building for Decision Analysis
98
Decision trees have a difficulty in that for many problems they rapidly become very
large: too large for the eye to comprehend as one. As such they have been described as a
“bushy mess” (French and Insua, 2000). Thus, decision trees are often displayed as a
series of subtrees. Influence diagrams are a much more compact representation.
However, their advantage in this respect is, in a sense, illusory. Decision trees can
represent asymmetric decision problems, i.e. problems in which a particular choice of
action at a decision node makes available different choices of action at subsequent
decision nodes that those available after an alternative choice (Raiffa & Schlaifer, 2000).
Such asymmetric problems are rather the rule than the exception in decision analysis.
Within the classes of problems commonly considered in statistical decision theory,
asymmetry is less common. It is the asymmetry within problems that causes the decision
trees to grow into bushy messes; and it is unfortunate that, as commonly formulated,
influence diagrams cannot model asymmetric problems.
The discussion and the examples shown demonstrate, on the surface at least, that decision
trees display considerably more information than do influence diagrams. And is
previously commented, decision trees get “messy” much faster than do influence
diagrams as decision problems become more complicated. Even one of the most
complicated decision trees may not be capable of showing all of the intricate details
contained in an influence diagram depicting the same decision problem. In practice, the
understanding of the graphical representation of influence diagrams is optimal as it is
regarded as especially easy to understand regardless of one’s mathematical training
(Clemen, 1996).
Both methods are worthwhile, and they complement each other well. Influence diagrams
are particularly valuable for the structuring phase of problem solving and for representing
large problems. Decision trees display the details of a problem. Yet, the ultimate
decision need not depend on the representation, as influence diagrams and decision trees
are isomorphic. That is a properly built influence diagram can be converted into a
decision tree, and vice verse; although the conversion may not be easy. One strategy is to
start by using an influence diagram to help understand the major elements of the situation
and then convert to a decision tree to fill in details.
Influence diagrams and decision trees provide two approaches for modeling a decision.
Because the two approaches have different advantages, one may be more appropriate
than the other, depending on the modeling requirements of the particular situation. For
example, if it is important to communicate the overall structure of a model to other
people, an influence diagram may be more appropriate. Careful reflection and sensitivity
analysis on specific probability and value inputs may work better in the context of a
decision tree. Using both approaches together may prove useful; the goal, after all, is to
make sure that the model accurately represents the decision situation. Because the two
approaches have different strengths, they should be regarded as complementary
techniques rather than as competitors in the decision modeling process.
Part IV: Model Building for Decision Analysis
99
Section 10: Probabilistic Graphical Models
The axiomatic developments presented in Part II justify the subjective expected utility
representation of a rational decision maker’s preferences in a decision problem; but
ignore the structure in the state, observation or consequence spaces. This section focuses
on the process of structuring the “unknowns” into parameters and observations thereby
facilitating the probability model building. This sections begins be revisiting a decision
model framework with emphasis on the statistical elements of parameters and
observations. This leads to probability modeling and the representation of probabilistic
graphical models from a Bayesian perspective. The development of models such as
Bayesian networks (belief nets), multilevel or hierarchical models and exchangeability
are the emphasis in the later portion of this section.
Section 10.1: Subjective Expected Utility Model – Rebuilding the
Decision Framework
The modeling of a decision maker’s preferences or beliefs is justifiably represented
through the established subjective expected utility axioms. Such a model has been
described as having three essentials: the action {A}, states {Θ}, and consequence {C}
spaces. Drawing upon the scientific process of modeling, French and Insua (2000)
restate the previously defined decision model in a more general framework containing
parameters and observable variables. Thus, let’s consider a model with parameters θ and
input variables α of the form ) , ( θ α M . French and Insua (2000) proceed to develop
comments of such a modeling framework.
The model ) , ( ⋅ ⋅ M used to predict a set of observed values ) , ( θ α M y = , consists of α
values which are observed while the parameters θ contain information obtained from
previous studies. That is the uncertainty lies within θ whereas α is known. It is
proposed that in decision modeling, partitioning α into ) , (
a a
α α relates back to the model
previously defined. If
a
α is defined to be the set of actions such that A a ∈ and
a
α is
defined to be those random quantities which are unknown, then relabeling
a
α as θ
gives ) , ( θ α M . As stated in section 9, there is no correct model. Such models simply
provide a means of encoding the current assumptions and stating the understood
relationships. These models are descriptive in nature and further taken to be
deterministic accept when allowing a prescriptive interpretation of probability.
The decision model presented in Part I and revisited in the previous section concedes that
the consequences of a decision are assumed to be determined by the action chosen and
the unknown state:
c a → ⊕θ .
The decision model can be restated as
Part IV: Model Building for Decision Analysis
100
) , ( ) , ( θ α θ α
a
M c = .
Based on the axiomatic developments in Part II, the subjective expected utility model
takes the following form:
Choose ))]. , ( ( [ max arg θ
θ
a c u E a
a
=
Thus, encoded in the model is the decision maker’s uncertainty and preferences. The
observed value X is subtly introduced into the model by redefining the parameter θ to
include the observations: ) , ( X θ θ = . The justification for the use of conditional
probabilities is provided by the following axiom (French and Insua, 2000):
T S T R T S T R Q T S R ∩ ∩ ⇔ ∈ ∀
~ ~
) )  ( )  ( , , , f f .
DeGroot (1970) proceeds with the following theorem based on the axioms stated in Part
II of this report by asserting that ) (⋅ ∃
θ
P , a probability distribution on Θ, such that
(a) ) ( ) ( : ,
~
S P R P S R Q S R
θ θ
≥ ⇔ ∈ ∀ f ;
(b) )  ( )  ( )  ( )  ( : 0 , ,
~
T S P T R P T S T R T with Q T S R
θ θ
≥ ⇔ / ∈ ∀ f f .
DeGroot’s next step is to construct the utility function. Including the observed value
X=x, gives the subjective expected utility model to be
Choose ].  )) , ( ( [ max arg x a c u E a
a
θ
θ
=
Note that the loss function a x a c u a l = − = ) ( and )) , ( ( ) , ( δ θ θ which is the action taken
after observing x. This brings us back to the development of the statistical decision
problem presented in Part I within the context of the developments found in Part II. This
leads directly into the building of probability models.
Section 10.2: Building the Probability Framework
The decision model elements: A, Θ, and C (subsequently defined as the action, state, and
consequence spaces) which constitute the parameters and random and/or observed
variables of the model have been prescribed as vectors of real or integer numbers or
categorical variables. In the previous section, both the general form of the model and the
consequence model ) , ( ) , ( ⋅ ⋅ = M c θ α are given to be deterministic; unless accentuated by a
“propensity interpretation of probability” (French and Insua, 2000). Decision making
under uncertainty is quantified or assessed through the development of probability
distribution and epitomizes the intrinsic nature of statistical decision theory.
The joint distribution of the unknown states of nature θ and the observed random variable
X are expressed through the joint distribution ) , (
,
X P
X
θ
θ
. The relationship between the
deterministic model ) , ( ⋅ ⋅
X
M and the ascribed joint probability distribution ) , (
,
X P
X
θ
θ
Part IV: Model Building for Decision Analysis
101
accounts for the decision maker’s uncertainties and the confounding effects within the
modeling framework.
As opposed to the descriptive structure of decision modeling construed in earlier sections,
the prescriptive approach discussed here is indicative of the structure of a decision
maker’s thoughts or belief process (French and Insua, 2000) and resides in the
relationship between the construction of probability distributions and utility functions. A
review of the development of the form or structure of the joint probability distribution
(the Bayesian perspective) leads to the construction of probability models and an
understanding of the graphical approach.
The development of the distributions ) , (
,
X P
X
θ
θ
, )  ( X P θ
θ
, and )  ( θ X P
X
is intrinsic in
determining the confounding effects of both observations and parameter estimates. The
construction of the joint probability distribution elicits the Bayes’ theorem. Furthermore,
the Bayes’ theorem relates the joint distribution to the marginal and conditional
distributions such that joint probability of two events A and B are given as:
) ( )  ( ) , ( B P B A P B A P
θ θ θ
= .
The Bayes’ theorem rearranges this form to demonstrate that
) (
) ( )  (
)  (
B P
A P A B P
B A P
θ
θ θ
θ
= .
The events A and B are independent, denoted A ⊥ B. Thus, the symmetry of the condition
given above is immediate. This concept of conditional probability within the utility
context is introduced by DeGroot (1970) in his presentation of the subjective expected
utility axioms as shown in section 10.1.
The concept of conditional probabilistic independence extends further the modeling
structure of a decision maker’s beliefs (French and Insua, 2000). And the Bayes theorem
tells us that “our revised belief for A”, the posterior probability )  ( B A P
θ
, can be
obtained by the product of the prior probability ) ( A P
θ
and the ratio ) ( / )  ( B P A B P
θ θ
where the numerator of the ratio is the likelihood of event A. Thus, the posterior
probability can be expressed in the following form:
).  ( ) ( )  ( A B P A P B A P
θ θ θ
× ∝
× ∝ likelihood prior posterior
This process of identifying conditional independence allows for the beliefs of a decision
maker to be represented in the form of a probability distribution. Furthermore, they aid
in developing the structural form of the utilities and in the assessment of the problem.
Section 10.3: Directed Acyclic Graphs (DAGs) and Probability
Part IV: Model Building for Decision Analysis
102
As illustrated in section 9 decision models can be represented graphically and provide
insight into the structure of the problem. It was stated that influence diagrams constitute
the formation of a network through a graphical framework. A formal description of
directed acyclic graphs (DAGs) is given in this section and offers a graphical approach to
Bayesian inference. This section begins with a description of directed graphs as stated in
graph theory and leads to its abilities to illustrate the Bayesian inference process.
A directed graph consists of nodes and directed arcs, as stipulated in the section on
influence diagrams. Formally, we may consider a network with nodes n j i ,..., 2 , 1 , = and
a subset of the possible ordered pairs ) , ( j i which specifies the directed arcs. A path
consists of a sequence:
k
i i i i → → → → ...
3 2 1
.
Such a directed graph is said to be acyclic when there exists no path with
k
i i =
1
. The
order of the nodes can be rearranged so that every directed arc j i → has j i < . Bather
(2000) gives the following proposition:
If a directed network is acyclic, then the vertices can be renumbered in such a way that i
< j for each directed arc j i → .
Bather (2000) goes on to prove this proposition which is omitted from this report. It
should also be noted that a path only exists if there is a link from one node to another;
else they are disconnected. This representation is useful when discussing causation. For
completeness purposes, a few terms in relation to graph theory are provided below.
Much of the terminology surrounding graph theory is centered on kinship relationships
such as parents, child, ancestors, etc. A node in a directed graph is said to be a root if it
has no parents and a sink if it has no children (Pearl, 1988). Every DAG has at least one
root and one sink. The term tree is used to qualify a connected DAG in which every node
has at least one parent. Similarly, if every node has a child then it is referred to as a
chain. A graph is said to be complete if every pair of nodes is connected by an arc. The
figure below illustrates an example of a DAG containing four nodes.
Figure 101:
B
D
C
A
Part IV: Model Building for Decision Analysis
103
This mathematical structure, a directed graph, represents a causal network consisting of
variables and directed links. This causal framework suggests the way influence may run
within a causal network. Two nodes (variables) are said to be separated if “new evidence
on one of them has no impact on our belief of the other” (Jensen, 2001). There are three
types of connections between triplets of variables A, B, and C.
Figure 102a:
Figure 102b:
Figure102c:
In both the linear and diverging cases, we have )  ( C B A ⊥ while in the converging case
we have )  ( C B A ⊥/ .
A
B
C
B
A
C
Converging
B
C
A
Diverging
Part IV: Model Building for Decision Analysis
104
Definition: (Jensen, 2001)
Two variables A and B in a casual network are said to be dseparated if for all trails
between A and B there is an intermediate variable V such that either
i. the connection is serial (linear) or diverging and the state of V is known; or
ii. the connection is converging and neither V nor any of V’s descendents have
received evidence.
This follows the properties of conditional independence (described in Section 10.2) as the
quantification of uncertainty in causal structures must obey the principle that whenever A
and B are dseparated then new information on one of them does not change the certainty
of the other.
Figure 103(a): Figure 103(b): Figure 103(c):
For Bayesian inference, DAGs illustrate the prior to posterior inference process. Each
diagram above represents the joint distribution of two random variables A and B. A
causal interpretation follows from figure 103(a) which describes the prior beliefs. The
posterior beliefs are represented by figure 103(b) and elicits an inferential process. Such
a representation forms the basis of Bayesian networks or belief nets.
Section 10.4: Directed Acyclic Graphs (DAGs) for Inference on
Bayesian Networks
The use of DAGs for Bayesian inference illustrates the prior to posterior inference
process as demonstrated in the previous section. Bayesian networks, on the other hand,
are generally more complicated than those presented above. Robert Cowell in his article
Introduction to Inference for Bayesian Networks, defines Bayesian networks as a
“model representation for the joint distribution of a set of variables in terms of
conditional and prior probabilities, in which the orientations of the arrows represent
B
A A
B
P(A)P(BA)
B
A
P(A,B) P(B)P(AB)
Part IV: Model Building for Decision Analysis
105
influence, usually though not always of a causal nature, such that these conditional
probabilities for these particular orientations are relatively straightforward to specify
(from data or eliciting from an expert)”.
Hence, in this section, the focus falls on the inferential procedure which involves the
calculation of marginal probabilities conditional on the observed data using Bayes’
theorem. This is equivalent to (diagrammatically) to reversing one or more of the
Bayesian network arrows. This section defines a Bayesian network and shows how one
can be constructed from prior knowledge.
Let’s first consider a simple causal network where A is a parent of B. Then it is simple to
calculate the )  ( A B P
θ
. However, if C is also a parent of B then the individual
conditional probabilities )  ( A B P
θ
and )  ( C B P
θ
are not sufficient in providing any
information on the interaction between A and B. Hence, a specification of ) ,  ( C A B P
θ
is
required. This ideology can be expanded to consider n random variables. Typically, the
interest lies in looking for relationships among a large number of variables for which the
Bayesian network is suitable for.
A Bayesian network for a set of variables } ,..., , {
2 1 n
X X X = X consists of a set of directed
arcs which in turn provide (a) a network structure S that encodes a set of conditional
independence assertions about variables in X, and (b) a set P of local probability
distributions associated with each variable (Heckerman, 1998; Jensen, 2001). Thus, the
variables together with the directed arcs form a directed acyclic graph. Together these
components:
i. define the joint probability distribution for X; and
ii. provide a onetoone correspondence between S and X.
Thus, a Bayesian network is a DAG whose structure defines a set of conditional
independence properties. These properties can be found through graphical manipulations
such as those presented in section 10.3 (Pearl, 1988). It has been contended that any
uncertainty must obey the definition of dseparation (Jensen, 2001). This is to say, that a
conditional probability distribution is associated with each node where the conditioning is
on the parents of the node )) (  (
i i
X par X P . Further, dseparation can be used to read off
conditional independencies from DAG representation of a Bayesian network.
Given a network structure S, the joint probability distribution over the set of all variables
U is given by
∏
=
i
i i
X par X P U P )) (  ( ) ( ,
where the ) (
i
X par is the parent set of
i
X . Thus, the local probability distributions P are
simply the distributions corresponding to the terms in the product of the equation given
above (Heckerman, 1999). Jensen (2001) defines this as the chain rule.
Part IV: Model Building for Decision Analysis
106
Next let’s consider the process of building a Bayesian network. Heckerman (1999)
illustrates this process through an example. Here, we generalize his “learning by
example” approach. The first task is much like that of the one described in developing a
decision tree or influence diagram. That is to say this initial task entails the more logical
structuring of the parameters and/or variables of interest. This can be characterized to
include: (1) the correct identification of the goals of modeling (i.e. prediction versus
explanation versus exploration); (2) the identification of all possible observations that
may be relevant to the problem; (3) the identification of a subset of relevant observations
to model; and (4) the classification of observations into variables having mutually
exclusive and collectively exhaustive states. As seen earlier, this is a task embedded in
decision analysis and hence is not exclusive to Bayesian modeling. This concludes the
more logical framework development in the construction of a Bayesian network.
The next step in the construction of a Bayesian network entails the more mathematical or
statistical framework. It is at this stage a DAG encoding the “assertions of conditional
independence” are built. This approach is based on the chain rule of probability which is
equivalent to the ) (U P shown above. In general the chain rule of probability is defined as
∏ ∏
= =
−
= =
n
i
i i
n
i
i i
x P x x x P X P
1 1
1 1
)  ( ) ,...,  ( ) ( π .
Then for every
i
X there exists some subset } ,..., {
1 1 −
⊆ Π
i i
X X such that
i
X and
i i
X X Π
−
\ } ,..., {
1 1
are conditionally independent given
i
Π . Thus, it is clear that the
variables sets ) ,..., (
1 n
Π Π correspond to the parent nodes of a Bayesian network and in
turn specifying the arcs in the network structure S.
It follows that in determining the structure of the Bayesian network requires ordering the
variables and determining the most appropriate subset of variables. DAGs can always
have their nodes linearly ordered so that for each node X all of its parents precedes it in
the ordering. Such an ordering is referred to as a topological ordering (Cowell, 1999).
Consider the graph shown below with nine nodes. This has been taken from Cowell
(1999).
Figure 104:
This example shows that (A, B, C, D, E, F, G, H, I) and (B, A, E, D, G, C, F, I, H) are two
possible topological orderings. Thus, this task (of ordering the variables) may not be the
most feasible or reasonable process to be used unless done so under a more logical
D
E
A
G
H
B
F
I
C
Part IV: Model Building for Decision Analysis
107
framework. For instance, if the variable ordering chosen is not probable, the resulting
network structure may fail to reveal some of the important conditional independencies or
in the worst case n! variable orderings may need to be explored. Thus, applying the
semantics of causal relationships (as observed naturally) readily asserts the corresponding
conditional dependencies.
Now, the task at hand is to compute the joint density over the set of all variables U as
defined earlier. According to the DAG, this is referred to as recursive factorization or
“the distribution being graphical over the DAG” (Cowell, 1999). It follows a similar
procedure illustrated above by simplifying the individual terms (or nodes) with respect to
their structural parents. Then, the final step in the construction of Bayesian networks
“simply” entails the assessment of the local probabilities, defined as )) (  (
i i
X par X P for
each i. This concludes the systematic approach to the construction of Bayesian networks.
It has been illustrated that each of these models determines a set of conditional
constraints, represented implicitly in the DAG. However, the implied criteria or
assumptions have not been discussed explicitly, and hence is made mention of here which
Pearl (1988) refers to as stability. The stability condition requires that all of the
probabilistic independence relations implied by the model should be invariant across
(small) perturbations to the parameters of the model. That is the parameters of the model
should not be functionally related to each other.
Furthermore, Verma & Pearl (1990) proved that two DAGs are observationally
equivalent if they have (i) the same skeleton; and (ii) the same sets of nodal structure –
i.e. two converging arcs whose tails are not connected by an arc. Using this criterion, it
can be seen that figure 105(a) and (b) that probabilistically they are indistinguishable
while that figure 105(c) and (d) are not. Thus, a variety of belief nets can represent the
same conditional independencies.
Figure 105:
E
D
B
C
E
D
C B
A
(a)
(b)
A
Part IV: Model Building for Decision Analysis
108
So far the emphasis has been much on the construction of a Bayesian network (from prior
knowledge, data or a combination). And the need is usually to determine various
probabilities of interest from the model. Thus, we now focus on probabilistic inference in
Bayesian networks. Because a Bayesian network for U determines a joint probability
distribution for U (a set of all variables), in principle, the Bayesian network can be used
to compute any probability of interest as described above. However, the task to refine the
structure and local probability distributions of a Bayesian network given data results in a
set of techniques for data analysis that combines prior knowledge with data to produce
improved knowledge (Heckerman, 1999). This is process referred to as “belief updating
in Bayesian networks”. Jensen (2001) summarizes the mathematical process involved:
1. Use the chain rule to calculate ) (U P .
2. Determine the
∏
=
=
n
i
h
s i i
h
s
S X par X P S U P
1
) , ), (  ( ) ,  ( θ θ where
i
θ is the vector of
parameters for the distribution ) , ), (  (
h
s i i
S X par X P θ ,
s
θ is the vector of
parameters ), ,..., (
1 n
θ θ and
h
S denotes the configuration of the network.
3. Marginalize
∏
=
=
n
i
h
s i i
h
s
S X par X P S U P
1
) , ), (  ( ) ,  ( θ θ to ) , ), (  (
h
s i i
S X par X P θ
for each
i
X in U.
4. Normalize ) , ), ( , (
h
s i i
S X par X P θ results in ) , ), (  (
h
s i i
S X par X P θ .
This illustrates the basic ideas for learning probabilities (and structure) within the
Bayesian or belief net framework. This section has described the construction steps in a
sequential manner; however, it is important to assert that many of these “steps” are
intermingled in practice. Both the problem of structuring and the assessments of
probability can lead to changes in the network structure. The section following presents
an extension of both probability modeling and belief nets within a hierarchical form.
E
D
B
C
E
D
C B
A
(c)
(d)
A
Part IV: Model Building for Decision Analysis
109
Section 10.5: Hierarchical Structures in Belief Nets
The natural occurrence of hierarchical structures in physical phenomenon is quite
common. Moreover, these structures find themselves in belief net representations as
hierarchical models. Hierarchical models were introduced in statistical modeling in 1972
by Lindley and Smith (Gelman et. al, 1995). Such models hold several advantages.
The simplest form of a hierarchical model can be described as a representation of beliefs
which consists of a layered structure. For instance, consider a set of observations
n
Y Y Y ..., ,
2 1
each related independently to a set of parameters
n
θ θ θ ,..., ,
2 1
which share some
similarity or characteristic. The assumption about
i
θ having a common distribution
)  ( γ
θ
⋅ P represents this “similarity”. A third level suggests that γ may be represented by a
prior distribution with parametersξ . Thus the model described is a three level model:
).  ( ~ : 3 Level
. ,..., 1 ),  ( ~ : 2 Level
. ,..., 1 ),  ( ~ : 1 Level
ξ θ
γ θ
θ
γ
θ
⋅
= ⋅
= ⋅
P
n i P
n i P Y
i
i Y i
i
Thus, the concept here is to simply extend a common parameterθ to model the
distribution of many observable quantities
i
X . If the parameter has many components
then it may be useful to specify their joint prior distribution using a common hyper
parameter, say γ (Neal, 1996) which is given its own prior. This hierarchy of a structure
is referred to as hierarchical modeling.
If the
i
θ are independent givenγ , then
∫
∏
=
= =
n
i
i n
d P P P P
1
1
)  ( ) ( ) ,..., ( ) ( γ γ θ γ θ θ θ .
Although theγ could have been dispensed (with θ given a direct prior); however, Neal
(1996) states that using a hyperparameter “may be much more intelligible”. This could
be extended to the level 1 data which would then require the decision maker to integrate
overθ .
In the hypothetical model illustrated above, the distribution )  ( γ
θ
⋅ P is common to all
i
θ ,
while the distributions )  (
i Y
i
P θ ⋅ may vary. This allows for the articulation of dependence
or independence in the beliefs about the components, if the
i
Y are multidimensional
(French and Insua, 2000). Further, a variety of experimental conditions can be ascribed.
A simple and natural occurring example of a hierarchy structure is in determining the
school to school variability of student achievement. Suppose the response is math
achievement on a standardized test. The innate structural formation describes students to
Part IV: Model Building for Decision Analysis
110
be nested within classrooms, classrooms nested within schools, and this could go on and
on. This would mean that
i
θ is the mean math score for classroom i, specified by a
normal distribution for that particular classroom. Thus, this allows us to update or
modify our belief of the math ability of students in general by adjusting the likely value
of the hyperparameterγ . Gelman et al. (1995) presents the advantages of hierarchical
models in terms of statistical inference. Such strengths draw upon the ability to learn
about the hyperparameterγ as well as drawing upon some information about the
particular effects of
i
θ .
These three level hierarchical models are not restricted to the structural formation
illustrated above but have also been used to structure statistical inference in terms of
observational and modeling error and prior information on the parameters. Furthermore,
hierarchical models are not restricted to three levels; more may be appropriate. That
judgment is somewhat left to the decision maker.
Section 10.6: Exchangeability and Other Forms of Probability
Models
One of the most fundamental concepts inlaid within subjective probability modeling is
the concept of exchangeability introduced by De Finetti’s. Conditional independence,
Bayesian networks, and hierarchical models give the decision analysts the tools for
exploring and constructing the form of the joint distribution through a decomposition into
its marginal and conditional components; However, the independence conditions to do
not suggest the parametric form of any of the conditional or marginal distributions
whereas exchangeability can (French and Insua, 2000).
De Finetti’s “twist of hypothetical permutation suggests that rather than “judging” the
resemblance between two groups, the decision maker should “imagine” a hypothetical
exchange of two groups and then decide whether the observed data under the swap would
be distinguishable from the actual data (Pearl, 1999). Let A and B denote two groups,
treated and untreated, and let ) (
1
y P
A
and ) (
0
y P
A
denote the distribution of group A under
two hypothetical conditions, treatment and no treatment. Pearl explains that if the
interest lies in some parameter µ of the response distribution then
1 A
µ and
0 A
µ denote the
values of the parameters in the corresponding distribution ) (
1
y P
A
and ) (
0
y P
A
, (with
1 B
µ
and
0 B
µ defined similarly for group B. Thus, in measuring the pair ) , (
0 1 B A
µ µ , the
hypothetical swap then suggests measuring ) , (
0 1 A B
µ µ .
Pearl (1999) defines two groups to be exchangeable relative to parameter µ if the two
pairs are in indistinguishable, that is if
) , ( ) , (
0 1 0 1 A B B A
µ µ µ µ = .
French and Insua (2000) extend this concept and state the following proposition:
Part IV: Model Building for Decision Analysis
111
A collection of random variables is exchangeable if and only if any ntuple, with n less
than or equal to the sixe of the collection, has the same distribution as any other ntuple.
Thus, “exchangeability develops the parametric form of a distribution from relatively
weak ideas and of symmetry between future observables rather than from modeling ideas
derived from scientific understanding” (French and Insua, 2000). Literature also
contends that these sets of symmetry conditions are not easily identifiable, and thus,
sometimes the parametric structure of the distribution is based more on intuition.
The methods illustrated in Section 10 provide a framework for the structuring of the
functional form of the decision maker’s probability distributions. In practice, many of
the methods are mixed and matched to accomplish the ultimate goal. Hence, a decision
maker may choose to build a Bayesian network and then focus on some nodes and
borrow conditions of exchangeability and modify the structure. Either way, there are
many choices in the structuring of decision problems; and this section introduces some of
these concepts.
Part IV: Model Building for Decision Analysis
112
Section 11: Probabilistic Graphical Models with
Markovian Properties
The framework for constructing probabilistic graphical models as DAGS is an initial step
in modeling complex decision problems; however, the formalism of hidden components
or sequential decision making processes has been ignored, thus far. This section does not
survey sequential decision making but does strive to present some of the extensions of the
graphical models presented earlier. It focuses primarily on sequential problems which
can be formulated in Markovian ways. Such applications stem from the body of theory
presented in PART III. Thus, we begin with exact inference when countering hidden
nodes and then briefly introduce dynamic linear modeling which underpins many
Bayesian forecasting techniques. This extends both probability modeling and the
representation of probabilistic graphical models described in the previous section. The
development of models such as Bayesian networks (belief nets) or hierarchical models
are extended to the development of neural networks, hidden Markov models, and
dynamic linear models.
Section 11.1: Inference in Probabilistic Graphical Models
The modeling of probabilistic graphs was introduced in section 10. This section extends
such structures to discuss probabilistic inference in graphical models which hold hidden
components or nodes. The problem in such inference making resides in the computation
of the conditional probability distribution over the unobserved or hidden nodes given the
observed information.
Let H denote the set of hidden nodes and E denote the set of “evidence” or observed
nodes. Thus the goal is to calculate the
) (
) , (
)  (
E P
E H P
E H P = .
Further, another aim is to calculate the marginal probabilities in graphical models, in
particular the ) (E P . Notably this is the likelihood function which of course has a direct
relationship with the conditional probability )  ( E H P . Thus, these distributions are
intertwined and hence, any inference or algorithmic approaches to computing such
probabilities coincide.
As shown in section 10, when belief updating in Bayesian networks, the joint probability
distribution ) ,  (
h
s
S U P θ is sufficient for the calculations required above. However
because the joint probability and corresponding calculations (based on its decomposition)
increase exponentially with the number of variables or nodes, more efficient methods are
desired (Jensen, 2001). Jordan (1999) presents an array of such methods encompassing
both exact and approximation algorithms. The junction tree algorithm, an exact approach
is introduced through the concepts of “moralization” and “triangulation”.
Part IV: Model Building for Decision Analysis
113
The focus in this part of the report has been on graphical models – more specifically on
DAGs. To recap, DAGs (directed acyclic graphs) can be numerically assessed
determining the local conditional probabilities associated with each mode. That is these
conditional probabilities give the )) (  (
i i
X par X P . To determine the joint probability
distribution of all of the N nodes, the product is taken over all possible nodes:
∏
=
i
i i
X par X P U P )) (  ( ) ( .
Thus, the probability inference in DAGs involves the computation of conditional
probabilities under this joint distribution.
This report has neglected undirected graphical models, thus far, as the introduction to this
subject area was directed by Bayesian networks in a very systematic approach. However,
we introduce them here. First off, such models are also referred to as Markov random
fields. They are numerically assessed by “associating potentials with the cliques of the
graph” (Jordan, 1999). A potential is defined as a function on the set of configurations of
a clique that associates a positive number with each configuration. And a clique is a
subset of nodes which are fully connected and maximal. To compute the joint probability
distribution for all of the nodes, the product is taken over all the clique potentials:
∑ ∏
∏
=
=
=
U
M
i
i i
M
i
i i
C
C
U P
1
1
) (
) (
) (
φ
φ
,
where M is the total number of cliques and the normalization factor (the denominator)
sums the numerator over all configurations. Figure 111 is an example of an undirected
graph adapted from Jordan (1999).
Figure 111:
The cliques are } , , {
3 2 1 1
X X X C ,
} , , {
5 4 3 2
X X X C and } , , {
6 5 4 3
X X X C .
Φ1(C1)
Φ2(C2)
S1
S2
S3
S4
S5
S6
Part IV: Model Building for Decision Analysis
114
The junction tree algorithm is commonly applied as a means of making exact inferences
and compiles directed graphical models into undirected graphical models. Jensen (2001)
provides a complete description and sequential steps involved in this graph theoretic
representation of this approach. Recall, that it was stated that this approach required to
main steps, “moralization” and “triangulation”. These are described next.
The first stage, moralization, translates the directed graph into an undirected graph. Both
graphs use the produce of the local functions to obtain the joint probability distributions
of the nodes. The directed form, )) (  (
i i
X par X P , does hold the property of being a real
valued function over the configuration; however, these nodes are not always situated
together within a clique. Thus, the objective is to “marry” the parents of all of the
undirected edges in the graph. This is done by simply dropping the arcs on the other
edges in the graph (Jordan, 1999). The resulting graph is called a moral graph and can be
used to represent the probability distribution on the original directed graph within the
undirected structure.
The second stage of the algorithm is referred to as triangulation. This processes the
moral graph (taking it as input) and produces an undirected graph (perhaps with
additional edges). This graph holds a property which allows recursive calculations of
probabilities (Jordan, 1999). In a triangulated graph, joint probabilities can be built up
sequentially through the graph. Figure 112 illustrates two graphs – (a) represents a non
triangulated graph whereas (b) demonstrates its triangulated form. Thus, if there are 4
cycles which do not have an edge, it is considered not triangulated but can be simply
formulated to be by simply adding a chord as seen below.
Figure 112(a): Figure 112(b):
Once a graph has been triangulated, the cliques of the graph can be arranged into a data
structure known as a junction tree. The aforementioned algorithmic structure is made
possible for trees and not networks because of the independent properties. That is, if X is
instantiated then all its neighbors are independent (Jensen, 2001). To summarize,
through graph theoretic methods, the independence properties in a network are converted
and organized into a set of sets of variables (cliques) and to construct a tree over the
cliques, resulting in a junction tree.
(a
(b
Part IV: Model Building for Decision Analysis
115
Section 11.2: Neural Networks
Neural Networks are Layered graphs endowed with a nonlinear “activation” function at
each node (Jordan, 1999). A neural network is characterized by its: (1) architecture – its
pattern of connections between the nodes; (2) learning algorithm – its method of
determining the weights on the connections (probability distributions); and (3) activation
function – which determines the output (Ripley, 1996). Figure 113 below illustrates the
layered graphical structure of a neural network.
Figure 113:
Typically, a neural network consists of a large number of processing elements (nodes), or
neurons (when speaking of artificial neural networks). Each node has an internal state,
referred to as its activity level or activation which is a function of the inputs it has
received. Thus, a node sends its activation as a signal to several other nodes; however, it
can only send one “message” at a time.
Let’s first consider a typical example of a simple neural network. Actually, for
simplicity, let’s consider a single output node Y which receives input (activation) from n
nodes
n
X X ,...,
1
. From each input node to the output node Y, there is an associated
weight
ij
θ . Then to determine the output of the node Y, simply sum the inputs with their
respective weights (consider multiple regression with
ij
θ as the regression coefficients):
nj nj j j j j i ij
x x x y θ θ θ θ + + + + = ...
2 2 1 1 0
The
0 i
θ is the “bias” parameter which allows one to change (or bias) the output
independently of the inputs. The next step is to apply an activation function, of which a
number of them could be used. However, a common activation function is the logistic
sigmoid function (an Sshape curve).
Part IV: Model Building for Decision Analysis
116
Let’s consider an activation function that is bounded between 0 and 1, such as that
obtained through the logistic function:
) 1 /( 1 ) (
y
e y f
−
+ = .
Thus, a binary variable
i
X is associated with each node in a graphical model; and the
interpretation of the activation of the node is the probability that the associated binary
variable takes one of its two values. Thus, the corresponding logistic function is:
∑
+
= =
∈
− −
) (
0
1
1
)) (  1 (
i
X par j
i j ij
X
i i
e
X par X P
θ θ
.
Thus, the
ij
θ are the parameters associated with the edges between the parent nodes j and
node i, and
0 i
θ is the bias parameter associated with node i. This is the sigmoid belief
network introduced by Radford Neal (1992). Such treatments of neural networks hold
many advantages including the ability to perform diagnostic calculations and to handle
missing data (Jordan, 1999).
From figure 113 above, it is quite clear that a node in a neural network, generally, has
the preceding layer as all of its parent nodes. Thus, when applying the junction tree
algorithm, the moral graph links between all of the nodes in this layer, as illustrated
below. Further note that the by definition (Section 11.1) the output nodes are the
“observed” or “evidence nodes and thus the hidden nodes are probabilistically dependent
and so are any of its ancestors (Jordan, 1999). Computationally this can be quite
intensive as the clique size due to the triangulation procedure will grow exponentially.
Section 11.3: Hidden Markov Models
This section introduces hidden Markov models which is one of the simplest kinds of a
dynamic Bayesian networks. As suggested by the model’s name, hidden Markov models
hold Markovian properties for which we refer to PART III. Thus, instead of regurgitating
such properties, we simply relay the directed Markovian property in graphical models.
This directed Markovian property is a conditional independence property which states
that a variable is conditionally independent of its nondescendents given its parents:
) (  ) ( X par X ND X ⊥
where ) ( X ND denotes the nondescendents of X. From section 10, it was mentioned that
the conditional probability )) (  (
i i
X par X P did not necessarily mean that if the
* ) ( π = X par then the *)  ( ) ( π x P x X P = = . That is the suggestion was that any other
information is irrelevant for this condition to hold. In DAGs, this information refers to
the knowledge about the node itself or any of its descendents. So, we can summarize this
Part IV: Model Building for Decision Analysis
117
by saying that having information about a nondescendent does not tell us anything more
about X, because either it cannot influence or be influenced by X either directly or
indirectly (Cowell, 1999). (Or if it can be influenced indirectly, then it can only be done
though influencing the parents which are all known anyway.)
Figure 114:
A hidden Markov model (HMM) is a graphical model in the form of a chain as seen in
the figure above. Applying the same property, consider a sequence of multinomial
“state” nodes
i
X and assume that the conditional probability of
i
X is independent of its
immediate predecessor
1 − i
X . This forms a matrix of transition probabilities
)  (
1 −
=
i i
X X P A which is time invariant. The output nodes
i
Y have an associated
“emission” probability law )  (
i i
X Y P B = which is also time invariant.
In determining probabilistic inferences in such graphical models, the output nodes are
treated as evidence nodes and the states as hidden nodes. The EM (expectation
maximization) algorithm is generally applied to update the parameters π , , B A . This two
step iterative procedure first computes the necessary conditional probabilities and then
updates the parameters via weighted maximum likelihood. Thus, the moralization and
triangulation procedures of the junction tree algorithm are innocuous or useless for such
graphical models as hidden Markov models (in this form).
Figure 115:
X
Y
Part IV: Model Building for Decision Analysis
118
Now consider a variation on the aforementioned hidden Markov model with a number of
chains. Such a graphical model is called a factorial HMM and is shown above. In
general, the structure is composed of m chains. Hence, the state node for the mth chain at
time i is denoted
) (m
i
X and corresponding transition matrix for the mth chain is denoted
) (m
A . Thus, the overall transitional probability for the full model is computed by taking
the product across the intrachain transition probabilities:
∏
=
− −
=
m
i
m
i
m
i
m
i i
X X A X X P
1
) (
1
) ( ) (
1
)  ( )  ( .
In this situation, the triangulation process will result in an exponential growth of cliques
as there are multiple chains to be considered. Thus, a factorial hidden Markov model
does not hold the tractable properties held by its more general formalism.
Another variation or amalgamation of two previously described models is hidden Markov
decision trees (HMDT). As the name suggests, this is a model which is essentially a
decision tree endowed with Markovian properties (Jordan, 1999). In an HMDT, the
decisions in the decision tree are not only conditional on the current data point, but are
also conditional on the decision at the previous moment in time.
Figure 116:
The figure above illustrates such a model where the horizontal edges represent the
Markovian temporal dependence. Note that the term temporal is used as it is assumed
that the model structure does not change (Murphy, 2003). Thus, given a sequence of
input vectors
i
U and a corresponding output vectors
i
Y , the task at hand is to compute the
conditional probability distribution over the hidden states again. This is much like the
case of the FHMM and is shown to be intractable when making inferences (numerically)
(Jordan, 1999).
Part IV: Model Building for Decision Analysis
119
Several other variations of the HMM exist including those of higher orders where each
state depends on the previous k states as opposed to the single previous state or those with
a mixture of both continuous or discrete nodes. Either way, the number of cliques for
such models grows exponentially. Thus, although these models provide a graphical
representation of complex problems, they are not easily tractable especially when
concerned with exact inference algorithms. However, many of these can still be assessed
using approximation methods which will be discussed generally in section 11.5.
Section 11.4: Linear Dynamic Models
Another important type of dynamic Bayesian networks is the class of decision problems
which has arisen out of the area of control of stochastic systems within the engineering
field. Essentially, the objective is to keep systems running over a number of stages along
target trajectories of states (French and Insua, 2000). They have also been developed for
the problem without controls, known as dynamic linear models (DLM). Their
importance resides in their formulation as a tool for Bayesian forecasting and sequential
procedures.
Linear dynamic models have the same topology as HMMs, but all the nodes are assumed
to have linearGaussian distributions. Let’s consider the basic structure:
) , ( ~ 
) , ( ~ 
1 1 t t t t t
t t
T
t t t
W G
V F y
− −
θ θ θ
θ θ
where
t
y is scalar (data) and
t
θ is the state vector. The model is defined by four quantities
} , , , {
t t t t
W V G F which are assumed to be known at time t:
t
F is a n vector,
t
G is a n n×
matrix,
t
V is a nonnegative scalar, and
t
W is a n n× symmetric and positive (semi)
definite variance matrix. A more general way of stating this model is:
Observation equation:
) , 0 ( ~ ,
t t t t
T
t t
V v v F y + = θ
System equation:
) , 0 ( ~ ,
1 t t t t t t
W w w G + =
−
θ θ
Figure 117:
HMM with Gaussian
Part IV: Model Building for Decision Analysis
120
The best way to think of a DLM is through is graphical structure as shown above. This
graph has two noticeable features. The first, is that the state vector
t
θ follows a Markov
process, so that t k j
k j k
≤ ≤ ≤ ⊥
−
1 all for , 
1
θ θ θ . This implies that
k j
w w ⊥ . The
second is that
k
θ separates
k
y from everything else, so that
k j j k
y y θ θ  , ⊥ . This further
implies that
k k
w v ⊥ and
j j k
w v v , ⊥ (West et. al., 1997).
Let } ,..., {
1 k k
y y D = represent the data where
k
y takes on either a numeric or missing
value. Then the prior beliefs about the state vector can be written as:
Prior information:
) , ( ~ 
0 0 0 0
C m D θ .
The DLM allows us to compute
k
m and
k
C from
1 − k
m ,
1 − k
C , and
k
y . In this way, the
information sets
k
D are defined recursively and can be computed through stochastic
dynamic programming algorithms; however, can be manageably daunting. Nonetheless,
these models do provide highly effective means of modeling univariate time series
(French and Insua, 2000).
Section 11.5: Variational Methods for Dynamic Bayesian Networks
The junction tree algorithm and dynamic programming algorithms provide solutions for
probabilistic inferences in graphical models. However, as we have seen in this section, a
need for other approaches is crucial for many forms of dynamic Bayesian networks as
they grow in complexity. Even in cases where exact algorithms are manageable, it is still
important to consider approximation methods.
A graphical model specifies a complete joint probability distribution over all the nodes.
Given the joint probability distribution, all possible inferences queries can be determined
by their marginal distributions. Computing several marginal distributions at the same
time can be done using dynamic programming which avoids the redundant computation
required by other marginalization procedures. However, much like the junction tree
algorithm, this grows exponentially pending the size of its largest subset of nodes.
Approximation algorithms provide an alternate solution. Jordan (1999) provides an
overview of some of the valuable and existing approximation methods. Variational
methods and sampling methods are two approximation approaches readily used.
Variational methods generally provide bounds on probabilities of interests (Jordan,
1999). The simplest example is the meanfield approximation which exploits the law of
large numbers to approximate large sums of random variables by their means (Murphy,
2003). This procedure requires the decoupling of all the nodes and updating of the
variation parameters, formulating a lower bound on the likelihood. The objective of such
methods is to convert a complex problem into a simpler one.
Part IV: Model Building for Decision Analysis
121
Since the basic idea (stemming from above) is to simplify the joint probability
distribution by transforming the local probability functions, an appropriate choice can
lead to a simplified inference problem. Transforming some of the nodes may allow for
some of the original graphical structure to remain intact or introduce a new graphical
structure to which exact methods may be applied. The sequential approach makes use of
variational transformations and aims to transform the network until the resulting structure
is amenable to exact methods (Jordan, 1999; French and Insua, 2000). Other similar
approaches exist and have shown to be fruitful in accomplishing the inference task. This
has been touched on briefly through some of the examples in this section.
Another approach to the design of approximation algorithms involves sampling (Monte
Carlo) methods. Markov chain Monte Carlo (MCMC) methods are a more efficient
approach and many of these algorithms have been applied to the inference problem in
graphical models. Their simplicity in implementation and theoretical overtures of
convergence make them advantageous; however, they can be slow to converge and it can
be hard to diagnose their convergence (Jordan, 1999).
The general inference problem is the computation of the conditional distribution over the
hidden values given the observed information. There is a lot of literature in this area and
especially on algorithms to accomplish this task. Some of these have been briefly
discussed in this section. These various approaches to inference are not mutually
exclusive, and in many instances, an amalgamation of such methods may prove to be the
best solution for a given graphical model. Some of these more common methods are
described in more detail in the PART V to illustrate analysis procedures.
Part V: Analytic & Computational Methods for Decision Problems
122
Part V: ANALYTIC &
COMPUTATIONAL METHODS
FOR DECISION PROBLEMS
The complexity of decision problems has led to the rapid development of methodological
solutions to these problems. The graphical structure of probabilistic models has also
added a more intuitive approach to the problem at hand. These developments have also
led to the advancement of computational methods to deal with the complexities of the
numerical assessment of these problems which arises. The task at hand is probabilistic
inference, whether it lies in graphical models or in expected losses. Recall that French
and Insua (2000) commented on the feasibility and robustness which relate to the
practicality and implications of decision processes. Thus, the need for computational
methods which meet the criteria is essential to decision modeling.
Earlier parts of this report introduced the framework for modeling decision problems in
introduced the decision theoretic aspects. PART IV of this report focused on building
models to illustrate decision problems with an emphasis on probabilistic graphical
models. This part of the report concludes this paper by discussing some of the
computational issues which arise in decision analysis. Computational methods including
Monte Carlo methods and dynamic programming are introduced as both feasible and
efficient approaches for the analytic assessment of decision problems. These analytic
concepts are the focus of the beginning sections. The last section is an application of the
typical methodology used in a prototypical example. This concludes the methodological
issues which arise in decision problems.
Part V: Analytic & Computational Methods for Decision Problems
123
Section 12: Decision Theoretic Computations
Part I and Part IV introduced the framework for decision models and the construction of
graphical models, respectively. Both approaches introduced the computational issues
which arise. Much of these issues reside in the calculation and or minimization of
probability distributions. Some of the methodological approaches were introduced;
however were never developed. This section reintroduces the problem and begins to
address algorithms which aim to solve some of these computational constraints. A
review of analytic approaches and numerical integration are presented but demonstrate
some limitations which are resolved though Monte Carlo methods. Although this section
concentrates on Bayesian decision theoretic computations, other statistical decision
theoretic approaches face the same computational problem. Therefore, they provide
insight into solving the computational issues which arise through the classical approach.
Section 12.1: Computational Issues in Decision Problems
The first half of this report focused on the construction and development of decision
models. Recall that the loss function ) , ( a l θ is one of the key attributes of decision
theory. Moreover, the process of calculating and minimizing expected losses is the key
objective. This section addresses the computation issues which arise in the modeling of
decision problems and specifically these objectives.
French and Insua (2000) state that there are two objectives which need to be solved:
∫
∈ ∈
= θ θ θ θ
θ θ
d x p a l x a l E
A a A a
)  ( ) , ( min ]  ) , ( [ min
and
∫
∈
θ θ θ
θ
d a x p a l
A a
) ,  ( ) , ( min .
The former is simply the calculation of an expectation. In this latter case, the probability
model involves dependencies on a which occurs in influence diagrams and decision trees
formulations. Assuming the optimal action is known means that the explicit computation
of the posterior probability is only required. Thus the solution of these decision problems
requires the computation of a posterior expectation.
In most instances, the integrals involved are difficult to evaluate analytically; and hence
more powerful integration methods need to be implemented. This section reviews
analytical and numerical approaches before introducing simulation or Monte Carlo
methods. The limitation of the analytic and numeric approaches draws the importance
for more computationally efficient and plausible methods. However, these methods are
not without their drawbacks and are addressed. The latter part of this section attends to
combining the issues of optimization and integration. Although, there is a focus on the
Part V: Analytic & Computational Methods for Decision Problems
124
Bayesian decision theoretic computations, these issues are synonymous to other statistical
decision theoretic issues; and hence provide insightful possibilities.
Section 12.2: The Bayesian Framework
In addressing the computational issues within a Bayesian framework, it deems necessary
to describe the Bayesian perspective in order to fully understand the issues which arise is
such methodology. Thus, we begin with an overview of Bayesian data analysis.
The essential characteristic of Bayesian methods is their explicit use of probability
models for quantifying uncertainty in inferences based on statistical data analysis. It can
be conceptualized by dividing it into the following three steps: (1) setting up a full
probability model; (2) conditioning on the observed data; and (3) evaluating the fit of the
model and the implications of the resulting posterior distribution (Gelman et al. 1995)
Setting up a full probability model entails the construction of a joint probability
distribution for all observable and unobservable quantities in a problem. The model
should be consistent with knowledge about the underlying scientific problem. The
second step involves calculating and interpreting the appropriate posterior distribution –
the conditional probability distribution of the unobserved quantities of ultimate interest,
given the observed data. Lastly, questions addressing “does the model fit the data, are the
substantive conclusions reasonable, and how sensitive are the results to the modeling
assumptions in the first step” (Gelman et al. 1995) should be tended to.
Contextually, it is often easier (with the Bayesian approach) to build highly complex
models when such complexity is realistic. The central feature to Bayesian inference is
the direct quantification of uncertainty which in principle allows models with multiple
parameters and complex layers to be fit (Gelman et al. 1995) with a conceptually simple
method. The purpose of this report is not to glorify Bayesian inference over the
frequentist approach as there is no need to choose exclusively between the two
perspectives nor is it without its constraints. However, it is the purpose of this section to
introduce and outline the general concept of the Bayesian paradigm. Before launching
into a discussion on computational advances in Bayesian methods, a review of some
terminology and standard notation and a formal presentation of the problem are provided.
The task in decision problems has been to make probabilistic inferences. Statistical
inference refers to the process of drawing conclusions about a population on the basis of
measurement or observations made on a sample of individuals from the population
(Everitt, 2002). From a Bayesian perspective, there is no fundamental distinction
between observables and parameters of a statistical model as they are all considered
random quantities. Thus, here in lies the task – the quantification of such uncertainty.
Recall that Bayesian data analysis involves (first) setting up the full (joint) probability
model. Let D denote the observed data, and θ denote model parameters and missing data.
The joint distribution is comprised of two parts: a likelihood )  ( θ D P and a prior
Part V: Analytic & Computational Methods for Decision Problems
125
distribution ) (θ P . To summarize, obtaining the likelihood )  ( θ D P describes the process
which gives rise to the data D in terms of the unknown parameters θ while the prior
distribution ) (θ P expresses what is known about θ, prior to observing the data.
Specifying )  ( θ D P and ) (θ P gives a full probability model, in which
) ( )  ( ) , ( θ θ θ P D P D P = .
The second step in Bayesian data analysis commemorates the Bayes Theorem is used to
determine the distribution of θ conditional on D, after observing the data:
.
) ( )  ( ) (
)  ( ) (
)  (
∫
=
θ θ θ
θ θ
θ
d D P P
D P P
D P
Applying the Bayes Theorem derives the posterior distribution of θ and is the object of
all Bayesian inference (Gilks et al. 1996).
Moments, quantiles and other features of the posterior distributions can be expressed in
terms of posterior expectations of functions of θ. The posterior expectation of a function
) (θ f is
.
)  ( ) (
)  ( ) ( ) (
]  ) ( [
∫
∫
=
θ θ θ
θ θ θ θ
θ
d D P P
d D P P f
D f E
The integration of this expression has until recently been the source of most of the
practical difficulties in Bayesian inference. The problem of calculating expectations in
highdimensional distributions adheres both to Bayesian inference and some areas of
frequentist inference (Geyer, 1995). To avoid an unnecessarily Bayesian “flavor” to this
discussion, the problem is restated in more general terms.
Let X be a vector of k random variables, with distribution ) (⋅ π . In Bayesian applications,
X will comprise of both model parameters and missing data; in frequentist applications, it
may comprise of data or random effects (Gilks et al. 1996) and Draper (1997). For
Bayesians, ) (⋅ π will be a posterior distribution (as described above) while for frequentists
it will refer to a likelihood. However, either way, the task is to evaluate the expectation
∫
∫
=
dx x
dx x x f
f E
) (
) ( ) (
)] ( [
π
π
θ
for some function of interest ) (⋅ f .
A common situation arises in such that the normalizing constant (the denominator of the
above equation) is unknown. In Bayesian practice, )  ( ) ( )  ( θ θ θ D P P D P ∝ ; however the
Part V: Analytic & Computational Methods for Decision Problems
126
normalizing constant
∫
) ( )  ( ) ( θ θ θ d D P P is not easily evaluated and is the focus of the
next section. Here, it has been assumed that X takes values in kdimensional space, i.e. it
is comprised of k vectors of continuous random variables. Although, this problem has
been described generally, X could consist of discrete random variables (in which case the
integral would be replaced by a summation) or by a mixture of both continuous and
discrete random variables.
Section 12.3: Integration Methods
As discussed in the previous section, the difficulty and the interest in Bayesian inference
is in evaluating expectations
∫
dx x x f ) ( ) ( π for various functions ) (x f . Integration is
usually simple in conjugate models – priors and posteriors are members of the same
parametric family of distributions and thus posteriors are easily determined (French and
Insua, 2002). Even for conjugate models, there may be cases (moments or probabilities)
for which the solution may not be computed explicitly.
In the last decade alone, there have been a number of advances in numerical (Tierney and
Kadane, 1986) and analytic (Smith et al. 1987) approximations for such calculations,
largely in part due to increased computing power. Some of these alternatives such as
numerical evaluation are difficult and inaccurate in k > 20 dimensions; analytic
approximations such as the Laplace approximation which is sometimes appropriate
(Gilks et al. 1996). Numerical quadrature methods for approximating integrals such as
the common trapezoid and Simpson’s rules (CSC260H, 2001) are commonly applied.
Better rules are obtained when the form of the integrand is taken into account such as
Gaussian quadrature methods which approximates well when the density function is a
normal kernel (French and Insua, 2002). Such procedures also extend into k dimensions.
Asymptotic methods based on asymptotic normality arguments also exist. However,
integration strategies based on these arguments are only valid if samples sizes are large
and other necessary conditions apply. These methods are efficient in problems with
specific structure and/or low dimensionality.
In more complex problems, simulation or Monte Carlo integration is required. Monte
Carlo methods are often the method of choice involving highdimensional problems, but
it is important to note that many of these computational methods complement each other
rather than compete (Tierney and Mira, 1999).
Monte Carlo Integration
The basic objective of the Monte Carlo approach is to draw an approximate i.i.d. sample
from ) (⋅ π which can then be used to compute sample averages as approximations to
population averages (STA4276H, 2003). Although this is an oversimplification of the
approach it introduces the concept concisely. More formally, Monte Carlo integration
evaluates )] ( [ X f E by drawing samples } ,..., 1 , { n t X
t
= from ) (⋅ π and then approximating
Part V: Analytic & Computational Methods for Decision Problems
127
. ) (
1
)] ( [
1
∑
=
≈
n
t
t
X f
n
X f E
So, the population mean of ) ( X f is estimated by a sample mean; the average value has
some basic properties which are very important. When the samples } {
t
X are
independent, the Law of Large Numbers ensures that the approximation can be made as
accurate as desired by increasing the sample size n (Gilks et al. 1996). Thus, n is under
the control of the analyst; it is not the size of a fixed data sample.
The problem at hand is to simulate observations from a posterior distribution, obtained
via Bayes’ Rule as
.
) ( )  ( ) (
)  ( ) (
)  (
∫
=
θ θ θ
θ θ
θ
d X P P
X P P
X P
In general, drawing samples } {
t
X from )  ( ) ( X P θ π = ⋅ is not feasible, since ) (⋅ π can be
quite complex and nonstandard. However the samples } {
t
X need not be independent; it
can be generated by any process which (loosely) draws samples throughout the support
of ) (⋅ π in the correct proportions (Gilks et al. 1996). One way this is accomplished is
through a Markov chain having ) (⋅ π as its stationary distribution.
Let ,... ,..., ,
2 1 t
X X X be a sequence of random variables which have outcomes from some
set S. Also, assume that the conditional distribution of
t
X given all previous
observations ,... ,
2 1 − − t t
X X only depends on the last observation. That is, if the current
value is known then the prediction of the next value does not require the previous values
(CHL5223H, 2003). This property is known as the Markovian property and such a
sequence of random variables is known as a Markov chain. More formally this is stated
as suppose a sequence of random variables } ,..., 1 , { n t X
t
= are generated such that at each
time 1 ≥ t , the next state
1 + t
X is sampled from a distribution )  (
1 t t
X X P
+
which depends
only on the current state of the chain,
t
X . Thus, to summarize, given
t
X , the next state
1 + t
X does not depend further on the history of the chain }. ,..., , {
1 2 1 − t
X X X
An important issue in basic Markov chain properties is the affect of the starting value of
the chain or state
1
X affect
t
X . The trait of the distribution of
1
 X X
t
, denoted
)  (
1
) (
X X P
t
t
, without the intervening variables } ,..., , {
1 3 2 − t
X X X suggests that
t
X
depends directly on
1
X . Subject to some conditions, the chain will gradually ‘forget’ its
initial state and )  (
1
) (
X P
t
⋅ will eventually converge to a unique limiting distribution
(Gilks et al, 1996). That is,
t
X settles down to a limiting distribution no matter what the
value of the starting value of
1
X is.
Part V: Analytic & Computational Methods for Decision Problems
128
There are some basic properties of Markov chains which should be stated for
completeness. The first is that if the set of states, S, is finite then the Markov chain holds
two properties: aperiodicity and irreducibility; if the set of states is infinite then an
additional property of an invariant distribution exists (CHL5223H, 2003). Thus, for a
finite Markov chain, the limiting distribution exists of it is aperiodic (the chain is not
cyclic) and irreducible (possible to move from one state to another) while infinite Markov
chains hold the additional property of having an existing invariant or stationary
distribution. The stationarity of a distribution simply refers to the limit as ∞ → n is ) (⋅ π .
Thus, as discussed earlier, as t increases, the sample points } {
t
X look increasingly like
dependent samples from the stationary distribution denoted as ) (⋅ φ based on a long burnin
of m iterations, where n m t ,..., 1 + = . The output from the Markov chain is used to
estimate the expectation )] ( [ X f E , where X has distribution ) (⋅ φ . The important by
product is the “path averages” to the expected value under the limiting distribution where
burnin samples are usually discarded for the calculation of the estimator:
)]. ( [ ) (
1
1
X f E X f
m n
f
n
m t
t ∞
+ =
→
−
=
∑
The convergence to the expectation is confirmed by the ergodic theorem whereby the
above estimator is referred to as the ergodic average (Gilks et al. 1996). Therefore these
path averages can be used to estimates expected values of the limiting distribution.
(See Part III for a discussion on Markov chains and their properties.)
Section 12.4: Markov Chain Monte Carlo Methods
Recall that is was stated that the basic objective of the Monte Carlo approach was to draw
an approximate i.i.d. sample from ) (⋅ π which could then be used to compute sample
averages as approximations to population averages. Since direct i.i.d. sampling is rarely
possible, a sample from a similar distribution is obtained and a Markov chain as its
unique invariant distribution is constructed (Tierney and Mira, 1999). These are the
concepts which formulate the Markov chain Monte Carlo (MCMC) method.
As explained the Monte Carlo techniques generate random variables having certain
discrete distributions. However, once it becomes too complex, Markov chains can be
used. If a Markov chain with a stationary distribution is the same as the desired
probability distribution ) (⋅ π (the target distribution), then the Markov chain can be run for
a long time, say m iterations. The probability that the chain is in state i will be
approximately the same as the probability that the discrete random variable equals i. The
idea behind the method is simple. Simply, a Markov chain with invariant
distribution ) (⋅ π is run and sample path averages are used to approximate expectations
under ) (⋅ π . If the chain is irreducible and the expectations exist, then the sample path
averages converge to the expectations. The problem of finding the Markov chain with
the desired stationary distribution still exists.
Part V: Analytic & Computational Methods for Decision Problems
129
The ergodic average shows how a Markov chain can be used to estimate )] ( [ X f E , where
the expectation is taken over its stationary distribution ) (⋅ φ . This would seem to provide
the solution to the initial problem; however, first the construction of a Markov chain such
that its stationary distribution ) (⋅ φ is precisely the distribution of interest ) (⋅ π needs to be
established. There are two widely established ideas which are presented in the next two
subsections – dimension reduction by conditioning (Gibbs sampling approach) and
acceptance and rejection (MetropolisHastings algorithm).
The practical appeal of simulation methods, including MCMC, is that, given a set of
random draws from the posterior distribution, one can estimate all summary statistics
from the posterior distribution directly from the simulations. MCMC methods have been
widely successful because they allow one to draw simulations from a wide range of
distributions, including many that arise in statistical work, for which simulation methods
were previously much more difficult to implement. To apply the MCMC method in a
particular problem a sampler has to be constructed. There are two central ideas for
building samplers and are guided by the two most common methods – the Gibbs sampler
and the MetropolisHastings algorithm which are described in sections § 4.2 and § 4.3.
These algorithms are not in competition with each other and in fact complement each
other. The Metropolis within Gibbs algorithm is identified as a conceptual blend in
which one parameter is updated at a time; this was the initial proposed algorithm which
can be automated (Gilks et al. 1996).
Gibbs Sampling
The Gibbs sampling algorithm holds the central idea of “dimension reduction by
conditioning” (Tierney and Mira, 1999) and is best described mathematically by its
transition kernel (Gilks et al. 1996). Suppose ) (⋅ π is the joint distribution of
} ,..., 1 , { n t X
t
= where t ≥ 1, a Markov chain. Update each component based on the value
of each other component
) ,..., , ,...,  ( )  (
1 1 1 1 n t t t t t
X X X X X X X
+ − −
= π π
These distributions are called full conditionals. The transition kernel describes the
density of going from one point X to another point Y (Gilks et al. 1996, Brooks et al.
1998):
}) , { }, , {  ( ) , (
1
i j X i j Y Y Y X K
j
d
i
j i
< < =
⋅
=
⋅ ⋅ ∏
π
This is a product of the conditional densities of the individual steps required to produce
an iteration of a ddimensional Gibbs sampler. A thorough theoretical discussion is
omitted for brevity sake.
Part V: Analytic & Computational Methods for Decision Problems
130
The Gibbs sampling algorithm is quite simpler in concept and thus, a simple example is
featured below. Let θ be 3dimensional and the parameters of a Bayesian model. The
algorithm follows:
1. Start with a set of initial values:
) 0 (
θ .
2. Given the mth update, get the (m+1)th sample by:
) data , ,  ( ~
) data , ,  ( ~
) data , ,  ( ~
) (
2 2
) (
1 1 3
) 1 (
3
) (
3 3
) (
1 1 2
) 1 (
2
) (
3 3
) (
2 2 1
) 1 (
1
m m m
m m m
m m m
f
f
f
θ θ θ θ θ θ
θ θ θ θ θ θ
θ θ θ θ θ θ
= =
= =
= =
+
+
+
3. The joint posterior is invariant, therefore:
)]. ( [ ) (
1
1
) (
θ θ f E f
m n
f
n
m t
m
∞
+ =
→
−
=
∑
It should be noted that when this sampling was developed by Geman and Geman (1984)
its purpose was to sample from a Gibbs distribution, which is a distribution over a lattice;
however since its inception it has expanded over many distributions and thus, being
called a Gibbs sampling algorithm is a bit of a misnomer. The Gibbs sampler is a special
case of the MetropolisHastings algorithm which is explained next.
MetropolisHastings Algorithm
The notion that if the chain is irreducible and the expectations exist, then the sample path
averages converge to the expectations is the basic idea proposed by Metropolis et al.
(1953) and was extended by Hastings (1970). The central idea to the Metropolis
Hastings algorithm is acceptance (or proposals) and rejection (Tierney and Mira, 1999).
This algorithm proposes a new point on the Markov chain which is either accepted or
rejected. If the point is accepted, the Markov chain moves to a new point. If it is
rejected, the Markov chain remains in the same state. By choosing the acceptance
probability correctly, a Markov chain which has ) (⋅ π as a stationary distribution is created.
Let S be a state space with probability distribution ) (⋅ π on S. Then choose a proposal
distribution } , , : { S j i q
ij
∈ with . each for 1 and 0 S i q q
S j
ij ij
∈ = ≥
∑
∈
Given that
1
,
+
=
n n
X i X is computed as the MH algorithm follows:
1. Choose j Y
n
=
+1
according to the Markov chain } , , : { S j i q
ij
∈
2. Set
=
ij
ij
ij
q i
q j
) (
) (
min
π
π
α
Part V: Analytic & Computational Methods for Decision Problems
131
3. With probability
proposal) (reject let Otherwise,
proposal) (accept let ,
1
1 1
i X X
j Y X
n n
n n ij
= =
= =
+
+ +
α
As desired, this will create a chain with stationary distribution ) (⋅ π . The Metropolis
Hastings algorithm can also be used for continuous random variable by using densities
and continuous proposal distributions, but is not explored further here. It is also possible
to extend the concept to a multidimensional case. The “Gibbs sampler”, is a special case
of the multivariate MetropolisHastings algorithm where the proposals q are the full
conditionals and the acceptance probability is always 1 as explained above.
Convergence Issues
A properly derived and implemented MCMC method will produce draws from the joint
posterior density once it has converged to stationarity ) (⋅ φ . As a result, a primary concern
in applying such methodology is determining that it has in fact essentially converged; that
is, that after an initial burnin period (designed to remove dependence of the simulated
chain on its starting location), all further samples may be safely though of as coming
from the stationary distribution. This is complicated by two factors. First, since what is
produced at convergence is not a single value but a random sample of values, somehow
the natural variability in the converged chain and the typically greater variability in the
preconvergence samples must be distinguishable. Second, since the sampled results
come from a Markov chain, they will typically be serially correlated.
Although MCMC methods have effectively revolutionized the field of Bayesian statistics
over the past decade, any inference based upon MCMC output relies critically upon the
assumption that the Markov chain being simulated has achieved a steady state or
“converged”. Many techniques have been developed for trying to determine whether or
not a particular Markov chain has converged, and the paper by Brooks and Roberts
(1998) reviews some of these methods with an emphasis on the mathematics
underpinning these techniques in an “attempt to summarize the current ‘stateofplay’ for
convergence assessment”. These are not elaborated here for brevity purposes.
A round table discussion held at the Joint Statistical meetings in August of 1996 with a
panel of experts in the field published parts of the discussion in which they discuss more
practical applications and procedures they follow. As various diagnostics are available,
these experts mentioned that use of autocorrelation plots and trace plots are amongst the
most useful and practical. The former is simply a plot demonstrating the path of the
parameter; while the latter is simply the internal correlations. These diagnostics are often
used to estimate the degree of mixing in a simulation which is the extent to which a
simulated chain traverses the entire parameter space (Kass et al. 1998).
Section 12.5: Integrating Integration & Optimization Methods
To summarize, some of the aspects addressed in the introductory section on Bayesian
framework (§ 12.2) such as awkward posterior distributions and further distributional
Part V: Analytic & Computational Methods for Decision Problems
132
complexity introduced by model parameters would (otherwise) require subtle and
sophisticated or analytic approximation; however dimensionality problems typically put
out of reach the implementation of other sophisticated approximation techniques as those
introduced in § 12.3. Markov chain Monte Carlo methods put to ease much of these
constraints which have otherwise limited Bayesian data analysis with the included
aspects of providing general functions of models parameters and alleviating much
awkward predictive inference. Thus, the problem of computing expectations or making
probabilistic inferences – in this instance is remedied.
In problems with optimization requirements such as general loss functions, such
computational effort is not sufficient. That is, combining problems of optimization and
integration are required for decision theoretic computations. For general loss functions,
the objective is to determine the alternative a* which minimizes the expected loss. There
are two operations: (1) compute the expected loss; and (2) minimize the expected loss.
French and Insua (2000) review several optimization methods which are useful within
statistical decision theory. It is also important to note that some of the integration or
sampling methods previously described also have an optimization step to compute
maximum likelihood estimators, for instance. Thus, it only seems plausible to have
Monte Carlo optimization methods to solve such problems. Adaptive methods which try
to accomplish such a task have been developed; however, they are not without their
limitations.
An alternative approach was introduced by Shao (1989) in the statistical decision theory
literature; and is referred to as sample path optimization. The strategy follows:
1. Select a sample )  ( ~ ,...,
1
x p
m
θ θ θ
θ
.
2. Solve the optimization problem
∑
=
∈
m
i
i
A a
a l
m
1
) , (
1
min θ
yielding ) (θ
m
a .
By the law of large numbers, it holds that ) (θ
m
a converges to a minimum expected loss
alternative a*. This approach can be extended to any form of the loss function and thus
presents some alternatives to the computational problem.
This section has introduced some of the main computational tools used in statistical
decision theory. These problems require both powerful integration and optimization
methods as well as algorithms that achieve both simultaneously.
Part V: Analytic & Computational Methods for Decision Problems
133
Section 13: Analytic Approaches to Solving Decision Problems
Part I, Part II and Part IV introduced the framework for decision models and the
construction of graphical models, respectively. The previous section emphasized some of
the computational issues which arise in decision problems. However, this was removed
from within the sequential framework described in Part III and which is eminent in
decision tree structures. This section develops upon the concepts introduced in earlier
sections and relates them to more analytic approaches to solving the problems. The
emphasis is placed on dynamic programming and progresses into its important as a
computational tool for decision problems. Sensitivity analysis is addressed last yet is
important as means of checking the sensitivity of the output, and the implications of the
findings and possible inconsistencies.
Section 13.1: Dynamic Programming
The computational methods presented in section 12 neglected decision problems which
involved a sequential approach or which had been transformed such that they could be
assessed sequentially. Sequential problems lend themselves to another extension of
optimization methods. Dynamic programming is a mathematical technique which allows
for a sequence of interrelated decisions to be made. And further, it provides a systematic
procedure for determining optimal combinations of decisions.
Hillier and Lierberman (2001) present the basic features that characterize dynamic
programming problems. These include:
1. The problem can be divided into stages, with a decision required at each stage.
2. Each stage has a number of states associated with the beginning of that stage.
3. The consequence of the decision at each stage is to transform the current state to a
state associated with the beginning of the next stage.
4. The solution procedure is designed to find an optimal choice for the overall
problem.
5. Bellman’s principle (Bather, 2000): Given the current state, an optimal policy for
the remaining stages is independent of the decisions adopted in previous stages.
6. The solution begins by finding an optimal choice for the last stage.
Dynamic programming is defined by a recursive relationship. The general notation used
to describe a dynamic programming problem is summarized below:
N = number of stages
n = index for current stage
n
θ = current state for stage n
n
x = decision variable for stage n
*
n
x = optimal value of
n
x 
n
θ
) , (
n n n
x f θ = contribution of stages N n n ,..., 1 , + to objective function
Part V: Analytic & Computational Methods for Decision Problems
134
The recursive relationship will lead to two possible forms, depending on the problem:
)} , ( { max ) (
*
n n n
x
n n
x f f
n
θ θ =
or
)} , ( { min ) (
*
n n n
x
n n
x f f
n
θ θ =
This recursive procedure induces a backward form of mathematical induction in order to
achieve the required task. This approach (as dynamic programming) was first outlined by
Bellman, and was coined such to describe the techniques which he had brought together
to study a class of optimization problems involving sequential decisions (Bather, 2000).
Sequential problems and subsequently dynamic programming problems can be classified
into two groups: deterministic and stochastic. The former approach suggests that the
state at the next stage is completely determined by the state and decision at the current
stage. The latter approach (which is probabilistic) implies that a probability distribution
will dictate what the next state will be. This probabilistic approach is more conducive to
our needs and is associated with solving decision tree diagrams.
Consider the diagram above illustrating the general structure of a stochastic dynamic
programming problem. Let S denotes the number of possible states at stage n + 1. The
system goes to state i with probability
i
p for S i ,..., 2 , 1 = , given state
n
θ and decision
n
x at
stage n. At state i, the cumulative information of stage n to the objective function is
represented by
i
C . This figure can be extended in such a manner for all possible stages;
and diagrammatically has the form of a decision tree.
With the inclusion of a probability distribution dictating the consequence of the next
state, means that the precise form of the objective function will differ slightly from the
one given above. It can now be written as:
), , ( min ) (
with
)] ( [ ) , (
1 1
*
1
1
*
1
1
+ + +
=
+
+
=
+ =
∑
n n
x
n
S
i
n i i n n n
x i f i f
i f C p x f
n
θ
where the minimization is taken over all feasible values of
1 + n
x .
In statistical decision theory
1
, it might be said that the transition from one stage to another
is controlled by a sequence of actions, given the state
n
θ at stage n, the choice of action
n
a
determines the probability distribution of the next state. Essentially because, we are
1
Bather provides a concise introduction to both deterministic and probabilistic dynamic programming and
its application to utility theory.
Part V: Analytic & Computational Methods for Decision Problems
135
dealing with a Markov system, only the current state is important. The state and action
variables may be discrete or continuous and the range of possible choices of an action
may vary state to state. The essential characteristic of such modeling (in summary) is
that given
n
θ at stage n, the choice of action or decision
n
x determines both the probability
distribution of the next state
1 + n
θ and the expected cost
2
of the transition
1 +
→
n n
θ θ .
Section 13.2: Bellman’s Principle and the Bayesian Perspective
Recall that one of the fundamental elements of decision problems is the loss function and
in dynamic programming it is this objective function we tend to want to minimize. Now
consider an additive function which by definition is monotonic and separable. (See Part
II on Utility Theory.) That is if the first few terms are fixed then minimizing its whole
sum is equivalent to minimizing its individual terms  a property which holds when
taking expectations. This expounds the aforementioned Bellman Principle of Optimality:
The optimal sequential decision policy for the problem which begins with
) (⋅
θ
P as the decision maker’s prior forθ and has R stages to run must have the
property that is, at any stage n < N, the observations
n n
x X x X = = ,...,
1 1
have
been made, then the continuation of the optimal policy must be the optimal
sequential policy for the problem beginning with ) ,...,  (
1 n
x x P ⋅
θ
as the prior
and having N – n stages to run. (French and Insua, 2000)
French and Insua make it a point to emphasize the importance of the decision maker’s
prior in Bellman’s principle. Since the prior describes the state of knowledge, it also
describes the state of the decision making process.
For simplicity of notation, let ) (⋅ =
θ
π P and ) (π
n
r be the Bayes risk of the optimal policy
with at most n stages left to run with knowledgeπ . Then the situation in which the
decision maker has no option to make an observation and must choose an action can be
expressed as:
)] , ( [ min ) (
0
θ π
θ
a l E r
A a∈
=
where the expectation overθ is taken with respect toπ . Next consider that the decision
maker is given the chance to make an observation. The decision maker has two options:
1. Take an action A a ∈ without making an observation at an expected loss of ) (
0
π r .
2. Make a single observation X at cost γ and choose an action A a ∈ in light of her
current knowledge ) ( X π at an expected loss of ))] ( ( [
0
X r E
X
π .
Bellman’s Principle assumes that after the observation(s), the optimal Bayes action is
taken (French and Insua, 2000). Accordingly, this can be extended to n observations
2
This refers to the expected loss function.
Part V: Analytic & Computational Methods for Decision Problems
136
which then give her option (1) stated above or an extension of option (2) which would
result in observations made for the remaining n 1 stages with an expected loss of
))] ( ( [
1
X r E
n X
π
−
.
Thus, for N n ,..., 2 , 1 =
))]} ( ( [ ), ( min{ ) (
1 0
X r E r r
n X n
π γ π π
−
+ = .
Note the form of this recursion formula. This defines the dynamic programming or
backward induction form introduced in the previous subsection. Thus, dynamic
programming allows for both the calculation of ) (π
n
r and the characterization of the
optimal policy. This following section now extends the basic principles demonstrated in
these earlier subsections to illustrate the analytic approach to solving decision problems
as applied to utility theory – or more precisely in the maximization of utilities.
Section 13.3: Analysis in Extensive Form
There are two basic modes of analysis which can be utilized to determine which course of
action will maximize the decision maker’s utility: the extensive form of analysis and the
normal form. The two forms are mathematically equivalent and lead to identical results;
each has something to contribute to the insight into the decision problem and have
separate technical advantages in certain situations. This section focuses on the extensive
form which equates the use of decision trees more readily.
Backwards Induction
Consider a simple decision tree whereby an initial decision results in a chance and a
subsequent action results in a final state. The extensive form of analysis proceeds by
working backwards from the end of the decision tree (the right side of the tree) to the
initial starting point: instead of starting by asking which experiment e the decision maker
should choose, the procedure starts by asking which terminal act he should choose if he
had already performed a particular experiment e and observed a particular outcome x.
Even at this point, with a known history (e, x), the utilities of the various possible
terminal acts are uncertain because θ which will be chosen by chance at the end is still
unknown; but this difficulty is easily resolved by treating the utility of any a for given (e,
x) as a random variable u(e, x, a, θ
~
)
3
and applying the operator
x
E
 θ
′ ′ which takes the
expected value of u(e, x, a, θ
~
) with respect to the conditional measure
x
P
 θ
′ ′ .
Symbolically, we can compute for any given history (e, x) and any terminal act a
)
~
, , , ( ) , , ( *

θ
θ
a x e u E a x e u
x
′ ′ ≡ ;
3
Recall notation introduced in Section 1 for clarification. Notation for unknown quantity is described as a
random variable in this case.
Part V: Analytic & Computational Methods for Decision Problems
137
this is the utility of being at the juncture ) , , ( a x e looking forward, before chance has made
a choice of θ (Raiffa and Schlaifer, 2000).
Since the decision maker’s objective is to maximize his expected utility, he will, if faced
with a given history (e, x), wish to choose the a (or one of the as if more than one exist)
for which ) , , ( * a x e u is greatest; and the utility of being at the terminal with history (e, x)
and the choice of a still to make is
) , , ( * max ) , ( * a x e u x e u
a
≡ .
After ) , ( * x e u has been computed in this way for all possible histories (e, x), the problem
of the initial choice of an experiment can be addressed. At this point, the initial move,
the utilities of the various possible experiments are uncertain only because x which will
be chosen by chance is still unknown, and this difficulty is resolved in exactly the same
way that the difficulty in choosing a given (e, x) was resolved: by putting a probability
measure over the chance‘s moves and taking expected values. In other words, )
~
, ( * x e u is
a random variable at the initial decision point because x
~
is a random variable, and
therefore it can be defined for any e
)
~
, ( * ) ( *

x e u E e u
e x
≡
where
e x
E

expects with respect to the marginal measure
e x
P

.
Again, the decision maker will wish to choose the e for which ) ( * e u is greatest; and
therefore may suggest that the utility of being at the initial decision node with the choice
of e still to make is
).
~
, ,
~
, ( max max ) ( * max *
 
θ
θ
a x e u E E e u u
x
a
e x
e e
′ ′ = ≡
This procedure of working back from the outermost branches of the decision tree to the
base of the trunk is often called “backward induction”. More descriptively it could be
called a process of “averaging out and folding back”.
Section 13.4: Analysis in Normal Form
The final product of the extensive form of analysis presented in the previous section can
be thought of as a description of the optimal strategy consisting of two parts:
1. A prescription of the experiment e which should be performed,
2. A decision rule prescribing the optimal terminal act a for every possible
outcome x of the chosen e.
Part V: Analytic & Computational Methods for Decision Problems
138
The whole decision rule for the optimal e can be simply determined from the part of the
analysis which determined the optimal a for every x in X; and incidentally these same
results also enable the optimal decision rule to accompany any other e in E, even though
the e in question is not itself optimal.
The normal form of analysis, also has as its end product the description of an optimal
strategy, and arrives at the same optimal strategy as the extensive form but via a different
route. Instead of first determining the optimal action a for every possible outcome x, and
thus implicitly defining the optimal decision rule for any e, the normal form of analysis
starts by explicitly considering every possible decision rule for a given e and then
choosing the optimal rule for that e. This can be done for E e ∈ ∀ such that an optimal e
can be found as in the extensive form of analysis.
Recall the decision rules as presented in Sections 2 and 3. Mathematically, a decision
rule δ for a given experiment e is a mapping which carries x in X into ) (x δ in A. Given a
particular strategy ) , ( δ e and a particular pair of values ) , ( θ x , the decision maker’s act as
prescribed by the rule will be ) (x a δ = and his utility will be ) ), ( , , ( θ δ x x e u ; but before
the experiment has been conducted and its outcome observed, )
~
), ( ,
~
, ( θ δ x x e u is a random
variable because x
~
and θ
~
are random variables.
The decision maker’s objective is therefore to choose the strategy ) , ( δ e which maximizes
his expected utility
)
~
),
~
( ,
~
, ( ) , ( *
 ,
θ δ δ
θ
x x e u E e u
e x
≡ .
This double expectation will actually be accomplished by iterated expectation and the
iterated expectation can be carried out in either order: we can first expect over θ
~
holding
x
~
fixed and then over x
~
, using the same measures
θ , e x
P and
θ
P′ .
If e and δ are given and θ
~
is held fixed, then by taking the expectation of
] ),
~
( ,
~
, [ θ x d x e u with respect to
θ , e x
P we obtain
] ),
~
( ,
~
, [ ) , , ( *
, 
θ θ δ
θ
x d x e u E e u
e x
≡ ,
which will be called the conditional utility of ) , ( δ e for a given state θ. Next taking the
expected value overθ
~
with respect to the unconditional measure
θ
P′ , we obtain
)
~
, , ( * ) , ( * θ δ δ
θ
e u E e u ′ = ,
which can be referred to as the unconditional utility of ) , ( δ e .
Part V: Analytic & Computational Methods for Decision Problems
139
For any particular experiment e, the decision maker is free to choose the decision rule δ
whose expected utility is greatest; and therefore it may be said that the utility of any
experiment is
) , ( * max ) ( * δ e u e u
d
≡ .
After computing the utility of every E e ∈ , the decision maker is free to choose the
experiment wit the greatest utility such that
].
~
),
~
( ,
~
, [ max max ) ( * max *
, 
θ δ
θ θ
x x e u E E e u u
e x
d e e
′ = ≡
Section 13.5: Equivalence of the Extensive and Normal Form
4
The extensive and normal forms of analysis will equivalent if and only if they assign the
same utility to every potential e in E, i.e. if the formula
]
~
),
~
( ,
~
, [ max ) ( *
, 
θ δ
θ θ
x x e u E E e u
e x
d
n
′ =
derived as ) , ( * max ) ( * δ e u e u
d
≡ by the normal form of analysis agrees for all e with the
formula
)
~
, ,
~
, ( max ) ( *
 
θ
θ
a x e u E E e u
x
a
e x e
′ ′ =
derived as )
~
, ( * ) ( *

x e u E e u
e x
≡ by the extensive for. The operation
θ θ , e z
E E′ is equivalent
to the expectation over the entire possibility space X × Θ and is therefore equivalent
to
x e z
E E
  θ
′ ′ . It follows that the formal result as presented above can be written as
]
~
),
~
( ,
~
, [ max ) ( *
 
θ δ
θ
x x e u E E e u
x e x
d
′ ′ ′ = ,
and it is then obvious that the best δ will be the one which for every x maximizes
].
~
), ( , , [

θ δ
θ
x x e u E
x
′ ′
This, however, is exactly the same thing as selecting for every x an a
x
which satisfies
)
~
, , , ( max )
~
, , , (
 
θ θ
θ θ
a x e u E a x e u E
x
a
x x
′ ′ = ′ ′
4
See Raiffa and Schlaiffer (2000) for further details on extensive and normal form of analysis. This has
been adapted from the author’s book on analysis of statistical decision problems.
Part V: Analytic & Computational Methods for Decision Problems
140
as stated by the extensive form of analysis. Letting ) ( * x δ denote the optimal decision
rule selected by the normal form, it has just been shown that
x
a x = ) ( * δ and that the
formulas ) ( * e u
n
and for ) ( * e u
e
are equivalent.
Thus, in order to choose the best e and must therefore evaluate ) ( * e u for all e in E, the
extensive and normal forms of analysis require exactly the same inputs of information
and yield exactly the same results even though the intermediate steps in the analysis are
different. If, however, e, is fixed and one wishes merely to choose an optimal terminal
action a, the extensive forma has the merit that one has only to choose an appropriate
action for the particular x which actually materializes; there is no need to find the
decision rule which selects the best action for every x which might have occurred but in
fact did not occur.
Section 13.6: Combination of Formal and Informal Analysis
5
The general model stated in the overview of modeling a decision problem is often
criticized on the grounds that utilities cannot be rationally assigned to the various
possible ) , , , ( θ a x e combinations because the costs, profits, or in general the
consequences of these combinations would not be certain even if θ were known. In
principle, such criticisms represent nothing but an incomplete definition of the state space
Θ which can be made rich enough to include all possible pairs of value of “these”
unknowns. The decision maker’s uncertainties about these values can then be evaluated
together with his other uncertainties in the probability measure which is assigned to Θ,
and the analysis of the decision problem can the proceed as before.
Consider the state θ expressed as a doublet ) , (
) 2 ( ) 1 (
θ θ so that the state space is of the
form
) 2 ( ) 1 (
Θ × Θ = Θ . Thus,
) 1 (
θ might be the parameter of a Bernoulli process while
) 2 (
θ might be the cost of the product (for example). In terms of the original decision tree
in figure 4, a play was a 4tuple ) , , , ( θ a x e and utilities were assigned directly to
each ) , , , ( θ a x e . If θ is split into ) , (
) 2 ( ) 1 (
θ θ , a play is 5tuple ) , , , , (
) 2 ( ) 1 (
θ θ a x e ; utilities
are assigned to each ) , , , , (
) 2 ( ) 1 (
θ θ a x e and the utility of any ) , , , (
) 1 (
θ a x e is the expected
value of the random variable )
~
, , , , (
) 2 ( ) 1 (
θ θ a x e , the expectation being taken with respect
to the conditional measure on
) 2 (
Θ given the history ) , , , (
) 1 (
θ a x e . This asserts the idea
that it is possible to further consider the split of a tree by only conducting a partial
analysis (and partial construction of a tree) based on
) 1 (
θ while holding
) 2 (
θ constant.
Besides cutting the decision tree before it is logically complete, the decision maker may
rationally decide not to make a complete formal analysis of even a truncated tree which
he has constructed. Thus if E consists of two experiments
1
e and
2
e , ) ( *
1
e u may be
formally worked out by evaluating
5
Raiffa and Schlaiffer (2000), French and Insua (2000), and Hillier and Lieberman (2001).
Part V: Analytic & Computational Methods for Decision Problems
141
)
~
, ,
~
, ( max ) ( *
1   1
1
θ
θ
a x e u E E e u
x
a
e x
′ ′ = ;
but after the formal analysis of
1
e is completed, it may be concluded without any formal
analysis at all that
2
e is not as good as
1
e and adopt
1
e .
That behavior is perfectly consistent with the principle of choice described in Sections 2
and 3 and readdressed in Section 6. Before making a formal analysis of
2
e , the decision
maker can think of the unknown quantity ) ( *
2
e u as a random variable v
~
. If the number
v, resulting from a formal analysis was known, was greater than the known
number ) ( *
1
e u ,
2
e would be adopted rather
1
e and the value of the information that
v v =
~
could be measured by the difference ) ( *
1
e u v − in the decision maker’s utility which
results from this change of choice. If on the contrary v were less than ) ( *
1
e u , the
decision maker would adhere to the original choice of
1
e and the information would have
been worthless.
In other words, the random variable can be defined as
= − )} ( *
~
, 0 max{
1
e u v value of information regarding
2
e ,
and before expounding such information at the cost of making a formal analysis of
2
e the
decision maker may prefer to compute its expected value by assigning a probability
measure tov
~
and then expecting with respect to this measure. If the expected value is
less that the cost, the decision maker will quite rationally decide to use
1
e without
formally evaluating ) ( *
2
e u v = . Operationally one usually does not formally compute
either the value or the cost of information on
2
e : these are subjectively assessed. The
computations could be formalized, of course, but ultimately direct subjective assessments
must be used if the decision maker is to avoid an infinite regress.
Before closing this section and the subject of incomplete analysis, it should be said that
completely formal analysis and completely intuitive analysis are not the only possible
methods of determining a utility such as ) ( *
2
e u . In many instances it is possible to make
a partial analysis in order to gain some insight but at a less prohibitive cost than a full
analysis entails.
Section 13.7: Sensitivity Analysis
6
Sensitivity analysis is an essential element of decision analysis. The principle of
sensitivity analysis is also directly applied to areas such as metaanalysis and cost
effectiveness analysis most readily but is not confined to such approaches. This section
6
See Petitti (2000), Hillier and Lieberman (2001), and Clemen (1996).
Part V: Analytic & Computational Methods for Decision Problems
142
simply describes the overall purpose of sensitivity analysis and describes oneway and its
expansion to nway analysis as applied to decision analysis on a very general level.
Sensitivity analysis evaluates the stability of the conclusions of an analysis to
assumptions made in the analysis. When a conclusion is shown to be invariant to the
assumptions, confidence in the validity of the conclusions of the analysis is enhanced.
Such analysis also helps identify the most critical assumptions of the analysis (Petitti,
2000).
An implicit assumption of decision analysis is that the values of the probabilities and of
the utility measure are the correct values for these variables. In oneway sensitivity
analysis, the assumed values of each variable in the analysis are varied, one at a time,
while the values of the other variables in the analysis remain fixed. When the assumed
value of a variable affects the conclusion of the analysis, the analysis is said to be
“sensitive” to that variable. When the conclusion does not change, when the sensitivity
analysis includes the values of the variables that are within a reasonable range, the
analysis is said to be “insensitive” to that variable.
If an analysis is sensitive to the assumed value of a variable, the likelihood that the
extreme value is the true value can be assessed qualitatively; perhaps, weighting the
benefit of one strategy over the other under the extreme assumption.
An extension of the oneway sensitivity analysis is the threshold analysis. In such a case,
the value of one variable is varied until the alternative decision strategies are to have
equal outcomes, and there is no benefit of one alternative over the other in terms of
estimated outcome. The threshold point is also called the breakeven point at the
decision is a “tooup”. That is, neither of the alternative decision options being compared
is clearly favored over the other. Threshold analysis is especially useful when the
intervention is being considered for use in groups that can be defined a priori based on
the values of the variable that is the subject of the threshold analysis.
In twoway sensitivity analysis, the expected outcome is determined for every
combination of estimates of two variables, while the values of all other variables in the
analysis are held constant at baseline. It is usual to identify the pairs of values that
equalize the expected or expected utility of the alternatives and to present the results of
the analysis graphically. It is simpler to interpret the results of a twoway sensitivity
analysis with the aid of graphs.
In nway sensitivity analysis, the expected outcome is determined for every possible
combination of every reasonable value of every variable. Nway sensitivity analysis is
analogous to nway regression and is seemingly difficult to interpret and is not discussed
any further in this report.
It is usual to do oneway sensitivity analysis for each variable in the analysis. The
highest and the lowest values within reasonable range of values are first substituted for
the baseline estimate in the decision tree. If substitution of the highest or the lowest
Part V: Analytic & Computational Methods for Decision Problems
143
value changes the conclusions, more values within the range are substituted to determine
the range of values.
In an analysis with many probabilities, there are numerous combinations of two and three
variables, and the computational burden of doing all possible twoway and threeway
sensitivity analysis is large. For this reason, it is not usually feasible to conduct twoway
sensitivity analysis for all possible combinations let alone nway. The choice of variables
for multiple way analysis requires considerable judgment, and therefore is not laid down
by any fast and hard rules. However, the graphical approach applied in these instances
also lends itself to the discussion of dominance considerations. In section 3, the notion of
dominant alternatives was introduced and can be considered a type of sensitivity analysis.
A graphical approach, in this instance suggests when one alternative may supercede
another and suggests possible implications.
Another area of emphasis for sensitivity analysis is the sensitivity with respect to the
prior, when applying Bayesian methodology. These are typically performed when there
is little imprecision un the loss because a standard choice of inference loss such as the
quadratic was adopted, for instance. It is usually suggested to start with this case as (a) it
is simplest and most thoroughly studied and (b) provide many insights for more general
problems. It is also notably one of the most difficult elements to assess.
The overview of statistical decision theory provided in this report places has tried to
maintain a balance between the classical and the Bayesian approach. In simple terms, the
solution of a statistical decision problem proceeds by modeling a decision maker’s
judgments by means of the loss function, probability model of the observation process
and a prior, and then uses these to identify a ‘good’ decision rule. Thus, this presents a
variety of reasons as to why a sensitivity analysis should be conducted – that is, checking
the sensitivity of the decision rule (output) with respect to the model and decision
maker’s judgments (inputs). In other words, the objective is to check the impact of the
loss function, the prior and the model on the Bayes decision rule or Bayes alternative, and
their posterior expected loss.
Part V: Analytic & Computational Methods for Decision Problems
144
Section 14: Prototype Example
The conceptual ideas presented thus far have laid the foundation of statistical decision
theory. The purpose of the following section is to present a prototype example which
exemplifies the key ideas and methods discussed in this report. This example has been
illustrated in many statistical decision theory books in various forms. It is illustrated in
this report and develops in much of the same way the methodological sections of this
report have.
Section 14.1: Summary of the Problem
An oil company owns a tract of land that may contain oil. A consulting geologist has
reported to management that it is believed that there is a 1 in 4 chance of oil. Because of
this prospect, another oil company has offered to purchase the land for $90,000.
However, the landowning oil company is considering holding the land in order to drill
for oil itself. The cost of drilling is $100,000. If oil is found, the resulting expected
revenue will be $800,000, so the company’s expected profit (after deducing the cost of
drilling) will be $700,000. A loss of $100,000 will be incurred if the land is dry (no oil).
Table 14.1: Prospective profits of oil company
Status of Land Payoff
Alternative Oil Dry
Drill for oil
Sell the land
$700,000
$90,000
$100,000
$90,000
Chance of status 1 in 4
Table 14.1 summarizes the prospective profits for this landowning company. This oil
company is operating without much capital so a loss of $100,000 would be quite serious.
Deciding whether to drill or sell also hinges on the option to conduct a detailed seismic
survey of the land to obtain a better estimate of the probability of finding oil. Thus, first
we present this prototype problem of decision without experimentation and then with
experimentation and conclude with ways of refining the evaluation of the consequences
of the various possible outcomes.
Section 14.2: Summary of the Decision Modeling Framework
The proceeding sections have explained the decision modeling framework. Here we re
cap the framework before proceeding with the prototype example. Simply, statistical
decision problems refer to those problems containing data or observations on the state of
nature. We first consider those cases without that information; that is problems of
making decisions in the absence of data (or without experimentation). This lays a
framework for decision analysis, for the problem at hand, which is later extended to
incorporate data or decision problems with experimentation and various other decision
methodologies.
Framework for Decision Analysis:
Part V: Analytic & Computational Methods for Decision Problems
145
1. Decision maker needs to choose one of the alternatives:
A a ∈
2. Nature would choose one of the possible states of nature:
Θ ∈ θ
3. Each combination of an action and state of nature would result in a
payoff/decision, which is given as one of the entries in a payoff/decision table.
C a c ∈ ) , ( θ
4. This payoff/decision table should be used to find an optimal action for the
decision maker according to an appropriate criterion.
An additional element needs to be incorporated into this framework – prior probabilities.
Many decision makers would generally want to incorporate additional information to
account for the relative likelihood of the possible states of nature. This information is
translated into a probability distribution where the state of nature is considered to be a
random variable  making up the prior probability distribution.
In general terms, the decision maker must choose an action from a set of possible actions.
The set contains all the feasible alternatives under consideration for how to proceed with
the problem of concern. This choice of an action must be made in the face of uncertainty,
because the outcome will be affected by random factors that are outside the control of the
decision maker. These random factors determine what situation is referred to as a
possible state of nature. For each combination of an action and a state of nature, the
decision maker knows what the resulting payoff (or consequence) would be. The payoff
is a quantitative measure of the value to the decision maker of the consequences of the
outcome. Although in many instances the payoff is a monetary gain, there are other
measures that can be used (as explained in utility theory). If the consequences of the
outcome do no become completely certain even when the state of nature is given, then the
payoff becomes an expected value (in the statistical sense) of the measure of the
consequences. The payoff table can be used to determine an optimal action according to
an appropriate criterion which suits the beliefs of the decision maker.
The decision problem faced by the oil company can now be placed within the decision
framework just presented. This can be summarized in the following decision
7
table:
Table 14.2: Formulation of the problem within the framework of decision analysis
Status of Land State of Nature Θ
Alternative A θ
1
θ
2
a
1
a
2
c
11
=700
c
21
=90
c
12
= 100
c
22
=90
Prior Probability 0.25 0.75
Section 14.3: Decision Making without Experimentation
7
Payoff table and decision table can be used interchangeably.
Part V: Analytic & Computational Methods for Decision Problems
146
This section illustrates the decision making process without experimentation. A
formulation of the problem being assessed has two possible actions under consideration:
drill for oil or sell the land. The possible states of nature are that the land contains oil and
that it is does not. Since the consulting geologist has estimates that there is a 1 in 4
chance of oil, the prior probabilities of the two states of nature are 0.25 (for oil) and 0.75
(for no oil). Table 14.1 has been redesigned in Table 14.2 with the payoff units in
thousands of dollars of profit. This payoff table will be used to determine the optimal
action under the three main criterions discussed here.
1. Minimax Criterion:
The minimax criterion was described in section 2. Its rationale is that it provides the best
guarantee of the payoff that will be obtained. That is, one determines, for each action a,
the maximum loss over the various possible states of nature:
) , ( max ) ( a l a M θ
Θ
= ,
and this provides an ordering among the possible actions. In words, for each possible
action, find the minimum payoff over all possible states of nature. Next find the
maximum of these minimum payoffs. Choose the maximum of the minimum payoff
gives the maximum.
The application of this criterion to the prototype example suggests that selling the land is
the optimal action to take. Regardless of what the true state of nature turns out to be for
the problem, the payoff from selling the land cannot be less than 90, which provides the
best available guarantee. Thus, this criterion provides the pessimistic viewpoint that
regardless of which action is selected, the worst state of nature for that action is likely to
occur, so one should choose the action which provides the best payoff with its worst state
of nature.
2. Maximum Likelihood Criterion:
The maximum likelihood function focuses on the most likely state of nature. Recall a
simple general definition of the likelihood function: For the observed data, x, the
function )  ( ) ( θ θ x f l = , considered as a function of θ, is called the likelihood function.
The intuitive reason for the name “maximum likelihood function” is that a θ for which
)  ( θ x f is small, in that x would be more plausible occurrence if )  ( θ x f were large. In
simplicity, the steps in this approach begin by identifying the most likely state of nature
(largest prior probability). For this state of nature, find the action with the maximum.
Choose this decision.
The application of this criterion to the prototype example indicates that the dry state has
the largest prior probability. In Table 10.2, in the dry cell column, the sell alternative has
the maximum payoff, so the choice is to sell the land.
Part V: Analytic & Computational Methods for Decision Problems
147
The appeal of this criterion is that the most important state of nature is most likely one, so
the action chosen is the best one for this particularly important state of nature. Basing the
decision on the assumption that this state of nature will occur tends to give a more, the
criterion does not rely on questionable subjective estimates of the probabilities of the
respective states of nature other than identifying the most likely state. However, the
major drawback of this criterion is that it completely ignores some relevant information.
No state of nature besides the most likely one is considered. Therefore, a problem with
many possible states of nature, the probability of the most likely one may be quite small
and or show little difference between “quite likely” states of nature.
3. Bayes’ Decision Rule (Expected Monetary Value Criterion):
The Bayes decision rule was introduced in Section 2. In this instance, the probability
weight assigned to each state of nature θ, the loss incurred for a given action incurs the
expected value:
∑
=
i
i i
a l g a B ) , ( ) ( ) ( θ θ .
This approach uses the best available estimates of the probabilities of the respective states
of nature (current prior probabilities) and calculates the expected value of the payoff for
each of the possible actions. Choose the action with the maximum expected payoff.
In this application of the criterion, it can be easily determined that the optimal action is to
drill. The expected payoffs can be calculated from Table 14.2 directly as follows:
90
) 90 ( 75 . 0 ) 90 ( 25 . 0
) , ( ) ( ) , ( ) ( ] [
100
) 100 ( 75 . 0 ) 700 ( 25 . 0
) , ( ) ( ) , ( ) ( ] [
2 2 2 2 1 1 2
1 2 2 1 1 1 1
=
+ =
+ =
=
− + =
+ =
a l g a l g c E
a l g a l g c E
θ θ θ θ
θ θ θ θ
Since 100 > 90, the alternative action to drill is selected. Note that this choice differs
from the other two preceding criteria.
The advantage to the Bayes’ decision rule is that it incorporates all the available
information, including payoffs and the best available estimates of the probabilities of the
respective states of nature. It is sometimes argued that these estimates of the probabilities
are largely subjective and so are too shaky to be trusted. Nevertheless, under many
circumstances, past experience and current evidence enables one to develop reasonable
estimates of the probabilities. The methodology of including such information was
described in Section 3. Before applying this approach to the problem at hand, we will
consider the use of sensitivity analysis to assess the effect of possible inaccuracies in the
prior probabilities.
Part V: Analytic & Computational Methods for Decision Problems
148
Section 13 briefly discussed the role and importance of sensitivity analysis in various
applications to study the effect if some of the elements included in the mathematical
model were not correct. The decision table in Table 14.2 representing the payoffs is the
mathematical model of concern. The prior probabilities in this model are most
questionable and will be the focus of the sensitivity analysis to be conducted; however, a
similar approach could be applied to the payoffs given in the table.
Basic probability theory suggests that the sum of the two prior probabilities must equal 1,
so increasing one of these probabilities automatically decreases the other one by the same
amount, and vice versa. The oil company’s management team feels that the true
“chances” of having oil on the tract of land are more likely to lie between the range from
0.15 to 0.35. Thus, the corresponding prior probability of the land being dry would range
from 0.85 to 0.65, respectively.
Conducting a sensitivity analysis in this situation requires the application of the Bayes’
decision rule twice – once when the prior probability of the oil is at the lower end (0.15)
of this range and next when it is at the upper end (0.35). When the prior probability is
conjectured to be 0.15, we find
90
) 90 ( 85 . 0 ) 90 ( 15 . 0
) , ( ) ( ) , ( ) ( ] [
20
) 100 ( 85 . 0 ) 700 ( 15 . 0
) , ( ) ( ) , ( ) ( ] [
2 2 2 2 1 1 2
1 2 2 1 1 1 1
=
+ =
+ =
=
− + =
+ =
a l g a l g c E
a l g a l g c E
θ θ θ θ
θ θ θ θ
and when the prior probability is thought to be 0.35, the Bayes’ decision rule finds that
. 90
) 90 ( 65 . 0 ) 90 ( 35 . 0
) , ( ) ( ) , ( ) ( ] [
180
) 100 ( 65 . 0 ) 700 ( 35 . 0
) , ( ) ( ) , ( ) ( ] [
2 2 2 2 1 1 2
1 2 2 1 1 1 1
=
+ =
+ =
=
− + =
+ =
a l g a l g c E
a l g a l g c E
θ θ θ θ
θ θ θ θ
Thus, the decision is very sensitive to the prior probability of oil as the expected payoff to
drill shifts from 20 to100 to180 when the prior probabilities are 0.15, 0.25, and 0.35,
respectively. Thus, if the prior probability of oil is closer to 0.15, the optimal action
would be to sell the land rather than to drill for oil as suggested by the other prior
probabilities. This suggests that it would be more plausible to determine the true value of
the probability of oil.
Part V: Analytic & Computational Methods for Decision Problems
149
Let p = prior probability of oil. Then the expected payoff from drilling for any p is
. 100 800
) 1 ( 100 700 ] [
− =
− − =
p
p p c E
Figure 141:
A graphical display of how the expected payoff for each alternative changes when the
prior probability of oil changes for the oil company’s problem of whether to drill or sell is
illustrated in figure 141. The point in this figure where the two lines intersect is the
threshold point where the decision shifts from one alternative (selling the land) to the
other (drill for oil) as the prior probability increases. Algebraically, this is simple to
determine as we set
, 2375 . 0
800
190
90 100 800
] [ ] [
2 1
= =
= −
=
p
p
c E c E
where c
1
is the expected payoff of the action to drill and subsequently c
2
is the expected
payoff of the action to sell the land. Thus, the conclusion should be to sell the land if p <
0.2375 and should drill for oil if p > 0.2375. Because, the decision for the oil company
decides heavily on the true probability of oil, serious consideration should be given to
conducting a seismic survey to estimate the probability more accurately. This is
considered in the next subsection.
Section 14.4: Decision Making with Experimentation
Here another element is added to the decision analysis framework – posterior
probabilities. This additional ‘element’ allows for testing or experimentation to be
Expected
Payoff
0.2 0.4
Drill for oil
Sell the Land
Prior Probability
Of Oil
Part V: Analytic & Computational Methods for Decision Problems
150
conducted; thus, improving the preliminary estimates of the probabilities of the respective
states of natures, i.e. the prior probabilities.
For this problem, an available option before making a decision is to conduct a detailed
seismic survey of the land to obtain a better estimate of the probability of oil. The cost is
$30,000. A seismic survey obtains seismic soundings that indicate whether the
geological structure if favorable to the presence of oil. Dividing the possible findings of
the survey into the following two categories:
USS: unfavorable seismic soundings; oil is unlikely, and
FSS: favorable seismic soundings; oil is likely to be found.
Based on past experience, if there is oil, then the probability of unfavorable seismic
soundings (USS) is
6 . 0 4 . 0 1 ) Oil State  FSS ( so , 4 . 0 ) Oil State  USS ( = − = = = = P P .
Similarly, if there is no oil (state of nature = Dry), then the probability of unfavorable
seismic soundings is estimated to be
. 2 . 0 8 . 0 1 ) Dry State  FSS ( so , 8 . 0 ) Dry State  USS ( = − = = = = P P
This data is used to find the posterior probabilities of the respective states of nature given
the seismic readings.
The main ‘criterion’ implemented here is Bayes Theorem. Now suppose before choosing
an action, an outcome from an experiment is observed X = x. Therefore for each
i=1,2,..,n, the corresponding posterior probability is
∑
=
= = =
= = =
= = =
n
k
k k
i i
i
P x X P
P x X P
x X P
1
) ( )  (
) ( )  (
)  (
θ θ θ θ
θ θ θ θ
θ θ
Proceeding in general terms, let
n = number of possible states of nature;
P(Θ = θ
i
) = prior probability that true state of nature is state i,
for i =1, 2, …, n;
X = finding from experimentation ( a random variable);
x
j
= one possible value of the finding;
P(Θ = θ
i
X = x
j
) = posterior probability that true state of nature is state i, given
X = x
j
, for j = 1, 2, …, n.
Thus, the question being addressed is: Given P(Θ = θ
i
) and P(X = x
j
 Θ = θ
i
), for i =1, 2,
…, n, what is P(Θ = θ
i
X = x
j
)? The answer is solved by following standard formulas of
probability theory which states that the conditional probability,
Part V: Analytic & Computational Methods for Decision Problems
151
). ( )  ( )  (
gives
, ) , ( ) (
where
,
) (
) , (
)  (
1
i i j j i
n
k
j k j
j
j i
j i
P x X P x X P
x X P x X P
x X P
x X P
x X P
θ θ θ
θ
θ
θ
= Θ = Θ = = = = Θ
= = Θ = =
=
= = Θ
= = = Θ
∑
=
Therefore, for i =1, 2, …, n, the desired formula for the corresponding posterior
probability is the Bayes theorem as stated above.
Returning to the prototype example and applying this formula, one finds if the finding of
the seismic survey is unfavorable, then the posterior probabilities are
.
7
6
7
1
1 )  ( ) USS Finding  Dry State (
,
7
1
) 75 . 0 ( 8 . 0 ) 25 . 0 ( 4 . 0
) 25 . 0 ( 4 . 0
)  ( USS) Finding  Oil State (
1 2
1 1
= − = = = Θ = = =
=
+
= = = Θ = = =
x X P P
x X P P
θ
θ
Note that USS = x
1
and FSS = x
2
. Similarly, if the seismic survey is favorable, then
.
2
1
2
1
1 )  ( ) FSS Finding  Dry State (
,
2
1
) 75 . 0 ( 2 . 0 ) 25 . 0 ( 6 . 0
) 25 . 0 ( 6 . 0
)  ( SS) F Finding  Oil State (
2 2
2 1
= − = = = Θ = = =
=
+
= = = Θ = = =
x X P P
x X P P
θ
θ
Probability Tree diagrams can be a useful and visually appealing way of organizing the
outcomes and the corresponding probabilities. Such a diagram is presented in Figure 14
2. Reading the probability diagram from left to right one can see all the probabilities –
prior probabilities to conditional probabilities to joint probabilities – leading to the
posterior probabilities. After these computations have been completed, Bayes’ decision
rule can be applied again just as before, with the posterior probabilities replacing the
prior probabilities. Again using the “payoffs” (in units of thousands of dollars) from
Table 14.2 and subtracting the cost of experimentation, we obtain the results shown
below.
Figure 142:
Prior
Probabilities
Conditional
Probabilities
Joint
Probabilities
Posterior
Probabilities
0
.2
5
0.6
0
.
7
5
0.4
0.2
0.8
0.25(0.6)=0.15
0.25(0.4)=0.1
0.75(0.2)=0.15
0.75(0.8)=0.6
0.15/0.3=0.5
0.1/0.7=0.14
0.15/0.3=0.5
0.6/0.7=0.86
Part V: Analytic & Computational Methods for Decision Problems
152
The expected payoffs if finding is unfavorable seismic soundings:
60
30 ) 90 (
7
6
) 90 (
7
1
)]  ( [
7 . 15
30 ) 100 (
7
6
) 700 (
7
1
)]  ( [
1 2 2
1 1 1
=
− + = =
− =
− − + = =
x X a c E
x X a c E
And the expected payoffs if the finding is favorable seismic surroundings:
60
30 ) 90 (
2
1
) 90 (
2
1
)]  ( [
270
30 ) 100 (
2
1
) 700 (
2
1
)]  ( [
2 2 2
2 1 1
=
− + = =
=
− − + = =
x X a c E
x X a c E
Table 143: Optimal Policies
Finding from
Seismic Survey
Optimal
Action
Expected Payoff
Excluding Cost of Survey
Expected Payoff Including
Cost of Survey
USS
FSS
Sell
Drill
90
300
60
270
Since the objective is to maximize the expected payoff, these results yield the optimal
policy shown in Table 143. However, this analysis has not yet addressed the issue of
whether the expense of conducting a seismic survey is truly valuable or whether one
should choose the optimal solution without experimentation. This issue is addressed
next.
The Value of Experimentation can be determined by using two complementary
methods. The first assumes (unrealistically) that the experiment will remove all
uncertainty about what the true state of nature is, and then calculates the resulting
“improvement” in the expected payoff (ignoring the cost of the experiment). This
quantity, the expected value of perfect information, provides an upper bound on the
potential value of the experiment. Therefore, if the upper bound is less than the cost of
experiment, than the experiment should be forgone. However, if this upper bound
exceeds the cost of experiment, then second methods should be implemented. This
method calculates the actual “improvement” in the expected payoff (ignoring the cost of
experiment) that would result from performing the experiment. Comparing the
improvement with the cost indicates whether the experiment should be performed.
1. Expected Value of Perfect Information
Suppose that the experiment could definitely identify the true state of nature thereby
providing “perfect” information. Whichever state of nature is identified, you naturally
Part V: Analytic & Computational Methods for Decision Problems
153
choose the action with the maximum payoff for the state. Since we do not know in
advance the state of nature which will be identified, a calculation of the expected payoff
or consequence with perfect information (ignoring the cost of the experiment), requiring
the weighting of the maximum payoff for each state of nature by the prior probability of
that state of nature is needed.
Expected payoff with perfect information = 0.25(700) + 0.75(90) = 242.5.
Thus, if the oil company could learn more before choosing its action whether the land
contains oil, the expected payoff as of now (before acquiring this information) would be
$242,500 (excluding the cost of the experiment generation the information.)
To evaluate whether the experiment should be conducted, we calculate the expected
value of perfect information (EPVI)
8
:
EPVI = E[payoff with perfect information] – E[payoff without experimentation]
Thus, since experimentation usually cannot provide perfect information, EVPI can
provide an upper bound on the expected value of experimentation. For this prototype
example, EVPI = 242.5 – 100 = 142.5, where the expected payoff without
experimentation was determined earlier to be 100 under Bayes’ decision rule. Since
142.5 far exceeds 30, the cost of experimentation (a seismic survey) is shown to be
worthwhile. Thus, we carry on and implement the second method of evaluating the
potential benefit of experimentation.
2. Expected Value of Experimentation
Now we want to determine expected increase directly which is referred to as the expected
value of experimentation. Calculating this quantity requires first computing the expected
payoff with experimentation (excluding the cost of experimentation). Obtaining this
latter quantity requires the previously determined posterior probabilities, the optimal
policy with experimentation, and the corresponding expected payoff (excluding the cost
of experimentation) for each possible finding from the experiment. Then each of these
expected payoff needs to be weighted by the probability of the corresponding finding,
that is,
Expected payoff with experimentation =
∑
= =
j
j j
x X c E x X P ]  [ ) ( ,
where the summation is taken over all possible j and c denotes the payoff or consequence.
For the prototype example much of the work has been done for the right side of the
equation. The values of ) (
j
x X P = for each of the two possible findings from the seismic
survey – unfavorable or favorable – were calculated at the bottom of the probability tree
diagram in Figure 14.2 to be 7 . 0 ) (
1
= x P and 3 . 0 ) (
2
= x P . For the optimal policy with
8
The value of perfect information is a random variable equal to the payoff with perfect information minus
the payoff without experimentation. Thus the expected value of perfect information is the expected value of
this random variable.
Part V: Analytic & Computational Methods for Decision Problems
154
experimentation, the corresponding expected payoff for each finding can be obtained
from Table 14.3 as
. 270 )  (
, 90 )  (
2
1
= =
= =
x X c E
x X c E
Then the expected payoff with experimentation is determined to be 0.7(90) + 0.3(300) =
153. Similar to before the expected value of experimentation (EVE) can be calculated as
EVE = 153 = 100 = 53
and identifies the potential value of the experimentation. Since this value exceeds 30, the
cost of conducting a detailed seismic survey shows valuable.
Section 14.5: Decision Trees
Decision trees are useful in providing a visual display of the sequential decision
processes and subsequent consequences. It can help in organizing the computational
work that has been described thus far.
This prototype problem involves a sequence of two decisions which each result in
possible consequences and in turn resulting in further decisions and consequences. It
involves a sequence of two decisions:
1. Should a seismic survey be conducted before an action is chosen?
2. Which action (drill for oil or sell the land) should be chosen?
The corresponding decision tree (with partially added numbers and performing
computations) is displayed in Figure 14.3. Note that this is the final decision tree and
was done to save space and time. Recall from Section 9, the interpretation of the specific
types of nodes. The nodes of the decision tree are referred to as forks, and the arcs are
called branches. A decision fork, represented by a square, indicates that decision needs
to be made at that point in the process. A chance fork, represented by a circle, indicates
that a random event occurs at that point.
In this example, the first decision is represented by a decision fork a. Fork b is a chance
fork representing the random event of the outcome of the seismic survey. The two
branches emanating from fork b represent the two possible outcomes of the survey. The
decisions c and d follow the possible outcomes and result in the consequences from the
random events of f and g. Similarly, decision fork e leads to a chance fork h, where
again the two branches correspond to the two possible states of nature.
Since the decision tree is filled in with the available information, thus far – some
interpretation can be deduced before the actual analysis proceeds. The numbers under or
over the branches that are not in parentheses are the cash flows that occur at those
branches. The resulting in end payoff is recorded in boldface at the end of the terminal
Part V: Analytic & Computational Methods for Decision Problems
155
branch. At each chance fork, the probabilities of random events are recorded in
parentheses. From chance fork h, the probabilities are the prior probabilities as no
seismic survey was conducted. The probabilities emanating from chance fork b are the
probabilities of these findings while the probabilities emanating from chance fork f and g
are the posterior probabilities of the states of nature, given the finding from the seismic
survey.
Performing the Analysis
The backward induction procedure was introduced in Section 10 and is implemented here
for the prototype example. The steps involved can be summarized as follows:
1. Start at the right side of the decision tree and move left one column at a time.
For each column, perform either step 2 or 3 depending on whether the forks in
that column are chance forks or decision forks.
2. For each chance fork, calculate the expected payoff by multiplying the
expected payoff of each branch (shown in boldface to the right of the branch)
by the probability of that branch and then summing these products. Record
this expected payoff for each decision fork in boldface next to the fork, and
designate this quantity as also being the expected payoff for the branch
leading to the fork.
3. For each decision fork, compare the expected payoffs of its branches and
choose the alternative whose branch has the largest expected payoff. In each
case, record the choice on the decision tree by inserting a double dash as a
barrier through each rejected branch.
To begin the procedure, consider the rightmost column of forks, namely, chance forks f,
g, and h. Applying step 2, their expected payoffs are calculated as
. 100 ) 100 (
4
3
) 700 (
4
1
] [
, 270 ) 130 (
2
1
) 670 (
2
1
] [
, 7 . 15 ) 130 (
7
6
) 670 (
7
1
] [
= − + =
= − + =
− = − + =
h
g
f
c E
c E
c E
Note that ] [
k
c E denotes the expected payoff or consequence of the k
th
chance fork. These
expected payoffs are placed above these forks as displayed in Figure 14.3. Next, moving
one column to the left, are the decision forks c, d, and e. The expected payoff for a
branch that leads to a chance fork now is recorded in boldface over that chance fork.
Therefore, step 3 can be applied as follows.
Part V: Analytic & Computational Methods for Decision Problems
156
drill. choose so , 90 100
. 90 )] ( [
, 100 )] ( [
drill. choose so , 60 270
. 60 )] ( [
, 270 )] ( [
sell. choose so , 7 . 15 60
. 60 )] ( [
, 7 . 15 )] ( [
1
2
1
1
2
1
2
2
1
= >
=
=
= >
=
=
= − >
=
− =
a
a c E
a c E
a
a c E
a c E
a
a c E
a c E
e
e
d
d
c
c
Next, move one column to the left again and apply step to the chance fork b. Here we
find that . 123 ) 270 ( 3 . 0 ) 60 ( 7 . 0 ] [ = + =
b
c E In the final step, we consider the last decision
fork a and apply step 3 as before:
survey. seismic do choose so , 100 123
. 100 )] ( [
, 123 )] ( [
01
02
01
= >
=
=
a
a c E
a c E
a
a
Note a
01
denotes the action to do a seismic survey where a
02
denotes the action to not
conduct a seismic survey. Now the decision tree is complete where the payoffs are given
on the far right side.
Figure 143:
D
o
S
e
i
s
m
i
c
S
u
r
v
e
y
N
o
S
e
i
s
m
i
c
S
u
r
v
e
y
U
n
fa
v
o
r
a
b
le
F
a
v
o
r
a
b
l
e
Drill
Sell
Drill
Sell
Drill
Sell
Oil(0.143)
0
800
100
670
130
60
670
130
60
700
100
90
15.7
60
123
123 270
100
270
100
90
Dry(0.857)
Part V: Analytic & Computational Methods for Decision Problems
157
The double dashes have blocked off the undesirable paths. According to Bayes’ decision
rule, follow the open paths from left to right. Thus, the optimal policy is:
• Do the seismic survey.
• If the result is unfavorable, sell the land.
• If the result is favorable, drill for oil.
• The expected payoff (including the cost of the seismic survey) is $123,000.
This was the same unique optimal solution obtained in the earlier section without the
decision tree which was shown in Table 14.3. The optimal policy via the backward
induction process leads to the same policy or policies via the more formal process.
Section 14.6: Utility Theory
This section will focus on various models of individual rationality which can be
categorized into an individual’s traits. Thus far, when applying the Bayes’ decision rule,
we have assumed that the expected payoff or consequence in monetary terms is the
appropriate measure of the consequences of taking an action. However, in many
situations this assumption is inappropriate. Fortunately, there is a way of transforming
monetary values to an appropriate scale that reflects the decision maker’s preferences.
This scale is referred to as the utility function for money.
A utility function for money illustrates an individual’s behavior pertaining to money. A
person who is riskaverse will exhibit a decreasing marginal utility function. However,
individuals who are risk seekers demonstrate an increasing marginal utility for money.
The intermediate case is that of a riskneutral individual, who prizes money at its face
value. Such an individual’s utility for money is simply proportional to the amount of
money involved. It is possible to exhibit a mixture of these kinds of behavior. For
example, an individual might be essentially riskneutral with small amounts of money,
then become a risk seeker with moderate amounts, and then turn riskaverse with large
amounts. Moreover, one’s attitude toward risk can change over time and circumstances.
The fact that different people have different utility functions for money has an important
implication for decision making in the face of uncertainty. When a utility function for
money is incorporated into a decision analysis approach to a problem, this utility function
must be constructed to fit the preferences and values of the decision maker involved. The
key to constructing the utility function for money to fit the decision maker is based on the
fundamental properties explained in Section 6. The following fundamental property
summarizes in words the central notion that under the assumptions of utility theory, the
decision maker’s utility for money has the property that the decision maker is in different
between two alternative courses of action if the two alternatives have the same expected
utility.
The scale of the utility function is irrelevant. It is only the relative values of the utilities
that matter. All the utilities can be multiplied by any positive constant without affecting
which alternative course of action will have the largest expected utility. To summarize
Part V: Analytic & Computational Methods for Decision Problems
158
the basic role of utility function in decision analysis is to simply state that when the
decision maker’s utility function for money is used to measure the relative worth of the
various possible monetary outcomes, Bayes’ decision rules replaces monetary payoffs by
the corresponding utilities. Therefore, the optimal action (or sequence of actions) is the
one which maximizes the expected utility. Although only utility functions for money are
discussed in this section, they can be constructed for consequences of the alternatives of
action that are not monetary.
Returning to the prototype example, recall in the summary of the problem it was stated
that the oil company was operating without much capital, so a loss of $100,000 would be
steep. The worst case scenario is to conduct a seismic survey and then drill for $100,000
and drill when there is no oil. On the other hand, striking oil is an exciting prospect of
earning $700,000. To apply the decision maker’s utility function for money to the
problem as described earlier, it is necessary to identify the utilities for all possible
monetary payoffs. These payoffs and their corresponding utilities are presented in Table
144 and can be obtained by the following methodology.
Table144: Utilities for oil company
Monetary Payoff Utility
130
100
60
90
670
700
150
105
60
90
580
600
As a starting point in constructing the utility function, it is natural to let the utility of zero
money be zero, so 0 ) 0 ( = u . An appropriate next step is to consider the worst scenario
and best scenario and then to address the question of what value of p would make the
decision maker indifferent between two alternatives. Suppose the decision maker’s
choice is
5
1
= p . If ) (M u denotes the utility of a monetary payoff of M, this choice of p
implies that
1). e alternativ of (utility 0 ) 700 (
5
1
) 130 (
5
4
= + − u u
The value of either ) 130 (− u or ) 700 ( u can be set arbitrarily to establish the scale of the
utility function. By choosing 150 ) 130 ( − = − u , this equation yields . 600 ) 700 ( = u To
identify u(100), a choice of p is made that makes the decision maker indifferent between
a payoff of 130 with probability p or definitely incurring a payoff of 100. The choice is
p = 0.7, so
. 105 ) 150 ( 7 . 0 ) 130 ( ) 100 ( − = − = − = − u p u
Part V: Analytic & Computational Methods for Decision Problems
159
To obtain ) 90 ( u , a value of p is selected that makes the decision maker in different
between a payoff of 700 with probability p or definitely obtaining a payoff of 90. The
value chosen is p = 0.15, so
. 90 ) 600 ( 15 . 0 ) 700 ( ) 90 ( = = = u p u
To obtain the decision maker’s utility function for money, a smooth curve was drawn
through ) 700 ( and ), 90 ( ), 100 ( ), 130 ( u u u u − − as shown in Figure 104. The values on this
curve at M = 60 and M = 670 provide the corresponding utilities, 60 ) 60 ( = u and
580 ) 670 ( = u , which completes the list of utilities displayed in Table 10.4. For contrast,
the dashed line drawn at a 45° shows the monetary value of the payoffs used exclusively
in the preceding sections. Note how ) (M u essentially equals M for small values of M,
and then how ) (M u gradually falls off M for larger values of M. This is a typical of a
moderately riskaverse individual.
There are other approaches of estimating ) (M u ; however, they are not necessarily
appropriate in this instance. Decision trees can also be used in an identical fashion to the
analysis in the preceding section except for substitution utilities for monetary payoffs.
For this analysis of a drilling decision we add an extra element, for variety purposes. We
also give a more general analysis perspective to end this section.
14.7: Combination of Formal and Informal Analysis
Continuing with the same problem, the desirability of drilling depends on the amount of
oil which will be found – oil or no oil. Before making this decision, the oil company
wishes to obtain more geological and geophysical evidence by means of a seismographic
recording which is quite expensive. It is also assumed that these recordings, if made, will
give completely reliable information that one of the tree conditions prevails; (1) there is
no subsurface structure, (2) there is an open subsurface structure, or (3) there is a closed
subsurface structure. The descriptions of the four spaces, A, Θ, E, X, are summarized in
Table 145 and the possible sequence of choices are displayed in the decision tree in
Figure 145.
Table 14.5: Possible Choices
Space Elements Interpretations
A a
1
a
2
Drill, do not sell location
Do not drill, sell location
Θ θ
1
θ
2
Oil
No oil
E e
1
e
2
Do not take seismic readings
Take seismic readings
X x
0
x
1
x
2
x
3
Dummy outcome of e
0
e
1
reveals no structure
e
2
reveals open structure
e
3
reveals closed structure
Part V: Analytic & Computational Methods for Decision Problems
160
Assignment of Utilities
The psychological stimulus associated in this problem with a ) , , , ( θ a x e 4tuple is highly
complicated. Different wells entertain different drilling costs, and different strike
produce different quantities and qualities of oil which can be recovered over different
periods of time and sold at different prices. Furthermore, each potential consequence of
the present drilling venture may interact with future potential drilling ventures such as
geological information gained.
There are uncertainties surrounding any particular ) , , , ( θ a x e complex which must be
compared formally or informally. Assume, nevertheless, that the decision maker can
assign to utility numbers to reflect his preferences. This presents the idea of cutting the
decision tree presented in Section 4 and 9.
Assignment of Probabilities
The assignment of a probability measure to the probability space X × Θ bears more
importance on the conditional (or “posterior”) measure
x
P
 θ
′ ′ and the marginal measure
e x
P

than on the complementary measures
θ , e x
P and
θ
P′ . Previous experience with the
amounts of oil found in the three possible types of geological structure (x
1
= no structure,
x
2
= open structure, x
3
= closed structure) may make it possible to assign a nearly
“objective” measure
x
P
 θ
′ ′ to the amount of oil which will be found given any particular
experiment result, whereas it would be much less clear what measure
θ
P′ should be
assigned to the amount of oil in the absence of knowledge of the structure. At the same
time it will in general be much more meaningful to a geologist to assign a marginal
measure
e x
P

to the various structures and thus to the sample space X than it would be to
assign conditional measures
θ , e x
P to X depending on the amount of oil which will be
found. The hypothetical measures
x
P
 θ
′ ′ and
e x
P

are shown on those branches of the decision
tree in Figure 6.11 which emanate from e
1
; the prior probabilities } {
1
θ
θ
P′ =0.2 and
} {
2
θ
θ
P′ =0.8 shown on the branches emanating from e
0
were computed from them by use
if the formula
).  ( )  ( )  ( )  ( )  ( )  ( ) (
1 3 3 1 2 2 1 1 1
e x P x P e x P x P e x P x P P
x i x i x i i
θ θ θ θ
θ θ θ θ
′ ′ + ′ ′ + ′ ′ = ′
Analysis
Since all the data required for analysis appears in the decision tree in Figure 145, it can
be easily verified that the optimal decision is to pay for seismographic recordings (e
1
) and
then drill (a
1
) if and only if the recordings reveal open (x
2
) or closed (x
3
) structures. It is
worthwhile to pay for the recordings because the expected utility of this decision is 15.25
whereas the expected utility of the optimal act without seismic information is 0.
References
161
References
Bather, J. (2000). Decision Theory. Chichester: John Wiley & Sons, Inc.
Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. (2
nd
ed.). New
York: SpringerVerlag New York Inc.
Brooks, S.P. and Roberts, G.O. (1998). Convergence Assessment Techniques for
Markov Chain Monte Carlo. Statistics and Computing, 8:319335.
Buchanan, J.T. (1982). Discrete and Dynamic Decision Analysis. Chichester: John
Wiley & Sons, Inc.
Chib, S. and Greenberg, E. (1995). Understanding the MetropolisHastings Algorithm.
The American Statistician. 49(4).
Clemen, R.T. (1996). Making Hard Decisions: An Introduction to Decision Analysis.
(2
nd
ed.). Pacific Grove, CA: Brooks/Cole Publishing Company.
Draper D. and Madigan, D. (1997). The Scientific Value of Bayesian Statistical Methods
and Outlook.
Everitt, B.S. (2002). The Cambridge Dictionary of Statistics (2
nd
ed.). Cambridge,
Cambridge University Press.
French, S., & Insua, D.R. (2000). Statistical Decision Theory. London: Arnold.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995). Bayesian Data Analysis.
London, Chapman and Hall.
Gilboa, I. & Schmeidler, D. (2001). Theory of Casebased Decisions. UK: Cambridge
University Press.
Gilks, W.R., Richardson, S. and Spielgelhalter, D.J. (1996). Markov Chain Monte Carlo
in Practice. London, Chapman and Hall.
Goldstein, H. (2002). Multilevel Statistical Models (3
rd
ed.). London, Arnold Publishers.
Grimmett, G.R. Stirzaker, D.R. (1982). Probability and Random Processes. New York:
Oxford University Press.
Hastings, N.A.J. & Mello, J.M.C. (1978) Decision Networks. UK: John Wiley & Sons,
Inc.
Hillier, F.S., & Lieberman, G.J. (2001). Introduction to Operations Theory. (7
th
ed.).
New York: McGrawHill.
References
162
Jensen, F.V. (2001). Bayesian Networks and Decision Graphs. New York: Springer
Verlag New York, Inc.
Jordan, M.I. (editor) (1999). Learning in Graphical Models. The Netherlends: Kluwer
Academic Publishers.
Kass, R.E., Carlin, B.P., Gelman, A., Neal, R.M. (1998). Statistical Practice: Markov
Chain Monte Carlo in Practice: A Roundtable Discussion. The American
Statistician. 52(2).
Lindgren, B.W. (1971). Elements of Decision Theory. New York: The Macmillan
Company.
Luce, R. D. (2000). Utility of Gains and Losses. New Jersey: Lawrence Erlbaum
Associates, Publishers.
MacDonald, I.L. & Zuchchini, W. (1997). Hidden Markov and Other Models for
Discretevalue Time Series. Boca Raton: Chapman & Hall, Inc.
Marshall, K.T. & Oliver, R.M. (1995). Decision Making and Forecasting. New York:
McGrawHill, Inc.
Muphy, K. (1998). http://www.ai.mit.edu/~murphyk/Bayes/bayes.html
Nau, R. (2002). Ph.D. Seminar on Choice Theory. Duke University: The FUQUA
School of Buisness.
Neal, R. (1996). Bayesian Learning for Neural Networks. New York: SpringerVerlag
New York, Inc.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. California: Morgan
Kaufmann.
Pearl, J. (1999). Causality. http://bayes.cs.ucla.edu/jp_home.html
Petitti, D.B. (2000). MetaAnalysis, Decision Analysis and CostEffectiveness Analysis:
Methods for Quantitative Synthesis in Medicine. (2
nd
ed.). New York: Oxford
University Press, Inc.
Raiffa, H., Schlaifer, R. (2000). Applied Statistical Decision Theory. New York: John
Wiley & Sons, Inc.
Ripley, B. (1987). Stochastic Simulation. Chichester: John Wiley & Sons, Inc.
Ripley, B. (1997). Pattern Recognition and Neural Networks. Chichester: John Wiley
& Sons, Inc.
References
163
Rivett, P. (1980) Model Building for Decision Analysis. Chichester: John Wiley & Sons,
Inc.
Tierney, L. and Mira, A. (1999). Some Adapative Monte Carlo Methods for Bayesian
Inference. Statistics in Medicine, 18:25072515.
Weirich, P. (2001). Decision Space. Cambridge: Cambridge University Press.
West, D.B. (2001). Introduction to Graph Theory. New Jersey: Prentice Hall.
White, D.L. (1976). Fundamentals of Decision Theory. New York: American Elsevier
Publishing Company, Inc.
Courses and Professors:
STA4276H (2003): MCMC algorithms by Professor Jeffrey Rosenthal.
CHL5223H (2002/3): Applied Bayesian Methods by Professor Michael Escobar
(meetings).
Part I: Decision Theory – Concepts and Methods
Part I: DECISION THEORY Concepts and Methods
Decision theory as the name would imply is concerned with the process of making decisions. The extension to statistical decision theory includes decision making in the presence of statistical knowledge which provides some information where there is uncertainty. The elements of decision theory are quite logical and even perhaps intuitive. The classical approach to decision theory facilitates the use of sample information in making inferences about the unknown quantities. Other relevant information includes that of the possible consequences which is quantified by loss and the prior information which arises from statistical investigation. The use of Bayesian analysis in statistical decision theory is natural. Their unification provides a foundational framework for building and solving decision problems. The basic ideas of decision theory and of decision theoretic methods lend themselves to a variety of applications and computational and analytic advances. This initial part of the report introduces the basic elements in (statistical) decision theory and reviews some of the basic concepts of both frequentist statistics and Bayesian analysis. This provides a foundational framework for developing the structure of decision problems. The second section presents the main concepts and key methods involved in decision theory. The last section of Part I extends this to statistical decision theory – that is, decision problems with some statistical knowledge about the unknown quantities. This provides a comprehensive overview of the decision theoretic framework.
1
Part I: Decision Theory – Concepts and Methods
Section 1: An Overview of the Decision Framework: Concepts & Preliminaries
Decision theory is concerned with the problem of making decisions. The term statistical decision theory pertains to decision making in the presence of statistical knowledge, by shedding light on some of the uncertainties involved in the problem. For most of this report, unless otherwise stated, it may be assumed that these uncertainties can be considered to be unknown numerical quantities, denoted by θ. Decision making under uncertainty draws on probability theory and graphical models. This report and more particularly this Part focuses on the methodology and mathematical and statistical concepts pertinent to statistical decision theory. This initial section presents the decisional framework and introduces the notation used to model decision problems.
Section 1.1: Rationale
A decision problem in itself is not complicated to comprehend or describe and can be simply summarized with a few basic elements. However, before proceeding any further, it is important to note that this report focuses on the rational decision or choice models based upon individual rationality. Models of strategic rationality (smallgroup behavior) or competitive rationality (market behavior) branch into areas of game theory and asset pricing theory, respectively. Thus for the purposes of this report, these latter models have been neglected as the interest of study is statistical decision theory based on individual rationality. “In a conventional rational choice model, individuals strive to satisfy their preferences for the consequences of their actions given their beliefs about events, which are represented by utility functions and probability distributions, and interactions among individuals are governed by equilibrium conditions” (Nau, 2002[1]). Decision models lend themselves to a decision making process which involves the consideration of the set of possible actions from which one must choose, the circumstances that prevail and the consequences that result from taking any given action. The optimal decision is to make a choice in such a way as to make the consequences as favorable as possible. As mentioned above, the uncertainty in decision making which is defined as an unknown quantity, θ, describing the combination of “prevailing circumstances and governing laws”, is referred to as the state of nature (Lindgren, 1971). If this state is unknown, it is simple to select the action according to the favorable degree of the consequences resulting from the various actions and the known state. However, in many real problems and those most pertinent to decision theory, the state of nature is not completely known. Since these situations create ambiguity and uncertainty, the consequences and subsequent results become complicated. Decision problems under uncertainty involve “many diverse ingredients”  loss or gain of money, security, satisfaction, etc., (Lindgren, 1971). Some of these “ingredients” can be assessed while some may be unknown. Nevertheless, in order to construct a mathematical framework in which to model decision problems, while providing a rational 2
It is sometimes possible to take measurements or conduct experiments in order to gain more information about the state. (or Parameter Space) The decision process is affected by the unknown quantity θ ∈ Θ which signifies the state of nature. 2. a decision maker perceives that a particular action a results in a corresponding state θ. The results of such experiments are called data or observations. notation and structure involved in decision modeling. then the decision problem is known as a statistical decision problem. The set of all possible states of nature is denoted by Θ. To summarize. a decision maker is to select a single action a ∈ A from a space of all possible actions. It is assumed that a decision maker can specify the following basic elements of a decision problem. 3. It should be noted that the term actions is used in decision literature instead of decisions. the “ingredients” of a decision problem include (a) a set of available actions. For brevity purposes. they can be used somewhat interchangeably. Thus. 1. Section 1. (b) a set of admissible states of nature. the decision problem is referred to as the “nodata” or “without experimentation” decision problem. The single action is denoted by an a. Usually something is known about the state of nature. allowing a consideration of a set of states as being admissible (or at least theoretically so). A decision process is referred to as “statistical” when experiments of chance related to the state of nature are performed. However. 3 . Because monetary gain is often neither an adequate nor accurate measure of consequences. State Space: Θ = {θ}. and thereby ruling out many that are not. These provide a basis for the selection of an action defined as a statistical decision rule. Action Space: A = {a}. When only these make up the elements of a decision problem. and (c) a loss associated with each combination of a state if nature and action. while the set of all possible actions is denoted as A. Consequence: C = {c}.2 The Basic Elements The previous section summarized the basic elements of decision problems. However. if (d) observations from an experiment defined by the state of nature are included with (a) to (c). this section will not repeat the description of the two types of decision models and simply state the mathematical structure associated with each element. This initial overview of the decision framework allows for a clear presentation of the mathematical and statistical concepts. the notion of utility is introduced to quantify preferences among various prospects which a decision maker may be faced with.Part I: Decision Theory – Concepts and Methods basis for making decisions. a numerical scale is assumed to measure consequences. Thus.
4. Decision Rule: δ ( x) ∈ A . If a decision maker is to observe an outcome X = x and then choose a suitable action δ ( x) ∈ A . An outcome of a potential experiment e ∈ E is denoted as x ∈ X . Family of Experiments: E = {e}. The quantification of a decision maker’s preferences is described by a utility function u (e. Sections 2 and 3 focus on discussing the appropriate measures of minimization in decision processes.⋅.⋅) on E × X × A × Θ . Thus. it should be noted that when a statistical investigation (such as an experiment) is performed to obtain information about θ. with a corresponding θ. The objectives of a decision maker are described as a realvalued loss function l (a. The importance of this outcome was explained in (3) and hence is not repeated here. which is 4 . Focusing on the former. such as X.θ ) .⋅. 8. choosing a particular action a.θ ) ∈ A × Θ . x. Sample Space: X = {x}. The probability distribution of a random variable.θ ) . Loss Function: l (a. then the result is to use the data to minimize the loss l (δ ( x). a.3 Probability Measures Statistical decision theory is based on probability theory and utility theory.Part I: Decision Theory – Concepts and Methods The consequence of choosing a possible action and its state of nature may be multidimensional and can be mathematically stated as c(a. Typically experiments are performed to obtain further information about each θ ∈ Θ . The evaluation of the utility function u takes into account costs of an experiment as well as consequences of the specific action which may be monetary and/or of other forms. Utility Evaluation: u (⋅.θ ) ∈ C . 7. the subsequent observed outcome x is a random variable.θ ) which is assigned to a particular conduct of e. while the set of all possible experiments is denoted as E. The set of all possible outcomes is the sample space while a particular realization of X is denoted as x. X is a subset of ℜ n . 6. Notably. 5. a resulting observed x. Section 1. which measures the loss (or negative utility) of the consequence c(a. However. this subsection presents the elementary probability theory used in decision processes. a decision maker may select a single experiment e from a family of potential experiments which can assist in determining the importance of possible actions or decisions.θ ) . A single experiment is denoted by an e.
Part I: Decision Theory – Concepts and Methods
dependent on θ, as stated above, is denoted as Pθ (E ) or Pθ ( X ∈ E ) where E is an event. It should also be noted that the random variable X can be assumed to be either continuous or discrete. Although, both cases are described here, the majority of this report focuses on the discrete case. Thus, if X is continuous, then
Pθ ( E ) = ∫ f ( x  θ )dx = ∫ dF X ( x  θ ) .
E E
Similarly, is X is discrete, then Pθ ( E ) = ∑ f ( x  θ ) .
x∈E
Although, E has been used to describe a family of experiments, an event and will be used in the next subsection to denote expectations, the meaning of E will be clear from the context. For ∀e ∈ E (where e represents an experiment or even an event), a joint probability measure Pθ , x (⋅,⋅  e) or simply denoted Pθ , xe is assigned and more commonly referred to as the possibility space. This is used to determine other probability measures (Raiffa & Schlaifer, 2000):
(i) (ii) (iii) (iv) The marginal measure Pθ′ (⋅) on the state space Θ. Of course, here the assumption is that Pθ′ (⋅) does no depend on e. The conditional measure Px (⋅  e, θ ) on the sample space X for a given e and θ. The marginal measure Px (⋅  e) on the sample space X for a given e. The conditional measure Pθ′′(⋅  x) on the state space Θ for a given e and x. The condition e is not stated as x is a result of the experiment and hence the relevant information is contained in x.
Before concluding this subsection, it is important to make two remarks. The first is to summarize the three basic methods for assigning the above set of probability measures. That is (a) if a joint probability measure is assigned to Θ × X , then the marginal and conditional measures on Θ and X can be computed, (b) if a marginal measure (probability distribution) is assigned to Θ and the conditional for X, then the joint can be found and similarly (c) if a marginal measure (probability distribution) is assigned to X and the conditional for Θ, then the joint can be determined. These elementary “methods” or concepts of probability have a more practical importance. The second remark is simply to clarify that the prime on the probability measure indicates a prior probability where as the double prime indicates a posterior probability. For the most part, these notations are redundant but at certain times will help to keep things clear. When it is obvious these superscripts will not required. A further discussion of priors is provided in Section 1.5.
5
Part I: Decision Theory – Concepts and Methods
Section 1.4 Random Variables and Expectations
In many instances, real numbers or ntuples (of numbers) describe the states of nature {θ) and sample outcomes {x}. In the sections of this report a tilde sign may be used to distinguish a random variable or function from a particular value of the function. For ~ ~ x x example, the random variables ~ and θ may be used to define θ (θ , x) = θ and ~ (θ , x) = x , respectively. Expectations of random variables are almost always considered necessary when dealing with decision processes such as loss functions. The expectation for a given value of θ, is defined to be
h( x) f ( x  θ ), (continuous) ∫ Eθ [h( X ) = X ∑ h( x) f ( x  θ ), (discrete) x∈X
As before, the superscripts and subscripts on the expectation operator will perform in much the same way as for the probability measure. When necessary, such scripts will be minimized when the context is clear. Thus, with respect to points (i)(iv) in the previous section, the notation for the following expectations is provided below: ~ ′ (i) Eθ or E ′(θ ) is taken with respect to Pθ′ . ′ (ii) Eθ′ x is taken with respect to Pθ′′x .  (iii) E ze ,θ is taken with respect to Pze ,θ . (iv) E xe is taken with respect to Pxe .
Section 1.5 Statistical Inference (Classical versus Bayesian)
Statistical inference is considered here within the decision framework. Both classical and Bayesian perspectives are briefly presented to show the varying approaches. Classical statistics uses the sample information to make inferences about the unknown quantity, θ. These inferences (within decision theory) are combined with other relevant information in order to choose the optimal decision. These other relevant information/sources include the knowledge of the possible consequences of the decisions and prior information which was previously mentioned. The former nonsample information, consequences, can be quantified by determining the possible loss incurred for each possible decision. The latter, prior information, is the information about θ arising from other relevant sources such as past experiences. Bayesian analysis is the approach which “seeks to utilize prior information” (Berger, 1985). This third type of information is best described in terms of a probability distribution. The symbol π (θ ) or simply p(θ ) will be used to represent a prior density of
6
Part I: Decision Theory – Concepts and Methods
θ. Similar, to the other definitions, under both the continuous and discrete cases, the prior probability distributions can be written as
π (θ )d (θ ), (continuous) ∫ P (θ ∈ E ) = ∫ dF (θ ) = E E ∑ π (θ ), (discrete) θ ∈E
π
The uses of prior probabilities are discussed in the proceeding sections. Both the nonBayes and Bayesian decision theory are discussed in this report.
Section 1.6: Convex Sets
Various concepts presented throughout this report make use of the concepts of convexity and concavity; hence, the required definitions and properties are summarized below.
Definition:
A linear combination of the form αx1 + (1 − α ) x 2 with 0 ≤ α ≤ 1 is called a convex combination of x1 and x 2 . (Lindgren, 1971) If x1 and x 2 in Ω , then it can be said that Ω is convex if the line segment between any two points in Ω is a subset of Ω . (Berger, 1985). This definition simply suggests that the set of all combinations of two given points is precisely the set of points which make up the line segments jointing these two points. It may be conceived that the values α and (1 − α ) which fall between 0 and 1 may be interpreted as probabilities. A convex combination of two points then becomes a combination of probabilities.
Definition:
If {x1 , x 2 ,...} is a sequence of points in R m , and 0 ≤ α ≤ 1 are numbers such that
∑
∞
i =1
α i = 1 , then ∑i =1α i xi (and finite) is called a convex combination. The convex hull
∞
of a set Ω is the set of all points which are convex combinations of points in Ω . (Berger, 1985).
Figure 11:
Ω1
Ω2
7
8 . for the figure above. as will be shown. Thus. Formally. 2000). while a doughnut or a banana or a golf ball is not. X is a convex set if every line segment connecting two distinct points in X is wholly contained in X. a strictly convex set has no flat sides or edges: its exterior consists only of curved surfaces that are bowed outwards. So a convex hull is a set with no holes in its interior and no indentations on its exterior—e.Part I: Decision Theory – Concepts and Methods The set Ω1 is convex while the Ω 2 is not. A sphere is strictly convex. an egg is convex. (Nau. Convex sets play a central role in the geometric representation of preferences (utility theory) and choices (statistical decision theory). while a cube is convex but not strictly convex. and it is strictly convex if the interior points of such a line segment are in the strict interior of X.g.
To summarize. however unrealistic. and further selecting a set of probabilities for the various actions. p2.…. a2. p2. z2.Part I: Decision Theory – Concepts and Methods Section 2: “Nonstatistical” Decision Processes Statistical decision problems include data or observations on the unknown quantity. each outcome is associated with an action. pk). The state space Θ and the action space A are both finite – i. θ. for a problem with action space consisting of actions a1. Choosing a randomized action from among all possible actions amounts to selecting a random device from among all random devices that could be used to determine the action actually taken. The use of a random device to select an action from the set of possible actions is called a randomized or mixed action. is a probability vector (p1. a2.…. ak and randomized actions. a ) . pk assigned. the action space {a} and the loss function l (θ .e. that is a sequence of nonnegative numbers pi whose sum is 1. a random device (such as a coin (ordinary or biased)) is an experiment of chance having as many possible outcomes as there are actions from which to choose.1 The Set of Randomized Actions A simple way to explain the general theory of decision making is to consider a typical coin toss. In general. One approach to handling such problems. “there are only finitely many actions [states]”. consider constructing an experiment of chance producing outcomes z1. this excludes the data available on the state of nature. Section 2. Simply. we assume that the loss function is known. the most important elements are the state space {θ}. These concepts will be defined in the proceeding section within the nodata decision problem constructs. is to consider the nonstatistical approach. These aspects form the theoretical analysis of decision making. For instance. nonrandom actions (Lindgren. (also referred to as the state of nature) which may be used to choose a more optimal decision. A singular probability vector can be used to determine a pure action. In this instance. In simple decision problems. If the outcome of the experiment performed is z1 then action a1 is taken. the action a2 can be equivalently written as the probability vector 9 . To distinguish between pure (or original) actions a1. This leads to the following definition of a random action. The basic elements or “ingredients” of this nodata problem were presented in the previous section. 1971). The loss incurred is assumed to be measured in negative utility units where utility is defined by a function defining the prospects of a decision maker. the corresponding action is taken. This is reduces the problem to a simpler one.….…. zk with probabilities p1. probability vectors are used to define the latter type. This introduces an extraneous random device which is useful in providing decision rules that under some criteria are better than those that use only the given. ak. and when a particular outcome is observed. Definition: A randomized action.…. respectively.
θ2. this can be written as taking the integral over all l (θ . a)] = l (θ1 . a1 ) l (θ1 .1. a k ) p k .…0) that assigns all of the probability mass to action a2. taking the expected value of the random variable (θ or l (θ . Thus. The latter points are the loss vectors defined by the pure actions while those defined by randomized actions are convex combinations of those defined by the pure actions (Berger. The above definition of expected loss is explained in detail in Chapter 2 of Berger’s Statistical Decision Theory and Bayesian Analysis (1985). Thus. a1 ) p1 + .. + l (θ1 . for i=1..2 The Loss Function The use of a randomized action in a decision problem with a given loss function inevitably sets the loss to be a random variable. a ) l (θ . l (θ m . there are m L’s corresponding to each randomized action..…. p2. ai ).…. a) ) measures the consequence of employing a given randomized action (when nature is in a given state). ai ) and pi. Notably. a k ) p k M L1 = E[l (θ m ..6..k. p2. for the continuous case.. for each state of nature. L2.. if the loss of a function l (θ . In particular. p2. L l (θ . L2... a1 ) p1 + . These relations can be written in the matrix form L1 l (θ1 . pk) is used to choose among the actions a1. a1 ) p1 + . + l (θ .…. The closing example to this subsection shows that the set of loss points (L1. 1971). a k ) p k defined by the randomized action (p1.0. In a decision problem with m states θ1. a k ) p k L2 = E[l (θ 2 ..….…. ai ) ).. a k ) M M M = p1 + . a 2 ) + . 1985) and (Lindgren.…. The notion of convex sets was introduced in subsection 1. a2. a) and the randomized action (p1. Lm) defined by all possible randomized actions (p1. a)] = l (θ m . ak.…. + p k . For the case of k actions. a )] = l (θ 2 .. these L’s are the following expected losses: L1 = E[l (θ1 .. the expected loss is the following weighted sum: l (θ . pk) is a convex set in m10 . Lm) is a point in mdimensional space computed as a convex combination of the points ( l (θ1 .….. a ) m m 1 m k which suggests the interpretation of the vector of losses (L1. + l (θ 2 . Section 2.Part I: Decision Theory – Concepts and Methods (0. θm. a1 ) p1 + l (θ . pure actions constitute a subset of the randomized actions. The most natural expected loss to consider when making a decision involves the uncertainty in θ. + l (θ m . pk).
if only p1 and p2 are positive. pk) yields a point (L1. then such a mixture of just these two actions lies on the edge of the convex set of points representing all randomizations (Lindgren. L2)) must lie on the line segment joining a1 and a2. with no mass at a3. which is the center of gravity of a system of masses with p1 units at a1.…. and pk units at ak. the action a3 results in the same losses as a 11 . Thus. a3 falls within the convex polyhedron. Lm).…. Note that one action. The following example demonstrates a two dimensional case – i. there are two states of nature.e. although some pure actions may fall inside the set. Interpreting randomized actions as centers of gravity helps geometrically interpret the state. Figure 21: L1 a5 a1 a3 a4 a2 L2 Recall from Section 1.….Part I: Decision Theory – Concepts and Methods dimensional space as described above. p2. 1971). then the center of gravity (and therefore the point (L1. the randomized action (p1. p2 units at a2. It forms a convex polyhedron where the extreme points are pure actions. for this example.6 that a convex combination of a set of points corresponds to the “center of gravity of a system of point masses at those points” where these masses are proportional to the weights in the convex combination.…. If these actions are represented by extreme points. Thus. Example 21: (Lindgren. ak. For instance. two states of nature and the respective loss function. L2. 1971) The following table describes a problem with five possible actions. θ1 θ2 a1 2 3 a2 4 0 a3 3 3 a4 5 2 a5 3 3 The pure actions define the loss vectors (the columns in the table) and the randomized actions define the convex set generated by these five actions which is founding the figure below.
a j ) − min l (θ i . a ) = max g (θ i . The gain resulting from taking action a when nature is in state θ is the negative of the loss: − g (θ . a4 and a5. Regret is often referred to as opportunity loss and represents a “thinking” in terms of gain rather than loss (Lindgren. which does not produce this minimum. a) . Suppose. a) . However. a ) .Part I: Decision Theory – Concepts and Methods certain mixture of a1. Hence.3 Regret If the state of nature is known then the action which results in minimal lost would be taken. Example 21: (cont’d. the action a for which l (θ i . and then discovers that nature is indeed in state θi. a j ) . as well as many other mixtures of a1. 1971) 12 . the amount of loss that could have saved by knowing that state of nature is called the regret. a j ) = l (θ i . a2. subtract the minimum loss mi from the losses involving that state to obtain the regret. Regret is defined for each state θi and action ai as follows: r (θ i . it could not be obtained by mixing actions a2. a4 and a5. a) = − max g (θ . minus the amount that actually was gained by taking action aj. one takes the action aj. if it were known that θi were the state of nature. a j ) = l (θ i . a2 and a5. a j ) − min l (θ i . A A This represents the maximum that could have been gained if the state of nature had been known. A is a loss that could not be avoided with even the best decision. a) A A and the regret can be reexpressed in terms of gain as follows: − r (θ i . 1971). a4 and a5. a) is smallest should be taken. and the minimum loss mi = min l (θ i . Thus. a) − g (θ i . or as a mixture involving only a1. the minimum loss is the negative of the maximum gain: min l (θ . Section 2.) (Lindgren. the decision maker would regret not having chosen the action that produces the minimum. A So for each state of nature. a) = l (θ .
a)] − min l (θ i . a) A 2 0 There is at least one zero in each state and the remaining entries are positive in the regret table. Geometrically. Figure 22: L1 a5 a1 a3 a4 a2 L2 Notice that the whole convex set of randomized actions shifts along with the five pure actions. The regret table is provided below and computed by subtracting the minimum loss from each of the losses in their respective rows. depending on the role of the decision maker and his or her stakes. a) ≡ Li − mi . This is generally the case because “the amount subtracted from each loss is independent of the action a. The classical treatment of statistical problems in statistical decision theory. upon which a probability distribution is imposed in a mixed action” (Lindgren. usually results in assuming a loss 13 . a1 θ1 θ2 θ1 θ2 2 3 a2 4 0 a3 3 3 a1 0 3 a4 5 2 a2 2 0 a5 3 3 a3 1 3 a4 3 2 a5 1 3 min l (θ . This is demonstrated in the figure below. the regret may be more painful then the losses. 1971): Ri = E[r (θ i .Part I: Decision Theory – Concepts and Methods Recall the example presented in the previous subsection. The table has been replicated below with the addition of another column representing the minimum loss over all five actions. in studying a decision problem. a)] = E[l (θ i . the objective or effect is to translate the set of points so that at least one action is on each axis. if one used regret instead of loss? In some instances. A Question: Would it make any difference.
1971 and French & Insua. the decision rule δ 1 is preferred to a rule δ 2 if sup R(θ . θ1 θ2 max l (θ . δ 2 ) .Part I: Decision Theory – Concepts and Methods function that is already a regret function. The smallest maximum loss is determined by action a1 to be 1. The minimax principle places a value on each action according to the worst that can happen with that action. however this is not always the case and leads to a discussion in cases where it does make a difference.4 The Minimax Principle Decision problems present a difficulty in determining the best decision because an action that is best under one state of nature is not necessarily the best under the other states of nature. θ ∈Θ Example 21: (cont’d. If δ ∈ ∆ is a randomized rule then the quantity sup R (θ . various schemes have been proposed .decision principles that lead to the selection of one or more actions as “best” according to the principle used – none is universally accepted. δ 1 ) < sup R(θ . For each action a. a) Θ a1 2 3 3 a2 4 0 4 a3 3 3 3 a4 5 2 5 a5 3 3 5 14 . Taking the action a for which the maximum loss M(a) is a minimum lends itself to the name minimax. 1985 §1. Although.5. assigning “values” to each action according to its desirability is a frequentist principle.) By linearly ordering the available actions. θ ∈Θ θ ∈Θ Similarly. Θ is determined and provides an ordering among the possible actions (Lindgren. Section 2. 2000). the maximum loss over the various possible states of nature: M (a) = max l (θ . a ) .) (Lindgren. (See Berger. δ ) among all randomized rules in ∆. Berger states the same principle within the context of a decision rule. a minimax decision rule is a minimax decision rule if it minimizes sup R (θ . 1971) Again the loss table is repeated below with the addition a row stating the maximum loss for the various actions. δ ) represents the worst that could happen in θ ∈Θ the decision δ is used. Furthermore.
and if two points both lie below it. if one used regret instead of loss? A graphical approach of determining the minimax pointwhich is feasible when there are two states of nature is a useful in approaching the previously stated question. This approach can be generalized to many states. L2 ) under (θ 1 . θ1 θ2 max l (θ . it is clear that this point (where the bisector meets the convex polyhedron) lies on the segment joining the 15 . If two points lie above the bisector.θ 2 ) . respectively. and that moving the action points relative to the coordinate system can alter the process of finding the minimax point. and when there are just two actions. the lower one has the smaller maximum. in studying a decision problem. in some instances. a graphical solution to the problem of determining a minimax mixed action can be carried out by representing the randomized actions in terms of ( L1 . There are two cases that can be considered for brevity sake at this point . the maximum of these losses is the first coordinate if the point lies below the bisector of the first quadrant while if the point lies above that bisector. When there are just two states of nature. the minimax loss action and the minimax regret action will not differ. 1971) Continuing with the same example and looking at the figure below. Example 21: (cont’d. This can be shown by returning to the example. but becomes more messy in deciphering all the combinations. For a given action a with losses ( L1 . the leftmost point has the smaller maximum. the maximum loss is the second coordinate L2. a) Θ a1 2 3 3 a2 4 0 2 a3 3 3 3 a4 5 2 3 a5 3 3 5 Recall the question posed at the end of the last subsection: Question: Would it make any difference. The table shows that the minimum maximum regret is determined by action a2. or 45˚ line.when there are just two states of nature. To determine the minimax action among the set of all randomized actions is generally more complicated.) (Lindgren. R2 ) .Part I: Decision Theory – Concepts and Methods The table of regrets is produced below and shows different results then that of the maximum loss table. L2 ) and ( R1 . Of course. The graphical approach to the minimax action is simply that the minimax process is related to the location of the origin of the coordinate system. because instead of choosing an action from a finite set of actions one must choose a probability vector from a set of possible probability vectors that is infinite in number (even if the set of actions is finite).
and none 5 5 4 1 at any other action. it is the point on that line segment with equal coordinates: 2 p + 4(1 − p) = x = y = 3 p + 0(1 − p) . 5 Figure 23: L1 a5 a1 a3 a4 a2 L2 4 1 of the total probability (1) at a1. 5 12 namely. ). The point on the graph representing the minimax actions is thus 5 12 12 ( .p).0. 5 5 4 that is.y) in question is a convex combination of a1 and a2: x y = 2 4 p + (1 − p) . 3 0 Moreover. the smallest maximum expected loss.Part I: Decision Theory – Concepts and Methods points representing a1 and a2. and so represents a mixture involving only those two actions. . 3. 1 . The point (x. The expected losses under the various states of nature can be computed 16 . 3 p = .0.0) . Notice that this minimax expected loss is actually less that the minimum 5 5 maximum loss achieved when only pure actions are admitted. The minimax loss. The randomized actions are vectors of the form (p. The minimax action therefore puts The case of two action is discussed only briefly as it can be determined in a similar fashion as above. producing the probability vector ( . is the value obtained by setting p = . 4 and equating these two functions of p yields p = . at a2. as determined in the “earlier example” presented in the previous section. defined by a single variable p.
a)] = ∑ l (θ . 1971).. which assigns probability pi to action ai. a j ) p j . The role of prior probabilities was briefly introduced in Section 1 and will be expanded upon here. This is accomplished by weighting the states and ordering the actions to permit the selection of a “best” action. convictions. Such a set of probabilities or weights is called a prior distribution for θ. the computation of the expected loss according to a given prior distribution provides a means of arranging or ordering the available actions on a scale (namely. p k ) . or information about the state of natureand how “likely” (whatever that means) the various states of nature are to be governing the situation” (Lindgren. B(a)) such that the action farthest to the left on that scale is the most desirable. That is. a probability weight g (θ ) 1is assigned to each state of nature θ. the loss incurred for a given action a is a random variable.Part I: Decision Theory – Concepts and Methods as functions of p. In contrast. In a general decision problem. considering concept of randomness for the state of nature is the Bayes decision principle.. The minimum point of the latter function then defines the minimax Θ action. j and the Bayes loss is obtained by averaging these (for the various θ’s) with respect to g (θ ) : 1 This notation is used so as to diffuse any confusion from the probability vectors assigned to each action. it is rational to incorporate into the decision making process one’s prior “hunches. See both Berger (1985) and Lindgren (1971) for examples and further discussion.5 Bayes Solutions The frequentist approach has considered the states of nature to be fixed. a) . a)] . Section 2. the extreme loss is minimized slightly by the state of nature that would have produced it. the expected loss for a given state θ is E[l (θ . Given the distribution g (θ ) . and from this one can determine (at least graphically) the function max E[l (θ . When randomized actions are considered. with expected value B (a ) = ∑ g (θ i )l (θ i . nevertheless... Thus. if a large loss can occur for a given action when nature is in a state that the decision maker feels is highly unlikely. 17 . a Bayes loss can be defined as the expectation with respect to a given prior distribution of the expected loss (with respect to the randomized action): For a randomized action p = ( p1 . i This is referred to as the Bayes loss corresponding to action a (Lindgren. The Bayes action is then defined to be the action a that minimizes the Bayes loss B(a). and is to be taken. p 2 . where these assigned weights are nonnegative and add up to 1. This stance considers nature as random or not. 1971).
Part I: Decision Theory – Concepts and Methods B (p) = ∑ g (θ i ) E[l (θ i . the Bayes action is a1 as it yields the smaller Bayes loss.5 then it is irrelevant whether action a1 or a2 is taken. a)] = ∑ g (θ i )∑ l (θ i . 1971) Consider the following loss table where there are just two states of nature and two actions.. i i i Since this Bayes loss is a function of ( p1 . g (θ 2 ) = 1.... for w2 < w ≤ 1 it is a3.5. p 2 . a j ) p j . Examination of that figure shows that for a prior distribution defined by a w in the range 0 ≤ w < w1 the Bayes action is a1. The prior probabilities are given and g (θ 1 ) = w.. for w1 < w < w2 it is a2.. θ1 0 1 w θ2 6 5 1w B(a) 6 – 6w 5 – 4w a1 a2 g (θ ) The figure below illustrates that for any w to the left of 0.. the smaller Bayes loss is incurred by taking action a2.w and the expected losses are simply calculated as B (a1 ) = 0 • w + 6(1 − w) = 6 − 6w B (a 2 ) = 1 • w + 5(1 − w) = 5 − 4w. Example 22: (Lindgren.5 1 18 . or k – 1 variables (pk is determined as soon as p1 . For any w to the right of 0. When w = 0. p 2 .. Figure 24: 6 a1 a2 w 0 w . the problem of determining the minimum Bayes loss is that of minimizing a function of (k – 1) variables. p k ) . the value of w for which the Bayes losses are equal.5. p k −1 are specified).
the value of B(p) is a convex combination of the values of the Bayes losses for the various pure actions. a j ) = l (θ i . a j )] by a term that does not involve aj. g (θ 2 ) = 1w. Substitution of r (θ i . a pure action can always be found which yields the minimum Bayes loss. it can be generalized that small w refers to a least favorable outcome where as w refers to a more favorable outcome. p2. 1971) Consider the following decision problem with its corresponding loss table. a ) into the expression for expected regret: E[r (θ . and for a given value of w the action corresponding to the line whose ordinate at that w is smallest is the Bayes action. it is also true in general. a j )g (θ i ) i yields E[r (θ . a ). a j ) p j = ∑ p j ∑ g (θ i )l (θ i . respectively. however. however. i j i j Thus. E[r (θ . a j ) − min l (θ i .Part I: Decision Theory – Concepts and Methods The interpretation of small or large w is of course dependent on the problem. This means that there is no gain possible in the use of randomized actions. a j )] − ∑ g (θ i ) min l (θ i .…. a j ) . pk) for actions a1. and is at least as great as the smallest of those values (Lindgren. Example 23: (Lindgren. a j )g (θ i ) − ∑ g (θ i ) min l (θ i . 19 . a j )] differs from E[l (θ . 1971). a j )] = ∑ l (θ i . The result that the Bayes actions are the same using the regret as using lossis evident for the problem involving only two states of nature. In general. Now when considering randomized actions (p1. i a i i a Thus. a j )] = ∑ r (θ i . a) = E[l (θ . This is illustrated the following example. These functions of w will be represented by straight lines.… ak. in a problem with two states of nature and g (θ 1 ) = w. the Bayes losses for the various actions will be linear functions of w. produces the expected losses B (p) = ∑ g (θ i )∑ l (θ i . The action that minimizes one must therefore minimize the other.
given prior weights g (θ 1 ) and g (θ 2 ) . a j )] also minimizes E[l (θ . a) ≥ l (θ . Definition: An action a* (pure or randomized) is said to dominate strictly an action a if it dominates action a and if. Section 2. irrespective of the action. a3 )] = 3g (θ1 ) + 5 g (θ 2 ) − [2 g (θ1 ) + g (θ 2 )] so that E[r (θ . a j )] by a difference of. Definition: An action a* (pure or randomized) is said to dominate an action a if the loss incurred by using action a is always at least as great as that incurred by using a: l (θ .6 Dominance and Admissibility The concepts of dominance and admissibility are counterparts to decision rules. a*). a1 )] = 2 g (θ1 ) + 3 g (θ 2 ) − [2 g (θ1 ) + g (θ 2 )] E[r (θ . are E[r (θ . a ) = l (θ1 . a 2 )] = 5 g (θ1 ) + g (θ 2 ) − [2 g (θ1 ) + g (θ 2 )] E[r (θ . for all θ.Part I: Decision Theory – Concepts and Methods θ1 θ2 The regrets are a1 2 3 a2 5 1 a3 3 5 r (θ1 . in addition. a) > l (θ . a) = l (θ 2 . Thus the action that minimizes E[r (θ . for some θ. a)] − [2 g (θ1 ) + g (θ 2 )]. It is important to state some definitions before discussing their importance any further. a) − 2 r (θ 2 . a) − 1 The expected regrets. there is some state of nature for which the loss inequality is strict: l (θ . Definition: 20 . ai )] = E[l (θ . a*). Some examples have been encountered in which certain of the available actions would never be used because there are others for which losses are always less.
then one may actually do worse by using a than by using a*. L2) plane. θ ). The figure below shows points corresponding to actions a*. an action is represented as a point in the (L1. with the exception of the point a* itself. in the case of a2 only the loss for θ1 is strictly greater than with a*. if it is strict. Let δ 1 (⋅) and δ 2 (⋅) be two decision rules. we can reiterate this concept in terms of risk functions and decision rules. in the case of a randomized action) incurred when that action is taken and θi is the state of nature. it can be stated that the δ 1 (⋅) and δ 2 (⋅) are equivalent if R (δ 1 . Before ending this section. ∀θ . since the losses would be the same. Incidentally the action a* would strictly dominate every action represented by points in the shaded quadrant (including the boundaries). θ ) = R(δ 2 . (In this kind of representation. “it is stated that risk functions induce a natural ordering among decision rule: a decision rule which performs uniformly better in terms of risk than another for each value of θ seems better overall”. θ ) ≤ R(δ 2 . then the losses are the same for a as for a*. a1 and a2. In Chapter 5 of French and Insua (2000).) Figure 25: L1 a1 a* a2 L2 If an action a* dominates an action a there is no need to leave a in the competition for a best action. Then δ 1 (⋅) dominates δ 2 (⋅) if R (δ 1 . where Li is the loss (or expected loss.Part I: Decision Theory – Concepts and Methods An action (pure or randomized) is said to be admissible if no other action dominates it strictly. Returning to the case of two states of nature. an action that dominates action a but does not strictly dominate it would be represented by the same point as action a. Similarly. In the case of a1 the losses are greater than for a* for both states of nature. and a can be dispensed with. If the dominance is not strict.θ ). 21 . with strict inequality. The proceeding paragraph was extracted from Lindgren (1971). ∀θ . for which losses are such that a* dominates both a1 and a2.
(See also Chapter 6 of French and Insua (2000). The latter is essentially based on the risk function R (δ . there may be solutions that are slightly inadmissible but are preferred for some reason to those that dominatebecause of computability. for some prior distribution (Lindgren. for example. from the geometrical representation in the L1L2 plane (Lindgren. The figures below illustrate four cases that may arise. there are degrees of inadmissibility which the terminology ignores. especially in the nodata decision problems. That a minimax solution is Bayes for some prior distribution is evident. On the other hand. 2 22 . and (3) Admissible actions are Bayes. for instance. 1971). Since this is only a partial ordering.Part I: Decision Theory – Concepts and Methods Definition: A decision rule (or action) is admissible if there exists no Rbetter decision rule where R is referring to the risk function. and can be dispensed with because there is an action that does at least as well under the circumstances. 1971). 2000). The above definition was provided by Berger (1985). for the case of two states of nature. namely the Bayes and Minimax Principle. In statistical problems. Thus. 3 It is important to note that decisions and actions are sometimes interchangeable. an action that is close to the lower left boundary of a set of mixed actions (in the L1. It is important to recall here that risk function is the expected loss (as defined by the Frequentist or classical approach). These are nonformal statements. (2) A minimax action is a Bayes action. Section 2. The risk function is further explored in Section 3. Much of this discussion centers around the concept of admissibility.7 Bayesian and Classical Approaches – Bayes versus Minimax Principle This section explores the connections between Bayesian and classical approaches through their respective forms of decision principles. however they are discussed in detail with proofs in 4. leading to the concept of admissibility. some possible generalities can be made – (1) Bayes solutions are usually admissible. 1971). although the notion of decision rules has been brought in this section it is discussed in more detail in the next section. A decision rule is inadmissible is inadmissible if there exists an Rbetter decision rule.θ ) 2which induces a partial ordering among the decision rules3. An action that is not admissible is said to be inadmissible. From the preceding three subsections.5 (Minimax admissibility and Comparison with Bayes) of Berger (1985).8 (Bayes admissibility) and 5. Again a geometric representation of just two states of nature is instructive as the set of all possible (randomized) actions is represented in the L1L2 plane as a convex set. That is. L2 representation) may not be so bad as the name inadmissible would imply (Lindgren. the Bayesian approach introduces a prior distribution which may be used to weight the risk function and orders according to Bayes risk (French and Insua.
which is a pure action. this action would be Bayes for the prior distribution (w.Part I: Decision Theory – Concepts and Methods Figure 2. 1. (d) The minimax action would be any that yields an (L1. (c) The prior distribution which produces the minimax action as a Bayes action corresponds to the tangent line at L1 = L2.w) such that – w/(1 – w) is equal to the slope of the line through the two pure actions involved. L2) on the bottom edge of the action set. which assigns probability 1 to θ2. this kind of set of randomized actions would only occur if there are infinitely many pure actions at the outset. would be Bayes for any prior distribution such that – w/(1 – w) is a slope between the slopes of the line segments which meet in that pure action. (b) The minimax action.6: (a) The minimax action occurs on the boundary of the set of actions. these are Bayes for the prior distribution. 23 . and is a mixture of two pure actions.
The graphs of these are shown in figure 2. 1) with equal coordinates: 4 This is said to be a least favorable prior distribution (Lindgren. 5 5 Figure 2. a1 2 3 a2 5 1 a3 3 5 θ1 θ2 B (a1 ) = 2w + 3(1 − w) = 3 − w B (a 2 ) = 5w + (1 − w) = 1 + 4w B (a3 ) = 3w + 5(1 − w) = 5 − 2w.7: 6 0 2/5 w 1 The minimax randomized action is easily found (by the graphical procedure described in 4 1 Section 2. 1971).4) to be ( .1 − w) . The lowest point of intersection and 2 the highest minimum is at w = . The loss table has been repeated for ease. The point on the line through (2. of all the prior distributions that Nature might 5 choose.0) with losses L1 = L2 = 13/5. 24 .7 below. the highest minimum Bayes loss (loss incurred by using the Bayes action) is 2 3 achieved for the prior distribution ( .5 produces the following Bayes losses with given prior probabilities ( w. 1971) Revisiting the problem introduced in Section 2.Part I: Decision Theory – Concepts and Methods Example 23: (Lindgren. 5 5 3) and (5. . That is. ) 4.
Part I: Decision Theory – Concepts and Methods L1 − 3 13 2 = − . δ ) . the distribution is least favorable. therefore. is not nearly so easy to defend. if * * p* = ( p1 . The Minimax Approach… The minimax approach.making the assumption that the worst will happen. This idea is discussed more indepthly in sources on game theory which is not included in this report. or L1 = .. Furthermore. on the other hand. The concept of subjective probability is discussed in Part C on utility theory. In the role of rational decision making. nonBayesians do criticize this approach due to its view of subjectivity.4 and 2. In defence of this approach. the inclusion of such personal preferences deems reasonable.5. Sometimes a minimax solution can be computed by determining among the Bayes solutions one for which the losses under the various states of nature are equal. 1− w 3 where 2/3 is the slope of the line through (5. it is clear that the minimax approach can be unreasonable. Notice that this w. in small inaccuracies in specification is not treacherous. is precisely the w of the least favorable distribution5 determined above. but not trivial to establish. which is w = 2 / 5 . Berger (1985) goes on to say that “when considered from a Bayesain viewpoint. 5 3 L1 − 2 A prior distribution that would yield this randomized action as Bayes action is defined by a w such that −w 2 =− . 1) and (2. A graphical representation of this has not been shown here but can be achieved in much the same manner described in Section 2. 5 25 . That is.. A more significant objection is that the minimax principle. It is pessimistic view . it is not terribly sensitive. 1985). the decision maker is incorporating into that weighting function personal preferences about what that unknown state of nature is likely to be. p k ) is a randomized action that is Bayes with respect to some prior g (θ ) and is such that the (expected) loss function The notion of a least favorable distribution giving rise to the minimax action as a Bayes action is general. The Bayes Perspective… Mathematically the Bayes approach is just postulating a weighting function that provides an ordering among the actions. However. by considering sup R(θ . practically. the Bayes solutions are usually admissible.. 3). may violate the rationality principles θ (Berger.. Although it is frequently admissible (and Bayes for some prior distribution).
This can also be visually easing with just two states of nature. parameter space. θ Similarly. P θ where the right side of this inequality is the minimax loss. which in turn is greater than or equal to the smallest such maximum: L(θ . The Invariance Principle. p) . Since L(θ . the constant L(θ . p*) is neither less than nor greater than the minimax loss. Before closing this section. then p* is minimax. it is equal to its maximum value.Part I: Decision Theory – Concepts and Methods L(θ . the Invariance Principle. Two of three key decision rules have been discussed here. Lindgren (1985) has described this phenomenon more in depth both algebraically and graphically. p) → B (p) ≤ max L(θ . p*) = E[l (θ . p*)] = L(θ . then the same decision rule should be used in each problem. 26 . p) . as was asserted. B (p*) = min B(p) ≤ B(p) . a)] = ∑ l (θ . p*) ≤ max L(θ . basically states that if two problems have identical formal structures (sample size. p) . I would like to make one note. θ P θ And then because L(θ . The third. p*) = max L(θ . it must be equal to itand this means that p* is a minimax solution. densities and loss function). which cannot exceed the maximum value of L(θ . p*) ≤ min max L(θ . p) and it follows for any p that θ L(θ . p*) ≥ min max L(θ . A similar proof can be found in Berger (1985) and is discussed in brief in French and Insua (2000). p*) . On the other hand. p) . The development follows (Lindgren. its mean value with respect to (wrt) the weighting g (θ ) is just that constant value: B (p*) = E[ L(θ . P But since L(θ . ai ) pi* i is constant in θ. has not been included in a theoretical discussion but for completeness is mentioned. p*) is defined to be constant in θ. p*) is constant. 1971): The assumption that p* is Bayes for g (θ ) means that for any randomized action p. as the name would have it imply. since B(p) is the expected value of L(θ . An entire chapter in Berger’s 1985 book Statistical Decision Theory and Bayesian Analysis is devoted to this principle. p*) cannot exceed the smallest of these maxima L(θ .
even the availability of data will not avoid completely the kind of situation encountered in the nodata case.2 Decision Functions Section 2 introduced the concept of a decision rule within the nodata context. and so the probability distribution for the data depends on that state of nature. f ( x. a random variable X will be employed to refer to the data. The data of a given problem may consist of a single number (value of a random variable).Part I: Decision Theory – Concepts and Methods Section 3: “Statistical” Decision Processes Statistical decision problems include data . In any case X will have certain possible “values” and a probability for each.meaning “the results of making one or more observations on some random quantity that is thought to be intimately related to the state of nature in some decision problem” (Lindgren. It’s relevance develops in this section as a procedure for using data as an aid to decision making involving a rule. or a sequence of random variables: (X1. In general. and in the problems considered here X will denote either a single random variable. (The notation Pθ (E ) will mean the probability of the event E when the state of nature is θ.…. or a sequence of numbers . That is.Xn). 1971). according to the state of nature. in which there is no clearcut criterion for rating the various candidate procedures as a basis for choosing one of them as best. or sometimes a result or results that are not numerical. say X = x. and It is important to realize that in order for the data to be of value in making a decision.)6 Section 3. 6 27 .θ ) = Pθ ( X = x) assigned to that value x by the state of nature θ. when the pertinent experiment is performed and a value of X obtained. Section 3. Thus. the dependence of the probability distribution for X on the state of nature must be known.1 Data and the State of Nature To obtain data for use in making decisions an appropriate experiment of chance should be performed – one in which the state of nature determines the generation of the data. an action has been assigned by the rule to that value. The availability of such data provides some illumination in the selection of an action such that the state of nature is not completely unknown.usually resulting from performing the same experiment repeatedly. However. that assigns one of the available actions to each possible value of the data X. for each value of x of the random quantity X there is a probability f ( x.θ ) is assumed to be given or known. Thus. It will be shown that the use of data will define procedures which result in an expected loss that is lower than what would be incurred if the data were not available. or set of instructions.
⋅) is. or statistical decision function. A randomized decision function is then a probability distribution over the set of pure decision functions. δ 1 ( x) and δ 2 ( x) .Part I: Decision Theory – Concepts and Methods that action is taken. the last decision rule performs in much the same way. a) is the probability that ac action in A will be chosen.…. some ignore the data. g2. x 0 1 2 δ1 a2 a2 a2 δ1 a2 a2 a1 δ1 a2 a1 a2 δ1 a1 a2 a2 δ1 a2 a1 a1 δ1 a1 a2 a1 δ1 a1 a1 a2 δ1 a1 a1 a1 Decision rules can be randomized in much the way actions are randomized – an extraneous random device to choose among the available rules. when the numbers of actions and possible values of X are finite. 1971). some are foolish.) Two decision rules. (Recall for a nodata problem. Definition: A randomized decision rule δ ( x.1.…. assigning probability pi to decision function δi(x). for each x.….) If X = x is the observed value of the sample information. A randomized decision rule can be defined alternatively by attaching an outcome xi of the data X (from an experiment). xm). δ ( x. a decision rule is simply an action. (Again a randomized decision rule in nodata problems is simply referred to as a 28 . x2. Berger (1985) gives the following definition. “Of the km possible decision functions. then there are precisely km distinct decision functions that can be specified. and if the data X can have one of just m possible values (x1. Definition: A (nonrandomized) decision rule δ (x) is a function from X into A. A decision rule is a function a = δ(x). The second decision rule suggests taking action a2 is X = 0 or 1.…. some are sensible. a2. This is rule is equivalent to a randomization of pure decision functions.2}. are considered equivalent if Pθ (δ 1 ( X ) = δ 2 ( X )) = 1 for all θ. ak). ak) according to some probability distribution (g1. Example 31: (Lindgren. which selects one of the actions (a1. then δ (x) is the action that will be taken. Question: How many distinct rules are there? If there are just k available actions (a1. (It is always assumed that the functions are measureable. a probability distribution on A. with the interpretation that if x is observed. The following definition provided by Berger (1985) closes this part. The first decision rule ignores the data as it takes action a2 regardless. 1971). 1971) Consider the following table with 8 decision rules to between two actions and where the observed value X={0. and some will use it wrongly” (Lindgren. and is called a decision function. a2. gk) (Lindgren.
Since X is random. and the probability weights used in computing the expected loss depends on θ. Thus. the expected value of loss if such a decision rule δ(x) was used repeatedly with varying X in the decision problem. the loss incurred can be restated as. Notice that the dependence of risk on the state θ arises because of two facts: the loss for a given action depends on θ. This section defines and explores this concept more explicitly. in that they correspond to the randomized rules which. a ) ). Assuming a distribution for X defined by Pθ ( X = x) = f ( x. knowing the risk function but ignorant of the true state of nature. δ ) = L(θ . δ ( x))] which depends on the state of nature θ and on the decision rule δ. for each x. a) = I A (δ ( x)) = 0 ifδ ( x) ∉ A. so that a specific action chosen has probability one. Following from the previous subsection. R(θ . l (θ . The frequentist decisiontheoretic evaluates.θ ) . but also on the value of X that is observed. a decision rule (pure or randomized) can be selected from those available. the risk function would be calculated as R (θ .) Nonrandomized rules can be considered a special case of randomized rules.3 The Risk Function Section 2 first introduced the concept of risk function as that understood by the classical approach. and is a random variable. 〈δ 〉 ( x. the risk function is the weighted average of the various values of the random loss. The problem of selecting a decision rule knowing the risk function is mathematically exactly the same as that of selecting an action in the 29 . the loss incurred depends not only on the state of nature that governs it. Section 3. i Or simply stated. Let 〈δ 〉 denote the equivalent randomized rule (at this time) for the nonrandomized rule δ (x) given by 1 ifδ ( x) ∈ A. (Notably.Part I: Decision Theory – Concepts and Methods randomized action. for a nodata problem. δ ) = ∑ l (θ .θ ) . δ ( x)) . for each θ. when the decision is based on observed results of an experiment. δ ( xi )) f ( xi . δ ) = E[l (θ . Definition: The risk function of a decision rule δ(x) is defined by R (θ . when a given decision function δ(x) is used.
R2) points representing the various procedures is pulled in toward the zero regret point. give points (R1. for each rule δ. • The data provides risk points which clearly dominate these nodata points. knowing the loss function. • The straight line joining δ1 and δ8 would consist of randomizations that are exactly equivalent (as regards risk) to randomizations of the pure actions a1and a2. The availability of data permits a reduction of losses if the data is used properly and if the set of (R1. 1971) when using or neglecting the data: • The pure decision functions δ1 and δ8.Part I: Decision Theory – Concepts and Methods absence of data. in particular. In a similar fashion to a loss table. the data cannot be independent of the state of nature. and consequently make poor use of the data. where (R1. L2) for the nodata problem. δ ) . Figure 31: R2 6 d8 d5 d6 d7 d3 d2 d4 5 d1 R1 1 Several observations can be made (Lindgren. The methods (minimax. The plot in the figure below shows in addition to the pure decision rules the set of randomized of the pure rules. as the convex set generated by them (French and Insua. which ignore the data. • There are several rules (δ2. If the distribution of X is independent of θ: Pθ ( X = x) = f ( x. R2) is plotted. in the absence of data. R2) which are exactly the same as the corresponding (L1. 2000). δ4. The decision rules are also representable graphically as before. δ6 and δ7 are all better. δ3 and δ5) which are worse than rules that make no use of the data. a risk table can be created in much the same way with the number of rows representing the possible states of nature and the number of columns producing the number of distinct decision rules. In this instance. Bayes) used in attacking the nodata problem can be used for the statistical problem (with data). and indeed δ4 and δ7 are admissible.θ ) = k ( x) . respectively. 30 . with Ri = R(θ i .
.6 introduced the concepts of dominance and admissibility which are extended to the decision function.…. by defining them in terms of the risk function. δ ( X ))] = ∑ l (θ i . If δ(x) assigns the action a to some possible value xj of positive probability under θ0.. δ ( x j )) . 1971): l (θ 0 . it is dominated by some action a* (Lindgren.θ 0 ) j < ∑ l (θ 0 . Ls).. δ ). M = ∑ k ( x j ) j l (θ . a*) < l (θ 0 . Section 3. In vector form this becomes l (θ 1 .δ * ( xi )) f ( xi ... The dominance is strict if the inequality is strict for at least one state of nature. 1971)..6. corresponding to the m actions δ ( x1 ). δ ( x m ) .. a new rule δ * ( x) is defined to be identical with δ (x) except that δ * ( x j ) = a * . This was defined by Berger (1985) and stated in Section 2. for all θ . this section discusses more in depthly within the constructs of modeling statistical decision problems. So. δ ( x j )) R1 M . for some θ 0 . δ ) j 31 .Part I: Decision Theory – Concepts and Methods then the risks for a given rule δ(x) are Ri = E[l (θ i . k ( x m ) of pure actions δ ( x1 ).δ ( xi )) f ( xi .... a decision function δ * dominates a decision function δ if and only if R (θ ..θ 0 ) = R(θ 0 . δ ( x )) R j s s which is a convex combination of the points (L1. s.. A decision function is admissible if it is not dominated strictly by any other decision function. These losses can be achieved by the randomization k ( x1 ).. and then R (θ 0 . δ *) ≤ R(θ . (Although this concept was addressed in the previous section. δ ( x j ))k ( x j ) j for i = 1. where Li = l (θ 1 . a).. A decision function that involves an inadmissible action is inadmissible.4 Selecting a Decision Function Section 2.. δ ( x m ) . δ *) = ∑ l (θ 0 .. If a is not admissible. and so the data does not extend the convex set of loss points to risk points that are any better (Lindgren.
ai ) = l (θ . δ ( X ))] = E[l (θ .by applying it to the initial loss function. the maximum risk over the various states of nature: M (δ ) = max R(θ .at least. a ) . The preferred decision rule is the decision function δ which minimizes this Bayes risk. or expected loss. if one considesr a minimax approach.Part I: Decision Theory – Concepts and Methods (The inequality follows because the losses in the sums are identical except for the term where i = j. a and the expected regret is then E[r (θ . a so for each x. δ ( X ))] − min l (θ . The regret function was defined as r (θ . δ ( x)) − min l (θ . δ ( x)) = l (θ . yielding a smaller loss. Recall the question posed at the last section: Question: Would it make any difference. δ )] = ∑ R(θ j .) For states other than θ0 the inequalities < are replaced by ≤. a) . j This is the Bayes risk. The Bayes Approach… Assigning prior probability weights g (θ ) to the various states of nature determines the average risk over the states: B (δ ) = E[ R(θ . and the strict dominance of δ by δ* follows. There are two ways to introduce the idea of regret . δ ) g (θ j ) . in studying a decision problem. These result to the same thing and are shown below. The Minimax Approach… The minimax principle provides a numerical measure of decision rules. δ ) . θ The minimax decision function is the δ that minimizes this maximum risk. and by applying it to the risk. ai ) − min l (θ . r (θ . if one used regret instead of loss? As in the nodata case. namely. it can make a difference . a 32 . a) . in which case δ* = a* instead of a.
these must be equal. a ) . δ δ a But if a* is the action that gives the minimum l (θ .4 p + .36 δ7 . δ ) − min R(θ . and one considers the decision rule δ * ( x) ≡ a * .7 .8 . δ ) ≤ R(θ . δ *) = E[l (θ . δ ( X ))] = min R(θ .5 .6 . a) .5 − .6 . the expected regret is the same as the “regretized” risk and is given as E[r (θ .82 . l (θ .9 .Part I: Decision Theory – Concepts and Methods Now for any decision rule δ.5 p. g (θ ) .2 M(δ) B(δ) δ1 1 0 1 . a) (Lindgren.1 . Therefore.9 .64 δ4 .9 .4 . δ ) ≥ min l (θ .7 .66 δ6 .9 .4 . 33 .4 .1 . a) a and min E[l (θ .1 − . a*) = min l (θ . Shown also are the maximum risks and the Bayes risk for the corresponding prior probabilities g (θ ) .8 δ2 δ3 .4 .1 .5 . then min R(θ . δ Example 31: (cont’d) (Lindgren. δ ( X ))] ≥ min l (θ .5 .34 δ5 .6 .5 which gives .2 Suppose we now from graphically analyzing this problem that the minimax mixed decision rule is a mixture decision rules 4 and 7 and can be computed as .1 p + . δ a Since the minimum regret over all rules δ is neither > nor < the minimum loss over all actions a. 1971). for all x . a). δ ) . δ ( x)) ≥ min l (θ .18 δ8 0 1 1 . 1971) Recall this example with the updated table of expected regrets.3 . a*)] = l (θ . a so that E[l (θ . δ ( X ))] = R(θ .1 p + (1 − p) .1 p = .
θ j ) = f ( xi  θ j ) g (θ j ) . Some of the concepts or relationships between conditional. Thus..θ m . is the same for all states).. When the distribution of the data X is unrelated to the state of nature (that is.. a brief description of the posterior distribution is provided here within the decision model framework. corresponding to the given prior. 1. The function h(θ j  xi ) is the posterior probability function for Θ. h(θ  xi ) = k ( xi ) 34 . f ( xi  θ ) g (θ ) = g (θ ) . From these the joint probabilities p ( xi . Assume that the distributions of X under the various states of nature are given: f ( xi  θ j ) = P ( X = xi  Θ = θ j ) .. then the posterior probabilities are equal to the prior probabilities. p = 4 / 7 and the minimum maximum risk is computed by substituting the value for p back in the equation above and obtaining the value 19/70..5: The Posterior Distribution For completeness. 1971).Part I: Decision Theory – Concepts and Methods Thus. except that the prior distribution would be utilized.. and the marginal probabilities for X: p ( xi ) = ∑ f ( xi  θ j ) g (θ j ) j can be constructed. Section 3. with possible values θ 1 . g (θ ) . x k and let Θ denote the state of nature.. Let X represent data with possible values x1 . The conditional probabilities for Θ given X=xi are h(θ j  xi ) = p ( x i . marginal and joint were addressed in Section 1 on probability measures. The Bayes approach works in much the same way.θ j ) : p( xi ..θ j ) p ( xi ) = ∑ f (x j f ( xi  θ j ) g (θ j ) i  θ j ) g (θ j ) . There are 2 notes to make before closing this section (Lindgren. as well as the prior distribution for Θ: g (θ j ) = P(Θ = θ j ) . if f ( xi  θ ) ≡ k ( xi ) . This relation is essentially Bayes theorem.
The previous subsection has indicated that an observation can alter prior odds to posterior odds. Suppose that two observations are made. 2. should it result in yet a new posterior distribution? The interest holds as if a posterior distribution is used as a prior distribution with new data. if 0 = θ *.Part I: Decision Theory – Concepts and Methods So the observation X does not alter the odds on the various states of nature from what they were before the observation. Expressing conditional probabilities using the defining formulas in terms of joint probabilities gives P ( X = x. y  θ ) = f ( x  θ ) f ( y  x. Thus if g (θ *) = 1 and g (θ ) = 0 for any other θ. Y). for any other θ . with probability function f ( x. h(θ  x) = f ( x  θ ) g (θ ) 1. y  θ ) = P ( X = x. If a prior belief is sufficiently strong. which can be rewritten as f ( x . then k ( x) = ∑ f ( x  θ i ) g (θ i ) = f ( x  θ *). Y = y  Θ) = θ = P( X = x  Θ = θ ) P (Y = y  X = x and Θ = θ ) . (X.6 Successive Observations This is adapted from Lindgren (1971) and will be discussed further in the Part on Sequential Statistical Decision Theory. the resulting posterior distribution is the same as if one had waited until all the data were at hand to use with the original prior distribution for a final posterior distribution. The posterior probabilities for θ given X=x. = k ( x) ∑ f ( x  θ i ) g (θ i ) 35 . θ ) . The concern or question arises that if the first posterior distribution were considered as if it were a prior. = f ( x  θ *) 0. observations will not alter it. i and then provided f ( x  θ *) ≠ 0. Section 3. and given a prior g (θ ) are h1 (θ  x) = f ( x  θ ) g (θ ) f ( x  θ ) g (θ ) . Y = y  Θ = θ ) .
the product f ( x j  θ i ) g (θ i ) can be replaced by the product of the conditional probability in which the condition is the value of X.weighting the various states of nature and choosing the decision function δ (⋅) so as to minimize the Bayes risk.θ ) f ( x  θ ) g (θ ) / k ( x) f ( y  x.θ j )h1 (θ j  x) ∑ f ( y  x. Y) and the prior g (θ ) . corresponding to a certain prior distribution can be found by applying the Bayes principle to the risk table . ∑ f ( x. δ ) g (θ i ) . y  θ ) g (θ ) . Section 3.7 Bayes Actions from Posterior Probabilities The following discussion is found in Chapter 4 of Lindgren (1971). i j Step 3: According to Bayes’ theorem.θ )h1 (θ  x) ∑ f ( y  x. This function of δ is the expected value of the risk B (δ ) = ∑ R(θ i . Chapter 6 of French and Insua (2000) and Chapter 3 of Raiffa and Schlaifer (2000). i Consider a different formalization. Chapter 4 of Berger (1985). Step 1: Recall the expression for R (θ . y ) = = = f ( y  x. The Bayes decision rule for a given problem.Part I: Decision Theory – Concepts and Methods Using this as a prior probability for Θ = θ together with the new observation Y=y yields the posterior probability h2 (θ  x. δ ( x j )) f ( x j  θ i )}g (θ i ) .θ j ) f ( x  θ j ) g (θ j ) / k ( x) f ( x. y  θ j ) g (θ j ) This is precisely the posterior probability that would be computed based on data (X. and the marginal probability X=xj: 36 . δ ) = ∑ l (θ . δ ) as it is computed from the loss function is: R (θ . B (δ ) . δ ( x j )) f ( x j  θ ) . j Step 2: Substitute this into B (δ ) yields B (δ ) = ∑∑ {l (θ i .
X = x j ) = ∑ f ( x j  θ i ) g (θ i ).Part I: Decision Theory – Concepts and Methods f ( x j  θ i ) g (θ i ) = h(θ i  x j ) p ( x j ) . j where Lh is made as small as possible. x j ) p(x j ) . δ ( x j ))h(θ i  x j ) p ( x j ). This is calculated by summing the joint probabilities for (Θ. i i Step 4: The indicated double summation can be calculated equally well by summing first on i and then on j: B (δ ) = ∑ ∑ l (θ i . δ ( x j ))h(θ i  x j ) . δ ( x j )) f ( x j  θ i ) g (θ i ) j i = ∑ ∑ l (θ i . i It is the average of the losses l (θ . Note: The quantity in braces depends on the observed value xi and on the action assigned to xi by the decision function used: Lh (δ . Step 5: Since B (δ ) is a weighted sum of expected posterior losses with nonnegative weights: B (δ ) = ∑ Lh (δ . δ ( x j )) with respect to the posterior probability weight which is referred to as the expected posterior loss. x j ) = ∑ l (θ i . j i where the last expression is obtained by making the substitution from Bayes’ theorem. So if (for the observed to xi) δ ( x j ) is taken to be the action that minimizes the expected posterior loss Lh. To obtain the Bayes action given a particular X. 37 . 1971). X ) over θ (which would be relevant in determining the posterior probabilities but is not at the moment): p ( x j ) = ∑ P (Θ = θ i . the resulting rule is one that minimizes the Bayes risk and is a Bayes procedure (Lindgren. it is not necessary to go through the computation of the whole decision rule where the number of decision functions is of a higher order than the number of actions. as given above.
The present approach incorporates the data into the prior distribution to obtain the posterior distribution. In summary. the earlier approach incorporates the data into the loss function to obtain the risk function where the original prior is then used on those risks to determine the Bayes rule. the expected posterior loss. by selecting one of those actions. the present approach is a simple process of minimizing a function of the actions. The results are equivalent. the earlier approach is a more complicated process of minimizing a function of the decision rules. namely.Part I: Decision Theory – Concepts and Methods Mathematically. over all possible decision rules. 38 . which is used as an educated “prior” distribution on the original losses.
The last section is a comparative assessment of the various theories that were developed. Most common is the Bayesian perspective coupled with expected utility maximization. The first section of this part presents the basic ideas and concepts of utility theory. There are two main paradigms for modeling human uncertainty which has been applied to decision making under uncertainty. 39 . The middle section focuses on the concept of ordinal utility and shifts to discuss the concept of subjective probability. The other classification is rulebased deductive systems which are based on axiomatic foundations. Sections 5 and 6 focus on the development of utility theory based on axiomatic foundations. One involves probabilistic and statistical reasoning. It also introduces the utility within the context of monetary values.Part II: Utility Theory . demonstrating the differences and similarities in their conception.Concepts and Methods Part II: UTILITY THEORY Concepts and Methods Decision making under uncertainty places the evaluation of the consequences of possible actions at the forefront of the problems concerning decision problems. Of particular interest is Savages expected utility which involves binary relations over functions defined on a measurable space. von NeumannMorgenstern’s theory of expected utility maximization under risk derived a basic set of axioms from which was deduced the existence of mathematical functions which have useful properties in comparing alternative states and preferences. In contrast.
A mathematical analysis is simpler. utility is a function defined on the various prospects (or consequences) with which one is faced. that follow from the axioms. but on a numerical scaleto assign a numerical measure to each prospect. Section 4. if it is possible to put prospects. not just in order. and the second is that even when there is a clear scale (usually monetary) by which consequences can be evaluated. such an assignment. without the implication that the future history has necessarily been brought upon by a decisionwhich happens to be the case in decision theory but is not generally so in the theory of utility. that the prospect on the small end of the symbol is the less desirable.1: Preference Axioms Every mathematical structure is based on a set of axioms. the term consequence has been used. which imply the existence of a utility function. dealing with preferences. In speaking about what one is faced with as the result of making a decision when nature is in a certain state. it is preferable to use the term prospect. Thus. the scale may not reflect the “true” value to the decision maker (Berger. or theorems. ~ 40 . With reasonable assumptions about one’s preferences among the various prospects. 1985). or function.Part II: Utility Theory . In evaluating these consequences of possible actions. measuring the relative desirability of these prospects on a numerical scale. among prospects is assumed. This section presents the underpinning axioms. two major problems are encountered. The following notation1 of the basic binary relation represents a weak preference taken to be at least as desirable: P1 f P2 means: prospect P1 is at least as desirable as prospect P2. with the obvious meaning. namely. ~ 1 The symbols f and f will occasionally be used in the other direction. The numbers are referred to as utilities and subsequently utility theory deals with the development of such numbers. providing a numerical scale in terms of which the consequences of actions are assumed to be measured. Acceptance of the axioms consolidates the acceptance of a whole string of consequences. can be achieved. In discussing utility. or relative desirability.Concepts and Methods Section 4: Utility Theory – From Axioms to Functions An analytic study of a decision problem seems to require the assumption of a mathematical model or structure of ordering among the various possible consequences of taking an action in the face of a particular state of nature. A notion of preference. The first is that the values of the consequences may not have any scale of measurement. This section focuses on the basic axioms and general concept of utility while the proceeding sections discuss the development of the different “types” of utility within individual rational decision making. because the situation has been brought about in part by the making of the decision.
called a mixture of P1. 1971).. it is necessary to address the concept of a “mixture” of prospects (Lindgren. These are summarized into three aspects. by definition it is symmetric. If P f Q and Q ~ R ..Concepts and Methods This relation is weak which assumes properties of completeness and transitivity (French and Insua.. then P ~ R. P1 ~ P1 . This aforementioned weak order can be used to define the equal desirability of two prospects: P1 ~ P2 means: prospect P1 and P2 are equally desirable... but P1 and P2 ~ are not equally desirable. 1971) & (Berger.. if one faces P1 with probability p1. P2 . and so forth. 2000).. P2 with probability p2..]( p1 . Pk and denoted [ P1 . (1) If ∃ with probability p of a prospect P1 and a probability (1p) of a prospect P2. this is called a random prospect and is considered to be a mixture of the prospects P1 and P2 and is denoted by [ P1 . Moreover. 1985): 41 .. that is. then P f R . then P f R . (iii) (iv) Before clearly stating the key axioms.. that is P1 ~ P2 is equivalent to P2 ~ P1 (Lindgren. (3) Similarly. 1971). then P f R . These relations (along with axiom 2 – transitivity) result in the following consequences: (i) (ii) If P ~ Q and Q ~ R . if it is understood that a prospect is at least as good as itself. and Pk with probability pk. p2 . P2 with probability p2... the notation for considering random prospects composed of infinite prospects. which is a condition that (by definition) exists when and only when P1 f P2 . P1 with probability p1.Part II: Utility Theory . Pk ]( p1 . These have been summarized into the adopted axioms as follows (Lindgren.. P2 ] p . If P f Q and Q f R . this is again a random prospect.) . ~ ~ Strict preference is expressed in the following manner: P1 f P2 means: prospect P1 is preferred over prospect P2. This latter relation ~ is reflexive.…. pk ) . which is defined to mean that both P1 f P2 and P2 f P1 or simply that they are indifferent. where p1+ p2+…=1 is [ P1 . ~ ~ If P f Q and Q ~ R . This latter trait – transitivity is stated below as axiom 2 and holds for the following two relations as well.. (2) More generally.
P3 ] r f P3 .… and Q1.… are sequences of prospects such that Pi f Qi for i [ P1 . 1971). Its assumption precludes a type of inconsistency in the mathematical structure that may or may not accurately describe one’s preferences.α 2 . Q2.) f [Q1 ..α 2 .Concepts and Methods AXIOM 1: Given any prospects P and Q..2. α 2 . P3 ] p f P2 f [ P1 . AXIOM 4: Given three prospects in the order P1 f P2 f P3 . then either P f Q . P2.. AXIOM 3: If P1 f P2 .. either one is preferred over the other. P0 ] p f P0 . For instance P1 f P0 . or P p Q .](α1 . AXIOM 3`: If =1. ~ ~ ~ Axiom 2 is the axiom of transitivity of the preference relation... for 0 < p < 1.) .. or P ~ Q ... there are mixtures [ P1 . P0 ] p . Axiom 1 states that. then for any set of probabilities α 1 . then for any probability p and prospect P. so that (by Axiom 3) 42 .. it follows that [ P1 . Q2 ... and that there is no P3 so terrible that the slightest chance of encountering it instead of P1 is worse than any ordinary P2... it then follows that (v) P1 f [ P1 . or they are equally desirable. The following consequences are implied (Lindgren. P2 . P] p f [ P2 . AXIOM 2: If P f Q and Q f R then P f R .. and Axiom 3` extends this notion to countably infinite mixtures. given any two prospects.Part II: Utility Theory . Since P1 can be thought of as [ P1 . P1 ] p and [ P0 . P3 ] r such that P1 f [ P1 . P3 ] p and [ P1 .(∑ α i = 1) P1. Axiom 4 says when P1 f P2 f P3 that there is no P1 so wonderful that the slightest chance of encountering it instead of P3 is better than any ordinary P2.](α1 .…. Axiom 3 says that improving one of the prospects in a finite mixture improves the mixture. P] p . or the other over the one.
each being identified with a number on that scale. Section 4. P0 ] p becomes increasingly more desirable. THEOREM: Given any two prospects P0 and P1 such that P0 p P1 . and u ( P1 ) = 1 . 2 The equality of two prospects means that they are precisely the same prospect.Part II: Utility Theory . 1971) This result means that the prospects intermediate to given prospects can be thought of as lined up on the scale of numbers from 0 to 1.2 Coding Intermediate Prospects The previous subsection showed that given any two distinct prospects P0 and P1 with P0 p P1 . and given any prospect P such that P0 p P p P1 . all mixtures of these prospects lie between them in order of desirability: P0 p [ P1 . For all prospects between P0 and P1 . P0 ] p p P1 . and the ordering of prospects in desirability corresponds to ordering on the scale – from least desirable to more desirable prospects. P1 ] p f [ P1 f P0 ] p f [ P0 . the mixture [ P1 . then there is a unique number p between 0 and 1 such that [ P1 . Now. define u ( P) = p . Equivalent prospects are located at the same point on the scale. then there is some mixture of P0 and P1 equivalent to P. the larger p the more desirable is the mixture. (Lindgren. That is as p increases from 0 to 1. if P ~ [ P1 . P0 ] p f [ P1 . as well as for P0 and P1 themselves u ( P0 ) = 0. 43 . if P1 f P0 and 0 ≤ q < p ≤ 1. P0 ] p is equivalent to P. Moreover. P0 ] p . consider that if P is an arbitrary prospect between P0 and P1 : P0 p P p P1 . This states that one mixture of two prospects is preferred over another if there is a stronger probability of the more desirable prospect in the first mixture. A utility function is now easily constructible which provides a coding of utilities that gives a numerical representation of preferencesby taking the utility of a prospect to be just the number on the scale from 0 to 1.Concepts and Methods P1 = [ P1 . Thus. P0 ] q . A further consequence of the axioms is the existence of a utility function which is to follow. 2 (vi) [ P1 . P0 ] p = P0 .
for any prospects P and Q between P0 and P1 . which is the utility assigned to the random prospect. for these P and Q. Property B. 1985. 44 . French and Insua. has the simple effect of shifting the origin and introducing a scale factor. as the value p in the mixture [ P1 . defined for any P between given P0 and P1 . thus. Q] r is u ([ PQ] r ) = pr + q (1 − r ) = ru ( P) + (1 − r )u (Q). the expected utility is therefore ru ( P) + (1 − r )u (Q ) . then u ( P ) ≥ u (Q) . 3 4 Berger. Berger (1985) describes a 5 step procedure in constructing a utility function much of which is described above. Property B can be expressed in terms of expectation. Property A. observe that [ P. The utility of a random prospect is the expected value of the utility (Berger. the utility is a random variable with possible values u (P) and u (Q) and corresponding probabilities r and 1r. P0 ] q . then p > q. P0 ] pr + q (1− r )′ which implies that the utility in [ P. 2000. Utility Property B: u ([ P. P0 ] p which is equivalent to P. He also includes in his construction of the utility function the fact that any linear function of such a utility would also satisfy Utility Properties A and B4. Defining a function v(P ) on prospects between P0 and P1 as follows: v( P) = au ( P) + b . P0 ] q ] r = [ P1 . But u ( P) = p and u (Q) = q . 1985). P0 ] p . Q] r = [[ P1 . follows from the fact that mixtures of P0 and P1 are ordered according to the proportion of P1 in the mixture. For a mixed prospect [ P. if P = [ P1 . Q] r .Part II: Utility Theory .Concepts and Methods There are certain properties satisfied by the utility function3. [ P1 . and P f Q . P0 ] p and Q = [ P1 . and serve just as well as a utility function. The utility of P0 and P1 in this new scheme would be v( P0 ) = b and v( P1 ) = a + b . as follows: Utility Property A: If P f Q . for a > 0. which yields Property A. Q] r ) = ru ( P) + (1 − r )u (Q).
for v(P) is satisfied. It can be shown that the probability of obtaining heads for the first time on the Nth toss is ½N. To obtain property B compute v([ P. the following random prospect.. 2N (That is. it is found that people generally will not pay even a large but finite entry for the proposed game. For one thing. for a fee. because your utility would not be decreased by so doing. Section 4. The matter of utilities for prospects that are not between two given prospects P0 and P1 has been neglected in this report. 45 . the sum is not a finite number. = ∞. This concludes this subsection. 1971) and (Berger. The famous example of the “St..Concepts and Methods Property A. You will be given $2N if in repeated tosses of a coin the first heads does not appear until the Nth toss.3 Utility for Money In many practical situations.) Despite the infinite expected payoff. and accepting this one obtains as the expected payoff: E ( payoff ) = ∑ 2 N • 1 ∞ 1 = 1 + 1 + 1 + . Q] r ) = au ([ P. Recall the introduction to this section which addressed that discounting the possibility that such a utility function overlooks other aspects of prospects addresses the issue that this scale of values is not usually a totally adequate basis in terms of which to analyze decision problems involving money. at least if M is allowed to range over all real numbers. In such cases. or as proportional to utility: u ( M ) = M or u ( M ) = kM .048. What entry fee would you be willing to pay? If money is utility. Q] r ) + b = a[ru ( P) + (1 − r )u (Q)] + b = r[au ( P) = b] + (1 − r )[au (Q) + b] = rv( P) + (1 − r )v(Q). if it is assumed that one’s utility function is 2M if M ≤ 20. Petersburg Paradox” illustrates the difficulty (Lindgren. (that is u ( M ) = M ). the function u ( M ) = M is not a bounded function of M.Part II: Utility Theory . it is tempting to treat the number of dollars as utility. On the other hand. u ( M ) = 20 2 = 1. 1985): Example 61: You are offered.576 if M > 20. you should be willing to pay any fee up to the expected value of the payoff. prospects are expressible in terms of amounts of money.
as in the example above. In dealing with a given problem it is often convenient to focus attention on the 46 . most people’s utility for money is not strictly linear. but in a long sequence of experiments the monetary gain is relevant. the offer should be accepted.Part II: Utility Theory .. it may be approximately linear over a restricted range of values of M. the expected payoff is $21. if there is a net monetary gain each time. = 21. if a gambling house with a capital of $1. for example.576 agrees to pay out the entire amount of N =20 or more. The expected utility is the appropriate guide for action in a single experiment.and therefore utility increases.no one has $240 to pay out in the event of 39 tails in a row before the first heads. the decision maker’s monetary assets increase . The monotonic nature of the curve representing utility for money would show that that if one is offered a favorable chance prospect involving money repeatedly. Figure 41: u(M) 0 M The function shown above is an increasing function of M. Figure 41 shows a utility function u (M ) plotted against M that is something like most people’s utility functions for money. as the computation above also shows. corresponding to the assumption that the greater the monetary value of a prospect the greater its desirability. It is unusual to encounter a situation in which a greater amount of capital is “worth” less than a smaller amount. rather than expected monetary gain. N 2 2 2 4 21 The game then has the same utility as $21.048.. Another aspect of the paradox is that no one could in fact offer the game honestly . although.Concepts and Methods the expected utility of the payoff is E[u ( Payoff 0] = ∑ 2 N • 1 20 ∞ 1 1 1 1 + ∑ (2 20 ) N = 20 + + + . That is. On the other hand. Even without the paradox arising from permitting one’s capital M to have any finite value.
Fair and Unfair “A bet is a random or mixed prospect [of the type encountered at the beginning of the previous section]: One starts with an initial capital M0 and according to the outcome an experiment of chance ends up either with ML. and in the latter case he wins. an amount less than M0. his capital being increased by the amount x = M W − M 0 . 1971) Definition: The money odds in such a bet are said to be “y to x” that one wins the bet. or with MW. If M denotes an increase or decrease from an initial status M0. It is said he “puts up with” y and his opponent “puts up with” x. Figure 42(b): This would be appropriate if the initial capital was large. Label the money axis in terms of the amount of money to be lost or gained. 47 . In the former case he loses. as shown in the figure below.Part II: Utility Theory . rather than in terms of total assets. A bet is said to be a fair bet of the probability odds and the money odds are equal. or y/x. The probability odds in such a bet are said to be “p to 1p” that one wins the bet. the winner taking both amounts. u (M ) is really u ( M + M 0 ) . and concave up in others.Concepts and Methods small portion of the curve representing the function u (M ) .” (Lindgren. an amount greater than M0. Figure 42(a): 0 M This would be appropriate if the initial capital was near 0. or p/(1p). It may also appear as though a person’s utility function is different in different situations reflected as different points on the overall utility curve. 0 M Section 4. his capital being reduced by the amount y = M 0 − M L . where p is the probability that he wins.4 Bets. Figure 41 showed that it is possible to have (locally) a utility for gain or loss that is concave down in some instances.
Consider a general nonlinear utility function u (M ) . u ( M W )) and ( M L . the point ( M 0 . the bet would not be fair. x 1− p Substitution of y = M 0 − M L and x = M W − M 0 yields ( M 0 − M L )(1 − p) − ( M W − M 0 ) p = 0. u(M ) W L which is interpreted geometrically to mean that the point ( M 0 . that is when u ( M ) = kM . u ( M L )) . There is no loss nor gain in capital on the part of the one bettor. A bet can be redefined as one in which the expected capital after the bet is equal to the capital of before the bet.U B ) lies on the line segment joining those two points. The utility of the bet is U B = pu ( M W ) + (1 − p )u ( M L ). or pM W + (1 − p ) M L = M 0 . as shown in the figures below. however. In vector notation M0 U = B MW ML + (1 − p ) p u ( M ) . the indifference relation stands – accepting or not accepting the bet are equally desirable: u (bet ) = pu ( M W ) + (1 − p)u ( M L ) = k ( pM W + (1 − p) M L ) = kM 0 = u ( M 0 ). It is “fair” in the sense that neither side is favored. 48 . or y (1 − p ) − px = 0. Thus.Concepts and Methods A fair bet can be thought of as the invariant principle addressed in Section 2. The equal desirability follows from the fact that taking the bet has the same utility as the initial capital and will be readdressed in the sections concerning the development of utility theory.U B ) is a convex combination of the two points ( M W . If the bet is fair. Two cases are shown when the person is not indifferent to the fair bet. and a fair bet: M 0 = pM W + (1 − p) M L . When utility is measured in terms of money. then y p = .Part II: Utility Theory . if one player experienced a net expected loss.
49 . The above inequality can be restated in the form M −ML p′ y < 0 = . and is smaller than the p corresponding to a fair bet for the same amounts of money. in such a case the expected capital after the bet is less than the initial capital: p ′M W + (1 − p ′) M L = M ′ < M 0 . U(ML) ML M0 Mw M Figure 43(b): U(M) This represents a utility function and fair bet that is unattractive.Concepts and Methods Figure 43(a): U(MW) U(M) UB This represents a utility function and fair bet that is attractive. 1 − p′ M W − M 0 x which says that the money odds are higher than the actual probability odds.Part II: Utility Theory . UB M ML M0 MW Consider a bet that is not fair. If U ′ denotes the utility in this unfair bet: U ′ = p ′u ( M W ) + (1 − p ′)u ( M L ). biased against the person whose interests are of concern and in terms of whose utility the situation is being studied. where p ′ is the probability of winning.
u ( M W )) and ( M L . Decreasing M ′ amounts to decreasing p ′ . W L This is shown in figure 44. altering the probability odds. the initial capital. M ′ . and there is a p* corresponding to the smallest M* for which the bet would be worth taking: p * M W + (1 − p*)M L = M * or p* = M * −M L . u ( M L )) : MW ML M ′ = p ′ U′ u ( M ) + (1 − p ′) u ( M ) . it can be shown that a utility is such that a fair bet is not advantageous. which is greater than the expected capital after the bet. In much the same manner. in which is also plotted M0.Concepts and Methods then the point ( M ′.Part II: Utility Theory . that is. Nau (2000) discusses the issues surrounding fair bets and utility for money under equilibrium in depth. MW − M L figure 44: U(M) U’ U(M0) ML M* M’ M0 M MW It was shown above a case in which the utility is such that a fair bet is advantageous.U ′) lies on the line segment joining the points ( M W . This is omitted here. He characterizes those who are risk averse and those who are not in delimitating their actions under varying conditions. 50 . which can be used to determine how unfair it is.
b). reflexive. Much of this theory presented here is crucial to the understanding of consumer theory as found in microeconomics. Recall the notion of indifference. transitive. apples and bananas7. These include axiomatic methods. b) = (a′. Kenneth Arrow’s Social Choice and Individual Values (1951). Section 5. this report only reviews models of individual rationality. and Savage’s Foundations of Statistics (1954)6. The prior presentation of the underpinning preference axioms is revisited briefly in this section.1: Ordinal Utility Returning to the notation for preference relations and the stated axioms. 6 Nau.” where probabilities for events are objectively determined). 7 This could be anything. But this is not the only possible way to define a utility function representing the same preferences. For brevity purposes. which has equal amounts of every preference. providing some background on the concept of “ordinal” utility. b′) if and only if U(a. The idea is to be able to assign a numeric value to something which may be numerically measured in terms of preference for a particular person. strictly increasing) function. b) it is possible to find a unique number x such that the bundle (x. If those preferences are complete. and several kinds of nonexpected utility theory. x). concepts of measurable utility and personal probability. so that (a. is precisely indifferent to (a. Let that number x be defined as U(a. b) to be equal to 2x or f(x) where f is any monotonic (i.” where probabilities for events are subjectively determined— or perhaps undetermined). expected utility (for choice under conditions of “risk. and continuous (see Section 6. 5 51 . This was shown in Section 4. and tools of general equilibrium theory and game theory. then there exists a utility function U which represents them. Then it can be said that for any group or bundle (a. b′). we could just as well define U(a. suppose that there are only two preferences – for instance. b) = U(a′. 2000. prior to the discussions of “cardinal” expected utility presented in the next section. Lindgren uses beef and chicken in his example. These include ordinal utility (for choice under conditions of “certainty”). b).. These papers laid the foundation in the use of mathematical methods such as some of those discussed in Section 6.e. The focus of the previous section was to address some of these aspects while these next two sections discuss the development of utility theory.Concepts and Methods Section 5: Development of Utility Theory I – Ordinal Utility The development of rational choice5 [decision] theory (in its modern form) dates back 50 years to a trio of publications: von Neumann and Morgenstern’s Theory of Games and Economic Behavior (1944/1947). Definition: Several economics and operations research papers interchange between the terms choice and decision theory. For example.1).Part II: Utility Theory . subjective expected utility and statepreference theory (for choice under conditions of “uncertainty. b) then represents a possible bundle of commodities (namely a apples and b bananas) that a person might possess. Now suppose that this person has preferences among such groups of apples and bananas. A pair of numbers (a.
the utility function representing a decision maker’s preferences is said to be merely an ordinal utility because only the ordering property of the utility numbers is meaningful. There is no further information in the numbers used to represent the decision maker’s preferences. viz. any strictly increasing transformation of an ordinal value function also agrees with the underlying weak preferences. 1) = 1. then V encodes exactly the same preferences as U. 2) = 4.. a 2 ∈ A. 1).e. Conversely. 2) = 3. THEOREM: For any domain of objects A with a weak order f . V(2. etc. suppose that U(1. 2) > (2. 1) = 2.Concepts and Methods An ordinal value function v : A → ℜ is a realvalued function which represents –or agrees with a weak preference in the following sense: ∀a. ~ ~ ~ Thus. Ordinal value functions represent preference rankings and nothing more. V(1. two apples and one banana yield two units of utility. U(1. Recall that any monotonic transformation of U carries exactly the same preference information: if f is a monotonic function. 1) > (1. any two ordinal value functions which agree with the same preference relation are related by a strictly increasing transformation. B is such that ∀a1 . 52 . one apple and one banana yield one unit of utility. However. two apples and two bananas are better than one apple and two bananas which are better than two apples and one banana which are better than one apple and one banana. 2) = 8. 2000). there exists an ordinal ~ value function if an only if there exists a countable order dense subset B ⊂ A. U(2.Part II: Utility Theory . and if V is another utility function defined by V(a. i. More formally. which provides a completely equivalent representation of this hypothetical decision maker’s preferences. such as strength of preference. and V(2. Example 51: For example. ~ The above definition was provided by French and Insua (2000) and simply states that an ordinal value function is a numerical assignment to a set of alternatives which represents the person or decision maker’s preference ranking by a numerical ranking. 2) > (1. Thus. 2) = 16. In other words. referring back to apples and bananas. ∃b ∈ B such that a1 f b f a 2 . it may not be concluded that two apples and one banana are “twice as good” as one apple and one banana or that one additional apple yields the same “increase in utility” regardless of whether you have one banana or two bananas to start with. b ∈ A. v(a ) ≥ v(b) ⇔ a f b . the following theorem is given by French and Insua (2000). a1 f a 2 . and U(2. b) = f(U(a. 1) = 4. For example. the representation is said to be “unique up to a strictly increasing transformation” (French and Insua. 1) = 2. b)). letting f(x) = 2x. Thus. we obtain a second utility function satisfying V(1. Then this means only that the preference ordering is (2.
if the ratio of the marginal utilities of apples to bananas is equal to 1/2 when the decision maker already has one apple and one banana means that the person would indifferently trade an ε fraction of an additional banana for a 2ε fraction of an additional apple. divided by ε.e. A decision maker’s indifference curves in ℜ 2 space is illustrated in figure 51. to recap. so it would be preferred to either x or y. Section 5. Then marginal utility (of an additional apple or banana)8 can be defined as the partial derivative of the utility function (with respect to apples or bananas). The ratios of marginal utilities are precisely the marginal rates of substitution between the prospects that leave total utility unchanged. The slope of a tangent line to an indifference curve (such as the dotted line through point y) is precisely the marginal rate of substitution between prospects at that point. Since it was previously stated that ordinal utilities are nearlymeaningless numbers. For example. The marginal utility of an apple is not necessarily the increase in utility yielded by an additional whole apple—it is the increase in utility yielded by an additional ε fraction of an apple. An indifference curve is a set of points representing endowments that all yield the same numerical utility—i. 8 53 . meaning that x they yield exactly the same utility z —i.Part II: Utility Theory . So.3: Indifference Curves Section 6 illustrated the geometrical perspective of a decision maker’s utility curve. it is a smooth function of the numbers of apples and bananas. however. it might be plausible to believe that marginal utilities would be nearlymeaningless numbers as well.Concepts and Methods Section 5.. an ordinal utility function determines the shape of indifference curves in the space of possible prospects or groups of prospects. where ε is infinitesimally small. that are equally preferred.2: Marginal Utility Suppose now that the nonunique ordinal utility function of concern is differentiable— i. the decision maker is precisely indifferent between them—whereas y point z lies on a “higher” indifference curve. 2000). Marginal rates of substitution are uniquely determined by preferences and as such they are observable. geometrically.e. hence any two utility functions that represent the same preferences must yield the same ratio of marginal utilities between any two preference groups (Nau.. the ratios of marginal ordinal utilities are always uniquely defined. evaluated at the point of interest. Figure 51: Points x and y are on the same indifference curve.e. Then the same ratio must be obtained for any other utility function that is a monotonic transformation of the original one.
. so that more is always better no matter how much you already have. there exists prospects a. and z denote general multidimensional preferences. 54 .. which means that if x = z and y = z. Thus. then the additive form of utilities refers to the consequences or alternatives to be multiattributed in ℜ n . if x and y are indifferent to each other. In other words. This property captures the intuition of diminishing marginal utility. which is at the same elevation. Lindgren (1971) discusses the case of arbitrary prospects more in depth in Chapter 2. 9 10 This is explored in Section 8. if the person started at point x on the map. etc. Note 2: While the shapes of the indifference curves are uniquely determined by preferences.e. then αx+(1α)y > z for any number α strictly between 0 and 1.. then everything “in between” x and y is strictly preferred to z10. This assumption means that if the holdings of all prospects are fixed except one.9 A stronger assumption of strictly convex preferences is usually made.Part II: Utility Theory . it approaches zero only asymptotically. if x and y are both weakly preferred to z.b) = u1(a) + u2(b).e.. then αx+(1α)y is strictly preferred to either of them—i. of the form U(a. (See Section 4.3. but the corresponding utility surface might otherwise look very different. ….4: Assumptions . then the marginal utility of that prospect should decrease as the “endowment” of it increases (Nau.Concepts and Methods Note 1: A person wishes to climb higher (to another indifference curve) if at all possible. c. In particular.2. d. The term nonsatiation refers to the fact that the marginal utility of every prospect always remains positive—i. 2000). the relative heights of the utility surface above different indifference curves are completely arbitrary. but would prefer to climb higher up to point z.e. the decision maker would prefer to have an average of x and y rather than either extreme. Note 3: Another utility function that represented the same preferences would have to yield the same indifference curves.e. See Section 6. he or she would be indifferent to walking over to point y.) Decreasing marginal utility is actually not a strong enough condition for the decision maker’s problem to always have a unique solution unless the utility function is also additively separable—i.i.. If the number of prospects is increased to some n . b. Section 5. y.Marginal Utility and MultiAttributed Theory The assumption of diminishing marginal utility was briefly addressed in Section 6.3. namely that ½ x provides “more than half as much utility” as x. but it does so by referring to observable preferences rather than unobservable utility. whose tangent lines would have exactly the same slopes at all points. where the vectors x.
Although it can be assumed that the subjective probability of a given state exists. they are not different as mathematical entities. However. where the weights are the corresponding (subjective probabilities). 55 . The theory of subjective probability was created to enable one to talk about probabilities when the frequency viewpoint does not apply (Berger. defined earlier (in Section 6) for the case of objective probabilities. (2) the frequentist (or objective) interpretation. The strict convexity of the level sets ensures that the decision maker’s problem has a unique solution (Nau. as the weighted sum of the possible values. He also provides a detail development of utility theory and equilibrium states as they formed in utility theory and impacted microeconomic theory. the set of all y such that U(y) = U(x) is a strictly convex set.Concepts and Methods If preferences are strictly convex. and (4) the subjective interpretation. When the outcomes are thought of as prospects with corresponding probabilities is termed a random prospect.5 Subjective Probability The classical concept of probability was initially introduced (in Part A) as a repeatable experiment (of chance) that (at each trial) results in one of a given set of possible outcomes. In the case of a chance or random variable.. and embodying the odds indicates that the matter of chance in this kind of experiment is actually personal. characterized by nonnumerical descriptions. the expected value of a random variable. Section 5. in as much as the same basic axioms are assumed to hold for both. The probability assigned to each outcome is thought of as an idealization of its relative frequency of occurrence in a long sequence of trials. A quantification of “chances” involving one’s beliefs or convictions concerning the occurrences of the event. which means that its level sets are strictly convex sets. In particular. 11 Nau (2000) discusses this within the constructs of consumer theory as understood in microeconomic theory. 1985). Although objective probability and subjective probability are different in conception and interpretation. Any definitions or theorems concerning objective probabilities apply equally well to subjective probabilities. This section discussed the role of ordinal utility. 2000)11. (3) the logical (or necessary) interpretation. Nau (2000) states that there are four “bestknown” concepts of probability – (1) the classical interpretation. this frequency concept does not always suffice when dealing with θ.e. these outcomes are ordinary numbers. would be defined in exactly the same way if the probabilities are subjective. it is seldom easy to determine that probability (Lindgren. This closing subsection has introduced the topic to be discussed in the following section which more formally develops the utility theory within its three comparable forms. The term subjective probability is used to refer to probability that represents a person’s degree of belief concerning the occurrence of an event. i. 1971). namely.Part II: Utility Theory . then any utility function that represents those preferences is necessarily strictly quasiconcave.
There are several important things to note here (Nau. y. 2000) and (French and Insua. Note 1: Additional “psychological” data is required to construct a cardinal utility function. it can be said that by definition. then it too has a utility that is exactly halfway between the utility of x and the utility of z. not just data about preferences among the rewards themselves. Before beginning a comparative look at the three theories of concern. ordinal utility functions were discussed according to the marginalist view of modeling decision problems under conditions of certainty. namely data about preferences among probability distributions over rewards. 2000). a 5050 gamble between x and z has a utility that is exactly halfway between the utility of x and the utility of z. The third topic of interest was addressed briefly in the closing subsection of Section 5. and if y is indifferent to such a gamble. Recall from the previous section that decision making under conditions of certainty were modeled as ordinal utility functions. a little insight into the importance of this last topic will be given first. It was discussed that under such conditions. “concepts of probability”. 56 . Both provided detailed accounts of the developments of all three works and provide citations for the original publications and other perspectives on these developments. and z. However. “primitive preferences”.Concepts and Methods Section 6: Development of Utility Theory II– Comparing Axiomatic Theories In the last section. Thus.” von Neumann and Morgenstern discovered that such a statement can be made if the decision maker has preferences not only among definite objects such as x. Nau discusses these three theories in terms of four attributes or concepts – “axiomatic fever”.1: The Independence Condition for Preferences → Cardinal Utility Von Neumann and Morgenstern in their aims to establish a “new gametheoretic foundation for economics”. decided it was necessary to first axiomatize a concept of cardinal measurable utility that could serve as a “single monetary commodity” whose expected value the players would seek to maximize through their strategy choices. the first two have been discussed thoroughly in the Section 4. The primary sources of information for this section are Nau (2000) and French and Insua (2000) and were stated at the beginning of Section 5. and subjective expected utility (SEU). Section 6. expected utility (EU). Of these four concepts. This section compares the main elements of the theory of individual rational modeling under risk and uncertainty. this last concept has not been addressed this far. These were developed in the mid1900’s: the theories of subjective probability (SP).Part II: Utility Theory . it is meaningless to say something like “the utility of y is exactly halfway between the utility of x and the utility of z. and “”the independence condition for preferences → cardinal utility”. but also among probability distributions over those objects.
Savage’s theory is essentially a merger of the other two and will be presented in conjunction with the works of Anscombe and Aumann. EU. In all three theories. a preference x over y is independent of common consequences received in other events.Concepts and Methods Note 2: Those preferences must satisfy an independence condition. ….. and Savage’s theory of subjective expected utility (SEU). While von NeumannMorgenstern utility function was being established other economists had discovered that a simple behavioral restriction on preferencesundercertainty implies that a consumer must have an additivelyseparable utility function. All that can be said is that y is indifferent to a 5050 gamble between x and z. however this section concentrates on three main contributions: de Finetti’s theory of subjective probability (SP).e. The ordering relation between events is one of comparative likelihood: A = B means that event A is considered “at least as likely” as event B. 57 . Table 61 (appended to the end of section 6) presents the axioms at a glance as a comparative means and was adapted from Nau (2000). 2000). there is a primitive ordering relation. This is followed by a discussion on some of the key elements of comparison for the three theories of concern. and SEU Many authors contributed to the development of axiomatic theories of rational decision making under uncertainty. etc. the objects of comparison are events A. 2000). In ~ SP theory. von Neumann and Morgenstern’s theory of expected utility (EU). x2. which are subsets of some grand set of possible states of nature.2: Comparative Axioms of SP. 2000) and (French and Insua. namely that x is preferred to y if and only if an α chance of x and a 1−α chance of z is preferred to an α chance of y and a 1−α chance of z. i. the objects of comparison are probability distributions f. which could consist of amounts of money or 12 The independence condition is implicit in von Neumann and Morgenstern’s axiomatization. which I will denote by f . however various authors present various different aspects and thus. over some set of possible consequences. a cardinal meaning cannot be attached to the utility values except in a context that involves risky choices: if the utility of y is found to lie exactly halfway in between the utilities of x and z.. regardless of the value of z—i.. I have tried to be consistent with my notation.Part II: Utility Theory . In EU theory. a utility function of the form U(x1.12 Note 3: Once probabilities have been introduced into the definition and measurement of utilities. g. this is an amalgamation of work published by (Nau. etc.e. which is unique up to an increasing linear transformation and is therefore cardinally measurable (Nau. Section 6. x3) = u1(x1) + u2(x2) + … + un(xn).. B. then it cannot be conclude that the increase in the measure results from the exchange of x for y to be exactly the same as the increase that results from the exchange of y for z.
The independence condition for subjective probability states: Af B ⇔ A ∪ C f B ∪ C ~ ~ for any C that is disjoint from both A and B. SEU theory merges the main elements of these two frameworks: the objects of comparison are “acts. as in EU theory. 2. Simply stated. A f B and B f C ~ ~ implies A f C.Concepts and Methods other tangible rewards or prizes. Similarity of Axioms Across Theories 1. ~ Assumption that the relation has a property called independence or cancellation. Thus.e. Assumption that the ordering relation is transitive: i. This condition also implies: A∪C f B ∪C ⇔ A∪C *f B ∪C * ~ ~ for any C. it is always possible to determine at least a weak ~ direction of ordering between two alternatives. In other words. an act is an assignment of some definite consequence to each state of nature that might obtain. which means that in determining the order of two elements (such as A f B). there is some sense in which common events or consequences can ~ be ignored on both sides. a common disjoint event on both sides of a likelihood comparison can be replaced by another common disjoint event without tipping the balance.Part II: Utility Theory . the balance of likelihood is not “tipped”. The ordering relation is one of preference: f = g means that distribution f is “at least as preferred” as distribution g. either A f B or B ~ f A or both.. Assumption that the relation is complete: for any A and B. or they could simply be elements of some abstract set.” which are real or imaginary mappings from states of nature to consequences. if the same nonoverlapping event is joined to two events whose likelihoods are being compared. The ordering relation is preference. The independence condition for expected utility is: 58 . 2. C* that are disjoint from both A and B. A Closer Look at the Independence Condition Across Probability Theories 1. 3. Hence.
There is also a second kind of independence condition that also appears in SEU theory. Simply. Thus. then: x f y ⇔ Ax + (1 − A) y f Ay + (1 − A) z . this condition says that comparing two acts which happen to agree with each other in some states (namely the states in which A is not true). namely an assumption about the stateindependence of preferences for particular consequences.Concepts and Methods f f g ⇔ αf + (1 − α )h f αg + (1 − α )h. then different agreeing consequences in those states can be substituted without tipping the balance. y. The independence condition for subjective expected utility (Savage’s axiom P2. then the direction of preference isn’t changed if they are mixed with a different common distribution h* instead. 3a.e. ~ ~ 13 A nonnull event is an event that has nonnegligible probability in the sense that at least some preferences among acts are affected by the consequences received in that event. ~ ~ which says that if f is preferred to g when they are both mixed with a common third distribution h. ~ ~ for any nonnull event A13. the balance of preference is not tipped between them. if f and g are objectively mixed with the same other distribution h. an α chance of receiving f and a 1α chance of receiving h. in exactly the same proportions. where the chances are objectively determined. ~ ~ The expression αf + (1α)h is an “objective mixture” of the distributions f and h—i.Part II: Utility Theory . This assumption (Savage’s P3) states that if x. and the consequence specified by act h in all other states is achieved. This property is necessary to define the concept of a conditional preference between two acts: the relation Af + (1−A)h f Ag + (1−A)h means that f is preferred to g conditional on event A. and z are constant acts (acts that yield the same consequence in every state of nature) and A is any nonnull event. In other words. the “common element” h can be replaced with any other common element. The expression Af + (1−A)h is a “subjective mixture” of f and h: it means the consequence specified by act f in every state where event A is true is achieved. The condition also implies: αf + (1 − α )h f αg + (1 − α )h ⇔ αf + (1 − α )h * f αg + (1 − α )h * . ~ 3b.. 59 . otherwise known as the “sure thing principle” as noted by Nau (2000)) states: Af + (1 − A)h f Ag + (1 − A)h ⇔ Af + (1 − A)h * f Ag + (1 − A)h * . The a decisiontree illustration of this condition is and its interpretation is explained in Section 9.
then f (s) is the amount of money you get if state s ∈ S occurs. the independence axiom would produce a contradiction.e. ~ A4: (independence) f f g ⇔ αf + (1−α)h f αg + (1−α)h for all h and 0 < α < 1. etc. h. for all 0 < α < 1. This assumption is needed to help separate utilities from probabilities—i. if the consequence x is preferred to the consequence y “for sure. The important thing to note is that. to ensure that relative utilities for consequences don’t depend on the states in which they are received. A0: (reflexivity) f f f for all f. i.g. f f g ⇔ αf + (1 − α )h f αg + (1 − α )h. and if it were not the same as the direction of preference between αf + (1α)h and αg + (1α)h. collectively exhaustive states of nature. they are really the same theory but their working parts are merely labeled and interpreted in different ways (Nau.3: Subjective Probability versus Expected Utility theory In de Finetti ‘s 1937 paper.Part II: Utility Theory . ~ ~ A2: (transitivity) if f f g and g f h. the common additive factors can be ignored or canceled (e. either f f g or g f f or both. ~ ~ In the presence of the completeness axiom. then f f g (i. in all these conditions. NOT g f f). Thus. The two theories are “dual” to each other—i. The additivity or linearity of the preference representation. Consider possible distributions of monetary wealth over those states. if f is a vector representing a wealth distribution. 60 . C. This is so because ~ ~ the completeness axiom requires the decision maker to have a definite direction of preference between f and g. in turn. Suppose that there is some set S of mutually exclusive. other things being equal. Section 6. One way is in terms of a binary relation of comparative likelihood (as illustrated in the table above). means that the direction of preference or comparative likelihood between two alternatives depends only on the difference between them. then f f h....e.e. he pointed out that there are actually two equivalent ways to axiomatize the theory of subjective probability.” then it is also preferred to y conditional on any event. The other way is in terms of the acceptance of monetary gambles. This property is necessary in order for the relation to be represented by a probability distribution that is additive across mutually exclusive events and/or a utility function that is a linear in probability.e.. ~ ~ ~ A3: (strict monotonicity) if f(s) > g(s) for all s. ~ A1: (completeness) for any f and g.Concepts and Methods Thus.. the independence axiom also works in reverse.) on opposite sides of the ordering relation. z. 2000).
A4 (or equivalently. Now consider the following set of axioms. either x ∈ G or –x ∈ G or both B2: (linearity) if x ∈ G. x is an acceptable $gamble if and only if its expectation is nonnegative according to π) . and let f. SP THEOREM: If preferences among wealth distributions over states satisfy A0A1A2A3. etc. Two of prominence are: The first of which is that the direction of preference between any two wealth distributions depends only on the difference between them. Thus. Consider reversing the role of money and probabilities. 2000). There are many results and important conclusions that follow from this line of thought which are discussed in some depth in Nau (2000). because scaling both f and g up or down by the same positive factor and/or increasing the decision maker’s initial wealth by a constant amount h does not lead to a reversal of preference. equivalently.Concepts and Methods This means the decision maker effectively has linear utility for money. then there exists a unique subjective probability distribution π such that f = g if and only if the expectation of the wealth distribution f is greater than or equal to the expectation of distribution g according to π (or.Part II: Utility Theory . then x + y ∈ G B4: (coherence) G contains no strictly negative vectors These are the assumptions used by de Finetti in his gamblebased method of axiomatizing subjective probability. B0: (reflexivity) 0 ∈ G B1: (completeness) for any x. and (by the arguments above) they are logically equivalent to A0−A4 (Nau. then αx ∈ G for any α > 0 B3: (additivity) if x ∈ G and y ∈ G. This is de Finetti’s theorem on subjective probability. f(c) now represents the probability of receiving prize c rather than an amount of money. g.. denote probability distributions over those prizes. And the second of which states that the inequalities between different pairs of wealth distributions can be added. Consider now the differences between wealth distributions as gambles – of which there are two types: acceptable and unacceptable. Let C be some set of consequences or prizes. The axioms are as follows: 61 . if acceptable gambles satisfy B0B1B2B3B4).
and vice versa ~ There is a small difference in Axiom 3 which is the strict monotonicity condition is weakened to a nontriviality condition. because f = g if and only if f′u = g′u. equivalently. such that f = g if and only if the expected utility of f is greater than or equal to the expected utility of g (or. EU THEOREM: The results are summarized as the expected utility theorem. Note that a difference between two probability distributions also a kind of “gamble” in the sense that it changes your risk profile—it just does so by shifting the probabilities attached to fixed prizes rather than by shifting the amounts of money attached to events with presumablyfixed subjective probabilities. The direction of preference between any two probability distributions depends only on the difference between those two distributions. If preferences among probability distributions over consequences satisfy A0A1A2A3A4 (or equivalently. B5: (closure): G is closed. B3: (additivity) if x ∈ G and y ∈ G.Concepts and Methods A0: (reflexivity) f f f for all f ~ A1: (completeness) for any f and g. ~ ~ A5: (continuity): for fixed f. x 62 .Part II: Utility Theory . The continuity/closure (A5/B5) requirement is needed to rule out unbounded utilities as well as lexicographic preferences (Nau. and a constant can also be added to it without loss of generality (wlog). acceptable pgambles satisfy B0B1B2B3B4). because this will just add the same constant to the expected utility of every probability distribution. Let u denote a utility function for prizes. and the quantities on the left and right of this expression are just the expectations of u under the distributions f and g. the set of g such that f f g is closed. unique up to positive affine transformations. NOT g* f f*) ~ A4: (independence) f f g ⇔ f + (1α)h f g + (1α)h where 0 < α < 1. then f f h ~ ~ ~ A3: (nontriviality) there exist f* and g* such that f* f g* (i. 2000) and French and Insua (2000). B4: (nontriviality) there is at least one nonzero pgamble not in G. there exists a utility function u. Then the followings axioms are developed: B0: (reflexivity) 0 ∈ G. either x ∈ G or –x ∈ G or both. respectively. The function u is subject to arbitrary positive scaling.. B1: (completeness) for any x.e. then x + y ∈ G. B2: (linearity) if x ∈ G. either f f g or g f f or both ~ ~ A2: (transitivity) if f f g and g f h. then αx ∈ G for any α > 0.
θ) denotes the objective probability that the horse lottery f assigns to consequence c when state θ occurs (i. This is supposed to conjure up an image of a horse race in which each horse carries a different lottery ticket offering objective probabilities of receiving various prizes. you get probabilities and utilities together. which is essentially the same as Savage’s P3: A6: (stateindependence): x f y ⇔ Ax + (1−A)z f Ay + (1−A)z ~ ~ for all constant x. together with one additional axiom. Anscombe and Aumann proposed that the objects of choice should be mappings from states to objective probability distributions over consequences. Anscombe & Aumann followed in much the same way. when horse s wins the race). Furthermore.Concepts and Methods is an acceptable pgamble if and only if it yields a nonnegative change in expected utility according to u). together with one more assumption. g.. merely with the roles of probabilities and payoffs reversed (French and Insua. and it is isomorphic to de Finetti’s theorem on subjective probability.4: Subjective Expected Utility (AnscombeAumann & Savage) Savage combined the features of subjective probability and expected utility theory in order to model situations in which the decision maker may attach their own subjective probabilities to events and their own subjective utilities to consequences. but had a simpler approach by simply merging Savage’s concept of an “act” (a mapping of states to consequences) with von Neumann and Morgenstern’s concept (an objective probability distribution over consequences).e. z and every event A Here. Such objects are known as “horse lotteries” in the literature. Ax + (1−A)z denotes the horse lottery that agrees with x if A occurs and agrees with z otherwise. This is von NeumannMorgenstern’s theorem on expected utility. and h denote horse lotteries. where A is an event (a subset of the states of the world). where f (c.Part II: Utility Theory . (French and Insua (2000). The axioms of subjective probability or expected utility are satisfied if and only if (a) your set of acceptable directions of change in (stochastic) wealth is a convex cone. let f. 2000). no matter which horse wins). Think of them as vectors with doublysubscripted elements. y. Henceforth.e. 63 . Section 6. By applying the von NeumannMorgenstern axioms of expected utility to horse lotteries.. y. and z. Let the usual axioms A0A5 apply to preferences among horse lotteries. denote horse lotteries that are “constant” in the sense that they yield the same objective probability distribution over consequences in every state of the world (i. let x. no matter what your current distribution. (b) it excludes at least some directions—especially those that lead to a sure loss—and (c) it is always the same set of directions.
Note 4: Preferences among Savage acts are represented by a unique probability distribution p over states and a stateindependent utility function u over consequences. although. then there exists a unique probability distribution p and a stateindependent utility function u that is unique up to positive affine transformations such that f = g if and only if the expected utility of f is greater than or equal to the expected utility of g. v(c. one further remark is made which addresses a weakness in this point. because it requires conditional preferences among objective probability distributions over consequences to be the same in all states of the world. before stating the theorem. where p(s) is a (unique!) nonnegative probability assigned to state s and u(c) is a stateindependent utility assigned to consequence c. Note 2: Axiom A6 then implies the further restriction that the expected utilities assigned to different consequences in state θ must be proportional to the expected utilities assigned to the same consequences in any other state. the utilities attached to particular consequences are assumed to be identical—not merely proportional—in different states. Section 6. The three theorems staed above with respect to each theory is restated generally with incomplete preferences. with act f preferred to act g if and only if the expected utility of f is greater than or equal to the expected utility of g.5: Incomplete Preferences The completeness axiom causes some distress. It forces the notion that a “gamble” must be accepted. However. the existence of an expectedutility vector v such that f = g if and only if f′v = g′v. s) = p(s)u(c). then there exists a convex set P of subjective probability distributions such that f = g if and only if the expectation of the 64 . SP THEOREM with incomplete preferences: If preferences among wealth distributions over states satisfy A0A2A3A4A5. θ) to v(c. So. with v(c. in which case the ratio of v(c.Part II: Utility Theory . θ) representing the expected utility of consequence c when it is received in state θ. θ) combines the probability of the state with the utility of the consequence.Concepts and Methods Note1: From axioms A0A5 and some work. but they also have the very same numerical utilities in every state. This is summarized as the subjective expected utility theorem. SEU THEOREM: If preferences among horse lotteries satisfy A0A1A2A3A4A5A6. Note 3: By convention. Intuitively. Remark: It should be noted that the uniqueness of the derived probability distribution depends on the unstated assumption that consequences are not only similarly ordered by preference in every state of the world (which is the content of axiom A6 above). θ′) can be interpreted as the ratio of the subjective probabilities of states θ and θ′. The elements of the vector v are subscripted the same as those of f and g. it may be unacceptable according to some axioms – it is still acceptable according to the completeness axiom. It follows that v can be decomposed as v(c.
Part II: Utility Theory . then there exists a convex set V of expectedutility functions. such that f f g if and only if the ~ expected utility of f is greater than or equal to the expected utility of g for every expectedutility function v in V.14 such that f f g if and only if the expected ~ utility of f is greater than or equal to the expected utility of g for every utility function u in U (or. 65 . Similar results are obtained when you drop the completeness axiom from EU or SEU theory: you end up with convex sets of utility functions or expectedutility functions. equivalently. unique up to positive affine transformations. equivalently. 14 Note: Theorem 3 addresses positive affine transformations in Chapter 2 of French and Insua. See French and Insua (2000). (2000). SEU THEOREM with incomplete preferences: If preferences among horse lotteries satisfy A0A2A3A4A5A6. x is an acceptable gamble if and only if its expectation is nonnegative for every distribution π in P) . acceptable pgambles satisfy B0B2B3B4B5).Concepts and Methods wealth distribution f is greater than or equal to the expectation of distribution g for every distribution π in P (or. there exists a convex set U of utility functions. x is an acceptable pgamble if and only if it yields a nonnegative change in expected utility for every utility function u in U ). at least one of which can be decomposed as the product of a probability distribution p and a stateindependent utility function u. EU THEOREM with incomplete preferences: If preferences among probability distributions over consequences satisfy A0A2A3A4A5 (or equivalently.
whence: αf + (1 − α )h f αg + (1 − α )h iff ~ ~ If f f g then ~ αf + (1 − α )h f αg + (1 − α )h ~ or g f h then f f h P1b: For all f. substitution. if event A is nonnull. then A f B iff A ∪ C f B ∪ C cancellation.h.C: if A f B ~ For all f.g.1: Theory of… Features Comparing Axioms: SP.h*.y. (mappings from states to consequences).g. surething principle) or B f C then A f C for all α strictly between 0 and 1.g.g: if f f g or g f f or both fff ~ ~ ~ P1a:For all f.B: if A f B fff ~ ~ For all f. then x f y iff Stateindependent utility axiom (“value can be purged of belief”) ~ Αx + (1 Α) z f Αy + (1 Α) z 66 .C (sets of states) Comparative Likelihood Von NeumannMorgenstern Expected utility Abstract set closed under probability mixtures Probability distributions f. EU & SEU de Finetti Subjective Probability States Events A.Part II: Utility Theory – Concepts and Methods Table 6.h: if f f g ~ ~ ~ Independence axiom If A ∩ C = B ∩ C = ∅ (a.g.h.k. if event A is nonnull. ~ ~ separability.g: if f f g or g f f or both ~ ~ or B f A or both ~ Transitivity axiom For all A.a.z.B.h Preference Savage Subjective expected utility States. then Αf + (1 − Α)h f Αg + (1 − Α)h iff ~ Αf + (1 − Α)h * f Αg + (1 − Α)h * ~ ~ αf + (1 − α )h * f αg + (1 − α )h * ~ P3: For all constant acts x.g. consequences “Acts” f.y.z Preference Objects of comparison Primitive relation f ~ Reflexivity axiom Completeness axiom Af A ~ For all A. the converse is also true as a theorem. constant acts x.B.h: if f f g or g f h then f f h ~ ~ ~ P2: For all f.
z*. Dominance axiom αf + (1 − α )h f βg + (1 − β )h P6: If f f g .y* such that x f y and Ω f A f ∅ for any A that is “uncertain” There exists a fair coin ∃ f and g such that f f g If f f g f h.Part II: Utility Theory – Concepts and Methods Αx + (1 − Α) z f Αy + (1 − Α) z ~ Qualitative probability axiom (“belief can be discovered from preference”) Nontriviality axiom Continuity axiom P4: For all events A & B and constant acts x. then x* f y * : Ax + (1 − A) y f Bx + (1 − B) y iff Ax * + (1 − A) y* f Bx * + (1 − B ) y * P5: ∃ f and g such that f f g for some α and β strictly between 0 and 1. then Αf + (1 − Α) h f Αg + (1 − Α)h Conditional probability axiom If A ⊆ C and B ⊆ C then ~ A  C f B  C iff A f B ~ ~ 67 .y. P7: If f (θ ) f g (θ ) one every state θ in event A. there exists a finite partition of the set of states into events{Ai} such that Α i f + (1 − Α i )h f (1 − Α i ) g + Α i h for any event Ai in the partition and h.
etc. decisions are “nested in a series of related decisions” (French and Insua. finance. engineering. Subsequent developments in sequential methods have instigated growth in sequential probability leading to a large body of theory on Markov decision processes.Part III: Sequential Statistical Decision Theory Part III: SEQUENTIAL STATISTICAL DECISION THEORY Introduction to Markov Decision Processes Concepts of decision processes (as those presented in prior sections) have assumed single decision making problem. this is rarely the case. This iterates the concept that a decision is contingent upon the outcome of previous decisions made. 68 . This sequential pattern of decision processes is a concept that has wide applications to all areas of the sciences. these developments and their applicability to a wide range of problems have propagated the use of sequential methods in subsequent fields. 2000). operations research. This final theoretical part of the report focuses on sequential decision processes. The formulation of sequential problems is most naturally expressed as a recursive equation and its solution is credited to Bellman. In most instances. the latter section focuses on Markov chains and Markov decision processes as an extension to the developments in sequential probability. These concepts establish the goal for preposterior and sequential analysis. Consequently. Since many problems can be formulated in Markovian ways. An examination of sequential sampling leads to the formulations of decision rules and stopping rules in statistical decision theory. Emphasis is placed on abstract and technical formulations of the concepts and methodology. This provides a framework for formulating sequential decision problems. However.
s ) can be considered to be a sum of the general loss function l (θ . Other strategies suggest sequential sampling and moreover. a) and the sampling cost c(n. The first introduction to statistical decision problems introduced the concept of having observations upon which a decision could be based..... is a sequential sample from the The cost of the experiment is attributed to two factors: n – the number of observations taken and s – the way in which the observations are taken (Berger. The goal is to minimize the overall cost which incorporates both the decision loss as well as the cost of conducting the experiment. Berger(1985) states that these functions are additive when the decision maker has a linear utility function which can be stated as l (θ . l (θ .. to be Χ i and define Χ j = ( X 1 . 1985). a. Noted in the framework. density f ( x  θ ) . then f j (x  θ ) = ∏ f ( xi  θ ) . n. Extending the framework for decision problems. a. X j ) where Χ j is assumed to have density f j (x j  θ ) . s) . With a goal to minimize cost. Section 7. X 2 . Those statistical procedures assumed a fixed sample size. Note that l (θ . the choice or the decision to experiment is also a very important aspect of statistical decision theory. This decision making process is frequently termed preposterior analysis.. j i =1 j Now suppose that X 1 ..Part III: Sequential Statistical Decision Theory Section 7: Preposterior and Sequential Methods The framework for decision problems as introduced in Part I assumed a fixed sample of observations for a single decision. A natural extension of the framework to encompass sequential decision processes establishes new concepts for statistical decision theory. X 2 . This instigated the expansion of hypothesis testing to encompass sequential methods. although it was not explicitly stated. X 2 . n) = l (θ . sampling is the principle issue in sequential statistical decision theory and establishes the concepts of optimal stopping rule and decision rules. i =1 n 69 .. multistage procedures. a. Thus. n. a ) + ∑ ci .. However.1: A Framework for Sequential Decision Problems The framework for decision problems thus far. denote the sample space for independent random variables X 1 . s ) denotes the overall loss when θ is the true state of nature and a is the action taken. has centered on decision making either with or without experimentation. decisions lead to consequences which may subsequently lead to further decision making... That is the option to experiment or make new observations before making a decision – a term which is coined preposterior analysis.
. a ) . If it assumed that the overall loss function has the aforementioned additive structure. The equivalent expression is then π r0 (π ) = E π EθΧ [l (θ . Thus. a.. a ) . n)] . Thus...Part III: Sequential Statistical Decision Theory where l (θ . concepts and methods of sequential analysis. this can be resolved by differentiating rn (π ) with respect to n and setting the equation equal to zero. then π δ n is the Bayes decision rule for the loss function. Recall that preposterior analysis was coined as such as the choice is made before the data is obtained. The difficulty in sequential decision problems resides in all the possible ways in which the observations are taken. 70 . a) = −u (θ . Although the latter approach is of primary interest. the “nonstatistical” decision problem – that is the noobservation problem first. g (θ . Within the Bayesian context. The advantages of sequential analysis are somewhat obvious and at the forefront lies the decision maker choice to gather the required amount of data needed to make a decision for a specified degree of precision. Another way of expressing this is π to let δ n denote a Bayes decision rule for this problem and incorporating the loss in observing Χ n = ( X 1 . 1985). δ n ( Χ n ). s )) . Then the situation in which the decision maker has no option to make an observation and must choose an action can be expressed as: r0 (π ) = min Eθ [l (a. a) − c(n. let π = Pθ (⋅) and rn (π ) be the Bayes risk of the optimal policy with at most n stages with knowledge π . The remainder of this section of the report focuses on the foundations. Recall that the loss function can be defined as l (θ . if the utility function is nonlinear and there is a monetary gain or loss. n) is the shortened form of the overall loss function for a given sampling approach and ci represents the ith observation cost. if the fixed sample size loss is equal to the sequential loss. n To determine the optimal fixed sample size in a decision problem requires resolving the n which minimizes rn (π ) .θ )] a∈ A where the expectation over θ is taken with respect to π . X n ) . s ) = −u ( g (θ . The two most common approaches are (a) the fixed sample method and (b) the sequential sampling approach. then l (θ . a. this section begins by describing the methodology of determining the optimal fixed sample size. In simplicity. leaving the latter approach for the remainder of this section. n. a sequential analysis of the problem is considered cheaper then a fixed sample size analysis (Berger. Furthermore.
..} .. A stopping rule is characterized by a family of functions of the form ϕ = {ϕ 0 .. If the rule is given in other terms..2: Sequential Analysis – The Basics The distinguishing feature of sequential analysis is the ability to take observations one at a time with the decision maker or experimenter having the option of stopping the experiment and making a decision.. ϕ1 ( X 1 ). Thus.. } where τ 0 = ϕ 0 . That is to say that ϕ 0 is the probability of making an immediate decision without sampling while ϕ n is the probability (zero or one) of stopping sampling and making a decision after X n is observed.. then the loss function when θ is the true state of nature can be expressed as l (θ .Part III: Sequential Statistical Decision Theory Section 7.. it can then be reiterated that if n observations are taken sequentially....τ 1 ( X 1 )... iff the rule says to take another observation. ϕ0 = and 0. Thus. at which point the decision maker chooses to take action a ∈ A . iff the rule says to take the first observation 1. and δ – a (terminal) decision rule. iff the rule says to take the no observations 0... Keeping within the sequential decision problem framework presented earlier. given that X . and moreover that the loss function is increasing in n. ϕ 2 ( X 2 ).. Alternatively. X 2 . n) .) = 1. A nonrandomized sequential decision procedure is denoted d = (ϕ ... δ ) and consists of two components: ϕ – a stopping rule. let’s assume that the cost is proportional to the number of observations. one can determine the functions ϕ n from the rule.τ 2 ( X 2 ). a. given that X 1 . and 71 . the stopping rule can be characterized by the following family of functions τ = {τ 0 . In the nonrandomized case. given the functions ϕ n one can determine when to stop.. X are at hand 1 n ϕ n ( X 1 . X n are at hand. iff the rule says to take no more observations.
(Lindgren.. where δ i ( X i ) is the action taken after sampling has stopped after observing X i . given the stream X 1 ..... as shown in Berger (1985): P ( N < ∞) = P( N = 0) + ∑ P ( N = n) i =1 n ∞ = P( N = 0) + ∑ ∞ i =1 N = n ∫ dF (x n θ) 72 .. Since the overall problem at hand. The (terminal) decision rule is simply defined as a collection of decision rules of the type used in problems with fixed sample size (Lindgren. δ 1 ( X 1 ). is to determine the final sample size at which ϕ says to stop and make a decision.[1 − ϕ n −1 ( X 1 . the function τ n gives the conditional probability that precisely n observations are called for: P ( N = n  X 1 .. Thus... if the rule calls for stopping sampling for the first time at X n .. X n −1 )]ϕ n ( X 1 .) = τ n ( X 1 . 1971).δ } = E{τ n ( X 1 . or given the stopping rule... the probability of stopping at the nth observation.... Similar to the ϕ n functions. More formally. δ 2 ( X 2 ). X n ) .. X 2 ... Further. it is favorable to discuss this issue in terms of the stopping time N. given a sequence of observations X 1 ... X n ) 0. X n ) = (1 − ϕ 0 )[1 − ϕ1 ( X 1 )].. the stopping time is the random function of X given (for a nonrandomized sequential procedure) by N ( X) = min{n ≥ 0 : τ n ( X n ) = 1} . X 2 ..... given the τ n functions one can determine when to stop. one can determine the τ n functions..... Thus. δ 0 is the action taken when no data is available and a decision is taken immediately... a terminal decision rule consists of a series of decision functions δ 0 .. Thus. {N = n} is the set of observations for which the sequential procedure stops at time n.. 1985). X n )} . if the rule either stopped the sampling prior to the nth observation or says to go on after it = 1. is simply the expected value assuming a particular decision and stopping rule: Pθ ={τ . Thus.... X 2 .. 1971).. and hence there is nothing to prevent N = ∞ . (Berger.Part III: Sequential Statistical Decision Theory τ n ( X 1 .
the loss incurred is l (θ . d )] . d ) = E[l (θ . If the Bayes decision rule δ π calls for taking all n observations in a particular application.1 that a preposterior analysis includes both the decision loss and the sampling or experiment cost when a particular action is taken for a particular state of nature. d ) . Assuming that l (θ . it then follows that 73 . corresponding to observations X 1 . So the objective of Bayesian sequential analysis is to find the sequential procedure which minimizes r (π . δ ) is defined to be r (π . the terminal decision is made according to the Bayes criterion. for n observations and a sequential procedure d = (ϕ . δ ) is used. d ) + c( N ) where N is the random variable whose value is the number of observations actually used in reaching a decision. If a sequence of observations is taken at a cost of c(N ) ... is subsequently expressed r (θ . And the expected loss. d Note that the idea here is that every stage of the procedure.. in applying sequential procedures it is favorable to have a rule for which P ( N < ∞) =1. The Bayes risk of a sequential procedure d = (ϕ . Now suppose that there are potentially n stages. or risk function. 1985).3: Bayesian Sequential Procedures Recall in section 7. n) ≥ − K . Bayesian sequential analysis is quite cumbersome. d ) = E π [r (θ .Part III: Sequential Statistical Decision Theory But. X 2 . This is defined as r (π ) = inf r (π . 1971). Bayesian analysis with fixed sample size problems is straightforward. one should compare the (posterior) Bayes risk of making an immediate decision with the “expected” (posterior) Bayes risk (Berger. a. That is. X n . the posterior distribution is applied to the given loss function to obtain averages over which the possible actions are ordered and the optimum chosen (Lindgren. Section 7.. d ) .. d ) + c( N )] .
Part III: Sequential Statistical Decision Theory r (π . n)dF π (θ )dF mn (x n ) n n =1 { N = n} Θ ∫ ∫ l (θ .0) + ∑ ∞ ∞ ∞ n =1 Θ { N = n} ∫ l (θ . δ n n 0 n =1 { N = n} ∫ r (π . while the other is to take more observations. δ 0 . n)dF mn (x n ). 74 .) Thus. Thus. However. then the problem of whether to stop (using the Bayes terminal decision) or obtain more data is solved by comparing two conditional expected losses (Lindgren. once one has stopped. δ n (x n ). if the posterior Bayes risk is at time n is a constant. Hence. Furthermore. the problem of determining the optimal stopping time is still a question and is the focus of the next section.0)] + ∑ = P( N = 0)r0 (π . 1985). One of these conditional expressions is based on those n – k observations. d ) is minimized if δ 0 and δ n are chosen to minimize the posterior expected loss. is simply the Bayes action for the given observations (Berger. of an optimal fixed sample size n. The conditional expected loss (given the observations) is the function corresponding to the posterior probabilities at each stage. this situation is rare. this is made easier if the Bayes stopping rule can be determined stage by stage. then the Bayes risk r (π . if the stopping rule τ suggests at least n – k observations. Yet when they are independent. r (π . independent of the actual observation then the optimal stopping rule corresponds to the choice. δ n (x n ). 1971). (Note that the subscript m refers to the marginal density. half the problem of determining the Bayes sequential problem is deciphered as regardless of the stopping rule used. d ) is minimized by that τ by the decision rule d where di is the Bayes decision rule based on the first i observations considered as a sample of fixed sample size i. δ 0 .4: Optimal Stopping Time The difficulty of determining a Bayes sequential stopping rule resides in the fact that there is an infinite “stream of identically distributed observations” upon which to base a sequential test. n)dFn (x n  θ )dF π (θ ) (x n ). those posterior probabilities depend on the observations (at hand) which enter the decision process through the posterior distribution (Lindgren. Section 7. d)] = P( N = 0) E π [l (θ .0)] + ∑ ∫ = P( N = 0) E π [l (θ . However. This is equivalent to the fixed sample size issue (Berger. This simplification arises because at every stage the future looks exactly the same (Berger. before an experiment. If τ is a given stopping rule and π a given prior. 1985). 1971). 1985). δ 0 . Subsequently. This is summarized by the following theorem found in Lindgren (1971). d) = E π [r (θ . the optimal action.
if π > . Lindgren (1971) demonstrates that if no observations are permitted by the stopping rule τ . if θ = H 0 r (θ . which suggest obtaining at least one observation. d ) = πR0 + (1 − π ) R1 .τ . The minimum Bayes risk for this set of rules is denoted: r1 (π ) = min[πR0 + (1 − π ) R1 ] . 75 . l a + lb This is shown graphically in figure 71 below. if π ≤ l + l . δ ) and given losses l a and lb corresponding to the erroneous acceptance and rejection of the null hypothesis. C1 Figure 72: The figure above represents a general graphical view of this function of π and has the following properties (Lindgren. where α and β are the usual error sizes. Recall that this was shown in section 2 as it is essentially the no data problem or “nonstatistical” decision problem.Part III: Sequential Statistical Decision Theory There are many situations which may arise when trying to determine the optimal stopping rule. c is the cost per observation and N is the number of observations used in making a decision. the expected loss is l α + cE ( N  H 0 ) ≡ R0 . (c) the graph of r1 (π ) is concave down. Before considering these situations it is important to develop the framework for such a test. δ ) = a l b β + cE ( N  H 1 ) ≡ R1 . Then assuming a prior distribution π assigned to H 0 . for a given procedure d = (τ . and the Bayes procedure which minimizes r (π . d ) . Figure 71: Now consider a set of rules. the minimum Bayes risk is just its value when the nodata Bayes procedure is used (as mentioned previously): la l a π . d ) = la l a (1 − π ). respectively. the Bayes risk is r (π . a b r0 (π ) = min r (π . C1 . 1971): (a) r1 (π ) ≥ c. if θ = H 1 . So. (b) r1 (0) = r1 (1) = c.
Part III: Sequential Statistical Decision Theory
The first property suggests that at least one observation should be taken at cost c. The second property suggests that if the prior is either 0 or 1, then no amount of sampling will alter the distribution, so the cost is just c plus the 0 of the nodata curve. The last property simply implies that the curve r1 (π ) is made up of a “family of line segments joining pairs of points (0, R0 ) and (1, R1 ) . Berger (1985) and Lindgren (1971) both establish more formally the concavity. With these properties, it is now possible to graphically compare the curves. For instance, if the graph of r1 (π ) lies above r0 (π ) then this suggests not to take any observations. This visual analysis can be carried out for the given hypotheses. Lindgren (1971) provides three steps for determining the appropriate Bayes sequential procedure. This can be summarized as follows: 1. Given losses l a and lb , and cost c per observation, the values of C and D as shown in figure 72 can be determined from the graphs of r0 (π ) and r1 (π ) . 2. If π ≤ C , reject H 0 , and if π ≥ D , do not reject H 0 ; otherwise sample and determine A and B (from C and D and π ) where
A=
1−π
C 1−π D , and B = . π 1− C π 1− D
This is established since the condition for continuing sampling, can be expressed in terms of the likelihood ratio for the first n observations such that A < Λ n < B . 3. After each observation X n determine the corresponding likelihood ratio Λ n for X 1 ,..., X n ; if Λ n ≤ A , reject H 0 , and if Λ n ≥ B , do not reject H 0 . If A < Λ n < B , take any other observation. The draw back to this procedure is that r1 (π ) is rarely known; however, given the properties of r1 (π ) , it is possible to determine the form of the Bayes sequential procedure.
Section 7.5: Sequential Probability Ratio Test
The sequential probability ratio test is provides one of the first applications of sequential analysis within statistical analysis (French and Insua, 2000). It tests a simple hypothesis H 0 against a simple alternative H 1 . As constructed in the previous subsection, the following constants are defined where α and β are the respective error sizes: A=
α
1− β
, B=
1−α
β
76
Part III: Sequential Statistical Decision Theory
These constants form the limits for the likelihood ratio Λ n computed after each observation is taken. Then if Λ n ≤ A , the sampling stops and the null hypothesis is rejected. If Λ n ≥ B , the sampling stops and the hypothesis is not rejected. And subsequently, if A < Λ n < B , then another observation is taken. Notice that this is essentially the Bayes sequential procedure described earlier. The essential difference is that there is no formal loss structure in this instance; however given any sequential probability likelihood ratio test essentially means that there exist losses and a sampling cost per observation such that the Bayes sequential test for some prior distribution is the given sequential probability ratio test. Thus, the analysis both statistically and graphically proceeds in the same manner. The decision theoretic approach of balancing the competing factors of expected losses and sampling costs resides in both the Bayesian and nonBayesian approaches; and thus makes it difficult to determine an optimal policy or rule. However, French and Insua (2000) state that there is an intuitive structure for the optimal policy. That it is to say, that if the likelihood ratio of the observations is not sufficiently small or large, then keep sampling – otherwise, stop and take action. Even the constant values of A and B remain undefined, knowing the general structure of the optimal policy allows for finding possible good values. Berger (1985), French and Insua (2000), and Lindgren (1971) provide further details on the development, theory and other methodological issues (constraints) with regards to the sequential probability ratio test.
77
Part III: Sequential Statistical Decision Theory
Section 8: Markov Decision Processes
The previous section provided a framework for sequential decision problems. Such problems can be formulated in Markovian ways which has led to large body of theory on Markov decision processes. Decision processes which make a sequence of probabilistic transitions and in which the transition probabilities depend only on the current state of the system are said to undergo a Markov process. Subsequently, if there are alternative actions amongst which a choice can be made then this is referred to as a Markov decision processes. Since such processes are comprised of Markov chains they are also attributed with the Markovian properties which provide favorable analytic abilities. This section begins with introducing the underlying concepts of Markov chains and their properties as well as other attributes of both finite and infinite chains. This section concludes with the framework and computational aspects of Markov decision processes for determining optimal policies.
Section 8.1: Sequential Decision Problems Formulated in Markovian Ways
The previous section introduced a framework for sequential decision making under uncertainty. Intervals of time separate the stages at which decisions must be made, and the effect of a decision at any stage is to influence the transition from current to succeeding state. If the transition from state to state is a probabilistic sequence and the transition probability from current to succeeding state is only dependent on the current state, then the decision process is said to have the Markovian property. This section begins by providing the basic notation and framework for sequential decision making processes which can be formulated in Markovian ways. A random process when observed over time will be in different states at different times; and this change in state is referred to as state transition. Suppose a process has N states and its behavior over time is specified by the state the “system” is in at each stage in time. In this report, the stage structure for the sequential problems considered (whether they are time based or not) will be discrete. More formally, a typical random process X is a family { X t : t ∈ T } of random variables indexed by some set T. Further, if we define T = {0,1,2,...} , the system is a discretetime process (Grimmett and Stirzaker, 1982). In sequential decision problems, we may define { X 0 , X 1 , X 2 ,....} to be a sequence of random variables which take values in some countable set S, the state space. Thus each X n is a discrete random variable taking on one of N possible values, where N = S  . It is evident that this system is not deterministic, and hence can be neither prophesied nor achieved. It is possible, however, to deduce some statistical description about the behavior of the process (Buchanan, 1982).
Definition: property
The process X is a Markov chain if it satisfies the Markov P ( X n = s  X 0 , X 1 ,..., X n −1 ) = P( X n = s  X n −1 )
78
1982). 1982). the probability of transition to j in (n + 1) steps is ϕ ik (n) p kj . This part now shifts to work with infinite stage Markovian decision problems. hence ϕ ij (n + 1) = ∑ ϕ ik (n) p kj . Much of theory of Markov chains is simplified by the condition that S is finite (Grimmett and Stirzaker. The transition probability matrix P describes the likelihood of transitions from state to state. where N is the number of states. (Buchanan. (Grimmett and Stirzaker. given that the state at the nth step is k. the result is achieved by transition in n steps from state i to some state k (for which the probability is ϕ ik (n) ) followed by transition from k to j (for which the probability is p kj ).Part III: Sequential Statistical Decision Theory for all n ≥ 1 and s ∈ S . let pij (n) denote the transition probability from i to j if the transition takes place at the nth step. This assumption is attributable to the attractive use of Markov chains in sequential decision problems. or pij ≥ 0 (b) P has row sums equal to one. These N stages for the transitions from i to j given the nth state are mutually exclusive and exhaustive. ∑p j ij = 1. Thus.2: Multistep Transition and State Probabilities The previous subsection focused on finite stage Markov decision problems. This summarizes the basic elements of Markov chains. Theorem: P is a stochastic matrix. A finite Markov chain is said to be timehomogenous if for every pair of states i and j. The transition probability for a Markov chain is a stochastic matrix. which is to say that (a) P has nonnegative entries. especially in infinite stage Markov decision processes. Section 8. and since our interest lies in the evolution of X. The remainder of this section presents the favorable attributes of Markov chains which demonstrate the potential applicability of Markov chains..1.. Define ϕ ij (n) = p( X n = s  X 0 ) for 1 ≤ i. k =1 N 79 . it becomes important to determine the multistep transition probability values. Such a chain is also described as having stationary transition probabilities. The Markovian assumption that only the current stage (n) and the state (i) are relevant to the determination of the future behavior of the process is described as being memoryless. as the nstep transition probabilities for the Markov chain (defined by P). The transition matrix for a Markov chain is the N × N matrix P = [ pij ] . If the system starts in state i at time 0 and is in state j at time (n+1).2. 1982) For simplicity (of notation). j ≤ N and n = 0. pij (n) = pij for all n.
Φ (0) = I . they must also adhere to the regular conditions of probability. From the justification above. X 0 } k =1 N from the defintion of conditional probability = ∑ P{ X n = s  X 0 }P{ X n +1 = s  X n } k =1 N by the Markovian assumption = ∑ ϕ ik (n) p kj . 80 .. P m + n = P m P n and so P n is the nth power. Φ (3) = Φ (2) P = P 3 . X n  X 0 } k =1 N = ∑ P{ X n = s  X 0 }P{ X n +1 = s  X n .Part III: Sequential Statistical Decision Theory Buchanan (1982) provides the following justification. So. and thus more generally Φ (n) = Φ (n − 1) P = P n −1 P = P n . for n = 0. k Since multistep transition probabilities are probabilities. Φ (1) = Φ (0) P = P. The next step is to formulate this into a matrix form which in turn develops the ChapmanKolmogorov equations (Grimmett and Stirzaker..1. The equation for the multistep transition probabilities can be represented in (stochastic) matrix form (quite simply). Define Φ (n) = [ϕ ij (n)] as the nstep transition probability matrix for all n. That is they must satisfy the relationship that N 0 ≤ ϕ ij ≤ 1 and ∑ j =1ϕ ij (n) = 1 .. 1982). This can then be deduced to the special case of the ChapmanKolmogorov equations as stated in Grimmett and Stirzaker (1982): ϕ (m + n) = ∑ ϕ ik (m) p kj (n). k Hence. Φ (2) = Φ (1) P = P 2 . ϕ ij (n + 1) = P{ X n +1 = s  X 0 } = ∑ P{ X n +1 = s. we can rewrite the final statement as Φ (n + 1) = Φ (n) P. where I is the N × N identity matrix..
Theorem: If i ↔ j then (a) i and j have the same period (b) i is transient if and only if j is transient (c) i is null persistent if and only if j is null persistent The first property. Section 8. we get ϕ (n1 + n2 ) = ∑ ϕ il (n1 )ϕ lk (n2 ) ≥ ϕ ij (n1 )ϕ jk (n2 ) > 0. The second is probably the most important property of the relationship of communication. It is also said that i and j intercommunicate if i → j and j → i . two states intercommunicate if transition is possible in either direction in some finite number of steps. Further. The last property which draws upon the term persistent simply claims that state i is persistent if the probability of eventual 81 . This is expected since the Markovian property assumes no memory to the process. then there exists numbers n1 and n 2 such that ϕ ij (n1 ) and ϕ jk (n2 ) are both positive. period is defined as the greatest common divisor of the epoch at which return is possible (Grimmett and Stirzaker. First we begin with some basic terminology which essentially focuses on the term communication (which exists between states). and informs us how X n depends on X 0 . That is. Grimmett and Stirzaker (1982) proceed by stating the following theorem. if the chain may ever visit j with positive probability. in which case we write i ↔ j . Thus. l =1 N This establishes the communication from k ↔ i . 1982). the state is said to be periodic if the divisor is > 1 and aperiodic if the divisor is equal to one. Note as the number of transitions increases. the influence of the original state diminishes. This will lead to a classification of the states. 1982). i → j if ϕ ij (m) > 0 for some m ≥ 0 . The number need not be the same for the directions of communication (Buchanan. Buchanan (1982) demonstrates this property by claiming that since i ↔ j and j ↔ k . the above theorem relates the longterm development to shortterm development. Thus. From the ChapmanKolmogorov equation. while Φ summarizes the longterm behavior of the process.3: Some Classes of Markov Chains This next subsection considers the way in which the states of a Markov chain are related to each other. It is important to keep clear that the probabilities of transitions from state to state are fully described by P. starting from i. written i → j .Part III: Sequential Statistical Decision Theory Now define Φ = lim n→∞ P n (if it exists) and call Φ the limiting multistep transition probability matrix or summarized as the limiting distribution for the chain. Definition: It is said that i communicates with j.
reaching a sufficient number of transitions at which all state to state transitions are possible. 1982) The final remark to be made here is with regard to defining a regular Markov chain. is 1. j ∈ C Once a chain has taken a value in a closed set C of states. Buchanan (1982) suggests considering ergodic states to be “collective” trapping states. Thus. (Grimmett and Stirzaker. 82 . an ergodic set of states is a set of states in which every state can be reached from every other state of the set. This concludes the introduction to Markov chains and their properties. any additional transitions will not lead to any such transitions having zero probability again. 1982). Grimmett and Stirzaker (1982) define a state to be ergodic if it is persistent. That is a Markov chain whose states form a single ergodic set is an ergodic chain. There are many classifications of states besides the ones already mentioned. Since Φ (n + 1) = PΦ (n) . We now define an ergodic set of states which is said to be a set of in which every state communicates with every other state of the set. Buchanan (1982) shows that the state probability vector π (n) = π (0)Φ (n) = π (0) P n at time n. and once the set is entered there can be no transition out of the set. States of this type are absorbing states. If such an ergodic set has all entries of the multistep transition probability matrix Φ (n) = P n greater than zero. Thus. We now provide another definition or classification of states which leads to a classification of chains. then it is said to be regular. it never leaves. we get π = π (0)Φ by taking the limit of both sides as n tends to infinity. And. In any Markov chain. i is transient. Denote the state probability as π i (n) . these are referred to as transient states.Part III: Sequential Statistical Decision Theory return to i. whereby every Markov chain contains at least one ergodic state. then P n +1 will have no zero entries if P n does not have any (Buchanan. From this. It is a property of regular Markov chains that the limiting multistep transition probability matrix Φ = [ϕ ij ] will have identical rows. and aperiodic. Definition: A set of C of states is called (a) closed if ϕ ij = 0 for all i ∈ C . having started from i. 1982). nonnull. sets of states (or at least one set of states) where all states which are members of the same set can be found which communicate with one another (Buchanan. It should also be noted that there may be states which do not belong to an ergodic state. an ergodic state is an element of an ergodic set. Of course. such an ergodic chain which is regular in subsequently termed a regular Markov chain. Subsequently. The equivalence class i ↔ j simply iterates that an irreducible set C is aperiodic if all states in the set are aperiodic. if this probability is strictly less than 1. j ∉ C (b) irreducible if i ↔ j for all i. Buchanan (1982) uses this to show that for a regular Markov chain that the limiting state probability vector π is independent of the initial state probability vector π (0) and is the same as the identical rows of the limiting multistep transition probability matrix Φ .
A set of actions.2.. c 2 (a 2 . There are two common ways of evaluating the overall timestream of rewards in Markov decision processes.. Thus..θ m } . If action k is chosen a return r (k . j . A policy whose gain differs from the optimal by not more than a known amount is called a tolerance optimal policy. Recall that in a decision framework.θ 2 ).. we now place things within the context of rewards rather than losses. we now extend this to include the concepts of utility theory. Now that Markov decision processes have been placed within a decision theoretic framework. 2000).. it is necessary to consider the increase in the value of state i from stage to stage. That is.) Assume that the reward received at stage t is u t (c t (a k .4: Markov Decision Processes The section now extends the properties and method of Markov chains to develop the body of knowledge known as Markov decision processes. i ) . θ i )) = rt (k . The first includes a discount factor ρ and a discounted additive utility function such that u (c 1 (a 1 . we define c t (a k . The formulation and solution of Markov decision problems will be illustrated by extending the decision structure. the system evolves to its next state θ t +1 . In state i there are a number of actions available. it is assumed that the probabilities hold the Markovian property and subsequently. Depending on the action a k chosen by the decision maker at time t and its current state θ t . one for each state.. i ) are bounded. θ 2 .. constitutes a policy. this can be extended to show the overall “timestream” of consequences (French and Insua. Since the aim is to find the optimal policy with minimal longterm average costs. a k ) describe the evolution of the system. Further.) = ∑ ρ t −1 rt (k . θ i ) which is the consequence a decision maker receives at each stage. k ) . θ i ∈ Θ = {θ1 . . Similarly. (Note that reward includes the cost of taking the action. Then the probabilities k pij = pθ t +1 (θ t +1 = θ j  θ t = θ i ..Part III: Sequential Statistical Decision Theory Section 8... t =1 ∞ 83 . It follows that all rt (k .θ 1 ). i ) .... N . every action taken with the current state results in a consequence.. At each time step t. Under a given policy δ the systems is a Markov process with returns and has steady state gain g (δ ) . A policy which has the highest gain (or the lowest in a minimization problem) is an optimal policy. i ) is generated and the system goes to state j with probability p(i. the decision maker is to select an action a t = a k ∈ A where A is finite. do not depend on either the earlier states of the process nor any of the earlier actions. In a Markov decision process there is a system with states labeled i = 1. The system or process may move between states at each infinite series of equal stages (or time steps) t = 1. Hastings and Mello (1978) provide the formulation for the bounds and convergence test. Suppose a system has a finite number of states.
If it is assumed that this Markov chain is single and ergodic then the assumption can be made that for each policy there exists a state which can be reached from any other state under the policy. Having an infinite Markovian decision process has the problem of approximation itself.Part III: Sequential Statistical Decision Theory The other common approach is to choose a series of actions to maximize her average reward: u (c (a . θ ). For a finite stage problem... There are other propositions for handling infinite Markov decision process.) for a Markovian decision process is a sequence of decision rules which prescribe for each stage t. which due to the attributes of Markov chains. Linear programming is one approach used. the infinite approximation is a feasible proposition. will increase without bound. French and Insua (2000) 84 . in certain instances. the cumulative expected reward from using a given policy will not be finite and hence. obviates this problem and consequently gives an expected discounted reward for each policy which is finite.) = lim T →∞ T t =1 t r (k . Some of these alternatives are presented in Hastings and Mello (1978) and Buchanan (1982). In the infinite stage problem. If the decision rule selected does not depend on the stage t. the decision maker seeks a policy for choosing actions which maximize the total expected discounted reward. There are many variants of this algorithmic procedure.. French and Insua (2000) also provide a discussion on the merits of some of alternative functions of utility to solve Markov decision processes. Now a decision policy (δ 1 .. value iteration and policy iteration are the most common. then the policies are said to be stationary. A stationary policy applied to a Markov decision process determines the transition probabilities from state to state and thus. and continues optimally receiving the expected total discounted reward.. However. the sequence will eventually converge for any starting point. The first of these. the criterion for discriminating between policies was the maximization of the expected reward associated with a policy. The latter approach gives the longrun average monetary return or the expected monetary value of the policy. value iteration. θ 1 1 1 2 2 2 ∑ ). induces a Markov chain. i ) immediately and k then passes to state θ j with probability p ij . It can then be further delineated to use the attributes of Markov chains to solve such decision processes. 2000). the action the decision maker should take if the current state of the process is θ t (French and Insua.) There are several algorithms (or classes of algorithms) which can utilized for solving Markov decision processes. However. i ) T . Finite Markovian decision processes decide upon a policy to be applied at a (previously known) finite number of decision epochs with the objective of optimizing some measure of performance.. considers that if the decision maker takes action a k and receives rt (k . c (a . the finite stage aspect may be too limiting a prospect which then “unbounds” the criterion presented above. (Computational issues are expounded in section 12. Discounting the rewards by some factor as stated above. In the first approach presented. δ 2 .
evaluates the function which defines the total expected discounted reward and seeks for an improved policy and until convergence is reached. Finite Markov decision processes are simpler and can be solved by backward dynamic programming or backward induction which is presented in section 13. It begins with an arbitrary policy. policy iteration has a similar objective. This has described the possible approaches for solving infinite Markov decision processes. The latter algorithmic procedure. instead of an arbitrary function as defined for the value iteration algorithm. This procedure begins with an arbitrary (stationary) policy.Part III: Sequential Statistical Decision Theory illustrate a couple of these and Mello and Hastings (1978) present another variant. 85 .
the second two. and “comparability with a wider philosophy” as criteria which should be satisfactorily determined. robustness. Concepts of problem structuring. list axiomatic basis. none of this is applicable without a decision model upon which the criteria is built and subsequently assessed. Emphasis is placed on probabilistic graphical models. The first two provide the theoretical merit. These latter parts of the report focus on the practicality and implications of decision processes and attempt to bridge the gap between the conceptual and practical components of decision theory. feasibility. relate to the practicality of performing the analysis while the latter two entail the implementation of the analysis results. This part of the report focuses on building models to illustrate decision problems under various assumptions. French and Insua. “transparency to users”.Part IV: Model Building for Decision Analysis Part IV: MODEL BUILDING FOR DECISION ANLYSIS: Topics in Probabilistic Graphical Models Methodologies for decision processes (as those presented in prior sections) are based upon probability and utility or different ways of handling uncertainty. Subsequently. 86 . Each section presented in Part IV describes a modeling approach for a particular formulation and framework for a given decision problem. lack of counter examples. For instance. feasibility and robustness. parameters and attributes lay the foundation for model building and are introduced in the description of each inception of the model building process. certain criteria should be satisfied as a provision for guiding decisionbased problems. However. These probability models are extensions of the procedures and methodology presented in earlier sections of this report.
states of nature and consequences. states and consequences. With respect to the three elements – actions. and leads to a model of decision making in which there is some representation of uncertainty relating to the unknown state. In essence. This tabular representation of decision problems has very close conceptual and historical connections with game theory (which is not explored in this report). the model can be presented in a decision table as shown below. decision problems consist of actions. whereupon nature will ‘choose’ a state (column). some 87 . That is. An introduction to graphical models for describing decision processes is illustrated through the conception of decision trees and influence diagrams. leading to the consequence. the terminology may vary. Section 9. Without forgoing the complexities of modeling such problems. The components of a decision problem provide the construction for a graphical representation. In this section. Actions a1 a2 ⋅ ⋅ ⋅ am States of Nature θ1 θ 2 L θ n c11 c 21 ⋅ ⋅ ⋅ c m1 c12 c 22 ⋅ ⋅ ⋅ cm 2 L c1n L c2n L ⋅ L ⋅ L ⋅ L c mn The simplest discussions of decision theory assume that the decision maker chooses an action (row) in the table. A comparative assessment of these two formulations demonstrates the strengths and weakness of the approaches in both structural and practical settings. a decision model can be concisely formulated as a ⊕θ → c . Recall that the choice of action or decision is under the control of the decision maker.Part IV: Model Building for Decision Analysis Section 9: Graphical Representation of Decision Problems Section 1 introduced the framework for decision problems. the distinction between these elements is common. the interaction between an action and a state of nature leads to a consequence. a formulation of a general decision problem is revisited in a concise manner in order to introduce the assumptions and criteria required for model building. however. but the state which pertains is beyond his/her control. When both the action space A and the state space are finite.1: Modeling Decision Problems Sections 1 through 3 demonstrated the key concepts and ideas pertaining to decision modeling. This separation is found throughout statistical decision theory.
multiple objectives or multiattribute problems surface which encompass conflicting goals. that they may be thought of as descriptions of the outcome of actions chosen under various possible “states of nature”. however.θ ) = c′ dP ( x  θ ) . it 88 . neither notation recognizes the possibility that the set of possible states may depend upon the action chosen (French and Insua. This statement merely emphasizes the adoption of the model a ⊕ θ → c . three representations of decision modeling have been presented each shown to have the same general structure. Thus. Then the previously defined sets C and A can be taken in a general sense without any further mathematical structure other than the existence of suitable (σ) fields required for the introduction of probability measures.θ ) ∈ D × Θ as follows: PC (c ′  θ ) = ∫ c ( d ( x ). decision problems can be formulated as D = A X = {d  d : X → A} which relates the decision maker’s actions with each possible observation and presents the set of all possible choices. Section 9. The notation c : A × Θ → C is commonly used however. (2) structuring the elements of the decision situation into a logical framework. That it is to say. the investor may wish to maximize the financial return of an investment portfolio but also minimize the volatility of the portfolio’s value. Statistical decision theory. Thus. neither the actions nor the states of nature need to be thought of quantitatively. this structure of statistical decision theory reduces to the general form of a decision problem. For instance. Consequently. often. and (3) refining the precise definition of all of the elements of the decision model.2: Model Building – Fundamental and Means Objectives Modeling a decision problem (in practicality) requires three fundamental steps. Conceptually.Part IV: Model Building for Decision Analysis representation of preference between the possible consequences. as stated by Robert T. Notably. Similarly. a choice d ∈ D induces a probability distribution for each pair (d . and a mechanism for bringing these two representations together to guide choice. 2000). other authors (such as Savage) employ an equivalent but more succinct notation which identifies the actions from the state space to the consequence space: a : Θ → C . 2000). Clemen in Making Hard Decisions. The aforementioned steps not only provide the foundation for the decision making process but also provide the construction process for a graphical representation of decision problems and methodology. The tabular representation (shown above) is replaced by a functional one when the action and state spaces are infinite. These can be summarized concisely as (1) identifying and structuring the values and objectives. assumes that the observation X = x from an observation space X is observed according to the conditional distribution PX (⋅  θ ) . (French and Insua. Much of the work presented in this report has focused on single objective problems. Thus. on the other hand. Thus. the “consequences” considered thus far. need not have been considered to be numerical.
fundamental objectives are organized into hierarchies. consistent with the decision context. When considering multiple objectives. Thus. Once a set of objectives. For the same example. For example. these objectives are parsed out separately for adults and children. Figure 91: Maximize Safety Minimize Loss of Life Minimize Serious Injuries Minimize Minor Injuries Adults Children Adults Children In contrast to a hierarchy organization. means objectives may include “Minimize accidents” and “Maximize Use of Vehicle Safety Features”. by an attribute refers to a factor to be considered and when given a “direction” of preference. and the lower levels explain or describe important elements of the more general levels.Part IV: Model Building for Decision Analysis is important to address the issue of value structuring which is necessary in order to graphically display complex decision processes. the consequence space becomes complex and presents a multiattributed structure. As seen in this example.” “Maximize Serious Injuries.” and “Minimize Minor Injuries. it is termed objective. 89 . thus far. the objectives should be separated into means and fundamental objectives. This step provides a value structuring process whereby distinguishing between those objectives that are important because they help achieve other objectives and those that are important because they reflect the goal – what we really want to accomplish. the problem arises of structuring these objectives in a consequential manner. There are several other possible means objectives which may work together (network) in order to accomplish the maximization of safety. means objectives are organized into networks. That is means objectives can be connected to several objectives in order to accomplish the task at hand. The upper levels in a hierarchy represent more general objectives. This process of arranging attributes holds cognitive advantages through graphical modeling and representation. A higher level fundamental objective might be “Maximize safety” where lower level fundamental objectives might be “Minimize Loss of Life. To clarify.” Furthermore. The consequence space C. has simply been stated as a set of objects which can be defined on an interval of the real line. has been established. these can be separated in order to pertain to particular issues for a particular group. say minimize or maximize. The figure below illustrates a possible hierarchy that might arise in the context of defining vehicle regulations.
a well measured means objective can sometimes substitute for a more difficult to measure fundamental objective (Clemen. French and Insua (2000) refer to the “topdown” questions.Part IV: Model Building for Decision Analysis Figure 92: Maximize Safety Maximize Vehicle Safety Minimize Accidents Motivate Purchase Of Safety Features Maintain Vehicle Maximize Driving Quality Require Safety features Educate Public Enforce Traffic Laws Reasonable Traffic Laws Minimize Driving Under Alcohol Influence Structuring the fundamental objectives into a hierarchy or tree is crucial for a multiple decision process. Although. subsequently. Table 91: Constructing MeanObjectives Networks and FundamentalObjectives Hierarchies To Move: To Ask: To Move: To Ask: Fundamental Objectives Downward in the hierarchy Mean Objectives Away from Fundamental Objectives What do you mean by that? Upward in the hierarchy Of what more general objective is this an aspect? How could you achieve this? Towards Fundamental Objectives Why is that important? 90 . In some instances. however. there are methods or techniques which facilitate this thinking process. problems are much complex in structure than the one presented here. the graphical interface that induces this methodology still requires the building of the model or decision problem. Clemen (1996) presents table 91 which summarizes the questions or four techniques for organizing the two types of objectives which are useful in constructing means objectives networks and fundamental objectives hierarchies. In reality. 1996). the means objective network provides an important basis for generating creative new alternatives. However. There are several ways in which a decision maker can choose to build such a model. much of this discussion is intuitive. Brownlow and Watson (1987) confer that structuring attributes assist the “cognitive overload” brought about by the complexities of a decision making process.
Although notation varies from source to source. uses an attribute which is perceived to describe the objective and is measurable.Part IV: Model Building for Decision Analysis It is obvious that attributes must hold certain properties or meet certain requirements in order to be useful. provides a direct measure such as cost. and consequences. both approaches require the elements of a decision problem. although important in the practical setting. • Diamond or rounded square nodes represent consequences (value) of the decision process. Recall. Decision trees and influence diagrams.4: Constructing Influence Diagrams 91 . provide a graphical approach to modeling decision problems. Section 9. there are three types of nodes which are used both in influence diagrams and decision trees and can be stated as follows: • Square nodes represent decisions (or actions) and mathematically are associated with a decision set D. The first. a proxy. These shapes are generally referred to as decision nodes. a natural scale. minimizing caregiver burden. defines the objective as well as indicates the impact of a specific consequence. It has been superceded. chance nodes and consequence nodes. These nodes are assimilated such that they coincide with the progression of time. The focus now turns on the graphical representation of the elements of a decision problem – decisions or alternatives. Section 9.g. that the intention is to quantify such attributes. uncertain events or outcomes. connected by arrows/branches or (directed) arcs. • Circular nodes represent random quantities or events which are associated with a random outcome X. they must be measurable on some scale for each consequence. are not explored in this report. by influence diagrams. The third scale. The second. a constructed or subjective scale. in recent years. These are represented by the set of all possible consequences C. French and Insua (2002) distinguish three types of scale. e. Other practical concerns such as decision ownership and feasibility. structuring. both. Keeney and Raiffa (1976) affirm that no two attributes should measure the same aspect and yet should distinguish between consequences. Hence. Nonetheless. The most common graphical representation of decision problems is a decision tree. and sorting out the means objectives from the fundamental objectives is the initial step of building a graphical model of the decision problem. The following two sections describe a general procedure for constructing and modifying influence diagrams and decision trees to model the structure of a decision problem. The latter section delved into the specifics for each modeling approach and concludes with a comparative assessment. That is they are created specifically within the context of the decision problem. noting that they may be objective or subjective measures.3: Model Building – A Graphical Approach The process of specifying. This (sub)section will focus on the common elements used is such model building processes. Nodes are integrated into graphs or networks. These are useful aids in model building.
This suggests that the chances (states of nature) associated with C may be different for different outcomes of A. connected by arcs. complex decision problems have the innate ability to become “messy”. 1996). First. they provide alternative situations and are easily interpretable. The use of algorithms and numerical techniques also enable an efficient and simple analysis. For instance. 92 . a node at the end of an arc is referred to as a successor. the figure above illustrates an arrow pointing into a chance node. the first cluster of nodes shows an arrow (arc) from A to C. A node at the beginning of an arc is qualified as a predecessor. This demonstrates that the predecessor is relevant for assessing the chances associated with the uncertain event. indicating relevance. and consequences and the subsequent interrelationships that exist within a decision problem. An appropriate model ensures the probabilistic structure of a decision problem. In general. The direction of the arc indicates the meaning and brings into the relevance or sequence context. and consequence nodes. Likewise. an arrow pointing from a decision node to a chance node means that the chosen decision is relevant for assessing the chances associated with the succeeding uncertain event. From a practical viewpoint. 1995). Secondly. however. and consequently. Figure 93: Relevance A B C D E F For example. Model building of decision problems as seen can become quite complex. an arc can represent either relevance or sequence (Clemen. Influence diagrams provide a format whereby the large volume of information can be summarized into relevant and sufficient matters. The design and construction of influence diagrams are based on the aforementioned three basic elements – decision nodes. uncertain events (states of nature). chance (states of nature) nodes. In the diagram above. The rules for using arcs to represent relationships among the nodes are shown in figures 93 and 94. influence diagrams hold many advantages. in a compact form (Marshall and Oliver. they provide a framework by which a decision maker can reject or confirm assumptions and accurately model dependencies through a graphical display. the timing of available information and interdependence of decisions that can be taken and which may arise under certain states of nature.Part IV: Model Building for Decision Analysis Influence diagrams provide a visual perspective of the decisions.
indicating that the consequence or calculation depends on the specific outcome of the predecessor node. Arrows that point to decisions represent information available at the time of the decision and hence represent sequence (figure 94). with chance node H and decision node I. Name each random quantity and decision. Such an arrow indicates that the decision is made knowing the outcome of the predecessor node. 3. all uncertainty associated with a chance event is resolved and the outcome is known when the decision is made. These sequential steps are summarized below. 1. Create a preliminary list of decisions and random events (or quantities of interest) whose outcomes are believed to be important in the formulation of the problem. Identify any influences or dependencies between random quantities and decision. Thus.Part IV: Model Building for Decision Analysis the choice taken in decision B is relevant for assessing the chances associated with C’s possible outcomes. Thus. An arrow from a chance node to a decision means that from a decision maker’s point of view. 93 . A simple procedure for constructing an influence diagram to model the structure of a decision problem is stated in Decision Making and Forecasting by Marshall and Oliver (1995). the sequential ordering of decisions is shown in an influence diagram by the path of arcs through the decision nodes. Represent each random quantity with a circular node and a decision with a square node. as illustrated in the figure above. Relevance arcs can also point into consequence or calculation nodes. An arrow from one decision to another decision simply means that the first decision is made before the second. The second cluster of nodes shows that consequence node F depends both on decision D and event E. the choice would normally be made on the basis of information available at the time. 2. Figure 94: Sequence G H I When the decision maker has a choice to make. This is the case. Insert directed arcs between nodes that influence one another with the direction corresponding to the natural influence believed. such as decision nodes G and I. The decision maker waits to learn the outcome of H before making decision I. Draw them in order of occurrence from left to right. information is available to the decision maker regarding the event’s outcome. Identify the attributes and objectives that are to be used to measure the consequence of the decisions and outcomes.
e. it cannot be stated that there is a unique correct diagram but that there are many ways in which a diagram can appropriately represent a decision problem. one and only one of the outcomes occurs. Furthermore. That is. Collectively exclusive means that no other possibilities exist. as stated in 5 of the procedure above. combination strategies are possible. Section 9. Also. Check that there any decision node that occurs before a later decision node has a directed arc from the former to the latter.5: Constructing Decision Trees Influence diagrams are an excellent display of a decision problem’s basic structure. Mutually exclusive means that only one of the outcomes can happen. they hide much of the detail which may be pertinent to a decision maker. one of the specified outcomes has to occur. That is. Incorporating all of the important concerns1 of the decision is the only way to get a requisite model and adequate representation of the problem. In designing a decision tree. Check to see that there are no directed cycles – i. 1995). Thus. each chance node must have branches that correspond to a set of mutually exclusive and collectively exclusive outcomes. Similarly. To display more of these details. however. The interpretation of decision trees follows much the same path as explained in section 9. 6. The design of a decision tree holds much of the character of an influence diagram. circles represent chance nodes and consequence nodes are found at the ends of the branches which connect the nodes. squares represent decisions to be made. Clemen (1996) offers some other remarks on the construction of influence diagrams. a decision tree can be constructed. Yet. the options represented by branches from a decision node must be such that the decision maker can choose only one option. This is a requirement of the principle of coherence (Marshall and Oliver. a chance node known to a given decision mode must be known to a later decision node. Putting these two specifications together means that when the uncertainty is resolved.4 on influence diagrams. Determine any conditional independencies and represent them accurately. in some instances. Secondly. although the construction of an influence diagram may be technically correct. However. 94 . they hold the drawback that the size of the problem cannot always be practically represented.Part IV: Model Building for Decision Analysis 4. 2000). an influence diagram should not contain any cycles. there is no clear cut supposition to suggest that the influence diagram constructed is the only correct one. 1 Sensitivity analysis is helpful in determining which elements are important. The representation that is the most appropriate is the one that is requisite for the decision maker. the nature of the arcrelevance or sequence can be ascertained by the context of the arc within the diagram. This process is fairly simple and thus alternative diagrams can be easily drawn. First. a requisite model contains everything that the decision maker considers important in making the decision (Raiffa and Schlaifer. 5. a connected set of arcs in a directed path that leads out of one node and back itself.
with the corresponding probability and consequence. It is sometimes useful to think of the nodes as occurring in a time sequence. Some basic decision trees will be described later in this section in conjunction with influence diagrams more illustratively. at the end of each branch. Figure 95: Best System Detection Effectiveness System1 System 2 System 3 Time to Implement Passenger Acceptance Cost System 4 System 5 Including multiple objectives in a decision tree is straightforward. Evaluating the alternatives requires “filling in the boxes” in the matrix. Each column of the matrix represents a fundamental objective. the number of potential paths may be very large. Marshall and Oliver (1995) provide a procedure for constructing decision trees. Label each branch emanating from a decision node and its respective loss or consequence. and each row represents an alternative. Much like (sequence) influence diagrams. represents a particular scenario that could unfold over time. once the set of random events and actions have been defined together with their sets of probabilities and respective utilities (cost/payoff). 3. Label each branch emanating from a chance node i with a unique element from set θ i ∈ Θ . consequently. each alternative must be measured on every objective. If it is a chance node. Draw a branch for each possible action from the first (leftmost) decision node. This is an example of a decision tree representation of FAA’s multiple objective bomb detection decision (Clemen. 2. including all possible decision alternatives and outcomes of chance events. An easy way to accomplish this systematically is with a consequence matrix as shown in the figure above. 1996). 95 . Each path. Place the consequence values from the set on the terminal nodes at the end of each branch. 4. do the same drawing a branch for each possible consequence. The steps include: 1.Part IV: Model Building for Decision Analysis A decision tree represents all possible paths that the decision maker might follow through time. In a complicated decision situation with many sequential decisions or sources of uncertainty. simply list all of the relevant consequences. decision trees are constructed in chronological order and nodes are connected by branches.
the probability distribution of θ is not conditioned on the choice of action. the decision must be made in anticipation of the chance event. if a chance node is to the right of a decision node. 1996). Placing a chance event before a decision means that the decision is made conditional on the specific chance outcome having occurred. Thus. The key to constructing a decision tree from an influence diagram is the specification of every possible outcome of every node: chance outcomes from each chance node. such as those presented above. If chance events have a logical time sequence between decisions.6: Comparing Problem Structuring Modeling decision problems can be formulated by use of graphical methods. Thus. The sequence of decisions is shown in a decision tree by order in the tree from left to right. Figure 96 below provides a decision tree representation of the general problem presented in the table in section 9. The influence diagram in Figure 97 makes this clear. actions from each decision node. Conversely.Part IV: Model Building for Decision Analysis The designation between decisions and chance nodes is critical.e. but does not provide a clear picture of the interrelation and influences between the uncertainties and decisions. is that the dependencies among branches on each path are not made evident as apparent by influence diagrams. i. it is beneficial to construct either a decision tree or influence diagram and bring the structure of the decision problem to the forefront. The one fundamental problem with decision trees. In this section. The decision tree displays the different possible scenarios. a general formulation of decision problem is presented in both graphical forms and provides the basis for a comparative assessment of these approaches. Thus. as stated in Marshall and Oliver (1995).1. a path in a decision tree is a series of decisions and random outcomes that lead from a single initial chance or decision node to a distinct end point or terminal node. Figure 96: θ1 θ2 θn c1n a1 c11 c12 The mn possible scenarios (histories) of the decision tree are represented by the mn possible paths in the tree. it is not immediately clear whether the m chance nodes at the end of the action branches represent the same uncertainty about θ. although the order used does suggest the conditioning sequence for modeling uncertainty (Clemen. then the order in which they appear in the decision tree is not critical. a2 a3 96 . for analysis purposes. they may be appropriately ordered. Section 9. If no natural sequence exists. and all possible consequences defined.
These emphasize a further difference between the two representations. In summary. Such a diagram is referred to as a belief net and its use is most common in the field of artificial intelligence. Together they bring complementary perspectives on the issues facing a decision maker. which it does not. beliefs.e. but the absence of an arrow from the action node to the state nodes indicates that the uncertainty about the state is not conditioned on the choice of action. a decision tree representation displays two facets of a problem particularly clearly: contingency and temporal relationships between decisions and realizations of random events. being conditional upon the observed value of X.Part IV: Model Building for Decision Analysis Figure 97: The arrows indicate that the consequence depends on the choice of action and the unknown state. i. But if this interpretation is made for all influence arrows. a c θ Figure 98 and 99 provide a decision tree and influence diagram for a statistical decision problem. From the tree. the uncertainty concerning θ differs between chance nodes. conditionalityto the forefront. 2 An influence diagram can be used to represent the structure of a decision maker’s knowledge. in probabilistic terms. An arrow entering a decision node indicates that the quantity at the other end of the arrow is known to the decision maker at the time of the decision. simply by avoiding the use of decision or value nodes. An arrow entering a value node indicates that the quantity in the value node is partially determined by the quantity at the other end of the arrow. So. in this case. Figure 98: c1n θ2 θn c12 θ1 X c1n a a1 θ c X=x1 X=x2 X=xq Figure 99: Generally in influence diagrams. an arrow entering a chance node indicates that the probability distribution represented by that node is conditional on the quantity (random variable or decision) at the other end of the arrow. it is clear that the observation X is made before an action is decided upon. An influence diagram2 representation hides these facets but does bring influencesor. it may be assumed that the uncertainty about θ is resolved and becomes known to the decision maker before X is observed. it might appear that the arrow from X to a indicates that X becomes known to the decision maker before the action a is chosen. Within the influence diagram. 97 .
2000). and they complement each other well. Within the classes of problems commonly considered in statistical decision theory. 98 . an influence diagram may be more appropriate. In practice. As such they have been described as a “bushy mess” (French and Insua. Influence diagrams are particularly valuable for the structuring phase of problem solving and for representing large problems. Because the two approaches have different strengths. Yet. One strategy is to start by using an influence diagram to help understand the major elements of the situation and then convert to a decision tree to fill in details. Influence diagrams are a much more compact representation. For example.e. Decision trees display the details of a problem. Influence diagrams and decision trees provide two approaches for modeling a decision. although the conversion may not be easy. after all. the goal. However. illusory. on the surface at least. the understanding of the graphical representation of influence diagrams is optimal as it is regarded as especially easy to understand regardless of one’s mathematical training (Clemen. and vice verse. Using both approaches together may prove useful.Part IV: Model Building for Decision Analysis Decision trees have a difficulty in that for many problems they rapidly become very large: too large for the eye to comprehend as one. as commonly formulated. Because the two approaches have different advantages. Such asymmetric problems are rather the rule than the exception in decision analysis. The discussion and the examples shown demonstrate. Careful reflection and sensitivity analysis on specific probability and value inputs may work better in the context of a decision tree. and it is unfortunate that. influence diagrams cannot model asymmetric problems. the ultimate decision need not depend on the representation. Both methods are worthwhile. decision trees are often displayed as a series of subtrees. is to make sure that the model accurately represents the decision situation. as influence diagrams and decision trees are isomorphic. asymmetry is less common. It is the asymmetry within problems that causes the decision trees to grow into bushy messes. And is previously commented. depending on the modeling requirements of the particular situation. Thus. decision trees get “messy” much faster than do influence diagrams as decision problems become more complicated. problems in which a particular choice of action at a decision node makes available different choices of action at subsequent decision nodes that those available after an alternative choice (Raiffa & Schlaifer. they should be regarded as complementary techniques rather than as competitors in the decision modeling process. 2000). one may be more appropriate than the other. i. Even one of the most complicated decision trees may not be capable of showing all of the intricate details contained in an influence diagram depicting the same decision problem. 1996). their advantage in this respect is. that decision trees display considerably more information than do influence diagrams. in a sense. That is a properly built influence diagram can be converted into a decision tree. Decision trees can represent asymmetric decision problems. if it is important to communicate the overall structure of a model to other people.
If α a is defined to be the set of actions such that a ∈ A and α a is defined to be those random quantities which are unknown. observation or consequence spaces.1: Subjective Expected Utility Model – Rebuilding the Decision Framework The modeling of a decision maker’s preferences or beliefs is justifiably represented through the established subjective expected utility axioms. The decision model can be restated as 99 . That is the uncertainty lies within θ whereas α is known. multilevel or hierarchical models and exchangeability are the emphasis in the later portion of this section. α a ) relates back to the model previously defined. consists of α values which are observed while the parameters θ contain information obtained from previous studies. This section focuses on the process of structuring the “unknowns” into parameters and observations thereby facilitating the probability model building. states {Θ}. there is no correct model. and consequence {C} spaces. Such a model has been described as having three essentials: the action {A}. partitioning α into (α a . French and Insua (2000) restate the previously defined decision model in a more general framework containing parameters and observable variables.θ ) .Part IV: Model Building for Decision Analysis Section 10: Probabilistic Graphical Models The axiomatic developments presented in Part II justify the subjective expected utility representation of a rational decision maker’s preferences in a decision problem. It is proposed that in decision modeling.θ ) . but ignore the structure in the state. These models are descriptive in nature and further taken to be deterministic accept when allowing a prescriptive interpretation of probability.⋅) used to predict a set of observed values y = M (α . This leads to probability modeling and the representation of probabilistic graphical models from a Bayesian perspective. Thus. This sections begins be revisiting a decision model framework with emphasis on the statistical elements of parameters and observations. Drawing upon the scientific process of modeling. let’s consider a model with parameters θ and input variables α of the form M (α . Section 10. French and Insua (2000) proceed to develop comments of such a modeling framework. As stated in section 9. then relabeling α a as θ gives M (α . The development of models such as Bayesian networks (belief nets). The model M (⋅. Such models simply provide a means of encoding the current assumptions and stating the understood relationships. The decision model presented in Part I and revisited in the previous section concedes that the consequences of a decision are assumed to be determined by the action chosen and the unknown state: a ⊕θ → c .θ ) .
Decision making under uncertainty is quantified or assessed through the development of probability distribution and epitomizes the intrinsic nature of statistical decision theory. and consequence spaces) which constitute the parameters and random and/or observed variables of the model have been prescribed as vectors of real or integer numbers or categorical variables. / ~ ~ DeGroot’s next step is to construct the utility function. a Note that the loss function l (a.θ )) and δ ( x) = a which is the action taken after observing x. Θ. (b) ∀R. X (θ . and C (subsequently defined as the action. both the general form of the model and the consequence model c(α . Including the observed value X=x. a probability distribution on Θ. gives the subjective expected utility model to be Choose a = arg max Eθ [u (c(a. X ) . X ) 100 .Part IV: Model Building for Decision Analysis c(α . X (θ . 2000): ∀R. such that (a) ∀R. encoded in the model is the decision maker’s uncertainty and preferences. ~ ~ DeGroot (1970) proceeds with the following theorem based on the axioms stated in Part II of this report by asserting that ∃Pθ (⋅) .θ ) = −u (c(a. 2000). The joint distribution of the unknown states of nature θ and the observed random variable X are expressed through the joint distribution Pθ .θ ) = M (α a . In the previous section.θ ) = M (⋅. Section 10. ( R  T ) f ( S  T ) ⇔ R ∩ T ) f S ∩ T . Based on the axiomatic developments in Part II. T ∈ Q. X ) .2: Building the Probability Framework The decision model elements: A. The observed value X is subtly introduced into the model by redefining the parameter θ to include the observations: θ = (θ .θ ))  x].⋅) and the ascribed joint probability distribution Pθ . θ ))]. This leads directly into the building of probability models. T ∈ Q with T f 0 : ( R  T ) f ( S  T ) ⇔ Pθ ( R  T ) ≥ Pθ ( S  T ) . S ∈ Q : R f S ⇔ Pθ ( R) ≥ Pθ ( S ) . S . state. The relationship between the deterministic model M X (⋅.⋅) are given to be deterministic. a Thus. This brings us back to the development of the statistical decision problem presented in Part I within the context of the developments found in Part II.θ ) . unless accentuated by a “propensity interpretation of probability” (French and Insua. The justification for the use of conditional probabilities is provided by the following axiom (French and Insua. S . the subjective expected utility model takes the following form: Choose a = arg max Eθ [u (c(a.
B) = Pθ ( A  B) Pθ ( B) . Thus. the posterior probability Pθ ( A  B ) . Furthermore. A review of the development of the form or structure of the joint probability distribution (the Bayesian perspective) leads to the construction of probability models and an understanding of the graphical approach. the posterior probability can be expressed in the following form: posterior ∝ prior × likelihood Pθ ( A  B) ∝ Pθ ( A) × Pθ ( B  A).3: Directed Acyclic Graphs (DAGs) and Probability 101 . 2000). they aid in developing the structural form of the utilities and in the assessment of the problem.1. and PX ( X  θ ) is intrinsic in determining the confounding effects of both observations and parameter estimates. And the Bayes theorem tells us that “our revised belief for A”. the prescriptive approach discussed here is indicative of the structure of a decision maker’s thoughts or belief process (French and Insua. As opposed to the descriptive structure of decision modeling construed in earlier sections. This concept of conditional probability within the utility context is introduced by DeGroot (1970) in his presentation of the subjective expected utility axioms as shown in section 10. the Bayes’ theorem relates the joint distribution to the marginal and conditional distributions such that joint probability of two events A and B are given as: Pθ ( A. The development of the distributions Pθ . can be obtained by the product of the prior probability Pθ ( A) and the ratio Pθ ( B  A) / Pθ ( B) where the numerator of the ratio is the likelihood of event A. This process of identifying conditional independence allows for the beliefs of a decision maker to be represented in the form of a probability distribution. Section 10. Furthermore. X ) . Pθ (θ  X ) . 2000) and resides in the relationship between the construction of probability distributions and utility functions.Part IV: Model Building for Decision Analysis accounts for the decision maker’s uncertainties and the confounding effects within the modeling framework. the symmetry of the condition given above is immediate. X (θ . Thus. denoted A ⊥ B. The concept of conditional probabilistic independence extends further the modeling structure of a decision maker’s beliefs (French and Insua. The construction of the joint probability distribution elicits the Bayes’ theorem. The Bayes’ theorem rearranges this form to demonstrate that Pθ ( A  B) = Pθ ( B  A) Pθ ( A) . Pθ ( B) The events A and B are independent.
a few terms in relation to graph theory are provided below. This representation is useful when discussing causation. Such a directed graph is said to be acyclic when there exists no path with i1 = ik .Part IV: Model Building for Decision Analysis As illustrated in section 9 decision models can be represented graphically and provide insight into the structure of the problem. as stipulated in the section on influence diagrams.2.. A node in a directed graph is said to be a root if it has no parents and a sink if it has no children (Pearl.. we may consider a network with nodes i. etc.. This section begins with a description of directed graphs as stated in graph theory and leads to its abilities to illustrate the Bayesian inference process. else they are disconnected. The figure below illustrates an example of a DAG containing four nodes. Similarly. Figure 101: A B C D 102 . if every node has a child then it is referred to as a chain. 1988). Formally. Bather (2000) gives the following proposition: If a directed network is acyclic. A formal description of directed acyclic graphs (DAGs) is given in this section and offers a graphical approach to Bayesian inference. j = 1. It was stated that influence diagrams constitute the formation of a network through a graphical framework. The order of the nodes can be rearranged so that every directed arc i → j has i < j . It should also be noted that a path only exists if there is a link from one node to another. The term tree is used to qualify a connected DAG in which every node has at least one parent. → ik . A path consists of a sequence: i1 → i2 → i3 → . A directed graph consists of nodes and directed arcs. A graph is said to be complete if every pair of nodes is connected by an arc. Bather (2000) goes on to prove this proposition which is omitted from this report. then the vertices can be renumbered in such a way that i < j for each directed arc i → j . j ) which specifies the directed arcs.. n and a subset of the possible ordered pairs (i. For completeness purposes. Every DAG has at least one root and one sink. child.. Much of the terminology surrounding graph theory is centered on kinship relationships such as parents.. ancestors.
and C. B. This causal framework suggests the way influence may run within a causal network. Figure 102a: A B C Figure 102b: B C A Converging Figure102c: A B Diverging C In both the linear and diverging cases. we have ( A ⊥ B  C ) while in the converging case we have ( A ⊥ B  C ) . represents a causal network consisting of variables and directed links. a directed graph. There are three types of connections between triplets of variables A. 2001). / 103 . Two nodes (variables) are said to be separated if “new evidence on one of them has no impact on our belief of the other” (Jensen.Part IV: Model Building for Decision Analysis This mathematical structure.
DAGs illustrate the prior to posterior inference process. 2001) Two variables A and B in a casual network are said to be dseparated if for all trails between A and B there is an intermediate variable V such that either i.2) as the quantification of uncertainty in causal structures must obey the principle that whenever A and B are dseparated then new information on one of them does not change the certainty of the other. Section 10. Robert Cowell in his article Introduction to Inference for Bayesian Networks.Part IV: Model Building for Decision Analysis Definition: (Jensen. the connection is converging and neither V nor any of V’s descendents have received evidence. This follows the properties of conditional independence (described in Section 10. Such a representation forms the basis of Bayesian networks or belief nets. The posterior beliefs are represented by figure 103(b) and elicits an inferential process.4: Directed Acyclic Graphs (DAGs) for Inference on Bayesian Networks The use of DAGs for Bayesian inference illustrates the prior to posterior inference process as demonstrated in the previous section. the connection is serial (linear) or diverging and the state of V is known. Figure 103(a): Figure 103(b): Figure 103(c): A A A B B B P(A)P(BA) P(A. are generally more complicated than those presented above. Bayesian networks. A causal interpretation follows from figure 103(a) which describes the prior beliefs. Each diagram above represents the joint distribution of two random variables A and B. on the other hand.B) P(B)P(AB) For Bayesian inference. in which the orientations of the arrows represent 104 . defines Bayesian networks as a “model representation for the joint distribution of a set of variables in terms of conditional and prior probabilities. or ii.
if C is also a parent of B then the individual conditional probabilities Pθ ( B  A) and Pθ ( B  C ) are not sufficient in providing any information on the interaction between A and B. the interest lies in looking for relationships among a large number of variables for which the Bayesian network is suitable for. Further. Jensen.3 (Pearl. the local probability distributions P are simply the distributions corresponding to the terms in the product of the equation given above (Heckerman.. 105 . Let’s first consider a simple causal network where A is a parent of B.Part IV: Model Building for Decision Analysis influence. Hence... Hence. This section defines a Bayesian network and shows how one can be constructed from prior knowledge. Thus. the variables together with the directed arcs form a directed acyclic graph. a Bayesian network is a DAG whose structure defines a set of conditional independence properties. and provide a onetoone correspondence between S and X. 2001). Given a network structure S. 1988). dseparation can be used to read off conditional independencies from DAG representation of a Bayesian network. Then it is simple to calculate the Pθ ( B  A) . This is equivalent to (diagrammatically) to reversing one or more of the Bayesian network arrows. and (b) a set P of local probability distributions associated with each variable (Heckerman. 1998. Jensen (2001) defines this as the chain rule. This ideology can be expanded to consider n random variables. A Bayesian network for a set of variables X = { X 1 . However. X 2 . the joint probability distribution over the set of all variables U is given by P (U ) = ∏ P( X i  par ( X i )) . Thus. ii. Thus. that a conditional probability distribution is associated with each node where the conditioning is on the parents of the node P ( X i  par ( X i )) . a specification of Pθ ( B  A. These properties can be found through graphical manipulations such as those presented in section 10. 2001). Typically. usually though not always of a causal nature. such that these conditional probabilities for these particular orientations are relatively straightforward to specify (from data or eliciting from an expert)”. Together these components: i. the focus falls on the inferential procedure which involves the calculation of marginal probabilities conditional on the observed data using Bayes’ theorem. C ) is required. X n } consists of a set of directed arcs which in turn provide (a) a network structure S that encodes a set of conditional independence assertions about variables in X. i where the par ( X i ) is the parent set of X i . This is to say. 1999). It has been contended that any uncertainty must obey the definition of dseparation (Jensen.. define the joint probability distribution for X. in this section.
. The next step in the construction of a Bayesian network entails the more mathematical or statistical framework. C. H) are two possible topological orderings... we generalize his “learning by example” approach. prediction versus explanation versus exploration).. That is to say this initial task entails the more logical structuring of the parameters and/or variables of interest. xi −1 ) = ∏ P( xi  π i ) . and (4) the classification of observations into variables having mutually exclusive and collectively exhaustive states. I) and (B. F. Here. In general the chain rule of probability is defined as P ( X ) = ∏ P( xi  x1 . It follows that in determining the structure of the Bayesian network requires ordering the variables and determining the most appropriate subset of variables. Heckerman (1999) illustrates this process through an example.. H. F. Such an ordering is referred to as a topological ordering (Cowell. B.Part IV: Model Building for Decision Analysis Next let’s consider the process of building a Bayesian network. D.e. E. Π n ) correspond to the parent nodes of a Bayesian network and in turn specifying the arcs in the network structure S. It is at this stage a DAG encoding the “assertions of conditional independence” are built. This has been taken from Cowell (1999). This approach is based on the chain rule of probability which is equivalent to the P (U ) shown above. Consider the graph shown below with nine nodes. G. it is clear that the variables sets (Π 1 .... I. (2) the identification of all possible observations that may be relevant to the problem. This can be characterized to include: (1) the correct identification of the goals of modeling (i. DAGs can always have their nodes linearly ordered so that for each node X all of its parents precedes it in the ordering. X i −1 } \ Π i are conditionally independent given Π i . Figure 104: A B C D E F I G H This example shows that (A.. C. Thus. A. Thus.. 1999). this is a task embedded in decision analysis and hence is not exclusive to Bayesian modeling... X i −1 } such that X i and { X 1 . E.. This concludes the more logical framework development in the construction of a Bayesian network. As seen earlier.. (3) the identification of a subset of relevant observations to model. D. The first task is much like that of the one described in developing a decision tree or influence diagram. i =1 i =1 n n Then for every X i there exists some subset Π i ⊆ { X 1 ... this task (of ordering the variables) may not be the most feasible or reasonable process to be used unless done so under a more logical 106 . G.
The stability condition requires that all of the probabilistic independence relations implied by the model should be invariant across (small) perturbations to the parameters of the model. Thus.e. this is referred to as recursive factorization or “the distribution being graphical over the DAG” (Cowell. it can be seen that figure 105(a) and (b) that probabilistically they are indistinguishable while that figure 105(c) and (d) are not. 1999). Using this criterion. It follows a similar procedure illustrated above by simplifying the individual terms (or nodes) with respect to their structural parents. the implied criteria or assumptions have not been discussed explicitly. defined as P ( X i  par ( X i )) for each i. the final step in the construction of Bayesian networks “simply” entails the assessment of the local probabilities. Now. the resulting network structure may fail to reveal some of the important conditional independencies or in the worst case n! variable orderings may need to be explored. represented implicitly in the DAG. Figure 105: A A C (a) B C (b) B D D E E 107 . It has been illustrated that each of these models determines a set of conditional constraints. if the variable ordering chosen is not probable. the task at hand is to compute the joint density over the set of all variables U as defined earlier. This concludes the systematic approach to the construction of Bayesian networks. Furthermore. Then. a variety of belief nets can represent the same conditional independencies. For instance. Thus. and hence is made mention of here which Pearl (1988) refers to as stability. According to the DAG. applying the semantics of causal relationships (as observed naturally) readily asserts the corresponding conditional dependencies.Part IV: Model Building for Decision Analysis framework. and (ii) the same sets of nodal structure – i. two converging arcs whose tails are not connected by an arc. Verma & Pearl (1990) proved that two DAGs are observationally equivalent if they have (i) the same skeleton. That is the parameters of the model should not be functionally related to each other. However.
and S h denotes the configuration of the network. This is process referred to as “belief updating in Bayesian networks”.. par ( X i ). 1999).θ s . This illustrates the basic ideas for learning probabilities (and structure) within the Bayesian or belief net framework.θ n ).. θ s is the vector of parameters (θ1 . And the need is usually to determine various probabilities of interest from the model. S h ) results in P ( X i  par ( X i ). data or a combination). Normalize P ( X i .θ s . in principle. However. it is important to assert that many of these “steps” are intermingled in practice.θ s . S h ) = ∏ P( X i  par ( X i ). S h ) i =1 n for each X i in U.. Because a Bayesian network for U determines a joint probability distribution for U (a set of all variables). S h ) . Use the chain rule to calculate P (U ) . Determine the P (U  θ s . the task to refine the structure and local probability distributions of a Bayesian network given data results in a set of techniques for data analysis that combines prior knowledge with data to produce improved knowledge (Heckerman.. S h ) to P ( X i  par ( X i ). 108 . we now focus on probabilistic inference in Bayesian networks. Thus. Both the problem of structuring and the assessments of probability can lead to changes in the network structure. Jensen (2001) summarizes the mathematical process involved: 1. S h ) = ∏ P( X i  par ( X i ).θ s . however.Part IV: Model Building for Decision Analysis A A C (c) B C (d) B D D E E So far the emphasis has been much on the construction of a Bayesian network (from prior knowledge. the Bayesian network can be used to compute any probability of interest as described above.θ s . 4. Marginalize P (U  θ s . The section following presents an extension of both probability modeling and belief nets within a hierarchical form. 3. This section has described the construction steps in a sequential manner. 2.θ s . S h ) . S h ) where θ i is the vector of i =1 n parameters for the distribution P ( X i  par ( X i ).
then P (θ ) = P (θ1 . Neal (1996) states that using a hyperparameter “may be much more intelligible”. 2000).. Such models hold several advantages. Level 3 : θ ~ Pγ (⋅  ξ ).... 1996) which is given its own prior. If the parameter has many components then it may be useful to specify their joint prior distribution using a common hyperparameter.θ n which share some similarity or characteristic.. The innate structural formation describes students to 109 . Level 2 : θ i ~ Pθ (⋅  γ ).... This hierarchy of a structure is referred to as hierarchical modeling. A third level suggests that γ may be represented by a prior distribution with parameters ξ . The assumption about θ i having a common distribution Pθ (⋅  γ ) represents this “similarity”.. i = 1. A simple and natural occurring example of a hierarchy structure is in determining the school to school variability of student achievement. if the Yi are multidimensional (French and Insua. Suppose the response is math achievement on a standardized test.. these structures find themselves in belief net representations as hierarchical models. In the hypothetical model illustrated above... consider a set of observations Y1 . This could be extended to the level 1 data which would then require the decision maker to integrate over θ . Yn each related independently to a set of parameters θ1 . This allows for the articulation of dependence or independence in the beliefs about the components.. Moreover. If the θ i are independent given γ . Y2 ..Part IV: Model Building for Decision Analysis Section 10. 1995). The simplest form of a hierarchical model can be described as a representation of beliefs which consists of a layered structure.θ 2 . however. the concept here is to simply extend a common parameter θ to model the distribution of many observable quantities X i . n. Further. say γ (Neal. Hierarchical models were introduced in statistical modeling in 1972 by Lindley and Smith (Gelman et. the distribution Pθ (⋅  γ ) is common to all θ i . n. For instance. al. Thus the model described is a three level model: Level 1 : Yi ~ PYi (⋅  θ i ).θ n ) = ∫ P(γ )∏ P(θ i  γ )dγ . while the distributions PYi (⋅  θ i ) may vary..... Thus. a variety of experimental conditions can be ascribed. i = 1.. i =1 n Although the γ could have been dispensed (with θ given a direct prior).5: Hierarchical Structures in Belief Nets The natural occurrence of hierarchical structures in physical phenomenon is quite common.
De Finetti’s “twist of hypothetical permutation suggests that rather than “judging” the resemblance between two groups. (1995) presents the advantages of hierarchical models in terms of statistical inference. and this could go on and on. Bayesian networks. the hypothetical swap then suggests measuring ( µ B1 . µ B 0 ) . That judgment is somewhat left to the decision maker. classrooms nested within schools. and hierarchical models give the decision analysts the tools for exploring and constructing the form of the joint distribution through a decomposition into its marginal and conditional components. Pearl explains that if the interest lies in some parameter µ of the response distribution then µ A1 and µ A0 denote the values of the parameters in the corresponding distribution PA1 ( y ) and PA0 ( y ) . µ A0 ) . Let A and B denote two groups. Such strengths draw upon the ability to learn about the hyperparameter γ as well as drawing upon some information about the particular effects of θ i . Section 10. in measuring the pair ( µ A1 . French and Insua (2000) extend this concept and state the following proposition: 110 . more may be appropriate. Gelman et al. the decision maker should “imagine” a hypothetical exchange of two groups and then decide whether the observed data under the swap would be distinguishable from the actual data (Pearl. specified by a normal distribution for that particular classroom. this allows us to update or modify our belief of the math ability of students in general by adjusting the likely value of the hyperparameter γ . 1999). Pearl (1999) defines two groups to be exchangeable relative to parameter µ if the two pairs are in indistinguishable. Conditional independence. Thus. µ A0 ) . that is if ( µ A1 . treated and untreated.Part IV: Model Building for Decision Analysis be nested within classrooms. hierarchical models are not restricted to three levels. treatment and no treatment. (with µ B1 and µ B 0 defined similarly for group B. the independence conditions to do not suggest the parametric form of any of the conditional or marginal distributions whereas exchangeability can (French and Insua. This would mean that θ i is the mean math score for classroom i. These three level hierarchical models are not restricted to the structural formation illustrated above but have also been used to structure statistical inference in terms of observational and modeling error and prior information on the parameters.6: Exchangeability and Other Forms of Probability Models One of the most fundamental concepts inlaid within subjective probability modeling is the concept of exchangeability introduced by De Finetti’s. and let PA1 ( y ) and PA0 ( y ) denote the distribution of group A under two hypothetical conditions. µ B 0 ) = ( µ B1 . Furthermore. Thus. 2000). However.
2000). In practice. with n less than or equal to the sixe of the collection. many of the methods are mixed and matched to accomplish the ultimate goal. a decision maker may choose to build a Bayesian network and then focus on some nodes and borrow conditions of exchangeability and modify the structure. Literature also contends that these sets of symmetry conditions are not easily identifiable. “exchangeability develops the parametric form of a distribution from relatively weak ideas and of symmetry between future observables rather than from modeling ideas derived from scientific understanding” (French and Insua. Thus. and this section introduces some of these concepts. sometimes the parametric structure of the distribution is based more on intuition.Part IV: Model Building for Decision Analysis A collection of random variables is exchangeable if and only if any ntuple. Either way. The methods illustrated in Section 10 provide a framework for the structuring of the functional form of the decision maker’s probability distributions. 111 . has the same distribution as any other ntuple. there are many choices in the structuring of decision problems. Hence. and thus.
S h ) is sufficient for the calculations required above. The problem in such inference making resides in the computation of the conditional probability distribution over the unobserved or hidden nodes given the observed information. more efficient methods are desired (Jensen.1: Inference in Probabilistic Graphical Models The modeling of probabilistic graphs was introduced in section 10. these distributions are intertwined and hence. the joint probability distribution P (U  θ s . Section 11. The development of models such as Bayesian networks (belief nets) or hierarchical models are extended to the development of neural networks. we begin with exact inference when countering hidden nodes and then briefly introduce dynamic linear modeling which underpins many Bayesian forecasting techniques. any inference or algorithmic approaches to computing such probabilities coincide. 2001). in particular the P (E ) . Notably this is the likelihood function which of course has a direct relationship with the conditional probability P ( H  E ) . Thus. 112 . E ) . As shown in section 10. however. This section does not survey sequential decision making but does strive to present some of the extensions of the graphical models presented earlier. thus far. P( E ) Further. Jordan (1999) presents an array of such methods encompassing both exact and approximation algorithms. when belief updating in Bayesian networks.Part IV: Model Building for Decision Analysis Section 11: Probabilistic Graphical Models with Markovian Properties The framework for constructing probabilistic graphical models as DAGS is an initial step in modeling complex decision problems. Thus. the formalism of hidden components or sequential decision making processes has been ignored. another aim is to calculate the marginal probabilities in graphical models. hidden Markov models. This section extends such structures to discuss probabilistic inference in graphical models which hold hidden components or nodes. Let H denote the set of hidden nodes and E denote the set of “evidence” or observed nodes. However because the joint probability and corresponding calculations (based on its decomposition) increase exponentially with the number of variables or nodes. It focuses primarily on sequential problems which can be formulated in Markovian ways. This extends both probability modeling and the representation of probabilistic graphical models described in the previous section. Thus the goal is to calculate the P( H  E ) = P( H . The junction tree algorithm. and dynamic linear models. Such applications stem from the body of theory presented in PART III. an exact approach is introduced through the concepts of “moralization” and “triangulation”.
1999). the probability inference in DAGs involves the computation of conditional probabilities under this joint distribution. To compute the joint probability distribution for all of the nodes. Figure 111: S4 S6 S1 S3 Φ2(C2) Φ1(C1) S5 S2 The cliques are C1{ X 1 . X 4 . X 6 } . where M is the total number of cliques and the normalization factor (the denominator) sums the numerator over all configurations. as the introduction to this subject area was directed by Bayesian networks in a very systematic approach. To recap. X 2 . X 5 } and C 3 { X 4 . That is these conditional probabilities give the P ( X i  par ( X i )) . This report has neglected undirected graphical models. They are numerically assessed by “associating potentials with the cliques of the graph” (Jordan. DAGs (directed acyclic graphs) can be numerically assessed determining the local conditional probabilities associated with each mode. However. the product is taken over all the clique potentials: P (U ) = ∏ φ (C ) i i M ∑ ∏ φi (Ci ) U i =1 i =1 M .Part IV: Model Building for Decision Analysis The focus in this part of the report has been on graphical models – more specifically on DAGs. such models are also referred to as Markov random fields. To determine the joint probability distribution of all of the N nodes. the product is taken over all possible nodes: P (U ) = ∏ P( X i  par ( X i )) . And a clique is a subset of nodes which are fully connected and maximal. we introduce them here. X 3 } . thus far. 113 . First off. Figure 111 is an example of an undirected graph adapted from Jordan (1999). C 2 { X 3 . X 5 . A potential is defined as a function on the set of configurations of a clique that associates a positive number with each configuration. i Thus.
Figure 112(a): Figure 112(b): (a (b Once a graph has been triangulated. if there are 4 cycles which do not have an edge. Thus. the cliques of the graph can be arranged into a data structure known as a junction tree. however. the independence properties in a network are converted and organized into a set of sets of variables (cliques) and to construct a tree over the cliques. In a triangulated graph. 2001). translates the directed graph into an undirected graph. Recall. Both graphs use the produce of the local functions to obtain the joint probability distributions of the nodes. The aforementioned algorithmic structure is made possible for trees and not networks because of the independent properties. does hold the property of being a realvalued function over the configuration. The directed form. This is done by simply dropping the arcs on the other edges in the graph (Jordan. resulting in a junction tree. This graph holds a property which allows recursive calculations of probabilities (Jordan. 1999). This processes the moral graph (taking it as input) and produces an undirected graph (perhaps with additional edges). Jensen (2001) provides a complete description and sequential steps involved in this graph theoretic representation of this approach. the objective is to “marry” the parents of all of the undirected edges in the graph. These are described next. The first stage. these nodes are not always situated together within a clique. Thus. Figure 112 illustrates two graphs – (a) represents a nontriangulated graph whereas (b) demonstrates its triangulated form. The resulting graph is called a moral graph and can be used to represent the probability distribution on the original directed graph within the undirected structure. moralization. The second stage of the algorithm is referred to as triangulation.Part IV: Model Building for Decision Analysis The junction tree algorithm is commonly applied as a means of making exact inferences and compiles directed graphical models into undirected graphical models. That is. 114 . it is considered not triangulated but can be simply formulated to be by simply adding a chord as seen below. joint probabilities can be built up sequentially through the graph. To summarize. if X is instantiated then all its neighbors are independent (Jensen. 1999). P ( X i  par ( X i )) . through graph theoretic methods. that it was stated that this approach required to main steps. “moralization” and “triangulation”.
Part IV: Model Building for Decision Analysis
Section 11.2:
Neural Networks
Neural Networks are Layered graphs endowed with a nonlinear “activation” function at each node (Jordan, 1999). A neural network is characterized by its: (1) architecture – its pattern of connections between the nodes; (2) learning algorithm – its method of determining the weights on the connections (probability distributions); and (3) activation function – which determines the output (Ripley, 1996). Figure 113 below illustrates the layered graphical structure of a neural network.
Figure 113:
Typically, a neural network consists of a large number of processing elements (nodes), or neurons (when speaking of artificial neural networks). Each node has an internal state, referred to as its activity level or activation which is a function of the inputs it has received. Thus, a node sends its activation as a signal to several other nodes; however, it can only send one “message” at a time. Let’s first consider a typical example of a simple neural network. Actually, for simplicity, let’s consider a single output node Y which receives input (activation) from n nodes X 1 ,..., X n . From each input node to the output node Y, there is an associated weight θ ij . Then to determine the output of the node Y, simply sum the inputs with their respective weights (consider multiple regression with θ ij as the regression coefficients): y ij = θ i 0 + θ1 j x1 j + θ 2 j x 2 j + ... + θ nj x nj The θ i 0 is the “bias” parameter which allows one to change (or bias) the output independently of the inputs. The next step is to apply an activation function, of which a number of them could be used. However, a common activation function is the logistic sigmoid function (an Sshape curve).
115
Part IV: Model Building for Decision Analysis
Let’s consider an activation function that is bounded between 0 and 1, such as that obtained through the logistic function: f ( y ) = 1 /(1 + e − y ) . Thus, a binary variable X i is associated with each node in a graphical model; and the interpretation of the activation of the node is the probability that the associated binary variable takes one of its two values. Thus, the corresponding logistic function is:
P ( X i = 1  par ( X i )) = 1+ e 1 . ∑ θij X j −θ i 0
−
j∈ par ( X i )
Thus, the θ ij are the parameters associated with the edges between the parent nodes j and node i, and θ i 0 is the bias parameter associated with node i. This is the sigmoid belief network introduced by Radford Neal (1992). Such treatments of neural networks hold many advantages including the ability to perform diagnostic calculations and to handle missing data (Jordan, 1999). From figure 113 above, it is quite clear that a node in a neural network, generally, has the preceding layer as all of its parent nodes. Thus, when applying the junction tree algorithm, the moral graph links between all of the nodes in this layer, as illustrated below. Further note that the by definition (Section 11.1) the output nodes are the “observed” or “evidence nodes and thus the hidden nodes are probabilistically dependent and so are any of its ancestors (Jordan, 1999). Computationally this can be quite intensive as the clique size due to the triangulation procedure will grow exponentially.
Section 11.3:
Hidden Markov Models
This section introduces hidden Markov models which is one of the simplest kinds of a dynamic Bayesian networks. As suggested by the model’s name, hidden Markov models hold Markovian properties for which we refer to PART III. Thus, instead of regurgitating such properties, we simply relay the directed Markovian property in graphical models. This directed Markovian property is a conditional independence property which states that a variable is conditionally independent of its nondescendents given its parents: X ⊥ ND ( X )  par ( X ) where ND ( X ) denotes the nondescendents of X. From section 10, it was mentioned that the conditional probability P ( X i  par ( X i )) did not necessarily mean that if the par ( X ) = π * then the P ( X = x) = P ( x  π *) . That is the suggestion was that any other information is irrelevant for this condition to hold. In DAGs, this information refers to the knowledge about the node itself or any of its descendents. So, we can summarize this
116
Part IV: Model Building for Decision Analysis
by saying that having information about a nondescendent does not tell us anything more about X, because either it cannot influence or be influenced by X either directly or indirectly (Cowell, 1999). (Or if it can be influenced indirectly, then it can only be done though influencing the parents which are all known anyway.)
Figure 114:
X
Y
A hidden Markov model (HMM) is a graphical model in the form of a chain as seen in the figure above. Applying the same property, consider a sequence of multinomial “state” nodes X i and assume that the conditional probability of X i is independent of its This forms a matrix of transition probabilities immediate predecessor X i −1 . A = P( X i  X i −1 ) which is time invariant. The output nodes Yi have an associated “emission” probability law B = P(Yi  X i ) which is also time invariant. In determining probabilistic inferences in such graphical models, the output nodes are treated as evidence nodes and the states as hidden nodes. The EM (expectationmaximization) algorithm is generally applied to update the parameters A, B, π . This two step iterative procedure first computes the necessary conditional probabilities and then updates the parameters via weighted maximum likelihood. Thus, the moralization and triangulation procedures of the junction tree algorithm are innocuous or useless for such graphical models as hidden Markov models (in this form).
Figure 115:
117
Another variation or amalgamation of two previously described models is hidden Markov decision trees (HMDT). Thus.Part IV: Model Building for Decision Analysis Now consider a variation on the aforementioned hidden Markov model with a number of chains. In an HMDT. the state node for the mth chain at time i is denoted X i( m ) and corresponding transition matrix for the mth chain is denoted A ( m ) . 1 i =1 m In this situation. the structure is composed of m chains. Thus. Hence. the decisions in the decision tree are not only conditional on the current data point. a factorial hidden Markov model does not hold the tractable properties held by its more general formalism. the task at hand is to compute the conditional probability distribution over the hidden states again. the overall transitional probability for the full model is computed by taking the product across the intrachain transition probabilities: P ( X i  X i −1 ) = ∏ A ( m ) ( X i( m )  X i(−m ) ) . this is a model which is essentially a decision tree endowed with Markovian properties (Jordan. 1999). given a sequence of input vectors U i and a corresponding output vectors Yi . As the name suggests. Figure 116: The figure above illustrates such a model where the horizontal edges represent the Markovian temporal dependence. but are also conditional on the decision at the previous moment in time. Thus. In general. 2003). 1999). Note that the term temporal is used as it is assumed that the model structure does not change (Murphy. This is much like the case of the FHMM and is shown to be intractable when making inferences (numerically) (Jordan. Such a graphical model is called a factorial HMM and is shown above. the triangulation process will result in an exponential growth of cliques as there are multiple chains to be considered. 118 .
The model is defined by four quantities {Ft .Part IV: Model Building for Decision Analysis Several other variations of the HMM exist including those of higher orders where each state depends on the previous k states as opposed to the single previous state or those with a mixture of both continuous or discrete nodes. Section 11. although these models provide a graphical representation of complex problems.Wt ) where y t is scalar (data) and θ t is the state vector.Vt . 2000). the objective is to keep systems running over a number of stages along target trajectories of states (French and Insua.4: Linear Dynamic Models Another important type of dynamic Bayesian networks is the class of decision problems which has arisen out of the area of control of stochastic systems within the engineering field. Gt is a n × n matrix. Essentially. They have also been developed for the problem without controls. they are not easily tractable especially when concerned with exact inference algorithms. Vt is a nonnegative scalar. System equation: vt ~ (0. the number of cliques for such models grows exponentially. and Wt is a n × n symmetric and positive (semi)definite variance matrix.5. many of these can still be assessed using approximation methods which will be discussed generally in section 11.Wt } which are assumed to be known at time t: Ft is a n vector. Vt ) θ t  θ t −1 ~ (Gtθ t −1 . Either way.Vt ) θ t = Gtθ t −1 + wt . Gt . Thus. Let’s consider the basic structure: y t  θ t ~ ( FtT θ t . Wt ) Figure 117: HMM with Gaussian 119 . but all the nodes are assumed to have linearGaussian distributions. Linear dynamic models have the same topology as HMMs. wt ~ (0. However. A more general way of stating this model is: Observation equation: y t = Ft T θ t + vt . known as dynamic linear models (DLM). Their importance resides in their formulation as a tool for Bayesian forecasting and sequential procedures.
The second is that θ k separates y k from everything else. Nonetheless. however. 2000). However. C k −1 . Approximation algorithms provide an alternate solution. can be manageably daunting... 1997). However. Variational methods and sampling methods are two approximation approaches readily used. is that the state vector θ t follows a Markov process. Given the joint probability distribution.. Let Dk = { y1 . This procedure requires the decoupling of all the nodes and updating of the variation parameters. Even in cases where exact algorithms are manageable. This further implies that v k ⊥ wk and v k ⊥ v j . as we have seen in this section. all possible inferences queries can be determined by their marginal distributions. Then the prior beliefs about the state vector can be written as: Prior information: θ 0  D0 ~ ( m 0 .. w j (West et. In this way. so that y k ⊥ y j . formulating a lower bound on the likelihood. the information sets Dk are defined recursively and can be computed through stochastic dynamic programming algorithms. and y k . it is still important to consider approximation methods. C 0 ) . much like the junction tree algorithm. The simplest example is the meanfield approximation which exploits the law of large numbers to approximate large sums of random variables by their means (Murphy.5: Variational Methods for Dynamic Bayesian Networks The junction tree algorithm and dynamic programming algorithms provide solutions for probabilistic inferences in graphical models. this grows exponentially pending the size of its largest subset of nodes. The DLM allows us to compute mk and C k from mk −1 . This graph has two noticeable features. 1999). Jordan (1999) provides an overview of some of the valuable and existing approximation methods. The objective of such methods is to convert a complex problem into a simpler one. 120 . Variational methods generally provide bounds on probabilities of interests (Jordan. these models do provide highly effective means of modeling univariate time series (French and Insua. a need for other approaches is crucial for many forms of dynamic Bayesian networks as they grow in complexity. This implies that w j ⊥ wk . for all 1 ≤ j ≤ k ≤ t . Computing several marginal distributions at the same time can be done using dynamic programming which avoids the redundant computation required by other marginalization procedures.. y k } represent the data where y k takes on either a numeric or missing value. al. A graphical model specifies a complete joint probability distribution over all the nodes. 2003). The first.Part IV: Model Building for Decision Analysis The best way to think of a DLM is through is graphical structure as shown above. θ j  θ k . so that θ k ⊥ θ j  θ k −1 . Section 11.
Other similar approaches exist and have shown to be fruitful in accomplishing the inference task. These various approaches to inference are not mutually exclusive. Their simplicity in implementation and theoretical overtures of convergence make them advantageous. an amalgamation of such methods may prove to be the best solution for a given graphical model. an appropriate choice can lead to a simplified inference problem. and in many instances. The sequential approach makes use of variational transformations and aims to transform the network until the resulting structure is amenable to exact methods (Jordan. Transforming some of the nodes may allow for some of the original graphical structure to remain intact or introduce a new graphical structure to which exact methods may be applied. Markov chain Monte Carlo (MCMC) methods are a more efficient approach and many of these algorithms have been applied to the inference problem in graphical models. however. 121 .Part IV: Model Building for Decision Analysis Since the basic idea (stemming from above) is to simplify the joint probability distribution by transforming the local probability functions. Some of these have been briefly discussed in this section. There is a lot of literature in this area and especially on algorithms to accomplish this task. 2000). Another approach to the design of approximation algorithms involves sampling (Monte Carlo) methods. they can be slow to converge and it can be hard to diagnose their convergence (Jordan. Some of these more common methods are described in more detail in the PART V to illustrate analysis procedures. The general inference problem is the computation of the conditional distribution over the hidden values given the observed information. 1999). This has been touched on briefly through some of the examples in this section. French and Insua. 1999.
Part V: Analytic & Computational Methods for Decision Problems Part V: ANALYTIC & COMPUTATIONAL METHODS FOR DECISION PROBLEMS The complexity of decision problems has led to the rapid development of methodological solutions to these problems. This concludes the methodological issues which arise in decision problems. the need for computational methods which meet the criteria is essential to decision modeling. The task at hand is probabilistic inference. 122 . This part of the report concludes this paper by discussing some of the computational issues which arise in decision analysis. These analytic concepts are the focus of the beginning sections. The graphical structure of probabilistic models has also added a more intuitive approach to the problem at hand. These developments have also led to the advancement of computational methods to deal with the complexities of the numerical assessment of these problems which arises. The last section is an application of the typical methodology used in a prototypical example. Earlier parts of this report introduced the framework for modeling decision problems in introduced the decision theoretic aspects. Computational methods including Monte Carlo methods and dynamic programming are introduced as both feasible and efficient approaches for the analytic assessment of decision problems. Thus. PART IV of this report focused on building models to illustrate decision problems with an emphasis on probabilistic graphical models. Recall that French and Insua (2000) commented on the feasibility and robustness which relate to the practicality and implications of decision processes. whether it lies in graphical models or in expected losses.
Part V: Analytic & Computational Methods for Decision Problems
Section 12: Decision Theoretic Computations
Part I and Part IV introduced the framework for decision models and the construction of graphical models, respectively. Both approaches introduced the computational issues which arise. Much of these issues reside in the calculation and or minimization of probability distributions. Some of the methodological approaches were introduced; however were never developed. This section reintroduces the problem and begins to address algorithms which aim to solve some of these computational constraints. A review of analytic approaches and numerical integration are presented but demonstrate some limitations which are resolved though Monte Carlo methods. Although this section concentrates on Bayesian decision theoretic computations, other statistical decision theoretic approaches face the same computational problem. Therefore, they provide insight into solving the computational issues which arise through the classical approach.
Section 12.1:
Computational Issues in Decision Problems
The first half of this report focused on the construction and development of decision models. Recall that the loss function l (θ , a) is one of the key attributes of decision theory. Moreover, the process of calculating and minimizing expected losses is the key objective. This section addresses the computation issues which arise in the modeling of decision problems and specifically these objectives.
French and Insua (2000) state that there are two objectives which need to be solved: min Eθ [l (a,θ )  x] = min ∫ l (a,θ ) pθ (θ  x)dθ
a∈A a∈ A
and min ∫ l (a,θ ) pθ (θ  x, a)dθ .
a∈ A
The former is simply the calculation of an expectation. In this latter case, the probability model involves dependencies on a which occurs in influence diagrams and decision trees formulations. Assuming the optimal action is known means that the explicit computation of the posterior probability is only required. Thus the solution of these decision problems requires the computation of a posterior expectation. In most instances, the integrals involved are difficult to evaluate analytically; and hence more powerful integration methods need to be implemented. This section reviews analytical and numerical approaches before introducing simulation or Monte Carlo methods. The limitation of the analytic and numeric approaches draws the importance for more computationally efficient and plausible methods. However, these methods are not without their drawbacks and are addressed. The latter part of this section attends to combining the issues of optimization and integration. Although, there is a focus on the
123
Part V: Analytic & Computational Methods for Decision Problems
Bayesian decision theoretic computations, these issues are synonymous to other statistical decision theoretic issues; and hence provide insightful possibilities.
Section 12.2:
The Bayesian Framework
In addressing the computational issues within a Bayesian framework, it deems necessary to describe the Bayesian perspective in order to fully understand the issues which arise is such methodology. Thus, we begin with an overview of Bayesian data analysis. The essential characteristic of Bayesian methods is their explicit use of probability models for quantifying uncertainty in inferences based on statistical data analysis. It can be conceptualized by dividing it into the following three steps: (1) setting up a full probability model; (2) conditioning on the observed data; and (3) evaluating the fit of the model and the implications of the resulting posterior distribution (Gelman et al. 1995) Setting up a full probability model entails the construction of a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem. The second step involves calculating and interpreting the appropriate posterior distribution – the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data. Lastly, questions addressing “does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in the first step” (Gelman et al. 1995) should be tended to. Contextually, it is often easier (with the Bayesian approach) to build highly complex models when such complexity is realistic. The central feature to Bayesian inference is the direct quantification of uncertainty which in principle allows models with multiple parameters and complex layers to be fit (Gelman et al. 1995) with a conceptually simple method. The purpose of this report is not to glorify Bayesian inference over the frequentist approach as there is no need to choose exclusively between the two perspectives nor is it without its constraints. However, it is the purpose of this section to introduce and outline the general concept of the Bayesian paradigm. Before launching into a discussion on computational advances in Bayesian methods, a review of some terminology and standard notation and a formal presentation of the problem are provided. The task in decision problems has been to make probabilistic inferences. Statistical inference refers to the process of drawing conclusions about a population on the basis of measurement or observations made on a sample of individuals from the population (Everitt, 2002). From a Bayesian perspective, there is no fundamental distinction between observables and parameters of a statistical model as they are all considered random quantities. Thus, here in lies the task – the quantification of such uncertainty. Recall that Bayesian data analysis involves (first) setting up the full (joint) probability model. Let D denote the observed data, and θ denote model parameters and missing data. The joint distribution is comprised of two parts: a likelihood P ( D  θ ) and a prior
124
Part V: Analytic & Computational Methods for Decision Problems
distribution P (θ ) . To summarize, obtaining the likelihood P ( D  θ ) describes the process which gives rise to the data D in terms of the unknown parameters θ while the prior distribution P (θ ) expresses what is known about θ, prior to observing the data. Specifying P ( D  θ ) and P (θ ) gives a full probability model, in which P ( D,θ ) = P( D  θ ) P(θ ) . The second step in Bayesian data analysis commemorates the Bayes Theorem is used to determine the distribution of θ conditional on D, after observing the data:
P(θ  D) = P(θ ) P( D  θ ) .
∫ P(θ ) P( D  θ )d (θ )
Applying the Bayes Theorem derives the posterior distribution of θ and is the object of all Bayesian inference (Gilks et al. 1996). Moments, quantiles and other features of the posterior distributions can be expressed in terms of posterior expectations of functions of θ. The posterior expectation of a function f (θ ) is
E[ f (θ )  D] =
∫ f (θ ) P(θ ) P( D  θ )dθ . ∫ P(θ ) P( D  θ )dθ
The integration of this expression has until recently been the source of most of the practical difficulties in Bayesian inference. The problem of calculating expectations in highdimensional distributions adheres both to Bayesian inference and some areas of frequentist inference (Geyer, 1995). To avoid an unnecessarily Bayesian “flavor” to this discussion, the problem is restated in more general terms. Let X be a vector of k random variables, with distribution π (⋅) . In Bayesian applications, X will comprise of both model parameters and missing data; in frequentist applications, it may comprise of data or random effects (Gilks et al. 1996) and Draper (1997). For Bayesians, π (⋅) will be a posterior distribution (as described above) while for frequentists it will refer to a likelihood. However, either way, the task is to evaluate the expectation
E[ f (θ )] =
∫ f ( x)π ( x)dx ∫ π ( x)dx
for some function of interest f (⋅) . A common situation arises in such that the normalizing constant (the denominator of the above equation) is unknown. In Bayesian practice, P (θ  D) ∝ P(θ ) P ( D  θ ) ; however the
125
Monte Carlo integration evaluates E[ f ( X )] by drawing samples { X t . More formally. 1986) and analytic (Smith et al. Integration is In the last decade alone. Numerical quadrature methods for approximating integrals such as the common trapezoid and Simpson’s rules (CSC260H. it is comprised of k vectors of continuous random variables. sample from π (⋅) which can then be used to compute sample averages as approximations to population averages (STA4276H. 1987) approximations for such calculations. Although. Even for conjugate models.. there may be cases (moments or probabilities) for which the solution may not be computed explicitly. largely in part due to increased computing power. it has been assumed that X takes values in kdimensional space. Although this is an oversimplification of the approach it introduces the concept concisely. Monte Carlo methods are often the method of choice involving highdimensional problems. Some of these alternatives such as numerical evaluation are difficult and inaccurate in k > 20 dimensions. In more complex problems. 2001) are commonly applied. Here. simulation or Monte Carlo integration is required. there have been a number of advances in numerical (Tierney and Kadane. 1999).. Monte Carlo Integration The basic objective of the Monte Carlo approach is to draw an approximate i. integration strategies based on these arguments are only valid if samples sizes are large and other necessary conditions apply. These methods are efficient in problems with specific structure and/or low dimensionality. i.. Asymptotic methods based on asymptotic normality arguments also exist. n} from π (⋅) and then approximating 126 .d. 2002). t = 1. However.3: Integration Methods usually simple in conjugate models – priors and posteriors are members of the same parametric family of distributions and thus posteriors are easily determined (French and Insua. this problem has been described generally.i.Part V: Analytic & Computational Methods for Decision Problems normalizing constant ∫ P(θ ) P( D  θ )d (θ ) is not easily evaluated and is the focus of the next section. but it is important to note that many of these computational methods complement each other rather than compete (Tierney and Mira. 2003). analytic approximations such as the Laplace approximation which is sometimes appropriate (Gilks et al.. Such procedures also extend into k dimensions. X could consist of discrete random variables (in which case the integral would be replaced by a summation) or by a mixture of both continuous and discrete random variables. 1996).e. As discussed in the previous section. the difficulty and the interest in Bayesian inference is in evaluating expectations ∫ f ( x)π ( x)dx for various functions f (x) . Section 12. Better rules are obtained when the form of the integrand is taken into account such as Gaussian quadrature methods which approximates well when the density function is a normal kernel (French and Insua. 2002).
. since π (⋅) can be quite complex and nonstandard. X 2 . The problem at hand is to simulate observations from a posterior distribution. n is under the control of the analyst.. Let X 1 . However the samples { X t } need not be independent.. The trait of the distribution of X t  X 1 .. One way this is accomplished is through a Markov chain having π (⋅) as its stationary distribution. the next state X t +1 is sampled from a distribution P ( X t +1  X t ) which depends only on the current state of the chain. it is not the size of a fixed data sample. n t =1 So. assume that the conditional distribution of X t given all previous observations X t −1 . X 2 . 2003). without the intervening variables { X 2 .. only depends on the last observation. denoted P (t ) ( X t  X 1 ) . Subject to some conditions... Thus.. given X t . n} are generated such that at each time t ≥ 1 . obtained via Bayes’ Rule as P (θ  X ) = P (θ ) P( X  θ ) .. X t settles down to a limiting distribution no matter what the value of the starting value of X 1 is.. 1996). the chain will gradually ‘forget’ its initial state and P ( t ) (⋅  X 1 ) will eventually converge to a unique limiting distribution (Gilks et al. X t . 1996). Thus.. 1996). That is. X t −1 }. drawing samples { X t } from π (⋅) = P(θ  X ) is not feasible. the population mean of f ( X ) is estimated by a sample mean. ∫ P(θ ) P( X  θ )d (θ ) In general.. More formally this is stated as suppose a sequence of random variables { X t . t = 1.. 127 . it can be generated by any process which (loosely) draws samples throughout the support of π (⋅) in the correct proportions (Gilks et al. That is. to summarize... X t . the average value has some basic properties which are very important.. be a sequence of random variables which have outcomes from some set S... X t − 2 . When the samples { X t } are independent... the Law of Large Numbers ensures that the approximation can be made as accurate as desired by increasing the sample size n (Gilks et al.Part V: Analytic & Computational Methods for Decision Problems 1 n E[ f ( X )] ≈ ∑ f ( X t ).. Also. if the current value is known then the prediction of the next value does not require the previous values (CHL5223H. the next state X t +1 does not depend further on the history of the chain { X 1 . X 3 . X t −1 } suggests that X t depends directly on X 1 . This property is known as the Markovian property and such a sequence of random variables is known as a Markov chain.. An important issue in basic Markov chain properties is the affect of the starting value of the chain or state X 1 affect X t .
4: Markov Chain Monte Carlo Methods Recall that is was stated that the basic objective of the Monte Carlo approach was to draw an approximate i.d. The first is that if the set of states.. Simply. for a finite Markov chain.d. the limiting distribution exists of it is aperiodic (the chain is not cyclic) and irreducible (possible to move from one state to another) while infinite Markov chains hold the additional property of having an existing invariant or stationary distribution. as discussed earlier. S. These are the concepts which formulate the Markov chain Monte Carlo (MCMC) method.. then the sample path averages converge to the expectations. The idea behind the method is simple. then the Markov chain can be run for a long time. Thus. If a Markov chain with a stationary distribution is the same as the desired probability distribution π (⋅) (the target distribution). 2003). n . 1996). The important byproduct is the “path averages” to the expected value under the limiting distribution where burnin samples are usually discarded for the calculation of the estimator: f = n 1 ∑+1f ( X t ) → E∞ [ f ( X )]. sampling is rarely possible. If the chain is irreducible and the expectations exist. sample from π (⋅) which could then be used to compute sample averages as approximations to population averages. say m iterations. Thus. As explained the Monte Carlo techniques generate random variables having certain discrete distributions. where t = m + 1. Since direct i. The problem of finding the Markov chain with the desired stationary distribution still exists.i. The stationarity of a distribution simply refers to the limit as n → ∞ is π (⋅) . a Markov chain with invariant distribution π (⋅) is run and sample path averages are used to approximate expectations under π (⋅) . The output from the Markov chain is used to estimate the expectation E[ f ( X )] . Therefore these path averages can be used to estimates expected values of the limiting distribution. is finite then the Markov chain holds two properties: aperiodicity and irreducibility. (See Part III for a discussion on Markov chains and their properties. a sample from a similar distribution is obtained and a Markov chain as its unique invariant distribution is constructed (Tierney and Mira. as t increases. Markov chains can be used. The probability that the chain is in state i will be approximately the same as the probability that the discrete random variable equals i.Part V: Analytic & Computational Methods for Decision Problems There are some basic properties of Markov chains which should be stated for completeness. if the set of states is infinite then an additional property of an invariant distribution exists (CHL5223H. 128 .) Section 12. once it becomes too complex. where X has distribution φ (⋅) . However.i.. 1999). n − m t =m The convergence to the expectation is confirmed by the ergodic theorem whereby the above estimator is referred to as the ergodic average (Gilks et al. the sample points { X t } look increasingly like dependent samples from the stationary distribution denoted as φ (⋅) based on a long burnin of m iterations..
Gibbs Sampling The Gibbs sampling algorithm holds the central idea of “dimension reduction by conditioning” (Tierney and Mira.3. To apply the MCMC method in a particular problem a sampler has to be constructed.. is that. X t −1 . given a set of random draws from the posterior distribution. this was the initial proposed algorithm which can be automated (Gilks et al. however. a Markov chain. Brooks et al.. first the construction of a Markov chain such that its stationary distribution φ (⋅) is precisely the distribution of interest π (⋅) needs to be established. There are two widely established ideas which are presented in the next two subsections – dimension reduction by conditioning (Gibbs sampling approach) and acceptance and rejection (MetropolisHastings algorithm). Update each component based on the value of each other component π ( X t  X t −1 ) = π ( X t  X 1 .2 and § 4... Y ) = ∏ π (Y⋅i  {Y⋅ j . A thorough theoretical discussion is omitted for brevity sake. The practical appeal of simulation methods. 1999) and is best described mathematically by its transition kernel (Gilks et al. This would seem to provide the solution to the initial problem. X n ) These distributions are called full conditionals...Part V: Analytic & Computational Methods for Decision Problems The ergodic average shows how a Markov chain can be used to estimate E[ f ( X )] . for which simulation methods were previously much more difficult to implement... MCMC methods have been widely successful because they allow one to draw simulations from a wide range of distributions. n} where t ≥ 1. There are two central ideas for building samplers and are guided by the two most common methods – the Gibbs sampler and the MetropolisHastings algorithm which are described in sections § 4. 129 . 1996). X t +1 . These algorithms are not in competition with each other and in fact complement each other. 1996). including many that arise in statistical work. where the expectation is taken over its stationary distribution φ (⋅) .{ X ⋅ j ... t = 1. 1996. including MCMC. Suppose π (⋅) is the joint distribution of { X t . j < i}. one can estimate all summary statistics from the posterior distribution directly from the simulations. j < i}) i =1 d This is a product of the conditional densities of the individual steps required to produce an iteration of a ddimensional Gibbs sampler. 1998): K ( X .. The transition kernel describes the density of going from one point X to another point Y (Gilks et al.. The Metropolis within Gibbs algorithm is identified as a conceptual blend in which one parameter is updated at a time.
Let S be a state space with probability distribution π (⋅) on S. which is a distribution over a lattice. a simple example is featured below. The joint posterior is invariant. n − m t = m +1 It should be noted that when this sampling was developed by Geman and Geman (1984) its purpose was to sample from a Gibbs distribution. MetropolisHastings Algorithm The notion that if the chain is irreducible and the expectations exist. j . X n +1 is computed as the MH algorithm follows: 1. a Markov chain which has π (⋅) as a stationary distribution is created. Given the mth update.∈ S} with qij ≥ 0 and ∑ qij = 1 for each i ∈ S . The central idea to the MetropolisHastings algorithm is acceptance (or proposals) and rejection (Tierney and Mira. data ) θ 2( m +1) ~ f (θ 2  θ 1 = θ 1( m ) . 1999).θ 2 = θ 2( m ) . therefore: f = n 1 ∑ f (θ ( m) ) → E∞ [ f (θ)]. The algorithm follows: 1. The Gibbs sampler is a special case of the MetropolisHastings algorithm which is explained next. the Markov chain remains in the same state. Set α ij = min π (i )qij 130 .θ 3 = θ 3( m ) . then the sample path averages converge to the expectations is the basic idea proposed by Metropolis et al. If the point is accepted. Then choose a proposal distribution {qij : i. data ) θ 3( m +1) ~ f (θ 3  θ 1 = θ 1( m ) . however since its inception it has expanded over many distributions and thus. get the (m+1)th sample by: θ 1( m +1) ~ f (θ 1  θ 2 = θ 2( m ) . j . (1953) and was extended by Hastings (1970). the Markov chain moves to a new point. data ) 3. Let θ be 3dimensional and the parameters of a Bayesian model.Part V: Analytic & Computational Methods for Decision Problems The Gibbs sampling algorithm is quite simpler in concept and thus. If it is rejected. 2.∈ S } π ( j )qij 2. being called a Gibbs sampling algorithm is a bit of a misnomer.θ 3 = θ 3( m ) . Choose Yn +1 = j according to the Markov chain {qij : i. Given that j∈S X n = i. Start with a set of initial values: θ ( 0 ) . This algorithm proposes a new point on the Markov chain which is either accepted or rejected. By choosing the acceptance probability correctly.
Convergence Issues A properly derived and implemented MCMC method will produce draws from the joint posterior density once it has converged to stationarity φ (⋅) . and the paper by Brooks and Roberts (1998) reviews some of these methods with an emphasis on the mathematics underpinning these techniques in an “attempt to summarize the current ‘stateofplay’ for convergence assessment”. since what is produced at convergence is not a single value but a random sample of values. but is not explored further here. this will create a chain with stationary distribution π (⋅) . since the sampled results come from a Markov chain. It is also possible to extend the concept to a multidimensional case. The “Gibbs sampler”. As a result. With probability α ij . they will typically be serially correlated. This is complicated by two factors. that after an initial burnin period (designed to remove dependence of the simulated chain on its starting location). all further samples may be safely though of as coming from the stationary distribution. The former is simply a plot demonstrating the path of the parameter. A round table discussion held at the Joint Statistical meetings in August of 1996 with a panel of experts in the field published parts of the discussion in which they discuss more practical applications and procedures they follow. that is.2) such as awkward posterior distributions and further distributional 131 . These diagnostics are often used to estimate the degree of mixing in a simulation which is the extent to which a simulated chain traverses the entire parameter space (Kass et al. Section 12. any inference based upon MCMC output relies critically upon the assumption that the Markov chain being simulated has achieved a steady state or “converged”. The MetropolisHastings algorithm can also be used for continuous random variable by using densities and continuous proposal distributions. 1998). Many techniques have been developed for trying to determine whether or not a particular Markov chain has converged. a primary concern in applying such methodology is determining that it has in fact essentially converged. Second. these experts mentioned that use of autocorrelation plots and trace plots are amongst the most useful and practical. is a special case of the multivariate MetropolisHastings algorithm where the proposals q are the full conditionals and the acceptance probability is always 1 as explained above. As various diagnostics are available.Part V: Analytic & Computational Methods for Decision Problems 3. First. while the latter is simply the internal correlations. let X n +1 = Yn +1 = j (accept proposal) Otherwise. some of the aspects addressed in the introductory section on Bayesian framework (§ 12. let X n +1 = X n = i (reject proposal) As desired.5: Integrating Integration & Optimization Methods To summarize. These are not elaborated here for brevity purposes. somehow the natural variability in the converged chain and the typically greater variability in the preconvergence samples must be distinguishable. Although MCMC methods have effectively revolutionized the field of Bayesian statistics over the past decade.
combining problems of optimization and integration are required for decision theoretic computations.θ m ~ pθ (θ  x) . however. Select a sample θ1 . These problems require both powerful integration and optimization methods as well as algorithms that achieve both simultaneously. such computational effort is not sufficient.. This section has introduced some of the main computational tools used in statistical decision theory. Thus. the objective is to determine the alternative a* which minimizes the expected loss. for instance. 132 . This approach can be extended to any form of the loss function and thus presents some alternatives to the computational problem. the problem of computing expectations or making probabilistic inferences – in this instance is remedied. By the law of large numbers.. Markov chain Monte Carlo methods put to ease much of these constraints which have otherwise limited Bayesian data analysis with the included aspects of providing general functions of models parameters and alleviating much awkward predictive inference. however dimensionality problems typically put out of reach the implementation of other sophisticated approximation techniques as those introduced in § 12. It is also important to note that some of the integration or sampling methods previously described also have an optimization step to compute maximum likelihood estimators. θ i ) m i =1 yielding a m (θ ) . and (2) minimize the expected loss. An alternative approach was introduced by Shao (1989) in the statistical decision theory literature. and is referred to as sample path optimization.Part V: Analytic & Computational Methods for Decision Problems complexity introduced by model parameters would (otherwise) require subtle and sophisticated or analytic approximation. There are two operations: (1) compute the expected loss. it holds that a m (θ ) converges to a minimum expected loss alternative a*.. Solve the optimization problem min a∈ A 1 m ∑ l ( a. 2. they are not without their limitations. Thus. The strategy follows: 1. That is. it only seems plausible to have Monte Carlo optimization methods to solve such problems. In problems with optimization requirements such as general loss functions.3.. French and Insua (2000) review several optimization methods which are useful within statistical decision theory. Adaptive methods which try to accomplish such a task have been developed. For general loss functions.
The emphasis is placed on dynamic programming and progresses into its important as a computational tool for decision problems. x n ) = contribution of stages n. Section 13. and the implications of the findings and possible inconsistencies. respectively. 5. Sequential problems lend themselves to another extension of optimization methods. 4. 6. The problem can be divided into stages. N to objective function 133 .... The general notation used to describe a dynamic programming problem is summarized below: N = number of stages n = index for current stage θ n = current state for stage n x n = decision variable for stage n * x n = optimal value of x n  θ n f n (θ n . n + 1. The consequence of the decision at each stage is to transform the current state to a state associated with the beginning of the next stage. Sensitivity analysis is addressed last yet is important as means of checking the sensitivity of the output. The solution procedure is designed to find an optimal choice for the overall problem. The solution begins by finding an optimal choice for the last stage. 2000): Given the current state. However. This section develops upon the concepts introduced in earlier sections and relates them to more analytic approaches to solving the problems. Part II and Part IV introduced the framework for decision models and the construction of graphical models. it provides a systematic procedure for determining optimal combinations of decisions. And further. 2. 3. with a decision required at each stage. Each stage has a number of states associated with the beginning of that stage.1: Dynamic Programming The computational methods presented in section 12 neglected decision problems which involved a sequential approach or which had been transformed such that they could be assessed sequentially.Part V: Analytic & Computational Methods for Decision Problems Section 13: Analytic Approaches to Solving Decision Problems Part I. this was removed from within the sequential framework described in Part III and which is eminent in decision tree structures. an optimal policy for the remaining stages is independent of the decisions adopted in previous stages.. The previous section emphasized some of the computational issues which arise in decision problems. These include: 1. Dynamic programming is defined by a recursive relationship. Hillier and Lierberman (2001) present the basic features that characterize dynamic programming problems. Dynamic programming is a mathematical technique which allows for a sequence of interrelated decisions to be made. Bellman’s principle (Bather.
x n )} xn or f (θ n ) = min{ f n (θ n . x n +1 ). The latter approach (which is probabilistic) implies that a probability distribution will dictate what the next state will be. x n )} * n xn This recursive procedure induces a backward form of mathematical induction in order to achieve the required task. At state i.2.Part V: Analytic & Computational Methods for Decision Problems The recursive relationship will lead to two possible forms. x n +1 where the minimization is taken over all feasible values of x n +1 . and diagrammatically has the form of a decision tree. Consider the diagram above illustrating the general structure of a stochastic dynamic programming problem. It can now be written as: f n (θ n . Essentially because. This figure can be extended in such a manner for all possible stages. 2000). and was coined such to describe the techniques which he had brought together to study a class of optimization problems involving sequential decisions (Bather... the choice of action a n determines the probability distribution of the next state. S . given state θ n and decision xn at stage n.. The system goes to state i with probability pi for i = 1. given the state θ n at stage n. The former approach suggests that the state at the next stage is completely determined by the state and decision at the current stage. we are 1 Bather provides a concise introduction to both deterministic and probabilistic dynamic programming and its application to utility theory. In statistical decision theory1.. Let S denotes the number of possible states at stage n + 1. This approach (as dynamic programming) was first outlined by Bellman. the cumulative information of stage n to the objective function is represented by C i . With the inclusion of a probability distribution dictating the consequence of the next state. This probabilistic approach is more conducive to our needs and is associated with solving decision tree diagrams. means that the precise form of the objective function will differ slightly from the one given above. depending on the problem: f n* (θ n ) = max{ f n (θ n . Sequential problems and subsequently dynamic programming problems can be classified into two groups: deterministic and stochastic. it might be said that the transition from one stage to another is controlled by a sequence of actions. x n ) = ∑ pi [C i + f n*+1 (i )] i =1 S with f n*+1 (i ) = min f n +1 (i. 134 .
.. 2.) That is if the first few terms are fixed then minimizing its whole sum is equivalent to minimizing its individual terms . 135 .. this can be extended to n observations 2 This refers to the expected loss function. the optimal Bayes action is taken (French and Insua.. Bellman’s Principle assumes that after the observation(s). The decision maker has two options: 1.. This expounds the aforementioned Bellman Principle of Optimality: The optimal sequential decision policy for the problem which begins with Pθ (⋅) as the decision maker’s prior for θ and has R stages to run must have the property that is. Make a single observation X at cost γ and choose an action a ∈ A in light of her current knowledge π ( X ) at an expected loss of E X [r0 (π ( X ))] .a property which holds when taking expectations.. For simplicity of notation. Next consider that the decision maker is given the chance to make an observation.2: Bellman’s Principle and the Bayesian Perspective Recall that one of the fundamental elements of decision problems is the loss function and in dynamic programming it is this objective function we tend to want to minimize. Section 13. x n ) as the prior and having N – n stages to run. then the continuation of the optimal policy must be the optimal sequential policy for the problem beginning with Pθ (⋅  x1 . (French and Insua. only the current state is important.θ )] a∈ A where the expectation over θ is taken with respect to π . let π = Pθ (⋅) and rn (π ) be the Bayes risk of the optimal policy with at most n stages left to run with knowledge π . Then the situation in which the decision maker has no option to make an observation and must choose an action can be expressed as: r0 (π ) = min Eθ [l (a.Part V: Analytic & Computational Methods for Decision Problems dealing with a Markov system. (See Part II on Utility Theory. the observations X 1 = x1 . 2000). The essential characteristic of such modeling (in summary) is that given θ n at stage n. at any stage n < N. The state and action variables may be discrete or continuous and the range of possible choices of an action may vary state to state. Take an action a ∈ A without making an observation at an expected loss of r0 (π ) . Now consider an additive function which by definition is monotonic and separable. Accordingly. the choice of action or decision xn determines both the probability distribution of the next state θ n +1 and the expected cost2 of the transition θ n → θ n +1 . 2000) French and Insua make it a point to emphasize the importance of the decision maker’s prior in Bellman’s principle. it also describes the state of the decision making process. X n = x n have been made... Since the prior describes the state of knowledge.
.θ ) .. dynamic programming allows for both the calculation of rn (π ) and the characterization of the optimal policy. x) and any terminal act a ~ ′ u * (e. Recall notation introduced in Section 1 for clarification. Thus. ~ ′ x) as a random variable u(e. 3 136 . Backwards Induction Consider a simple decision tree whereby an initial decision results in a chance and a subsequent action results in a final state. Note the form of this recursion formula. The two forms are mathematically equivalent and lead to identical results. Notation for unknown quantity is described as a random variable in this case. N rn (π ) = min{r0 (π ). θ ) with respect to the conditional measure Pθ′′x . x. we can compute for any given history (e. a. a) ≡ Eθ′ x u (e.2. The extensive form of analysis proceeds by working backwards from the end of the decision tree (the right side of the tree) to the initial starting point: instead of starting by asking which experiment e the decision maker should choose. but this difficulty is easily resolved by treating the utility of any a for given (e. the utilities of the various possible terminal acts are uncertain because θ which will be chosen by chance at the end is still unknown. with a known history (e. This section focuses on the extensive form which equates the use of decision trees more readily.. This defines the dynamic programming or backward induction form introduced in the previous subsection. x. Section 13. γ + E X [rn −1 (π ( X ))]} . x.3: Analysis in Extensive Form There are two basic modes of analysis which can be utilized to determine which course of action will maximize the decision maker’s utility: the extensive form of analysis and the normal form. for n = 1. x).  Symbolically. a. This following section now extends the basic principles demonstrated in these earlier subsections to illustrate the analytic approach to solving decision problems as applied to utility theory – or more precisely in the maximization of utilities. Thus.. θ )3 and applying the operator Eθ′ x which takes the ~ expected value of u(e. the procedure starts by asking which terminal act he should choose if he had already performed a particular experiment e and observed a particular outcome x. Even at this point.Part V: Analytic & Computational Methods for Decision Problems which then give her option (1) stated above or an extension of option (2) which would result in observations made for the remaining n 1 stages with an expected loss of E X [rn −1 (π ( X ))] . x. a. each has something to contribute to the insight into the decision problem and have separate technical advantages in certain situations.
the initial move. x). x) has been computed in this way for all possible histories (e. the utilities of the various possible experiments are uncertain only because x which will be chosen by chance is still unknown. x) was resolved: by putting a probability measure over the chance‘s moves and taking expected values. u * (e.4: Analysis in Normal Form The final product of the extensive form of analysis presented in the previous section can be thought of as a description of the optimal strategy consisting of two parts: 1. x) and the choice of a still to make is u * (e.θ ). In other words. A prescription of the experiment e which should be performed. a. a ) looking forward. 137 . a After u * (e. Section 13. 2. if faced with a given history (e. ~ ) is x ~ is a random variable. and this difficulty is resolved in exactly the same way that the difficulty in choosing a given (e. and the utility of being at the terminal with history (e. x e e a This procedure of working back from the outermost branches of the decision tree to the base of the trunk is often called “backward induction”. a ) is greatest. 2000). the decision maker will wish to choose the e for which u * (e) is greatest. x. x. x). the problem of the initial choice of an experiment can be addressed. before chance has made a choice of θ (Raiffa and Schlaifer. x. ~ ) x where E xe expects with respect to the marginal measure Pxe . x) ≡ max u * (e. and therefore may suggest that the utility of being at the initial decision node with the choice of e still to make is ~ ′ u* ≡ max u * (e) = max E xe max Eθ′ x u (e. wish to choose the a (or one of the as if more than one exist) for which u * (e. Since the decision maker’s objective is to maximize his expected utility. At this point. A decision rule prescribing the optimal terminal act a for every possible outcome x of the chosen e.Part V: Analytic & Computational Methods for Decision Problems this is the utility of being at the juncture (e. he will. a) . and a random variable at the initial decision point because x therefore it can be defined for any e u * (e) ≡ E xe u * (e. Again. ~. More descriptively it could be called a process of “averaging out and folding back”.
θ ) . θ ) . Recall the decision rules as presented in Sections 2 and 3. δ ) = Eθ u * (e. then by taking the expectation of u[e. and incidentally these same results also enable the optimal decision rule to accompany any other e in E. xe u (e. ~. δ ) and a particular pair of values ( x. the decision maker’s act as prescribed by the rule will be a = δ (x) and his utility will be u (e. d ( ~ ). we obtain ~ ′ u * (e. δ ( x). also has as its end product the description of an optimal strategy. δ ) which maximizes his expected utility ~ u * (e. the normal form of analysis starts by explicitly considering every possible decision rule for a given e and then choosing the optimal rule for that e. using the same measures P x x xe . δ ) for a given state θ. Given a particular strategy (e. x The decision maker’s objective is therefore to choose the strategy (e.θ ] .θ we obtain x x u * (e. δ ) ≡ Eθ . Instead of first determining the optimal action a for every possible outcome x. Next taking the ~ expected value over θ with respect to the unconditional measure Pθ′ . u (e. but before ~ the experiment has been conducted and its outcome observed. even though the e in question is not itself optimal. θ′ ~ If e and δ are given and θ is held fixed.Part V: Analytic & Computational Methods for Decision Problems The whole decision rule for the optimal e can be simply determined from the part of the analysis which determined the optimal a for every x in X. which can be referred to as the unconditional utility of (e.θ u[e. ~. The normal form of analysis. δ ( x).θ ) . ~. δ ( ~ ). and arrives at the same optimal strategy as the extensive form but via a different route. 138 . and thus implicitly defining the optimal decision rule for any e. θ ) ≡ E xe . δ . This can be done for ∀e ∈ E such that an optimal e can be found as in the extensive form of analysis. a decision rule δ for a given experiment e is a mapping which carries x in X into δ (x) in A. x x This double expectation will actually be accomplished by iterated expectation and the ~ iterated expectation can be carried out in either order: we can first expect over θ holding ~ fixed and then over ~ .θ ] with respect to Pxe . d ( ~ ).θ ) . δ ) . ~. δ . x. Mathematically.θ and P . x x which will be called the conditional utility of (e.θ ) is a random x ~ variable because ~ and θ are random variables.
The operation Eθ E ze. θ ) = max Eθ′ x u (e. 139 .θ u[e. x x e e d Section 13. ~. a x . δ ) .θ ] x x d derived as u * (e) ≡ max u * (e. i. d After computing the utility of every e ∈ E . δ ( x). and therefore it may be said that the utility of any experiment is u * (e) ≡ max u * (e. x x d and it is then obvious that the best δ will be the one which for every x maximizes ~ ′ Eθ′ x u[e. however.θ ] .Part V: Analytic & Computational Methods for Decision Problems For any particular experiment e. δ ( ~ ). ~. This. This has been adapted from the author’s book on analysis of statistical decision problems. is exactly the same thing as selecting for every x an ax which satisfies ~ ~ ′ ′ Eθ′ x u (e. ~. δ ) by the normal form of analysis agrees for all e with the d formula ~ ′ u e * (e) = E xe max Eθ′ x u (e.5: Equivalence of the Extensive and Normal Form4 The extensive and normal forms of analysis will equivalent if and only if they assign the same utility to every potential e in E. a.θ ].θ ) x a ′ derived as u * (e) ≡ E xe u * (e. x.e. ~ ) by the extensive for. x. the decision maker is free to choose the decision rule δ whose expected utility is greatest.θ ) a 4 See Raiffa and Schlaiffer (2000) for further details on extensive and normal form of analysis. δ ( ~ ). x. It follows that the formal result as presented above can be written as ~ ′ ′ u * (e) = max E xe Eθ′ x u[e.θ ]. the decision maker is free to choose the experiment wit the greatest utility such that ~ ′ u* ≡ max u * (e) = max max Eθ E xe .θ is equivalent x to the expectation over the entire possibility space Θ × X and is therefore equivalent ′ to E ze Eθ′ x . a. if the formula ~ ′ u n * (e) = max Eθ E xe . ~.θ u[e. δ ( ~ ).
If θ is split into (θ (1) . Thus. a. a. a play is 5tuple (e.θ ( 2 ) ) . is fixed and one wishes merely to choose an optimal terminal action a. a.6: Combination of Formal and Informal Analysis5 The general model stated in the overview of modeling a decision problem is often criticized on the grounds that utilities cannot be rationally assigned to the various possible (e. a.θ ) . If. Thus. the expectation being taken with respect to the conditional measure on Θ ( 2 ) given the history (e. a play was a 4tuple (e. and the analysis of the decision problem can the proceed as before. Section 13. Besides cutting the decision tree before it is logically complete.θ ( 2 ) ) . such criticisms represent nothing but an incomplete definition of the state space Θ which can be made rich enough to include all possible pairs of value of “these” unknowns.θ ) combinations because the costs. θ (1) ) is the expected ~ value of the random variable (e. x. x. The decision maker’s uncertainties about these values can then be evaluated together with his other uncertainties in the probability measure which is assigned to Θ. the decision maker may rationally decide not to make a complete formal analysis of even a truncated tree which he has constructed. however. This asserts the idea that it is possible to further consider the split of a tree by only conducting a partial analysis (and partial construction of a tree) based on θ (1) while holding θ ( 2 ) constant. Thus if E consists of two experiments e1 and e2 . e. θ (1) ) . in order to choose the best e and must therefore evaluate u * (e) for all e in E. x. x. a. profits. x. Letting δ * ( x ) denote the optimal decision rule selected by the normal form. utilities are assigned to each (e. there is no need to find the decision rule which selects the best action for every x which might have occurred but in fact did not occur.θ (1) . 140 . the extensive and normal forms of analysis require exactly the same inputs of information and yield exactly the same results even though the intermediate steps in the analysis are different. θ (1) might be the parameter of a Bernoulli process while θ ( 2 ) might be the cost of the product (for example). and Hillier and Lieberman (2001).θ ( 2 ) ) and the utility of any (e. it has just been shown that δ * ( x) = a x and that the formulas u n * (e) and for u e * (e) are equivalent.Part V: Analytic & Computational Methods for Decision Problems as stated by the extensive form of analysis.θ (1) . or in general the consequences of these combinations would not be certain even if θ were known. θ ( 2 ) ) . the extensive forma has the merit that one has only to choose an appropriate action for the particular x which actually materializes. a. x. a. x. In terms of the original decision tree in figure 4. θ (1) . θ ( 2 ) ) so that the state space is of the form Θ = Θ (1) × Θ ( 2) . x. In principle. French and Insua (2000).θ ) and utilities were assigned directly to each (e. u * (e1 ) may be formally worked out by evaluating 5 Raiffa and Schlaiffer (2000). Consider the state θ expressed as a doublet (θ (1) . a.
~. it should be said that completely formal analysis and completely intuitive analysis are not the only possible methods of determining a utility such as u * (e2 ) . The computations could be formalized. the random variable can be defined as ~ max{0. 141 . e2 would be adopted rather e1 and the value of the information that ~ v = v could be measured by the difference v − u * (e1 ) in the decision maker’s utility which results from this change of choice. was greater than the known number u * (e1 ) . it may be concluded without any formal analysis at all that e2 is not as good as e1 and adopt e1 . This section 6 See Petitti (2000). v − u * (e1 )} = value of information regarding e2 . The principle of sensitivity analysis is also directly applied to areas such as metaanalysis and costeffectiveness analysis most readily but is not confined to such approaches. the decision maker will quite rationally decide to use e1 without formally evaluating v = u * (e2 ) . If the number v.Part V: Analytic & Computational Methods for Decision Problems ~ ′ u * (e1 ) = E xe1 max Eθ′ x u (e1 . Hillier and Lieberman (2001). of course. Section 13. If the expected value is less that the cost. resulting from a formal analysis was known. That behavior is perfectly consistent with the principle of choice described in Sections 2 and 3 and readdressed in Section 6. In other words. If on the contrary v were less than u * (e1 ) . In many instances it is possible to make a partial analysis in order to gain some insight but at a less prohibitive cost than a full analysis entails. θ ) . Before closing this section and the subject of incomplete analysis. Operationally one usually does not formally compute either the value or the cost of information on e2 : these are subjectively assessed. x a but after the formal analysis of e1 is completed.7: Sensitivity Analysis6 Sensitivity analysis is an essential element of decision analysis. a. Before making a formal analysis of e2 . the decision maker would adhere to the original choice of e1 and the information would have been worthless. and Clemen (1996). the decision ~ maker can think of the unknown quantity u * (e2 ) as a random variable v . and before expounding such information at the cost of making a formal analysis of e2 the decision maker may prefer to compute its expected value by assigning a probability ~ measure to v and then expecting with respect to this measure. but ultimately direct subjective assessments must be used if the decision maker is to avoid an infinite regress.
If an analysis is sensitive to the assumed value of a variable. weighting the benefit of one strategy over the other under the extreme assumption. one at a time. When a conclusion is shown to be invariant to the assumptions. when the sensitivity analysis includes the values of the variables that are within a reasonable range. An implicit assumption of decision analysis is that the values of the probabilities and of the utility measure are the correct values for these variables. 2000). It is usual to identify the pairs of values that equalize the expected or expected utility of the alternatives and to present the results of the analysis graphically.Part V: Analytic & Computational Methods for Decision Problems simply describes the overall purpose of sensitivity analysis and describes oneway and its expansion to nway analysis as applied to decision analysis on a very general level. If substitution of the highest or the lowest 142 . An extension of the oneway sensitivity analysis is the threshold analysis. while the values of the other variables in the analysis remain fixed. The highest and the lowest values within reasonable range of values are first substituted for the baseline estimate in the decision tree. the expected outcome is determined for every combination of estimates of two variables. In such a case. the expected outcome is determined for every possible combination of every reasonable value of every variable. the likelihood that the extreme value is the true value can be assessed qualitatively. That is. Sensitivity analysis evaluates the stability of the conclusions of an analysis to assumptions made in the analysis. confidence in the validity of the conclusions of the analysis is enhanced. and there is no benefit of one alternative over the other in terms of estimated outcome. In twoway sensitivity analysis. perhaps. the analysis is said to be “insensitive” to that variable. When the conclusion does not change. It is simpler to interpret the results of a twoway sensitivity analysis with the aid of graphs. the value of one variable is varied until the alternative decision strategies are to have equal outcomes. Nway sensitivity analysis is analogous to nway regression and is seemingly difficult to interpret and is not discussed any further in this report. In nway sensitivity analysis. The threshold point is also called the breakeven point at the decision is a “tooup”. neither of the alternative decision options being compared is clearly favored over the other. Threshold analysis is especially useful when the intervention is being considered for use in groups that can be defined a priori based on the values of the variable that is the subject of the threshold analysis. When the assumed value of a variable affects the conclusion of the analysis. In oneway sensitivity analysis. the assumed values of each variable in the analysis are varied. It is usual to do oneway sensitivity analysis for each variable in the analysis. Such analysis also helps identify the most critical assumptions of the analysis (Petitti. while the values of all other variables in the analysis are held constant at baseline. the analysis is said to be “sensitive” to that variable.
it is not usually feasible to conduct twoway sensitivity analysis for all possible combinations let alone nway. and then uses these to identify a ‘good’ decision rule. The overview of statistical decision theory provided in this report places has tried to maintain a balance between the classical and the Bayesian approach. Thus. and the computational burden of doing all possible twoway and threeway sensitivity analysis is large. in this instance suggests when one alternative may supercede another and suggests possible implications. and therefore is not laid down by any fast and hard rules. for instance. These are typically performed when there is little imprecision un the loss because a standard choice of inference loss such as the quadratic was adopted. the objective is to check the impact of the loss function. when applying Bayesian methodology. In other words. However. A graphical approach. The choice of variables for multiple way analysis requires considerable judgment. more values within the range are substituted to determine the range of values. the notion of dominant alternatives was introduced and can be considered a type of sensitivity analysis.Part V: Analytic & Computational Methods for Decision Problems value changes the conclusions. there are numerous combinations of two and three variables. In simple terms. Another area of emphasis for sensitivity analysis is the sensitivity with respect to the prior. and their posterior expected loss. It is usually suggested to start with this case as (a) it is simplest and most thoroughly studied and (b) provide many insights for more general problems. the prior and the model on the Bayes decision rule or Bayes alternative. the solution of a statistical decision problem proceeds by modeling a decision maker’s judgments by means of the loss function. It is also notably one of the most difficult elements to assess. the graphical approach applied in these instances also lends itself to the discussion of dominance considerations. In an analysis with many probabilities. checking the sensitivity of the decision rule (output) with respect to the model and decision maker’s judgments (inputs). For this reason. In section 3. this presents a variety of reasons as to why a sensitivity analysis should be conducted – that is. probability model of the observation process and a prior. 143 .
another oil company has offered to purchase the land for $90. The purpose of the following section is to present a prototype example which exemplifies the key ideas and methods discussed in this report. However. If oil is found. Deciding whether to drill or sell also hinges on the option to conduct a detailed seismic survey of the land to obtain a better estimate of the probability of finding oil.000.000 $90. so the company’s expected profit (after deducing the cost of drilling) will be $700. The cost of drilling is $100. for the problem at hand. that is problems of making decisions in the absence of data (or without experimentation).000 $90.000. Section 14. the resulting expected revenue will be $800. Thus.000.1: Prospective profits of oil company Status of Land Alternative Drill for oil Sell the land Chance of status Oil $700. This example has been illustrated in many statistical decision theory books in various forms.000. Table 14.1 summarizes the prospective profits for this landowning company. We first consider those cases without that information. This lays a framework for decision analysis. Framework for Decision Analysis: 144 . Because of this prospect.2: Summary of the Decision Modeling Framework The proceeding sections have explained the decision modeling framework. It is illustrated in this report and develops in much of the same way the methodological sections of this report have.Part V: Analytic & Computational Methods for Decision Problems Section 14: Prototype Example The conceptual ideas presented thus far have laid the foundation of statistical decision theory.000 1 in 4 Table 14.1: Summary of the Problem An oil company owns a tract of land that may contain oil. A loss of $100. the landowning oil company is considering holding the land in order to drill for oil itself. This oil company is operating without much capital so a loss of $100. Simply. statistical decision problems refer to those problems containing data or observations on the state of nature.000 will be incurred if the land is dry (no oil). Here we recap the framework before proceeding with the prototype example.000 would be quite serious.000 Payoff Dry $100. A consulting geologist has reported to management that it is believed that there is a 1 in 4 chance of oil. first we present this prototype problem of decision without experimentation and then with experimentation and conclude with ways of refining the evaluation of the consequences of the various possible outcomes. which is later extended to incorporate data or decision problems with experimentation and various other decision methodologies. Section 14.
For each combination of an action and a state of nature. Decision maker needs to choose one of the alternatives: a∈ A 2. which is given as one of the entries in a payoff/decision table. The decision problem faced by the oil company can now be placed within the decision framework just presented. The payoff table can be used to determine an optimal action according to an appropriate criterion which suits the beliefs of the decision maker. there are other measures that can be used (as explained in utility theory).25 State of Nature Θ θ2 c12= 100 c22=90 0. If the consequences of the outcome do no become completely certain even when the state of nature is given. In general terms. This choice of an action must be made in the face of uncertainty. Each combination of an action and state of nature would result in a payoff/decision. The payoff is a quantitative measure of the value to the decision maker of the consequences of the outcome. The set contains all the feasible alternatives under consideration for how to proceed with the problem of concern. because the outcome will be affected by random factors that are outside the control of the decision maker. Although in many instances the payoff is a monetary gain. Many decision makers would generally want to incorporate additional information to account for the relative likelihood of the possible states of nature.making up the prior probability distribution.3: Decision Making without Experimentation 7 Payoff table and decision table can be used interchangeably.θ ) ∈ C 4.2: Formulation of the problem within the framework of decision analysis Status of Land Alternative A a1 a2 Prior Probability θ1 c11=700 c21=90 0. Nature would choose one of the possible states of nature: θ ∈Θ 3. These random factors determine what situation is referred to as a possible state of nature. the decision maker must choose an action from a set of possible actions. This payoff/decision table should be used to find an optimal action for the decision maker according to an appropriate criterion. the decision maker knows what the resulting payoff (or consequence) would be. then the payoff becomes an expected value (in the statistical sense) of the measure of the consequences. This can be summarized in the following decision7 table: Table 14. An additional element needs to be incorporated into this framework – prior probabilities. c ( a.Part V: Analytic & Computational Methods for Decision Problems 1. 145 .75 Section 14. This information is translated into a probability distribution where the state of nature is considered to be a random variable .
A formulation of the problem being assessed has two possible actions under consideration: drill for oil or sell the land.2. In words. Thus. The application of this criterion to the prototype example suggests that selling the land is the optimal action to take. Choose this decision. in that x would be more plausible occurrence if f ( x  θ ) were large. In simplicity. That is. Recall a simple general definition of the likelihood function: For the observed data. the payoff from selling the land cannot be less than 90. so one should choose the action which provides the best payoff with its worst state of nature.2 with the payoff units in thousands of dollars of profit. considered as a function of θ. Next find the maximum of these minimum payoffs. 2. so the choice is to sell the land. Θ and this provides an ordering among the possible actions. The possible states of nature are that the land contains oil and that it is does not. Since the consulting geologist has estimates that there is a 1 in 4 chance of oil.25 (for oil) and 0. Maximum Likelihood Criterion: The maximum likelihood function focuses on the most likely state of nature. x. the steps in this approach begin by identifying the most likely state of nature (largest prior probability). for each action a. 1. In Table 10. Regardless of what the true state of nature turns out to be for the problem. for each possible action. the sell alternative has the maximum payoff. one determines. The application of this criterion to the prototype example indicates that the dry state has the largest prior probability.1 has been redesigned in Table 14. this criterion provides the pessimistic viewpoint that regardless of which action is selected. For this state of nature. the worst state of nature for that action is likely to occur. which provides the best available guarantee. Its rationale is that it provides the best guarantee of the payoff that will be obtained. 146 . Table 14.Part V: Analytic & Computational Methods for Decision Problems This section illustrates the decision making process without experimentation.75 (for no oil). a ) . find the action with the maximum. is called the likelihood function. find the minimum payoff over all possible states of nature. the prior probabilities of the two states of nature are 0. the function l (θ ) = f ( x  θ ) . This payoff table will be used to determine the optimal action under the three main criterions discussed here. in the dry cell column. The intuitive reason for the name “maximum likelihood function” is that a θ for which f ( x  θ ) is small. Choose the maximum of the minimum payoff gives the maximum. Minimax Criterion: The minimax criterion was described in section 2. the maximum loss over the various possible states of nature: M (a) = max l (θ .
including payoffs and the best available estimates of the probabilities of the respective states of nature. Note that this choice differs from the other two preceding criteria. a 2 ) = 0. the alternative action to drill is selected.2 directly as follows: E[c1 ] = g (θ1 )l (θ1 . the major drawback of this criterion is that it completely ignores some relevant information. a1 ) = 0. the criterion does not rely on questionable subjective estimates of the probabilities of the respective states of nature other than identifying the most likely state. a) . the probability of the most likely one may be quite small and or show little difference between “quite likely” states of nature. so the action chosen is the best one for this particularly important state of nature. The methodology of including such information was described in Section 3.Part V: Analytic & Computational Methods for Decision Problems The appeal of this criterion is that the most important state of nature is most likely one. The advantage to the Bayes’ decision rule is that it incorporates all the available information. a 2 ) + g (θ 2 )l (θ 2 . under many circumstances. a1 ) + g (θ 2 )l (θ 2 . No state of nature besides the most likely one is considered. Therefore. 147 . Bayes’ Decision Rule (Expected Monetary Value Criterion): The Bayes decision rule was introduced in Section 2. the loss incurred for a given action incurs the expected value: B (a ) = ∑ g (θ i )l (θ i . The expected payoffs can be calculated from Table 14. In this application of the criterion. the probability weight assigned to each state of nature θ.25(90) + 0. However.75(−100) = 100 E[c 2 ] = g (θ 1 )l (θ 1 . In this instance. Basing the decision on the assumption that this state of nature will occur tends to give a more.75(90) = 90 Since 100 > 90. it can be easily determined that the optimal action is to drill. Choose the action with the maximum expected payoff. i This approach uses the best available estimates of the probabilities of the respective states of nature (current prior probabilities) and calculates the expected value of the payoff for each of the possible actions. It is sometimes argued that these estimates of the probabilities are largely subjective and so are too shaky to be trusted. we will consider the use of sensitivity analysis to assess the effect of possible inaccuracies in the prior probabilities. a problem with many possible states of nature. past experience and current evidence enables one to develop reasonable estimates of the probabilities.25(700) + 0. 3. Nevertheless. Before applying this approach to the problem at hand.
a1 ) + g (θ 2 )l (θ 2 . respectively. a 2 ) = 0. a 2 ) + g (θ 2 )l (θ 2 . Thus. The decision table in Table 14. so increasing one of these probabilities automatically decreases the other one by the same amount. a1 ) = 0. a1 ) = 0. a similar approach could be applied to the payoffs given in the table. The oil company’s management team feels that the true “chances” of having oil on the tract of land are more likely to lie between the range from 0. however. 148 . Conducting a sensitivity analysis in this situation requires the application of the Bayes’ decision rule twice – once when the prior probability of the oil is at the lower end (0. When the prior probability is conjectured to be 0. Basic probability theory suggests that the sum of the two prior probabilities must equal 1.15. the decision is very sensitive to the prior probability of oil as the expected payoff to drill shifts from 20 to100 to180 when the prior probabilities are 0. we find E[c1 ] = g (θ1 )l (θ1 . if the prior probability of oil is closer to 0.65(−100) = 180 E[c 2 ] = g (θ 1 )l (θ 1 . 0. a 2 ) = 0.85(−100) = 20 E[c 2 ] = g (θ 1 )l (θ 1 . Thus.35). the Bayes’ decision rule finds that E[c1 ] = g (θ1 )l (θ1 .Part V: Analytic & Computational Methods for Decision Problems Section 13 briefly discussed the role and importance of sensitivity analysis in various applications to study the effect if some of the elements included in the mathematical model were not correct. and vice versa.15(700) + 0.85 to 0. Thus.35(700) + 0. a 2 ) + g (θ 2 )l (θ 2 . the optimal action would be to sell the land rather than to drill for oil as suggested by the other prior probabilities. This suggests that it would be more plausible to determine the true value of the probability of oil. respectively. The prior probabilities in this model are most questionable and will be the focus of the sensitivity analysis to be conducted. a1 ) + g (θ 2 )l (θ 2 .2 representing the payoffs is the mathematical model of concern.15.35.15.35.15 to 0.65(90) = 90.85(90) = 90 and when the prior probability is thought to be 0.25.35.15) of this range and next when it is at the upper end (0.15(90) + 0. the corresponding prior probability of the land being dry would range from 0.35(90) + 0.65. and 0.
this is simple to determine as we set E[c1 ] = E[c 2 ] 800 p − 100 = 90 p= 190 = 0. Section 14.2375.4: Decision Making with Experimentation Here another element is added to the decision analysis framework – posterior probabilities.Part V: Analytic & Computational Methods for Decision Problems Let p = prior probability of oil. Algebraically. The point in this figure where the two lines intersect is the threshold point where the decision shifts from one alternative (selling the land) to the other (drill for oil) as the prior probability increases. Figure 141: Expected Payoff Drill for oil Sell the Land 0. Thus. Then the expected payoff from drilling for any p is E[c] = 700 p − 100(1 − p ) = 800 p − 100. This additional ‘element’ allows for testing or experimentation to be 149 . This is considered in the next subsection. serious consideration should be given to conducting a seismic survey to estimate the probability more accurately.2375.2375 and should drill for oil if p > 0. the conclusion should be to sell the land if p < 0. 800 where c1 is the expected payoff of the action to drill and subsequently c2 is the expected payoff of the action to sell the land.4 Prior Probability Of Oil A graphical display of how the expected payoff for each alternative changes when the prior probability of oil changes for the oil company’s problem of whether to drill or sell is illustrated in figure 141. the decision for the oil company decides heavily on the true probability of oil.2 0. Because.
Similarly. Thus.2. n. This data is used to find the posterior probabilities of the respective states of nature given the seismic readings. i. then the probability of unfavorable seismic soundings (USS) is P ( USS  State = Oil) = 0. oil is likely to be found. 150 . Dividing the possible findings of the survey into the following two categories: USS: unfavorable seismic soundings. Therefore for each i=1.2. improving the preliminary estimates of the probabilities of the respective states of natures.4 = 0. thus.6 . …. what is P(Θ = θi X = xj)? The answer is solved by following standard formulas of probability theory which states that the conditional probability. if there is oil. and FSS: favorable seismic soundings.8.4. the prior probabilities.n. X = finding from experimentation ( a random variable). so P(FSS  State = Oil) = 1 − 0. The cost is $30. P(Θ = θi) = prior probability that true state of nature is state i. for i =1. n. for j = 1.000. then the probability of unfavorable seismic soundings is estimated to be P ( USS  State = Dry) = 0. the question being addressed is: Given P(Θ = θi) and P(X = xj  Θ = θi). 2. let n = number of possible states of nature. given X = xj.8 = 0. if there is no oil (state of nature = Dry). 2. oil is unlikely.. Now suppose before choosing an action. …. For this problem. …. for i =1..Part V: Analytic & Computational Methods for Decision Problems conducted. P(Θ = θi X = xj) = posterior probability that true state of nature is state i. n.e. The main ‘criterion’ implemented here is Bayes Theorem. 2. xj = one possible value of the finding. A seismic survey obtains seismic soundings that indicate whether the geological structure if favorable to the presence of oil. an outcome from an experiment is observed X = x. the corresponding posterior probability is P (θ = θ i  X = x) = P( X = x  θ = θ i ) P (θ = θ i ) ∑ P( X = x  θ = θ k =1 n k ) P(θ = θ k ) Proceeding in general terms. an available option before making a decision is to conduct a detailed seismic survey of the land to obtain a better estimate of the probability of oil. so P(FSS  State = Dry) = 1 − 0. Based on past experience..
25(0.25) + 0. Similarly. one finds if the finding of the seismic survey is unfavorable.5 0. After these computations have been completed. the desired formula for the corresponding posterior probability is the Bayes theorem as stated above. we obtain the results shown below.15 0.1/0. then the posterior probabilities are 0.3=0.15/0.2)=0. 0.7 5 0.8(0.25(0.2 0.2 and subtracting the cost of experimentation. then 0. Such a diagram is presented in Figure 142. 2 2 P (State = Oil  Finding = USS) = P(Θ = θ1  X = x1 ) = Probability Tree diagrams can be a useful and visually appealing way of organizing the outcomes and the corresponding probabilities. Therefore.15 0.6 Posterior Probabilities 0.4)=0.25) 1 = .6/0.8 0. 2.6 0. for i =1.75(0. 7 7 Note that USS = x1 and FSS = x2.7=0.25) 1 P (State = Oil  Finding = F SS) = P(Θ = θ1  X = x 2 ) = = . if the seismic survey is favorable. with the posterior probabilities replacing the prior probabilities.5 0. Reading the probability diagram from left to right one can see all the probabilities – prior probabilities to conditional probabilities to joint probabilities – leading to the posterior probabilities. n.75) 2 1 1 P (State = Dry  Finding = FSS) = P(Θ = θ 2  X = x 2 ) = 1 − = .6(0.6(0. X = x j ). Again using the “payoffs” (in units of thousands of dollars) from Table 14.25) + 0. 0.1 0. Figure 142: Prior Probabilities Conditional Probabilities Joint Probabilities 0.7=0. Returning to the prototype example and applying this formula. P ( X = x j ) = ∑ P(Θ = θ k .75) 7 1 6 P (State = Dry  Finding = USS) = P(Θ = θ 2  X = x1 ) = 1 − = .3=0.86 151 .2 0.15/0. X = x j ) P( X = x j ) .14 0.2(0. k =1 n gives P (Θ = θ i  X = x j ) = P( X = x j  Θ = θ i ) P (Θ = θ i ).4(0.Part V: Analytic & Computational Methods for Decision Problems P (Θ = θ i  X = x j ) = where P (Θ = θ i .4 5 0 . Bayes’ decision rule can be applied again just as before.6)=0.4(0. ….8)=0.75(0.
if the upper bound is less than the cost of experiment. this analysis has not yet addressed the issue of whether the expense of conducting a seismic survey is truly valuable or whether one should choose the optimal solution without experimentation. Whichever state of nature is identified. Therefore. The Value of Experimentation can be determined by using two complementary methods. Comparing the improvement with the cost indicates whether the experiment should be performed.Part V: Analytic & Computational Methods for Decision Problems The expected payoffs if finding is unfavorable seismic soundings: 6 1 E[c1 (a1  X = x1 )] = (700) + (−100) − 30 7 7 = −15. The first assumes (unrealistically) that the experiment will remove all uncertainty about what the true state of nature is. if this upper bound exceeds the cost of experiment. than the experiment should be forgone. you naturally 152 . However. the expected value of perfect information. This issue is addressed next. However. This quantity.7 1 6 E[c 2 (a 2  X = x1 )] = (90) + (90) − 30 7 7 = 60 And the expected payoffs if the finding is favorable seismic surroundings: 1 1 E[c1 (a1  X = x 2 )] = (700) + (−100) − 30 2 2 = 270 1 1 E[c 2 (a 2  X = x 2 )] = (90) + (90) − 30 2 2 = 60 Table 143: Optimal Policies Finding from Seismic Survey USS FSS Optimal Action Sell Drill Expected Payoff Excluding Cost of Survey 90 300 Expected Payoff Including Cost of Survey 60 270 Since the objective is to maximize the expected payoff. then second methods should be implemented. Expected Value of Perfect Information Suppose that the experiment could definitely identify the true state of nature thereby providing “perfect” information. provides an upper bound on the potential value of the experiment. and then calculates the resulting “improvement” in the expected payoff (ignoring the cost of the experiment). these results yield the optimal policy shown in Table 143. 1. This method calculates the actual “improvement” in the expected payoff (ignoring the cost of experiment) that would result from performing the experiment.
the optimal policy with experimentation.3 .) To evaluate whether the experiment should be conducted. we carry on and implement the second method of evaluating the potential benefit of experimentation. Since 142.25(700) + 0. since experimentation usually cannot provide perfect information. EVPI = 242.5. The values of P( X = x j ) for each of the two possible findings from the seismic survey – unfavorable or favorable – were calculated at the bottom of the probability tree diagram in Figure 14. and the corresponding expected payoff (excluding the cost of experimentation) for each possible finding from the experiment. that is.75(90) = 242. Since we do not know in advance the state of nature which will be identified.Part V: Analytic & Computational Methods for Decision Problems choose the action with the maximum payoff for the state. if the oil company could learn more before choosing its action whether the land contains oil. requiring the weighting of the maximum payoff for each state of nature by the prior probability of that state of nature is needed. Thus.5 – 100 = 142.5 far exceeds 30. For the prototype example much of the work has been done for the right side of the equation. a calculation of the expected payoff or consequence with perfect information (ignoring the cost of the experiment).2 to be P( x1 ) = 0. we calculate the expected value of perfect information (EPVI)8: EPVI = E[payoff with perfect information] – E[payoff without experimentation] Thus. Expected payoff with experimentation = ∑ P ( X = x j ) E[c  X = x j ] . For this prototype example. Obtaining this latter quantity requires the previously determined posterior probabilities. Thus. Calculating this quantity requires first computing the expected payoff with experimentation (excluding the cost of experimentation). the cost of experimentation (a seismic survey) is shown to be worthwhile. j where the summation is taken over all possible j and c denotes the payoff or consequence. Thus the expected value of perfect information is the expected value of this random variable. the expected payoff as of now (before acquiring this information) would be $242. where the expected payoff without experimentation was determined earlier to be 100 under Bayes’ decision rule. 8 153 .7 and P ( x 2 ) = 0. Then each of these expected payoff needs to be weighted by the probability of the corresponding finding. Expected payoff with perfect information = 0.5. For the optimal policy with The value of perfect information is a random variable equal to the payoff with perfect information minus the payoff without experimentation. Expected Value of Experimentation Now we want to determine expected increase directly which is referred to as the expected value of experimentation. 2. EVPI can provide an upper bound on the expected value of experimentation.500 (excluding the cost of the experiment generation the information.
This prototype problem involves a sequence of two decisions which each result in possible consequences and in turn resulting in further decisions and consequences. The numbers under or over the branches that are not in parentheses are the cash flows that occur at those branches. The nodes of the decision tree are referred to as forks. decision fork e leads to a chance fork h. represented by a circle. The two branches emanating from fork b represent the two possible outcomes of the survey. It can help in organizing the computational work that has been described thus far. In this example. Should a seismic survey be conducted before an action is chosen? Which action (drill for oil or sell the land) should be chosen? The corresponding decision tree (with partially added numbers and performing computations) is displayed in Figure 14.3 as E (c  X = x1 ) = 90. Similar to before the expected value of experimentation (EVE) can be calculated as EVE = 153 = 100 = 53 and identifies the potential value of the experimentation. represented by a square. the corresponding expected payoff for each finding can be obtained from Table 14. thus far – some interpretation can be deduced before the actual analysis proceeds. It involves a sequence of two decisions: 1. indicates that a random event occurs at that point. Fork b is a chance fork representing the random event of the outcome of the seismic survey. Since the decision tree is filled in with the available information. Since this value exceeds 30. Section 14.5: Decision Trees Decision trees are useful in providing a visual display of the sequential decision processes and subsequent consequences. where again the two branches correspond to the two possible states of nature. the cost of conducting a detailed seismic survey shows valuable.Part V: Analytic & Computational Methods for Decision Problems experimentation. the interpretation of the specific types of nodes. Recall from Section 9. The resulting in end payoff is recorded in boldface at the end of the terminal 154 . A decision fork. E (c  X = x 2 ) = 270. A chance fork. 2.3(300) = 153.7(90) + 0. Note that this is the final decision tree and was done to save space and time. The decisions c and d follow the possible outcomes and result in the consequences from the random events of f and g. Then the expected payoff with experimentation is determined to be 0. Similarly.3. and the arcs are called branches. indicates that decision needs to be made at that point in the process. the first decision is represented by a decision fork a.
Therefore. These expected payoffs are placed above these forks as displayed in Figure 14. For each decision fork. To begin the procedure. Applying step 2. Record this expected payoff for each decision fork in boldface next to the fork. and designate this quantity as also being the expected payoff for the branch leading to the fork. 3. their expected payoffs are calculated as 1 6 (670) + (−130) = −15. moving one column to the left. perform either step 2 or 3 depending on whether the forks in that column are chance forks or decision forks. At each chance fork. The probabilities emanating from chance fork b are the probabilities of these findings while the probabilities emanating from chance fork f and g are the posterior probabilities of the states of nature.3. The expected payoff for a branch that leads to a chance fork now is recorded in boldface over that chance fork. consider the rightmost column of forks. g. 2 2 3 1 E[c h ] = (700) + (−100) = 100. and e. d. given the finding from the seismic survey. namely. For each column. The steps involved can be summarized as follows: 1. step 3 can be applied as follows. Performing the Analysis The backward induction procedure was introduced in Section 10 and is implemented here for the prototype example. are the decision forks c. In each case. the probabilities are the prior probabilities as no seismic survey was conducted.7.Part V: Analytic & Computational Methods for Decision Problems branch. calculate the expected payoff by multiplying the expected payoff of each branch (shown in boldface to the right of the branch) by the probability of that branch and then summing these products. For each chance fork. 2. and h. Next. 4 4 E[c f ] = Note that E[c k ] denotes the expected payoff or consequence of the kth chance fork. 155 . 7 7 1 1 E[c g ] = (670) + (−130) = 270. compare the expected payoffs of its branches and choose the alternative whose branch has the largest expected payoff. From chance fork h. record the choice on the decision tree by inserting a double dash as a barrier through each rejected branch. the probabilities of random events are recorded in parentheses. Start at the right side of the decision tree and move left one column at a time. chance forks f.
E[c d (a1 )] = 270.3(270) = 123. so choose a 2 = sell. E[c d (a 2 )] = 60. E[ce (a1 )] = 100. move one column to the left again and apply step to the chance fork b. 270 > 60. E[cc (a 2 )] = 60. Now the decision tree is complete where the payoffs are given on the far right side. E[c a (a 02 )] = 100.Part V: Analytic & Computational Methods for Decision Problems E[cc (a1 )] = −15. 100 > 90.143) 800 0 Dry(0.7.7 670 Figure 143: Un ble ora fa v 60 Drill 100 Oil(0. so choose a1 = drill. so choose a 01 = do seismic survey.7. 15. 123 > 100. 60 > −15.7(60) + 0.857) Sell 90 130 123 Su rv ey e bl ra vo Fa 60 670 Se ism 270 Drill 270 Do 123 ic 130 Sell ey Surv 60 700 100 Drill 100 100 Sell ic eism No S 90 156 . E[ce (a 2 )] = 90. In the final step. Next. we consider the last decision fork a and apply step 3 as before: E[c a (a 01 )] = 123. Note a01 denotes the action to do a seismic survey where a02 denotes the action to not conduct a seismic survey. so choose a1 = drill. Here we find that E[cb ] = 0.
individuals who are risk seekers demonstrate an increasing marginal utility for money. The intermediate case is that of a riskneutral individual. Such an individual’s utility for money is simply proportional to the amount of money involved. For example. and then turn riskaverse with large amounts. Thus far. A utility function for money illustrates an individual’s behavior pertaining to money. sell the land.3. drill for oil. we have assumed that the expected payoff or consequence in monetary terms is the appropriate measure of the consequences of taking an action. Fortunately. However. Thus. The scale of the utility function is irrelevant.000. It is only the relative values of the utilities that matter. The optimal policy via the backward induction process leads to the same policy or policies via the more formal process. A person who is riskaverse will exhibit a decreasing marginal utility function. there is a way of transforming monetary values to an appropriate scale that reflects the decision maker’s preferences. in many situations this assumption is inappropriate. The expected payoff (including the cost of the seismic survey) is $123. Moreover. The fact that different people have different utility functions for money has an important implication for decision making in the face of uncertainty. The key to constructing the utility function for money to fit the decision maker is based on the fundamental properties explained in Section 6. the optimal policy is: • • • • Do the seismic survey. one’s attitude toward risk can change over time and circumstances. All the utilities can be multiplied by any positive constant without affecting which alternative course of action will have the largest expected utility. Section 14. This scale is referred to as the utility function for money. an individual might be essentially riskneutral with small amounts of money. this utility function must be constructed to fit the preferences and values of the decision maker involved. To summarize 157 . It is possible to exhibit a mixture of these kinds of behavior.Part V: Analytic & Computational Methods for Decision Problems The double dashes have blocked off the undesirable paths. The following fundamental property summarizes in words the central notion that under the assumptions of utility theory. However. According to Bayes’ decision rule. This was the same unique optimal solution obtained in the earlier section without the decision tree which was shown in Table 14. If the result is favorable. the decision maker’s utility for money has the property that the decision maker is in different between two alternative courses of action if the two alternatives have the same expected utility. If the result is unfavorable. who prizes money at its face value. When a utility function for money is incorporated into a decision analysis approach to a problem. follow the open paths from left to right. when applying the Bayes’ decision rule.6: Utility Theory This section will focus on various models of individual rationality which can be categorized into an individual’s traits. then become a risk seeker with moderate amounts.
The choice is p = 0.000. Table144: Utilities for oil company Monetary Payoff 130 100 60 90 670 700 Utility 150 105 60 90 580 600 As a starting point in constructing the utility function. Returning to the prototype example. By choosing u (−130) = −150 .000 would be steep. this choice of p 5 implies that 4 1 u (−130) + u (700) = 0 (utility of alternative 1). so a loss of $100. Although only utility functions for money are discussed in this section. On the other hand. 5 5 The value of either u (−130) or u (700) can be set arbitrarily to establish the scale of the utility function. Therefore. striking oil is an exciting prospect of earning $700. If u (M ) denotes the utility of a monetary payoff of M. a choice of p is made that makes the decision maker indifferent between a payoff of 130 with probability p or definitely incurring a payoff of 100. they can be constructed for consequences of the alternatives of action that are not monetary.7(−150) = −105. so u (−100) = p u (−130) = 0. To apply the decision maker’s utility function for money to the problem as described earlier. this equation yields u (700) = 600. it is necessary to identify the utilities for all possible monetary payoffs. 158 . An appropriate next step is to consider the worst scenario and best scenario and then to address the question of what value of p would make the decision maker indifferent between two alternatives. Bayes’ decision rules replaces monetary payoffs by the corresponding utilities. it is natural to let the utility of zero money be zero.000 and drill when there is no oil. These payoffs and their corresponding utilities are presented in Table 144 and can be obtained by the following methodology. The worst case scenario is to conduct a seismic survey and then drill for $100. Suppose the decision maker’s 1 choice is p = .Part V: Analytic & Computational Methods for Decision Problems the basic role of utility function in decision analysis is to simply state that when the decision maker’s utility function for money is used to measure the relative worth of the various possible monetary outcomes. recall in the summary of the problem it was stated that the oil company was operating without much capital. the optimal action (or sequence of actions) is the one which maximizes the expected utility. so u (0) = 0 . To identify u(100).7.
This is a typical of a moderately riskaverse individual. for variety purposes. a smooth curve was drawn through u (−130). sell location Oil No oil Do not take seismic readings Take seismic readings Dummy outcome of e0 e1 reveals no structure e2 reveals open structure e3 reveals closed structure 159 . Decision trees can also be used in an identical fashion to the analysis in the preceding section except for substitution utilities for monetary payoffs. Table 14. Before making this decision. X.4. Note how u (M ) essentially equals M for small values of M. the dashed line drawn at a 45° shows the monetary value of the payoffs used exclusively in the preceding sections.7: Combination of Formal and Informal Analysis Continuing with the same problem. so u (90) = p u (700) = 0. We also give a more general analysis perspective to end this section. if made. the oil company wishes to obtain more geological and geophysical evidence by means of a seismographic recording which is quite expensive. For contrast. or (3) there is a closed subsurface structure. E. (1) there is no subsurface structure. A. do not sell location Do not drill. Θ. The values on this curve at M = 60 and M = 670 provide the corresponding utilities. (2) there is an open subsurface structure.Part V: Analytic & Computational Methods for Decision Problems To obtain u (90) . the desirability of drilling depends on the amount of oil which will be found – oil or no oil.5: Possible Choices Space A Θ E X Elements a1 a2 θ1 θ2 e1 e2 x0 x1 x2 x3 Interpretations Drill. The descriptions of the four spaces. a value of p is selected that makes the decision maker in different between a payoff of 700 with probability p or definitely obtaining a payoff of 90. The value chosen is p = 0. however. and u (700) as shown in Figure 104. It is also assumed that these recordings. u (90). To obtain the decision maker’s utility function for money.15. u (60) = 60 and u (670) = 580 . which completes the list of utilities displayed in Table 10. u (−100). For this analysis of a drilling decision we add an extra element. they are not necessarily appropriate in this instance. and then how u (M ) gradually falls off M for larger values of M. There are other approaches of estimating u (M ) . will give completely reliable information that one of the tree conditions prevails. are summarized in Table 145 and the possible sequence of choices are displayed in the decision tree in Figure 145.15(600) = 90. 14.
This presents the idea of cutting the decision tree presented in Section 4 and 9.θ ) 4tuple is highly complicated.11 which emanate from e1. There are uncertainties surrounding any particular (e. The hypothetical measures Pθ′′x and Pxe are shown on those branches of the decision  tree in Figure 6.8 shown on the branches emanating from e0 were computed from them by use if the formula Pθ′ (θ i ) = Pθ′′(θ i  x1 ) Px ( x1  e1 ) + Pθ′′(θ i  x 2 ) Px ( x 2  e1 ) + Pθ′′(θ i  x3 ) Px ( x3  e1 ). Assignment of Probabilities The assignment of a probability measure to the probability space Θ × X bears more importance on the conditional (or “posterior”) measure Pθ′′x and the marginal measure  Pxe than on the complementary measures Pxe . x. that the decision maker can assign to utility numbers to reflect his preferences. a. x3 = closed structure) may make it possible to assign a nearly “objective” measure Pθ′′x to the amount of oil which will be found given any particular  experiment result. 160 .θ to X depending on the amount of oil which will be found. x2 = open structure. Furthermore. the prior probabilities Pθ′{θ1 } =0. it can be easily verified that the optimal decision is to pay for seismographic recordings (e1) and then drill (a1) if and only if the recordings reveal open (x2) or closed (x3) structures. Previous experience with the amounts of oil found in the three possible types of geological structure (x1 = no structure. At the same time it will in general be much more meaningful to a geologist to assign a marginal measure Pxe to the various structures and thus to the sample space X than it would be to assign conditional measures Pxe .25 whereas the expected utility of the optimal act without seismic information is 0. nevertheless.θ and Pθ′ .θ ) complex which must be compared formally or informally. Analysis Since all the data required for analysis appears in the decision tree in Figure 145. whereas it would be much less clear what measure Pθ′ should be assigned to the amount of oil in the absence of knowledge of the structure.2 and Pθ′{θ 2 } =0. It is worthwhile to pay for the recordings because the expected utility of this decision is 15. Assume.Part V: Analytic & Computational Methods for Decision Problems Assignment of Utilities The psychological stimulus associated in this problem with a (e. a. x. each potential consequence of the present drilling venture may interact with future potential drilling ventures such as geological information gained. and different strike produce different quantities and qualities of oil which can be recovered over different periods of time and sold at different prices. Different wells entertain different drilling costs.
W. Gilboa. Stern.O. and Greenberg. S. J.R. Statistical Decision Theory and Bayesian Analysis. 49(4). J. & Insua. Everitt. Multilevel Statistical Models (3rd ed. (1982). (2000).C. Gilks. and Madigan.P. New York: Oxford University Press. D. Chib. N.S. Inc. J. Clemen. D. Chapman and Hall. (2002). Carlin. J. (2nd ed. The Scientific Value of Bayesian Statistical Methods and Outlook.M. (7th ed. & Schmeidler. S. S. (2nd ed.. London. (1998). Hillier. I. Discrete and Dynamic Decision Analysis. The American Statistician.B. New York: McGrawHill.. (2001). A.). R. London: Arnold. London.S. S. The Cambridge Dictionary of Statistics (2nd ed.R. (2002).. E.). G.S. London.J. Convergence Assessment Techniques for Markov Chain Monte Carlo.A. Buchanan. Gelman.R. Inc. Inc. G. D. and Rubin. D. Arnold Publishers. Berger. Richardson. Hastings. (1997). Bayesian Data Analysis. CA: Brooks/Cole Publishing Company.. B. Chichester: John Wiley & Sons. Cambridge University Press.T. J. (1978) Decision Networks.. H. (1996). Decision Theory.B. Stirzaker. Statistical Decision Theory. Statistics and Computing. Brooks.R. UK: Cambridge University Press. (2001). Grimmett. Pacific Grove. Chichester: John Wiley & Sons. Understanding the MetropolisHastings Algorithm.T. UK: John Wiley & Sons. (1985). Cambridge. Probability and Random Processes. and Roberts. (1995). (1996). (2000). Introduction to Operations Theory.).). Draper D. Chapman and Hall. French. F.O. H.References References Bather.. (1995). (1982). 8:319335. D. Theory of Casebased Decisions.). D. 161 . & Lieberman. New York: SpringerVerlag New York Inc. & Mello. G. Goldstein. Making Hard Decisions: An Introduction to Decision Analysis.J. Markov Chain Monte Carlo in Practice. and Spielgelhalter.J.
MacDonald.ucla. Decision Making and Forecasting. J. Utility of Gains and Losses. Neal.mit.P. New York: McGrawHill. The American Statistician.T. Hidden Markov and Other Models for Discretevalue Time Series. The Netherlends: Kluwer Academic Publishers. (1998). Causality.ai. W.html Petitti.V. MetaAnalysis. B.edu/jp_home. Raiffa. Applied Statistical Decision Theory. F. H. Learning in Graphical Models. New Jersey: Lawrence Erlbaum Associates. Marshall. R. Decision Analysis and CostEffectiveness Analysis: Methods for Quantitative Synthesis in Medicine. Duke University: The FUQUA School of Buisness. K. Kass. Seminar on Choice Theory. B.. Ripley. R. New York: The Macmillan Company.cs. Lindgren. Inc. (2000). M. Inc. (2nd ed. New York: SpringerVerlag New York. Luce. (1971). California: Morgan Kaufmann. Inc.M. R.E. K.. Neal.). J. Inc. R. http://www. (2000). Inc. Gelman. Pearl.I. New York: John Wiley & Sons.D. (2000). Carlin. Muphy. Ph. (editor) (1999).html Nau. Chichester: John Wiley & Sons. I. B.. (2002). 162 . (2001). http://bayes.B. D. Pattern Recognition and Neural Networks. Schlaifer. B. Bayesian Learning for Neural Networks. (1998). Jordan. R. New York: Oxford University Press. New York: SpringerVerlag New York. (1997). D. Elements of Decision Theory.L. (1997). 52(2). Chichester: John Wiley & Sons. A.References Jensen. Bayesian Networks and Decision Graphs. Inc. Inc. Statistical Practice: Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Probabilistic Reasoning in Intelligent Systems. Boca Raton: Chapman & Hall.M.. (1995). R. Stochastic Simulation. (1988).edu/~murphyk/Bayes/bayes. & Zuchchini. (1999). Ripley.W. R. Pearl. & Oliver. (1987). Inc. (1996). Publishers.
(2001). Statistics in Medicine. L. Introduction to Graph Theory. Decision Space.L. Weirich. CHL5223H (2002/3): Applied Bayesian Methods by Professor Michael Escobar (meetings). (1980) Model Building for Decision Analysis. Chichester: John Wiley & Sons. A. 18:25072515. White.B. (2001). Inc. and Mira. (1999). New Jersey: Prentice Hall. Some Adapative Monte Carlo Methods for Bayesian Inference. Inc. P. Courses and Professors: STA4276H (2003): MCMC algorithms by Professor Jeffrey Rosenthal. Tierney. West. D. (1976). 163 . D. Cambridge: Cambridge University Press.References Rivett. New York: American Elsevier Publishing Company. Fundamentals of Decision Theory. P.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.