You are on page 1of 32

Chapter 3.

Properties of
Random Variables

IN CHAPTER 2 random variables and their probability density functions were discussed in
general and somewhat abstract terms. Actually, nearly every hydrologic variable is a random
variable. This includes rainfall, streamflow, infiltration rates, evaporation, reservoir storage, etc.
Any process whose outcome is a random variable can be thought of as an experiment. A single
outcome from an experiment is a realization of the experiment or an observation from the
experiment. Thus, daily rainfall values are observations generated by a set of meteorologic
conditions that comprise the experiment.

The terms realization and observation can be used interchangeably; however, an


observation is generally taken to be a single value of a random variable and a realization is
generally taken as a time series of random variables generated by a random experiment. A 10-year
record of daily rainfall might be considered as a single realization of a stochastic process (daily
rainfall). A second 10-year record of daily rainfall from the same location would then be a second
realization of the process.

In this chapter we will be concerned mainly with observations of random variables and
with the collection of possible values that these observations may take on. The complete
assemblage of all of the values representative of a particular random process is called a population.
Any subset of these values would be a sample from the population. For example, the pages of this
book could represent a population while the pages of this chapter are a sample of that population.
All of the books in a library might be taken as a population and should this book be found in the
library, it would be a sample from the total population.

Generally, one has at hand a sample of observations or data from which inferences about
the originating population are to be made, and then possibly inferences about another sample from
this population. Streamflow records for the past 50 years on a particular stream would be a sample
from which inferences about the behavior of the stream for all time (the population) could be
made. This information could also be used to estimate the behavior of the stream during some
future period of years (another but yet unrealized sample) so that a structure could be properly
designed for the stream. Thus, one might use information gleaned from one sample to make
decisions regarding another sample.

Quantities that are descriptive of a population are called parameters. In most situations
these parameters must be estimated from samples of data. Sample statistics are estimates for
population parameters. Sample statistics are estimated from samples of data and as such are
functions of random variables (the sample values) and thus are themselves random variables. The
58
average number of pages in all of the books in a particular library would be a parameter
representing the population (the books in the library). This parameter could be estimated by
determining the average number of pages in all of the books on a particular shelf in the library (a
sample of the population). This estimate of the parameter would be a statistic.

As pointed out in chapter 1, for a decision based on a sample to be valid in terms of the
population, the sample statistics must be representative of the population parameters. This in turn
requires that the sample itself be representative of the population and that “good” parameter
estimation procedures are used. One could not get a “good” estimate of the average number of
pages per book in a library by sampling a shelf that contained only fat, engineering handbooks. By
the same token, one cannot get "good" estimates for the parameters of a streamflow synthesis
model if the estimates are based on a short period of record during which an extreme drought
occurred.

One rarely, if ever, has available a population of observations on a hydrologic variable.


What is generally available is a sample (of observations) from the population. Thus, population
parameters are rarely, if ever, known and must be estimated by sample statistics. By the same
token, the true probability density function that generated the available sample of data is not
known. Thus, it is necessary to not only estimate population parameters, but, it is also necessary to
estimate the form of the random process (experiment) that generated the data.

This chapter is devoted to a discussion of parameters descriptive of populations and how


estimates (statistics) for these parameters can be obtained from samples drawn from populations.

MOMENTS AND EXPECTATION – UNIVARIATE DISTRIBUTIONS

A convenient way of quantifying the location and some measures of the shape of a
probability distribution is by computing the moments of the distribution. Referring to figure 3.1,
the first moment of the elemental area dA about the origin is given by

𝑑𝜇1′ = 𝑥𝑑𝐴

and the first moment of the total area about the origin is

𝜇1′ = ∫𝐴 𝑥𝑑𝐴 (3.1)

In case of a random variable and its associated probability density function such as shown
in figure 3.2, the first moment about the origin is again given by

𝜇1′ = ∫𝐴 𝑥𝑑𝐴

In this case 𝑑𝐴 = 𝑝𝑋 (𝑥) so that



𝜇1′ = ∫−∞ 𝑥𝑝𝑥 (𝑥)𝑑𝑥 (3.1)
59
Figure 3.1. Moment of arbitrary area

Figure 3.2. Moment of probability distribution.

Generalizing the situation, the ith moment about the origin is



𝜇𝑖′ = ∫−∞ 𝑥 𝑖 𝑝𝑥 (𝑥)𝑑𝑥 (3.2)

In the case of a discrete distribution

𝜇𝑖′ = ∑𝑗 𝑥𝑗𝑖 𝑓𝑋 �𝑥𝑗 � (3.4)

The ith central moment is defined as the ith moment about the mean, μ, of a distribution and
60
is given by

𝜇𝑖 = ∫−∞(𝑥 − 𝜇)𝑖 𝑝𝑥 (𝑥)𝑑𝑥 (3.3)

The expected value of the random variable X is defined to be



𝐸(𝑋) = ∫−∞ 𝑥𝑝𝑋 (𝑥)𝑑𝑥 𝑋 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 (3.6)

𝐸(𝑋) = ∑𝑗 𝑥𝑗 𝑓𝑥 �𝑥𝑗 � 𝑋 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (3.7)

If g(X) is a function of X, then the expected value of g(X) is given by



𝐸[𝑔(𝑋)] = ∫−∞ 𝑔(𝑥)𝑝𝑥 (𝑥) 𝑑𝑥 𝑋 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 (3.8)

𝐸[𝑔(𝑋)] = ∑𝑗 𝑔�𝑥𝑗 �𝑓𝑥 �𝑥𝑗 � 𝑋 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (3.9)

It is apparent that the expected value of (x-μ)i is equal to the ith central moment

𝐸�(𝑋 − 𝜇)𝑖 � = 𝜇𝑖 (3.10)

and the E(X) = μ1' is the first moment about the origin.

Some rules for finding expected values are:

𝐸(𝑐) = 𝑐 (3.11)

𝐸[𝑐𝑔(𝑋)] = 𝑐𝐸[𝑔(𝑋)] (3.12)

𝐸[𝑔1 (𝑋) ± 𝑔2 (𝑋)] = 𝐸[𝑔1 (𝑋)] ± 𝐸[𝑔2 (𝑋)] (3.13)

MEASURES OF CENTRAL TENDENCY

Arithmetic Mean

Generally the first property of a random variable that is of interest is its mean or average
value. The mean, μX, of a random variable, X, is its expected value. Thus

𝜇𝑋 = 𝐸(𝑋) = 𝜇1′ (3.14)

A sample estimate of the population mean is the arithmetic average, X, calculated from
𝑥
𝑥̅ = ∑𝑛𝑖=1 𝑛𝑖 (3.15)
61
where n is the number of observations or items in the sample. The arithmetic mean can be
estimated from grouped data by
1
𝑥̅ = 𝑛 ∑𝑘𝑖=1 𝑥𝑖 𝑛𝑖 (3.16)

where k is the number of groups, n is the number of observations, ni is the number of observations
in the ith group and xi is the class mark of the ith group.

Geometric Mean

The sample geometric mean, ����


𝑋𝐺 , is defined as

����
𝑋𝐺 = (∏𝑛𝑖=1 𝑥𝑖 )1/𝑛 (3.4)

where Πi=1 xi = x1 x2 x3 ... xn.


n

The logarithm of ��
𝑋��
𝐺 is equal to the arithmetic average of the logarithms of the xi's. The
logarithm of the population geometric mean would be the expected value of logarithm of X.

Median

The sample median, Xmd, is the observation such that half of the values lie on either side of
Xmd. The population median, μmd, would be the value satisfying
𝜇
𝑚𝑑
∫−∞ 𝑝𝑥 (𝑥)𝑑𝑥 = 0.5 𝑋 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 (3.18)

or 𝜇𝑚𝑑 = 𝑥𝑝 where p is determined from

∑𝑝𝑖=1 𝑓𝑋 (𝑥𝑖 ) = 0.5 𝑋 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (3.19)

The median of a sample or a population may not exist.

Mode

The mode is the most frequently occurring value. Thus the population mode, μmo, would be
a value of X maximizing pX(x) and thus satisfying the equations

𝑑𝑝𝑋 (𝑥) 𝑑2 𝑝𝑥 (𝑥)


=0 and < 0 𝑋 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 (3.20)
𝑑𝑥 𝑑𝑥 2

or the value of X associated with


𝑛
𝑀𝑎𝑥𝑖=1 𝑓𝑋 (𝑥𝑖 ) 𝑋 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (3.21)

62
The sample mode, Xmo, would simply be the most frequently occurring value in the sample.
A sample or a population may have none, one or more than one mode.

Weighted Mean

The calculation of the arithmetic mean of grouped data is an example of calculating a


weighted mean where ni /n is the weighting factor. In general the weighted mean is

∑𝑘
𝑖=1 𝑤𝑖 𝑥𝑖
𝑥𝑤 =
���� ∑𝑘
(3.22)
𝑖=1 𝑤𝑖

where wi is the weight associated with the ith observation or group and k is the number of
observations or groups.

MEASURES OF DISPERSION

Range

The two most common measures of dispersion are the range and the variance. The range of
a sample is simply the difference between the largest and smallest sample values. The range of a
population is many times the interval from -∞ to ∞ or from 0 to ∞. The sample range is a function
of only two of the sample values but does convey some idea of the spread of the data. The
population range of many continuous hydrologic variables would be 0 to ∞ and would convey little
information. The range has the disadvantage of not reflecting the frequency or magnitude of values
that deviate either positively or negatively from the mean since only the largest and smallest values
are used in its determination. Occasionally the relative range -- the range divided by the mean – is
used.

Variance

By far the most common measure of dispersion is the variance or its positive square root
the standard deviation. The variance of the random variable X is defined as the second moment
about the mean and is denoted by σX.
2

𝑉𝑎𝑟(𝑋) = 𝜎 2 = 𝜇2 = 𝐸[(𝑋 − 𝜇)2 ] = 𝐸(𝑋 2 ) − 𝐸 2 (𝑋) (3.5)

Thus, the variance is the average squared deviation from the mean. For a discrete population of
size n, equation 3.23 becomes
∑𝑖(𝑥𝑖 −𝜇)2
𝜎𝑋2 = (3.24)
𝑛

The sample estimate of σX is denoted by sX2 and calculated from


2

63
2
�∑ 𝑥 �
∑𝑖(𝑥𝑖 −𝑥̅ )2 ∑𝑖 𝑥𝑖2 − 𝑖 𝑖 ∑𝑖 𝑥𝑖2 −𝑛𝑥̅ 2
𝑠𝑥2 = = 𝑛
= (3.6)
𝑛−1 𝑛−1 𝑛−1

Two basic differences should be noted between equations 3.24 and 3.25. First in 3.25 𝑥̅ is used
instead of μ. This is because in dealing with a sample, the population mean would not be known.
Secondly, 𝑛 − 1 is used as the denominator in determining sX2 rather than n when calculating σX.
2

Σi(xi- x )2
The reason for this is that would result in a biased estimate for σX2.
n

The variance for grouped data can be estimated from

∑𝑘
𝑖=1 𝑛𝑖 (𝑥𝑖 −𝑥̅ )
2
𝑠𝑋2 = (3.26)
𝑛−1

where k is the number of groups, n is the number of observations, xi is the class mark and ni the
number of observations in the ith group.

The variance of some functions of the random variable X can be determined from the
following relationships:

𝑉𝑎𝑟(𝑐) = 0 (3.27)

𝑉𝑎𝑟(𝑐𝑥) = 𝑐 2 𝑉𝑎𝑟(𝑥) (3.28)

𝑉𝑎𝑟(𝑎 + 𝑏𝑋) = 𝑏 2 𝑉𝑎𝑟(𝑥) (3.29)

The units on the variance are the same as the units on X2. The units on the standard
deviation are the same as the units on the random variable. A dimensionless measure of dispersion
is the coefficient of variation defined as the standard deviation divided by the mean. The
coefficient of variation is estimated from
𝑠𝑋
𝑐𝑉 = (3.30)
𝑥̅

MEASURES OF SYMMETRY

As is apparent from figure 2.13 many distributions are not symmetrical. They may tail off
to the right or to the left and as such are said to be skewed. A distribution tailing to the right is said
to be positively skewed and one tailing to the left is negatively skewed. The skewness is the third
moment about the mean and is given by

𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = ∫−∞(𝑥 − 𝜇)3 𝑝𝑥 (𝑥)𝑑𝑥 (3.7)

One measure of absolute skewness would be the difference in the mean and the mode. A measure
64
such as this would not be too meaningful, however, because it would depend on the units of
measurement. A relative measure of skewness, known as Pearson's first coefficient of skewness,
can be obtained by dividing the difference in the mean and the mode by the standard deviation.
𝜇−𝜇𝑚𝑜
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = (3.8)
𝜎

which can be estimated by


3(𝑥̅ −𝑥𝑚𝑑 )
𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = (3.9)
𝑠𝑥

The mode of moderately skewed distributions can be estimated from (Parl 1967)

𝑥𝑚𝑜 ≅ 𝑥̅ − 3(𝑥̅ − 𝑥𝑚𝑑 ) (3.10)

so that
3(𝑥̅ −𝑥𝑚𝑑 )
𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = (3.11)
𝑠𝑥

If sample estimates are replaced by population values in equation 3.35, Pearson's second
coefficient of skewness results.

The most commonly used measure of skewness is the coefficient of skew given by
𝜇3
𝛾= 3/2 (3.12)
𝜇2

An unbiased estimate for the coefficient of skew based on a sample of size n is

𝑛2 𝑀
3
𝑐𝑠 = (𝑛−1)(𝑛−2)𝑠 3 (3.13)
𝑥

where M3 is the sample estimate for μ3. The sample coefficient of skew has the advantage of
being a function of all of the observations in the sample. Figure 3.3 shows symmetrical, positively
and negatively skewed distributions.

65
Figure. 3.3. Location of mean, median and mode.

Figure. 3.4. Illustration of kurtosis.

MEASURES OF PEAKEDNESS

A fourth property of random variables based on moments is the kurtosis. Kurtosis refers to
the extent of peakedness or flatness of a probability distribution in comparison with the normal
probability distribution. Kurtosis is the fourth moment about the mean. A coefficient of kurtosis is
defined as
66
𝜇
𝜅 = 𝜇42 (3.38)
2

The sample estimate for the coefficient of kurtosis is


𝑀𝜇4
𝑘= (3.14)
𝑆𝑥4

where M4 is the sample estimate for μ4. According to Yevjevich (1972a), a less biased estimate for
𝑛3
the coefficient of kurtosis is obtained by multiplying equation 3.39 by(𝑛−1)(𝑛−2)(𝑛−3) where n is
the sample size.

The coefficient of kurtosis for a normal distribution is 3. The normal distribution is said to
be mesokurtic. If a distribution has a relatively greater concentration of probability near the mean
than does the normal, the coefficient of kurtosis will be greater than 3 and the distribution is said to
be leptokurtic. If a distribution has a relatively smaller concentration of probability near the mean
than does the normal, the coefficient of kurtosis will be less than 3 and the distribution is said to be
platykurtic. Figure 3.4 illustrates kurtosis. The coefficient of excess, ξ, is defined as κ-3. Therefore
for a normal distribution ξ is 0, for a leptokurtic distribution ξ is positive and for a platykurtic
distribution ξ is negative.

MOMENTS AND EXPECTATION – JOINTLY DISTRIBUTED RANDOM VARIABLES

If X and Y are jointly distributed continuous random variables and U is some function of X
and Y, U = g(X,Y), then E(U) can be found by using the methods of chapter 2 to derive the
marginal distribution of U, pU(u), so that

𝐸(𝑈) = 𝐸[𝑔(𝑋, 𝑌)] = ∫ 𝑢𝑝𝑈 (𝑢)𝑑𝑢 (3.15)

A much simpler and more direct method of finding E[g(X,Y)] would be to use the relationship

𝐸[𝑔(𝑋, 𝑌)] = ∬ 𝑔(𝑥, 𝑦) 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (3.41)

In either case the result is the average value of the function g(X,Y) weighted by the probability that
X=x and Y=y or more simply the mean of the random variable U.

In the discrete case


67
𝐸[𝑔(𝑋, 𝑌)] = ∑𝑖 ∑𝑗 𝑥𝑖𝑟 𝑦𝑗𝑠 𝑓𝑋,𝑌 �𝑥𝑖 , 𝑦𝑗 � (3.42)

A general expression for the r,s moment about the origin of the jointly distributed random
variables X and Y is

𝜇𝑟,𝑠 = ∬ 𝑥 𝑟 𝑦 𝑠 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (3.43)

for X and Y continuous and



𝜇𝑟,𝑠 = ∑𝑖 ∑𝑗 𝑥𝑖 𝑟 𝑦𝑗 𝑠 𝑓𝑋,𝑌 �𝑥𝑖 , 𝑦𝑗 � (3.44)

for X and Y discrete.

The r,s central moment is defined as


𝑠
𝜇𝑟,𝑠 = ∬(𝑥 − 𝜇𝑥 )𝑟 �𝑦 − 𝜇𝑦 � 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (3.45)

for continuous random variables and as


𝑠
𝜇𝑟,𝑠 = ∑𝑖 ∑𝑗(𝑥𝑖 − 𝜇𝑥 )𝑟 �𝑦𝑗 − 𝜇𝑦 � 𝑓𝑋,𝑌 (𝑥 , 𝑦 ) (3.46)

for discrete random variables.

For most situations only moments about the origin and about the means are of interest. As
in the case of univariate distributions, the r,s moment about the origin of a bivariate distribution is
equal to the expected value of 𝑋 𝑟 𝑌 𝑠 .

The cases where r =1 and s = 0 and r = 0 and s = 1 are of special interest. For example

𝐸(𝑋1 𝑌 0 ) = ∬ 𝑥𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑦𝑑𝑥 (3.47)

= ∫ 𝑥 ∫ 𝑝𝑋,𝑌 (𝑥, 𝑦) 𝑑𝑦𝑑𝑥

= ∫ 𝑥𝑝𝑥 (𝑥)𝑑𝑥

= 𝜇𝑥

The analogous result holds for E(X°Y1 ).

The most useful central moments are for (r = 2, s = 0), (r = 1, s = 1) and (r = 0, s = 2). For the
case (r = 2, s = 0) we have

𝐸[(𝑋 − 𝜇𝑥 )2 ] = ∬(𝑥 − 𝜇𝑋 )2 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (3.48)

68
= ∫(𝑥 − 𝜇𝑥 )2 ∫ 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑦𝑑𝑥

= ∫(𝑥 − 𝜇𝑥 )2 𝑝𝑥 (𝑥)𝑑𝑥

= 𝑉𝑎𝑟(𝑋)

The analogous result holds for( r = 0, s = 2). The comparable results for discrete random
variables are easily obtained.

Covariance

The covariance of X and Y is defined as the 1, 1 central moment

𝐶𝑜𝑣(𝑋, 𝑌) = 𝜎𝑋,𝑌 = 𝜇1,1

= 𝐸[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] (3.16)

= 𝐸(𝑋𝑌) − 𝐸(𝑋)𝐸(𝑌)

= ∬(𝑥 − 𝜇𝑋 ) (𝑦 − 𝜇𝑌 )𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥 𝑑𝑦

For the case where X and Y are independent, equation 3.49 can be written

𝜎𝑋,𝑌 = ∫(𝑥 − 𝜇𝑥 )𝑝𝑥 (𝑥)𝑑𝑥 ∫(𝑦 − 𝜇𝑌 )𝑝𝑦 (𝑦)𝑑𝑦 (3.17)

since 𝑝𝑋,𝑌 (𝑥, 𝑦) would equal 𝑝𝑥 (𝑥)𝑝𝑦 (𝑦). Furthermore both of the integrals in equation 3.50 are
equal to zero so that

𝐶𝑜𝑣(𝑋, 𝑌) = 𝜎𝑋,𝑌 = 0 (3.18)

if X and Y are independent. The converse of this is not necessarily true however.

The sample estimate for the population covariance σX,Y is sX,Y computed from

∑𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦
�)
𝑠𝑋,𝑌 = (𝑛−1)
(3.19)

Correlation Coefficient

The covariance has units equal to the units of X times the units of Y. A normalized
covariance called the correlation coefficient is obtained by dividing the covariance by the products
of the standard deviations of X and Y
𝜎
𝜌𝑋,𝑌 = 𝜎 𝑋,𝑌 (3.53)
𝜎
𝑋 𝑌

69
It can be shown (Thomas 1971) that −1 ≤ 𝜌𝑋,𝑌 ≤ 1 . Obviously, if X and Y are
independent, ρX,Y = 0 . Again the converse is not necessarily true. X and Y can be functionally
related and still have ρX,Y (and σX,Y ) equal to zero. Actually ρX,Y is a measure of the linear
dependence between X and Y. If ρX,Y = 0 then X and Y are linearly independent; however, they
may be related by some other functional form. A value of ρX,Y equal to ± 1 implies that X and Y
are perfectly related by Y=a + bX. If ρX,Y = 0, X and Y are said to be uncorrelated. Any nonzero
value of ρX,Y means X and Y are correlated.

The covariance and the correlation coefficient are a measure of how the two variables X
and Y vary together. If ρX,Y and σX,Y are positive, large values of X tend to be paired with large
values of Y and vice versa. If ρX,Y and σX,Y are negative, large values of X tend to be paired with
small values of Y and vice versa.

The population correlation coefficient ρX,Y can be estimated by the sample correlation
coefficient as
𝑠
𝑟𝑋,𝑌 = 𝑠 𝑋,𝑌 (3.20)
𝑠
𝑋 𝑌

where sX and sY are the sample estimates for σX and σY given by equation 3.25 and sX,Y is the
sample covariance given by equation 3.52.

Figure 3.5 demonstrates some typical values for rX,Y . In figure 3.5a all of the points lie on
the line Y=X-1; consequently there is perfect linear dependence between X and Y and the
correlation coefficient is unity. In figure 3.5b the points are either on or slightly off of the line
Y=X-1, and rX,Y = 0.986. Perfect linear dependence does not exist in this case because some of the
points deviate slightly from the straight line. In measuring and relating naturally occurring
hydrologic variables, a correlation coefficient of 0.986 would be considered quite good and the
resulting straight line, Y=X-1 in this case, would usually be judged a good usable relationship
between X and Y.

In figure 3.5c the correlation coefficient has dropped to -0.671. The points in this case are
scattered about the line Y = 1.264- 1.571X. The scatter of the points is much greater than in the
previous case although the existence of some dependence (stochastic) is still in evidence.

In figure 3.5d the scatter of the points is very great, with a corresponding lack of a strong
(stochastic) dependence. Generally, a correlation coefficient of 0.211 is considered too small to
indicate a useful stochastic dependence as knowledge about X gives very little information about
Y.

In the last two paragraphs the modifier "stochastic" has appeared with the word
dependence. This is because in reality there are two kinds of dependence -- stochastic and
functional. Generally throughout this book the word dependence alone should be taken to mean
stochastic (or statistical) dependence.
70
Figures 3.5e and 3.5f contain examples of functionally dependent variables. In figure 3.5e
the relationship is Y = X2 /4 for X > 0 and in figure 3.5f the relationship is 𝑌 = ±√9 − 𝑋 2 for -3
< X < 3. The correlation coefficient for figure 3.5e is 0.963 indicating a high degree of stochastic
(linear) dependence. This illustrates that even though the dependence between X and Y is

Figure 3.5. Examples of the correlation coefficient.

71
nonlinear, a high correlation coefficient can result. If the plot of figure 3.5e were to cover a
different range of X, the correlation coefficient would change as well.

Figure 3.5f illustrates a situation where Y and X are perfectly functionally related even
though the correlation coefficient is zero. The functional relationship is not linear, however. This
figure demonstrates that one cannot conclude that X and Y are unrelated based on the fact that their
correlation coefficients are small.

The fact that two variables have a high degree of linear correlation should not be
interpreted as indicating a functional or cause and effect relationship exists between the two
variables. The annual water yield on two adjacent watersheds may be highly positively correlated
even though a high yield from one watershed does not cause a high yield from the second
watershed. More likely the same climatic factors and geomorphic factors are operating on the two
watersheds causing their water yields to be similar. The fact is often overlooked that high
correlation does not necessarily mean a cause and effect relationship exists between the correlated
variables.

Further Properties of Moments

If Z is a linear function of two random variables X and Y, then

𝑍 = 𝑎𝑋 + 𝑏𝑌

𝐸(𝑍) = 𝐸(𝑎𝑋 + 𝑏𝑌) = 𝑎𝐸(𝑋) + 𝑏𝐸(𝑌) (3.55)

𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑎𝑋 + 𝑏𝑌) = 𝐸(𝑎𝑋 + 𝑏𝑌)2 − 𝐸 2 (𝑎𝑋 + 𝑏𝑌)

or

𝑉𝑎𝑟(𝑍) = 𝑎2 𝑉𝑎𝑟(𝑋) + 𝑏 2 𝑉𝑎𝑟(𝑌) + 2𝑎𝑏𝐶𝑜𝑣(𝑋, 𝑌) (3.56)

Equations 3.55 and 3.56 can be generalized when Y is a linear function of n random
variables as follows.

𝑌 = ∑𝑛𝑖=1 𝑎𝑖 𝑋𝑖

then

𝐸(𝑌) = 𝐸(∑𝑛𝑖=1 𝑎𝑖 𝑋𝑖 ) = ∑𝑛𝑖=1 𝑎𝑖 𝐸(𝑋𝑖 ) (3.57)

and

𝑉𝑎𝑟(𝑌) = ∑𝑛𝑖=1 𝑎𝑖2 𝑉𝑎𝑟(𝑋𝑖 ) + 2 ∑𝑖<𝑗 𝑎𝑖 𝑎𝑗 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 ) (3.58)

A noteworthy result of equation 3.56 or 3.58 is that for uncorrelated random variables, the
72
variance of a sum or difference is equal to the sum of the variances. This is because the variation in
each of the random variables contributes to the variation of their sum or difference.

As a special case of a linear function consider the Xi to be a random sample of size n. Let
the ai all be equal to 1/n. Then Y is equal to 𝑋�, the mean of the sample. The Var(Y) is the Var(𝑋�)
and can be found from equation 3.58. Since the Xi form a random sample, the Cov(Xi,Xj) = 0 for i
≠ j and Var(Xi) = Var(X) we now have
1 𝑛
𝑉𝑎𝑟(𝑌) = 𝑉𝑎𝑟(𝑋�) = ∑𝑛𝑖=1 𝑛2 𝑉𝑎𝑟(𝑋) = 𝑛2 𝑉𝑎𝑟(𝑋)

or
𝑉𝑎𝑟(𝑋)
𝑉𝑎𝑟(𝑋�) = 𝑛 (3.21)

Equation 3.59 states that the variance of the mean of a random sample is equal to the
variance of the sample divided by the number of observations used to estimate the mean of the
sample. If X and Y are independent random variables then the equation 3.49 shows that the
expectation of their product is equal to the product of their expectation.

𝐸(𝑋𝑌) = 𝐸(𝑋)𝐸(𝑌) 𝑖𝑓 𝑋 𝑎𝑛𝑑 𝑌 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡) (3.22)

The variance of the product XY for X and Y independent can be obtained from

𝑉𝑎𝑟(𝑋𝑌) = 𝐸(𝑋𝑌)2 − 𝐸 2 (𝑋𝑌)

and noting that

𝐸(𝑋𝑌)2 = ∫ 𝑥 2 𝑦 2 𝑝𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦

Because X and Y are independent 𝑝𝑋,𝑌 (𝑥, 𝑦) = 𝑝𝑋 (𝑥)𝑝𝑌 (𝑦) and E(XY)2 becomes E(X2)E(Y2) or
E(XY)2 = (μX + σX) (μY + σY). Also from equation 3.60, E2(XY)= E2(X) E2(Y) = μX μY. Thus
2 2 2 2 2 2

𝑉𝑎𝑟(𝑋𝑌) = (𝜇𝑋2 + 𝜎𝑋2 )(𝜇𝑌2 + 𝜎𝑌2 ) − 𝜇𝑋2 𝜇𝑌2

which reduces to

𝑉𝑎𝑟(𝑋𝑌) = 𝜇𝑋2 𝜎𝑌2 + 𝜇𝑌2 𝜎𝑋2 + 𝜎𝑋2 𝜎𝑌2 (3.61)

for X and Y independent.

A final word of caution on closing this section concerning the expected value of a function
of random variables. The caution is that in general

73
𝐸(𝑔(𝑋)) ≠ 𝑔(𝐸(𝑋))

That this is true is obvious from the example of g(X) = X2. From equation 3.23 it can be
seen that 𝐸(𝑋 2 ) = 𝜎𝑋2 + 𝜇𝑋2

𝐸�𝑔(𝑋)� = 𝐸(𝑋 2 ) = 𝜎𝑋2 + 𝜇𝑋2

𝑔�𝐸(𝑋)� = 𝑔(𝜇𝑋 ) = 𝜇𝑋2

thus demonstrating that in general E(g(X)) ≠ g(E(X)).

SAMPLE MOMENTS

If xi for i = 1 to n is a random sample, then the rth sample moment about the origin is

𝑥𝑖𝑟
𝑀𝑟′ = ∑𝑛𝑖=1 (3.62)
𝑛

and the rth sample moment about the sample mean is


(𝑥𝑖 −𝑥̅ )𝑟
𝑀𝑟 = ∑𝑛𝑖=1 (3.23)
𝑛

For the bivariate case involving a random sample of xi and yi, the r,s sample moment about
the origin is

𝑥𝑖𝑟 𝑦𝑖𝑠

𝑀𝑟,𝑠 = ∑𝑛𝑖=1 (3.64)
𝑛

and the r,s sample moment about 𝑥̅ , 𝑦� is


(𝑥𝑖 −𝑥̅ )𝑟 (𝑦𝑖 −𝑦�)𝑠
𝑀𝑟,𝑠 = ∑𝑛𝑖=1 (3.65)
𝑛

The expected value of sample moments is equal to the population moments (Mood, et al. 1974).

Two important properties of moments worthy of repeating are:

1. The first moment about the mean is zero.

𝐸(𝑋 − 𝜇𝑥 ) = 𝐸(𝑋) − 𝜇𝑋 = 𝜇𝑥 − 𝜇𝑥 = 0

2. The second moment about the origin is equal to the variance plus the square of the mean.

𝜎𝑋2 = 𝐸(𝑋 − 𝜇𝑋 )2 = 𝐸(𝑋 2 ) − 𝐸 2 (𝑋) = 𝐸(𝑋 2 ) − 𝜇𝑋2


74
𝐸(𝑋 2 ) = 𝜎𝑋2 + 𝜇𝑋2

The moments about the mean are related to the moments about the origin by the following
general equation (Thomas 1971)
𝑟
𝑀𝑟 = ∑𝑟𝑠=0(−1)𝑠 � � 𝜇 𝑠 𝑀𝑟−𝑠

(3.66)
𝑠
For the computation of sample moments it is often convenient to use equation 3.66. The results of
equation 3.66 for the first four sample moments are

𝑀1 = 0

𝑀2 = 𝑀2′ − 𝑥̅ 2

𝑀3 = 𝑀3′ − 3𝑥̅ 𝑀2′ + 2𝑥̅ 3

𝑀4 = 𝑀4′ − 4𝑥̅ 𝑀3′ + 6𝑥̅ 2 𝑀2′ − 3𝑥̅ 4

Sample moments can be computed from grouped data by using the equations

𝑥𝑗𝑖 𝑛𝑗
𝑀𝑖′ = ∑𝑘𝑗=1 (3.67)
𝑛

and

�𝑥𝑗 −𝑥̅ �𝑛𝑗


𝑀𝑖 = ∑𝑘𝑗=1 (3.24)
𝑛

where xj and nj are the class mark and number of observations, respectively, in the jth group, n is
the total number of observations, and k is the number of groups.

Moments of greater than third order are generally not computed for hydrologic variables
because of the small sample size. Higher order moments are very unreliable (have a high
variance) for small samples. For example, the variance of s2 (the variance of the sample variance)
is (Mood et al. 1974)
1 (𝑛−3)𝑠4
𝑉𝑎𝑟(𝑠 2 ) = 𝑛 �𝑀4 − (𝑛−1)

Yevjevich (1972a) presents general expressions for the variance of the variance, coefficient of
skew and kurtosis.

75
PROBABILITY-WEIGHTED MOMENTS AND L-MOMENTS

Probability weighted moments (PWMs) and linear functions of ranked observations


(known as L-moments) are another way of characterizing a pdf. This discussion of PWMs and
L-moments relies heavily on Stedinger et al. (1994) which should be consulted for more details.
The rth PWM, βr, is given by

𝛽𝑟 = 𝐸(𝑋[𝑃𝑋 (𝑥)]𝑟 ) (3.25)

β0 is the population mean and an estimator b0 of β0 is 𝑥̅ . Estimates for other PWMs can
be obtained from order statistics. A random sample of observations can be arranged so that x(n)
≤ x(n-1) ≤ ... ≤ x(1). The x(i) are known as order statistics. An estimator, 𝑏𝑟∗ , for βr for r ≥ 1 is

1 (𝑗−0.35) 𝑟
𝑏𝑟∗ = 𝑛 ∑𝑛𝑗=1 𝑥(𝑗) �1 − � (3.70)
𝑛

where 1 - (j-0.35)/n are estimators for PX(x(j)). Stedinger et al. (1994) recommend this estimator
for single site estimation even though it is biased because it generally results in a smaller mean
square error than the unbiased estimator given below.

When unbiasedness is important, the following estimators may be used

𝑏0 = 𝑥̅
(𝑛−𝑗)𝑥(𝑗)
𝑏1 = ∑𝑛−1
𝑗=1 𝑛(𝑛−1)

(𝑛−𝑗)(𝑛−𝑗−1)𝑥(𝑗)
𝑏2 = ∑𝑛−2
𝑗=1 (3.71)
𝑛(𝑛−1)(𝑛−2)

(𝑛−𝑗)(𝑛−𝑗−1)(𝑛−𝑗−2)𝑥(𝑗)
𝑏3 = ∑𝑛−3
𝑗=1 𝑛(𝑛−1)(𝑛−2)(𝑛−3)

Stedinger et al. (1994) recommend these unbiased estimators in regionalization studies.

As previously indicated, L-moments are linear functions of ranked observations. Let


x(i|n) be the ith largest observation in a sample of size n. The ith L-moment, λi, is given by

𝜆1 = 𝐸(𝑋)
1
𝜆2 = 2 𝐸�𝑥(1|2) − 𝑥(2|2) �

1
𝜆3 = 3 𝐸�𝑥(1|3) − 2𝑥(2|3) + 𝑥(3|3) � (3.72)

1
𝜆4 = 4 𝐸�𝑥(1|4) − 3𝑥(2|4) + 3𝑥(3|4) − 𝑥(4|4) �
76
L-moment estimates for the mean, standard deviation, skewness, and kurtosis are given by

𝑋� = 𝜆̂1

𝑠𝑥 = 𝜆̂2

𝐶𝑠 = 𝜆̂3 ⁄𝜆̂2 (3.73)

𝑘 = 𝜆̂4 ⁄𝜆̂3

Because L-moments do not involve squares and cubes of observations, they tend to produce less
variable estimates for higher moments especially when an unusually large or small observation
happens to be present in a sample.

L-moments and probability weighted moments are related by

𝜆1 = 𝛽0

𝜆2 = 2𝛽1 − 𝛽0

𝜆3 = 6𝛽2 − 6𝛽1 + 𝛽0 (3.74)

𝜆4 = 20𝛽3 − 30𝛽2 + 12𝛽1 − 𝛽0

Estimates,𝜆̂𝑖 , of λi are obtained by replacing the βr with sample estimates br .

PARAMETER ESTIMATION

Thus far, probability distribution functions have been written pX(x) or fX(x) depending on
whether they were continuous or discrete. More correctly, they should be written
𝑝𝑋 (𝑥; θ1 , 𝜃2 , … , 𝜃𝑚 ) or 𝑓𝑋 (𝑥; θ1 , 𝜃2 , … , 𝜃𝑚 ) indicating that in general the distributions are a
function of a set of parameters as well as the random variables. To use probability distributions to
estimate probabilities, values for the parameters must be available. This section discusses
methods for estimating the parameter values for probability distributions. Certain properties of
these parameter estimates or statistics are also discussed. Rather than carry a dual set of
relationships - one for continuous and one for discrete random variables - only the expressions for
the continuous random variables will be displayed. The results are equally applicable to discrete
distributions.

The usual procedure for estimating a parameter is to obtain a random sample x1, x2, ..., xn
from the population X. This random sample is then used to estimate the parameters. Thus 𝜃�𝑖 , an
estimate for the parameter θi, is a function of the observations or random variables. Since 𝜃�𝑖 is a
77
function of random variables, 𝜃�𝑖 is itself a random variable possessing a mean, variance and
probability distribution.

Intuitively, one would feel that the more observations of the random variables that were
available for parameter estimation, the closer should be 𝜃� to θ. Also, if many samples were used
for obtaining 𝜃�, one would feel that the average value of 𝜃� should equal θ. These two statements
deal with two properties of estimators known as consistency and unbiasedness.

Unbiasedness

An estimate 𝜃� of a parameter θ is said to be unbiased if 𝐸�𝜃�� = 𝜃. The bias, if any, is


given by 𝐸�𝜃�� − 𝜃

𝑏𝑖𝑎𝑠 = 𝐸�𝜃�� − 𝜃 (3.26)

The fact that an estimator is unbiased does not guarantee that an individual 𝜃� is equal to
θ or even close to θ, it simply means that the average of many independent estimates for θ will
equal θ.

Consistency

An estimator 𝜃� of a parameter θ is said to be consistent if the probability that 𝜃� differs


from θ by more than an arbitrary constant ε approaches 0 as the sample size approaches infinity.

Consistency is an asymptotic property since it says that by selecting n sufficiently large,


the 𝑝𝑟𝑜𝑏��𝜃� − 𝜃� > 𝜀�can be made as small as desired. For small samples (as are many times
used in practice) consistency does not guarantee that a small error will be made. In spite of this
one feels more comfortable knowing that 𝜃� would converge to θ if a larger sample were used.

A single estimate of θ from a small sample is a problem since neither unbiasedness nor
consistency give us much comfort. In choosing between several methods for estimating θ in
addition to being unbiased and consistent, it would be desirable if the Var(𝜃�) were as small as
possible. This would mean that the probability distribution of 𝜃� would be more concentrated
about θ.

Efficiency

An estimator 𝜃� is said to be the most efficient estimator for θ if it is unbiased and its
variance is at least as small as that of any other unbiased estimator for θ. The relative efficiency
of 𝜃�1 with respect to 𝜃�2 for estimating θ is the ratio of Var(𝜃�2 ) to Var(𝜃�1 ).

78
Sufficiency

Finally, it is desirable that 𝜃� use all of the information contained in the sample relative
to θ. If only a fraction of the observations in a sample are used for estimating θ, then some
information about θ is lost. An estimator 𝜃� is said to be a sufficient estimator for θ if 𝜃� uses
all of the information relevant to θ that is contained in the sample.

More formal statements of the above four properties of estimators and procedures for
determining if an estimator has these properties can be found in books on mathematical statistics
(Lindgren 1968; Freund 1962; Mood et al. 1974).

There are many ways for estimating population parameters from samples of data. A few
of these are graphical procedures, matching selected points, method of moments, maximum
likelihood, and minimum chi-square. The graphical procedure consists of drawing a line through
plotted points and then using certain points on the line to calculate the parameters. This
procedure is very arbitrary and dependent upon the individual doing the analysis. Frequently the
method is employed when few observations are available -- with the thought that few
observations will not produce good parameter estimates anyway. When few points are available
is precisely the time when the best methods of parameter estimation should be used.

The method of matching points is not a commonly used method but can produce
reasonable first approximations to the parameters. The procedure can be valuable in getting
initial estimates for the parameters to be employed in iterative solutions that can arise when the
method of moments or maximum likelihood are used.

Example 3.1. A certain set of data is thought to follow the distribution 𝑝𝑋 (𝑥) = 𝜆𝑒 −𝜆𝑥 for
X>0. In this particular data set, 75% of the values are less than 3.0. Estimate the parameter λ.

Solution:

𝑝𝑋 (𝑥) = 𝜆𝑒 −𝜆𝑥
𝑥
𝑃𝑋 (𝑥) = ∫0 𝜆𝑒 −𝜆𝑥 𝑑𝑡 = 1 − 𝑒 −𝜆𝑥

1 − 𝑃𝑋 (𝑥) = 𝑒 −𝜆𝑥

𝜆𝑥 = −𝑙𝑛�1 − 𝑃𝑋 (𝑥)�
−ln(1−𝑃𝑋 (𝑥)) −ln(1−0.75)
𝜆̂ = 𝑥
= 3.00
= 0.46

Comment: If sample size n is available the above procedure could be used to obtain n estimates
for λ. These n estimates could then be averaged to obtain 𝜆̂. If the probability distribution of
interest had m parameters, then the value of PX(x) and x at m points would be used to obtain m
79
equations in the m unknown parameters. The method of matching points is not recommended
for general use in getting final parameter estimates. Certainly this method would not use all of
the information in the sample. Also several different estimates for the parameters could be
obtained from the same sample depending on which observations were used in the estimation
process.

Method of Moments

One of the most commonly used methods for estimating the parameters of a probability
distribution is the method of moments. For a distribution with m parameters, the procedure is to
equate the first m moments of the distribution to the first m sample moments. This results in m
equations which can be solved for the m unknown parameters. Moments about the origin, the
mean, or any other point can be used. Generally for 1-parameter distributions the first moment
about the origin, the mean, is used. For 2- parameter distributions the mean and the variance are
generally used. If a third parameter is required, the skewness may be used.

Similarly, L-moments may be used in parameter estimation by equating sample


estimates of the L-moments to the population expression for the corresponding L-moment
depending on the particular pdf being used. Again, for m parameters, m L-moments would be
required. This technique will be illustrated in chapter 6 for some particular pdfs.

Example 3.2. Estimate the parameter λ of the distribution 𝑝𝑋 (𝑥) = 𝜆𝑒 −𝜆𝑥 for X > 0 by the
method of moments.

Solution: The first moment about the origin of pX(x) is


∞ 1
𝜇𝑋 = 𝜆 ∫0 𝑥𝑒 −𝜆𝑥 𝑑𝑥 = 𝜆 ∙ 𝜆2 = 1/𝜆

1
Thus the mean of pX(x) is 1/λ so that λ can be estimated by 𝜆̂ = � .
𝑋

Example 3.3. Use the method of moments to estimate the parameters of

(𝑥−𝜃1 )2
1 −� �
2𝜃2
𝑝𝑋 (𝑥) = 𝑒 2 −∞<𝑋 <∞
�2𝜋𝜃22

Solution:

(𝑥−𝜃 )2
1 �
∞ 1 ∞ −�
2𝜃2
𝜇𝑋 = ∫−∞ 𝑥𝑝𝑋 (𝑥)𝑑𝑥 = ∫−∞ 𝑥𝑒 2 𝑑𝑥
�2𝜋𝜃2 2

80
𝑥−𝜃1
let 𝑦 = so that 𝑑𝑥 = 𝜃2 𝑑𝑦
𝜃2

and
1 ∞ 2 /2
𝜇𝑋 = ∫ (𝜃 𝑦 + 𝜃1 )𝑒 −𝑦 𝑑𝑦
√2𝜋 −∞ 2

𝜃2 ∞ 2 /2 𝜃1 ∞ −𝑦 2�
= ∫ 𝑦𝑒 −𝑦
√2𝜋 −∞
𝑑𝑦 + ∫ 𝑒
√2𝜋 −∞
2 𝑑𝑦

The first integral has an integrand h(y) such that h(-y) = -h(y) and is therefore zero. The second
integral can be written as

∞ −𝑦 2�
2 ∫0 𝑒 2 𝑑𝑦 = √2𝜋

Therefore, μX = θ1, or the parameter θ1 of this distribution is equal to the mean of the distribution
and can be estimated by

𝜃�1 = 𝑋�

The second moment about the mean is equal to the variance.

(𝑥−𝜃1 )2
∞ 1 ∞ −� �
2𝜃2
𝜎𝑥2 = ∫−∞(𝑥 − 𝜇𝑋 )2 𝑝𝑋 (𝑥)𝑑𝑥 = ∫−∞(𝑥 − 𝜇𝑥 )2 𝑒 2 𝑑𝑥
�2𝜋𝜃2 2

but θ1 = μX so

(𝑥−𝜇𝑥 )2
1 ∞ −� �
2𝜃2
𝜎𝑋2 = ∫−∞(𝑥 − 𝜇𝑥 ) 2 𝑒 2 𝑑𝑥
�2𝜋𝜃2 2

𝑥−𝜇𝑥
let 𝑦 = so that 𝑑𝑥 = √2𝜃2 𝑑𝑦
√2𝜃2

and

2𝜃22 ∞ 2
𝜎𝑋2 = ∫−∞ 𝑦 2 𝑒 −𝑦 𝑑𝑦
√𝜋

4𝜃22
∞ 2
= ∫ 𝑦 2 𝑒 −𝑦 𝑑𝑦
√2𝜋 −∞

= 𝜃22

81
Thus the parameter θ2 is equal to the variance and can be estimated by sX2 (the sample variance).
2

𝜃�22 = 𝑠𝑋2

Substituting the parameter estimates in terms of their population values into the
expression for pX(x), the result is

(𝑥−𝜇𝑥 )2
1 −� �
2𝜎2
𝑝𝑋 (𝑥) = 𝑒 2 −∞<𝑋 <∞
2
�2𝜋𝜎𝑋

which is the normal distribution.

Maximum Likelihood

Assume we have in hand n random observations x1, x2, ..., xn. Their joint probability
distribution is 𝑝𝑋 (𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , 𝜃2 , … , 𝜃𝑚 ) . Since for a random sample the xi's are
independent their joint distribution can be written

𝑝𝑋 (𝑥1 ; 𝜃1 , 𝜃2 , … , 𝜃𝑚 )𝑝𝑋 (𝑥2 ; 𝜃1 , 𝜃2 , … , 𝜃𝑚 ) … 𝑝𝑋 (𝑥𝑛 ; 𝜃1 , 𝜃2 , … , 𝜃𝑚 )

Now this latter expression is proportional to the probability that the particular random sample
would be obtained from the population and is known as the likelihood function.

𝐿(𝜃1 , 𝜃2 , … 𝜃𝑚 ) = ∏𝑛𝑖=1 𝑝𝑋 (𝑥𝑖 ; 𝜃1 , 𝜃2 , … , 𝜃𝑚 ) (3.76)

The m parameters are unknown. The values of these m parameters that maximize the
likelihood that the particular sample in hand is the one that would be obtained if n random
observations were selected from pX(x; θ1,θ2,...,θm ) are known as the maximum likelihood
estimators. The parameter estimation procedure becomes one of finding the values
of θ1,θ2,...,θm that maximize the likelihood function. This can be done by taking the partial
derivative of L(θ1,θ2,...,θm) with respect to each of the θi ’s and setting the resulting
expressions equal to zero. These m equations in m unknowns are then solved for the m unknown
parameters.

Because many probability distributions involve the exponential function, it is many


times easier to maximize the natural logarithm of the likelihood function. Since the logarithmic
function is monotonic, the values of the θ's that maximize the logarithm of the likelihood
function also maximize the likelihood function.

Example 3.4. Find the maximum likelihood estimator for the parameter λ of the
distribution pX(x) = λ e-λ x for X>0.

82
Solution:

𝐿(𝜆) = ∏𝑛𝑖=1 𝜆𝑒 −𝜆𝑥𝑖 = 𝜆𝑛 𝑒 −𝜆 ∑ 𝑥𝑖

𝑙(𝜆) = 𝑙𝑛𝐿(𝜆) = 𝑛𝑙𝑛(𝜆) − 𝜆 ∑𝑛𝑖=1 𝑥𝑖


𝛿𝑙(𝜆) 𝑛
= 𝜆 − ∑𝑛𝑖=1 𝑥𝑖 = 0
𝛿𝜆

𝑛 1
𝜆̂ = ∑𝑛 𝑥 = 𝑋�
𝑖=1 𝑖

Note that this is the same estimate as obtained in example 3.2 using the method of moments.
The two methods do not always produce the same estimates.

Example 3.5. Find the maximum likelihood estimators for the parameters θ1, and θ2 of the
2

distribution

(𝑥−𝜃1 )2
1 −� �
2𝜃2
𝑝𝑋 (𝑥) = 𝑒 2 −∞<𝑋 <∞
�2𝜋𝜃22

Solution (all summations from 1 to n):

∑(𝑥−𝜃1 )2
1 −� �
2𝜃2
𝐿(𝜃1 , 𝜃22 ) = 𝑛 𝑒 2

��2𝜋𝜃22 �

𝑛𝑙𝑛(2𝜋) 𝑛𝑙𝑛�𝜃22 � 1 ∑(𝑥𝑖 −𝜃1 )2


𝑙(𝜃1 , 𝜃22 ) = 𝑙𝑛𝐿(𝜃1 , 𝜃22 ) = − − −2
2 2 𝜃22

𝛿𝑙�𝜃1 ,𝜃22 � ∑(𝑥𝑖 −𝜃1 )


= =0
𝛿𝜃1 𝜃22

Therefore Σ(xi - θ1) = 0

∑𝑥
or 𝜃�1 = 𝑛 𝑖 = 𝑥̅

𝛿𝑙�𝜃1 ,𝜃22 � 𝑛 1
𝛿𝜃1
= − 2𝜃2 + 2𝜃4 ∑(𝑥𝑖 − 𝑥̅ )2 = 0
2 2

∑(𝑥 −𝑥̅ ) 2
(𝑛−1)𝑠 2
𝜃�22 = 𝑖𝑛 = 𝑛

83
Example 3.5 shows that the maximum likelihood estimators are not unbiased. It can be
shown, however, that the maximum likelihood estimators are asymptotically (as 𝑛 → ∞ )
unbiased. Maximum likelihood estimators are sufficient and consistent. If an efficient estimator
exists, maximum likelihood estimators, adjusted for bias, will be efficient. In addition to these
four properties, maximum likelihood estimators are said to be invariant, that is, if 𝜃� is a
maximum likelihood estimator of θ and the function h(θ) is continuous, then h(𝜃�) is a maximum
likelihood estimator of h(θ).

The method of moments and the method of maximum likelihood do not always produce
the same estimates for the parameters. In view of the properties of the maximum likelihood
estimators, this method is generally preferred over the method of moments. Cases arise,
however, where one can get maximum likelihood estimators only by iterative numerical
solutions (if at all) thus leaving room for the use of more readily obtainable estimates possibly
by the method of moments. The accuracy of the method of moments is severely affected if the
data contains errors in the tails of the distribution where the moment arms are long (Chow 1954).
This is especially troublesome with highly skewed distributions.

Finally it should be kept in mind that the properties of maximum likelihood estimators
are asymptotic properties (for large n) and there well may exist better estimation procedures for
small samples for particular distributions.

CHEBYSHEV INEQUALITY

Certain general statements about random variables can be made without placing
restrictions on their distributions. More precise probabilistic statements require more
restrictions on the distribution of the random variables. Exact probabilistic statements require
complete knowledge of the probability distribution of the random variable.

One general result that applies to random variables is known as the Chebyshev
inequality. This inequality states that a single observation selected at random from any
probability distribution will deviate more than kσ from the mean, μ, of the distribution with
probability less than or equal to 1/k2.
1
𝑝𝑟𝑜𝑏(|𝑥 − 𝜇𝑥 | ≥ 𝑘𝜎𝑥 ) ≤ 𝑘 2 (3.27)

For most situations this is a very conservative statement. The Chebyshev inequality produces an
upper bound on the probability of a deviation of a given magnitude from the mean.

Example 3.6. The data of table 2.1 has a mean of 66,540 cfs and a standard deviation of 22,322
cfs. Without making any distributional assumptions regarding the data, what can be said of the
probability that the peak flow in a year selected at random will deviate more than 40,000 cfs
84
from the mean?

Solution: Applying Chebyshev's inequality we have kσ = 40,000 cfs. Using 22,322 cfs as an
estimate for σ we obtain k = 1.79.
1 1
𝑝𝑟𝑜𝑏(|𝑥 − 𝜇𝑥 | ≥ 𝑘𝜎𝑥 ) ≤ 𝑘 2 = (1.79)2 = 0.311

The probability that the peak flow in any year will deviate more than 40,000 cfs from the
mean is thus less than or equal to 0.311.

Comment: That this is a very conservative figure can be seen by noting that only 6 values out of
99 (6/99 = 0.061) lie outside the interval 66,540 ± 40,000. By not making any distributional
assumptions, we are forced to accept very conservative probability estimates. In later chapters we
will again look at this problem making use of selected probability distributions.

LAW OF LARGE NUMBERS

Chebyshev's inequality is sometimes written in terms of the mean 𝑥̅ of a random sample


of size n. In such a case equation 3.77 becomes
𝑘𝜎𝑋 1
𝑝𝑟𝑜𝑏 �|𝑥̅ − 𝜇𝑥 | ≥ � ≤ 𝑘2 (3.78)
√𝑛

𝜎2
If we now let δ = 1/k2 and choose n so that 𝑛 ≥ 𝜕𝜖𝑋2 , we have the (weak) Law of Large Numbers
(Mood and Graybill 1963) which states:

Let pX(x) be a probability density function with mean μX and finite variance σX2. Let 𝑥̅𝑛 be the
mean of a random sample of size n from pX(x). Let ε and δ be any two specified small numbers
σX
2

such that ε >0, 0< δ< 1. Then for n any integer greater than 2
εδ

𝑝𝑟𝑜𝑏(|𝑥̅𝑛 − 𝜇𝑥 | ≥ 𝜖) ≤ 𝛿 (3.28)

This statement insures us that we can estimate the population mean with whatever accuracy
we desire by selecting the sample large enough. The actual application of equation 3.79 requires
knowledge of population parameters and is thus of limited usefulness.

Example 3.7. Assume that the standard deviation of peak flows on the Kentucky River near
Salvisa, Kentucky, is 22,322 cfs. How many observations would be required to be at least 95%
sure that the estimated mean peak flow was within 10,000 cfs of its true value if we know nothing
85
of the distribution of peak flows?

Solution: Applying equation 3.79 we have

δ = 1 - .95 = .05 ε = 10,000 σX = 22,322

𝜎2 (22,322)2
𝑛 ≥ 𝜖2𝑋𝛿 = (10,000)2 (0.05) = 100

We must have at least 100 observations to be 95% sure that the sample mean is within
10,000 cfs of the population mean if we know nothing of the population distribution except its
standard deviation. This happens to be very close to the number of observations in the sample
(99).

Comment: We will look at this problem again later making certain distributional assumptions.

86
EXERCISES

3.1 What is the expected mean and variance of the sum of values obtained by tossing two dice?
What is the coefficient of skew and kurtosis?
𝑥
3.2 Modular coefficients defined as 𝐾𝑡 = 𝑥̅𝑡 are occasionally used in hydrology. What are the
mean, variance and coefficient of variation of modular coefficients in terms of the original data?

3.3 What effect does the addition of a constant to each observation from a random sample have
on the mean, variance and coefficient of variation?

3.4 What effect does multiplying each observation in a random sample by a constant have on the
mean, variance and coefficient of variation?

3.5 Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that �𝑄� − 𝜇𝑄 � is greater than 10,000 cfs?

3.6 Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that a single random observation will deviate
more than 10,000 cfs from μQ?

3.7 Using the data of exercise 2.2 calculate the mean and variance from the grouped data. How do
the grouped data mean and variance compare to the ungrouped mean and variance? Which
estimate do you prefer?

3.8 Calculate the covariance between the peak discharge Q in thousands of cfs and the area A in
thousands of square miles for the following data.

Q A Q A
(103 cfs) (103 mi2) (103 cfs) (103 mi2)
15.50 1.250 18.00 1.400
8.50 0.871 8.75 0.297
85.00 5.690 8.25 0.322
105.00 8.270 3.56 0.178
24.80 1.620 1.90 0.148
3.80 0.175 16.50 0.872
1.76 0.148 2.80 0.091

3.9 Calculate the correlation coefficient between Q and A for the data in exercise 3.8.
87
3.10 Calculate the coefficient of skew for Q in exercise 3.8. Note that this estimate is relatively
unreliable because of the small sample.

3.11 Calculate the kurtosis and the coefficient of excess for Q in exercise 3.8. Note that these
estimates are unreliable because of the small sample size.

3.12 Complete the steps necessary to arrive at equation 3.56 from 3.55.

3.13 Show that 𝜎𝑥 𝜎𝑦 ≥ �𝜎𝑥𝑦 �

3.14 A convenient relationship for calculating the estimated variance of a sample of data is
2
�∑ 𝑥𝑖 �
∑ 𝑥𝑖2 −𝑛𝑥̅ 2 ∑ 𝑥𝑖2 −
𝑠𝑋2 = = 𝑛
𝑛−1 𝑛−1

Derive this relationship from equation 3.25.

3.15 The estimated covariance between X and Y of a bivariate random sample can be calculated
from
∑ 𝑥 𝑖 ∑ 𝑦𝑖
∑ 𝑥𝑖 𝑦𝑖 −𝑛𝑥𝑦
���� ∑ 𝑥𝑖 𝑦𝑖 −
𝑆𝑋,𝑌 = = 𝑛
𝑛 𝑛

Derive this expression from equations 3.49. Note that the above estimated covariance is biased. In
practice the final divisor of n is replaced by n-1 to correct for bias.

3.16 In exercise 2.14, if the future maximum life of the ferry is 15 years, what is the expected net
profit? Neglect the interest or discount rate.

3.17 What are the maximum likelihood estimates for the parameters of the two parameter
exponential distribution? This distribution is given by

𝑝𝑋 (𝑥) = 𝜆𝑒 −𝜆(𝑥−𝜀) 𝑋 ≥ 𝜀, 𝜆 > 0

3.18 What are the moment estimates for the parameters of the exponential distribution given in
exercise 3.17?

3.19 For the following data, what are the moment and maximum likelihood estimates for the
parameters of the distribution given in exercise 3.17?

x = 15.0, 10.5, 11.0, 12.0, 18.0, 10.5, 19.5

3.20 Calculate the coefficient of skew for the Kentucky River data of table 2.1.

3.21 Calculate the kurtosis of the Kentucky River data of table 2.1.
88
3.22 Using the data of exercise 2.2, calculate the coefficient of skew from the grouped data.

3.23 Using the data of exercise 2.2, calculate the kurtosis from the grouped data.

3.24 What are the maximum likelihood estimates for α and β in the distribution
1
𝑝𝑋 (𝑥) = 𝛽−𝛼 𝛼≤𝑋≤𝛽

1
3.25 What are the mean and variance of fX(x) =N for x = 1, 2, ..., N?

3.26 What are the mean and variance of pX(x) = a sin2 x for 0< X< π?

3.27 Use the method of moments to estimate a in px(x) = a sin2x for 0<X<π based on the random
sample given by

X = 0.5, 2.0, 3.0, 2.5, 1.5, 1.8, 1.0, 0.8, 2.5, 2.2.

3.28 The rth moment about x0 can be written as E(X - x0)r. Show that the variance is the smallest
possible second moment.

89

You might also like