You are on page 1of 516

Statistical Methods,in

HYDROLOG*
S t a t i s t i c a l Methods i n
HYDROLOGY
Second Edition

CHARLES T. HAAN

Iowa State Press


A Blackwell Publishing Company
CHARLES T. HAAN is Regents Professor and Sarkeys Distinguished Professor, Emeritus, from
the Department of Biosystems and A,gicultural Engineering, Oklahoma State University, Still-
water.

O 1974 Iowa State University Press


O 2002 Iowa State Press
A Blackwell Publishing Company
All rights reserved

Iowa State Press


2121 State Avenue, Ames, Iowa 50014

Orders: 1-800-862-6657
Office: 1-515-292-0140
Fax: 1-515-292-3348
Web site: www.iowastatepress.com

Authorization to photocopy iteins for internal or personal use, or the internal or personal use of
specific clients, is granted by Iowa State Press, provided that the base fee of $.lo per copy is paid
directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For those
organizations that have been granted a photocopy license by CCC, a separate system of payments
has been arranged. The fee code for users of the Transactional Reporting Service is 0-8 138-1503-
712002 $. 10.

@Printed on acid-free paper in the United States of America

First edition, 1974


Second edition, 2002

Library of Congress Cataloging-in-Publication Data


Haan, C. T. (Charles Thomas)
Statistical methods in hydrology / Charles T. Haan.-2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 0-8 138-1503-7 (acid-free paper)
1. Hydrology-Statistical methods. I. Title.
GB656.2.S7 H3 2002
55 1.48'07'27-4~21 2002000060

The last digit is the print number: 9 8 7 6 5 4 3 2 1


I dedicate this book once again to my wife,Janice, who has been
my constant companion, friend, helpmate, and source of
encouragementfor the past 34 years.

Secondly, I dedicate the book to my two daughters,


Patti and Pam, and to my son Chris, his wzye Rie,
and their two children, Katrina and Daniel.

nirdly, I dedicate the book to my parents, Charles and Dorothy,


who gaue me a start in life and taught me
many of the values I hold dear:

Finally, the book is dedicated to the many graduate students


that I have worked with. They have been a constant
source of renewal, challenge, inspiration, and joy.
Contents
PREFACE TO SECOND EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
..
PREFACE TO FIRST EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x v ~ i

ACKNOWLEDGMENTS FOR THE SECOND EDITION . . . . . . . . . . . . . . . . . . . . . . .xix

ACKNOWLEDGMENTS FOR THE FIRST EDITION . . . . . . . . . . . . . . . . . . . . . . . . . .xx

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Hydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2 PROBABILITY AND PROBABILITY DISTRIBUTIONS-BASIC CONCEPTS . . . . .16


Probability .............................................. : ..........17
Total probability theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
Bayestheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Graphical presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Randomvariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
Univariate probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41
Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Deriveddistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Mixed distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 PROPERTIES OF RANDOM VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


Moments and expectation-univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 53
Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Arithmeticmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Geometricinean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Weightedmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -57
Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Measures of symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Measuresofpeakedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Moments and expectation-jointly distributed random variables . . . . . . . . . . . . . . . 60
Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Further properties of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Sample moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Probability-weighted moments and L-moments . . . . . . . . . . . . . . . . . . . . . . . . . . . -68
Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chebyshevinequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Lawoflargenumbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4 SOME DISCRETE PROBABILITY DISTRIBUTIONS AND THEIR


APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
Bernoulli processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -90
Summary of Bernoulli process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90
CONTENTS ix
Poissonprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Summary of Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -94
Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 NORMALDISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
General normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Reproductiveproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -102
Approximations for standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Constructing pdf curves for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Normal approximations for other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 CONTINUOUS PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . .114


Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114
Triangular distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Gammadistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -126
Extreme value distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Extreme Value Type I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -132
Extreme Value Type III Minimum (Weibull) . . . . . . . . . . . . . . . . . . . . . . . . . .134
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Generalized extreme value distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Betadistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Pearson distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Some important distributions of sample statistics . . . . . . . . . . . . . . . . . . . . . . . . . .142
Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
TheFdistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146
7 FREQUENCYANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 .
Probability plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Historicaldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Analytical hydrologic frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160
Log Pearson type I11 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Extreme value type I distribution (Gumbel distribution) . . . . . . . . . . . . . . . . .164
Other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Generalconsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165
Confidenceintervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Treatmentofzeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Truncation of low flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176
Use of paleohydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Probable maximum flood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Discussion of flood frequency determinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178
Regionalfrequencyanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180
Delineation of homogeneous regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180
Historical development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
Frequencydistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
Regression-based procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Index-floodmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Regional index-flood relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186
Regionalization using L-moments and the GEV distribution . . . . . . . . . . . . . 187
Regionalization using modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Frequency analysis of precipitation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -189
Frequency analysis of other hydrologic variables . . . . . . . . . . . . . . . . . . . . . . . . . .191
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192

8 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . .194


Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196
Mean of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Variance of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
One-sided confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200

..
Parameters of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -201
Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -201
H, p = pl. Ha: p = p2.normal distribution. known variance . . . . . . . . . . .206

..
H, p = p,. Ha: p = p2.normal distribution. unknown variance . . . . . . . . .206
H, p = po. Ha: p # po.normal distribution. known variance . . . . . . . . . . . .207
H, p = po.Ha: p # po.normal distribution. unknown variance . . . . . . . . . -207
Test for differences in means of two normal distributions . . . . . . . . . . . . . . . .208
CONTENTS xi

.
Test of H,: u2 = a; versus Ha: a ' # a: normal population . . . . . . . . . . . . . . 209
Test of H, a: = a; versus Ha: a: # a; for two normal populations . . . . . . .209
Test for equality of variances from several normal distributions . . . . . . . . . . .209
Testing the goodness of fit of data to probability distributions . . . . . . . . . . . . . . . . 210
Chi-square goodness of fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Distributional tests based on cumulative distributions . . . . . . . . . . . . . . . . . . .213
Comparing two empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
General comments on goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9 SIMPLE LINEAR REGRESSION . . . . . . . . . . . . . .


Simple regression . . . . . . . . . . . . . . . . . . . . . . .
Evaluating the regression . . . . . . . . . . . . . . . . .
Confidence intervals and tests of hypotheses . .
Inferences on regression coefficients . . . .
Confidence intervals on regression line . .
Confidence intervals on standard error . . .
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . .
General considerations . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 MULTIPLE LINEAR REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Generallinearmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Confidence intervals and tests of hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Confidence intervals on standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Inferences on the regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Confidence intervals on the regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Other inferences in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251
Whichlineisbest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256
Autocorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257
Testing for serial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Corrective action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260
Multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260
Detection of multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -262
An application of multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -262
Transforming linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -266
Indicator variables in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .268
Generalcomments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .278
11 CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 .
Inferences about population correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . .282
Serialcorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .287
Correlation and regional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Correlation and cause and effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -291
Spurious correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293

12 MULTIVARIATE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297


Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297
Principalcomponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .298
Regression on principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -307
Multivariate multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .311
Canonical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .312
Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .318

13 DATAGENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Univariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -321
Multivariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327
Multivariate. correlated. normal random variables . . . . . . . . . . . . . . . . . . . . . -327
Multivariate. correlated. nornormal random variables . . . . . . . . . . . . . . . . . . .328
Applications of data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -331
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .334

14 ANALYSIS OF HYDROLOGIC TIME SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -336


Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Trendanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .340
Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .346
Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .350
Autoregressive integrated moving average models (ARIMA) . . . . . . . . . . . . . . . . . 355
Moving Average Processes (MA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .356
Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 .
Autoregressive Moving Average Models ARMA (p, q) . . . . . . . . . . . . . . . . . .362
Autoregressive Integrated Moving Average ARIMA (p. d. q) . . . . . . . . . . . . -363
~stimateof noise variance o: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -364
Parameter estimation via least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
ARmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364
MAmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364
Parameter estimation via maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . -366
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
...
CONTENTS xu1
15 SOME STOCHASTIC HYDROLOGIC MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Purely random stochastic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -374
First-order Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .375
First-order Markov process with periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Higher-order autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Markovchainmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .380
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388

16 PROBABILISTIC METHODS FOR UNCERTAINTY. RISK. AND RELIABILITY


ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .390
Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .391
Traditional or local sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -391
Global sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Uncertainty analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396
Reliability and risk analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396
Uncertainty. risk. and reliability analysis methods . . . . . . . . . . . . . . . . . . . . . .398
First-order approximation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Simplified FOA estimates for some functional forms . . . . . . . . . . . . . . . . . . -399
Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -404
Corrected FOA method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -406
Correcting FOA mean and variance estimates of an individual function . . . . .406
Second-order approximation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -411
First-order reliability method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -412
Generic expectation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .418
Othermethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Second-order reliability methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423
Point estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Transform methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424

17 GEOSTATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .426
Semivariogrammodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .430
Combination semivariogram models . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . .432
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .433
Anexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 .
Anisotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
.
Local and global estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Polygon declustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Celldeclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447
Pointkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Blockkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
xiv CONTENTS
Estimation of cumulative distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .448
Modeling using geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -449

APPENDIXES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .451
A .1. Common distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -451
Hydrologicdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .454
A.2. Monthly runoff (in.), Cave Creek near Fort Spring, Kentucky . . . . . . . -454
A.3. Peak discharge (cfs), Cumberland River at Cumberland Falls,
Kentucky .................................................. 455
A.4. Peak discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine ......457
A.5. Total Precipitation (in.) for week of March 1 to March 7, Ashland,
Kentucky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .458
A.6. Flow and sediment load, Green River at Munfordville, Kentucky . . . . . .458
A.7. Streamflow (in.), Walnut Gulch near Tombstone, Arizona . . . . . . . . . . . .459
A.8. Monthly Rainfall (in.), Walnut Gulch near Tombstone, Arizona . . . . . . .460
A.9. Annual discharge (cfs ), Spray River, Banff, Canada . . . . . . . . . . . . . . . .461
A.lO. Annual discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine . . . .461
A.ll. Annual discharge (cfs), Llano River, Junction, Texas . . . . . . . . . . . . . .461
Statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A.12. Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A .13. Percentile values for the t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.14. Percentile values for the chi square distribution . . . . . . . . . . . . . . . . . . .465
A.15. Percentile values for the F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .467
A .16. Critical values for the Kolmogorov-Smirnov test statistic . . . . . . . . . . .469
A .17. Durban-Watson test bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .470

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .471

INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .483
Preface to the
Second Edition
SINCE THE publication of the first edition of this book, statistics has come to play an
increasingly important role in hydrology. The advancements in computing technology and data
management have made the application of statistical techniques that were previously known but
difficult to implement allnost routine. User friendly software for personal computers has made
powerful statistical routines available to nearly all hydrologists. Generally, this software comes
with user manuals or help files that lead a new user through the steps needed to use the programs.
Unfortunately, these aids rarely indicate the assumptions inherent in the techniques, the limita-
tions of the techniques, and the situations in which the techniques should or should not be used.
They are generally weak in instructing one on the interpretation of the results of the analysis as
well. This software is a tool that is available for use in hydrology but does not replace sound
hydrologic understanding of the problem at hand nor does it replace a basic understanding of the
statistical technique being used.
This current edition should serve as a companion to many of the software programs
available-not to explain how to use the software, but to provide guidance as to the proper rou-
tines to use for a particular problem and the interpretation of the results of the analysis.
The basic philosophy of the current edition is the same as that of the first edition. Enough
detail on particular statistical methods is presented to gain a working understanding of the tech-
nique. Certainly the treatment on any particular statistical technique is not exhaustive. Much
theory and derivation are omitted and left to more in-depth treatments found in books dealing
specifically with the various topics.
Two chapters have been added to the book. One of these chapters deals with uncertainty
analysis and the other with geostatistics. Both of these topics have received great emphasis in
xvi PREFACE TO THE SECOND EDITION
the past decade. Uncertainty analysis is a growing concern as it is increasingly recognized that
both statistical and deterministic analyses result in estimates that are far from absolute answers.
Increasingly. attempts are made to evaluate how much uncertainty should be associated with
various types of analyses. Rather than providing a point estimate of some quantity, confidence
limits are sought, such that one can assert with various degrees of confidence bounds within
which the sought after quantity is thought to be. Geostatistics has become of increasing impor-
tance as geographically referenced information becomes available and is used in geographical in-
formation systems (GISs) to produce hydrologic estimates.
The chapter on uncertainty was written by Aditya Tyagi, a former PhD candidate at
Oklahoma State University and currently a water resources engineer with CH2M Hill. Jason
Vogel, a research engineer and PhD candidate at Oklahoma State University, was a coauthor of
the chapter on geostatistics.
Preface to the
First Edition
THE RANDOM variability of such hydrologic variables as streamflow and precipitation has
been recognized for centuries. The general field of hydrology was one of the first areas of science
and engineering to use statistical concepts in an effort to analyze natural phenomena. Many pa-
pers have been published that amply demonstrate the value of statistical tools in analyzing and
solving hydrologic problems. In spite of the long history and proven utility of statistical tech-
niques in hydrology, relatively few comprehensive and basic treatments of statistical methods in
hydrology have been published.
This book has been prepared to assist engineers and hydrologists develop an elementary
knowledge of some statistical tools that have been successfully applied to hydrologic problems.
The intent of the book is to familiarize the reader with various statistical techniques, point out
their strengths and weaknesses and demonstrate their usefulness. The serious reader will want to
supplement the material with formal courses or independent study of those individual topics that
are major interests. No single topic has been developed completely. Books have been written cov-
ering many of the topics discussed as single chapters in this presentation. Again the purpose here
is to develop understanding and illustrate the usefulness of the techniques. Most of the techniques
are discussed in sufficient detail for a thorough understanding and application to problem situa-
tions. The philosophy of the presentation has been that one does not have to understand hydro-
dynamics to swim even though it could help one to become a more proficient swimmer.
The book has not been written for statisticians or for those primarily interested in statistical
theory. Rather it has been prepared for hydrologists and engineers interested in learning how
statistical models and methods can be valuable tools in the analysis and solution of many
hydrologic and engineering problems. The basic premise has been taken (and justifiably so) that
xviii PREFACE TO THE FIRST EDITION
statisticians are competent so that many statistical results are presented without developing a
rigorous proof of their validity. Proofs for most results can be found in mathematical statistics
books many of which are listed in the bibliography.
No prior knowledge of statistics is required if one starts with Chapter 2. Those with varying
degrees of statistical knowledge may choose to start with later chapters. A knowledge of calculus
is required throughout and some familiarity with matrices is needed for material in later chapters.
Appendix D is a review of the basic matrix manipulation used in the book (not in this new
edition).
This is not a statistical "cookbook" for hydrologists. It does non contain step-by-step calcu-
lation procedures for "standard hydrologic problems. Basic statistical concepts are discussed
and illustrated in enough detail so that one can develop his own computational procedures or
methods.
Most of the computations in actual work situations would be done on digital computers.
Computer programs have not been included because it is felt that most computer centers will
have programs or programmers available. Likewise computational techniques are not empha-
sized. For example, in the chapter on multiple regression, efficient techniques for matrix inver-
sion are not presented as it is felt that these techniques are readily available at most computer ten-
ters. The emphasis is thus retained on the statistical technique being used and not on the
computational aspects of the problem.
Some liberties have been taken in that many terms are not precisely defined in a mathemat-
ical sense unless such a definition is warranted. Where terms are loosely defined, it is hoped that
the meticulous reader will accept the general connotation of the terms for purposes of simplicity
and to avoid placing emphasis on terms rather than concepts.
Many of the problems require sets of data. Those data may be supplied by the reader or se-
lected from the data in Appendix C.
I am grateful to the Literary Executor of the late Sir Ronald A. Fisher, F. R. S., to Dr. Frank
Yates, F. R. S. and to London Group Ltd., London, for permission to reprint Table E.5 from their
book Statistical Tablesfor Biological, Agricultural arzd Medical Research, 6~ Edition ( 1 974) (not
in this new edition).
Acknowledgments for
the Second Edition
IT HAS been nearly a quarter century since I wrote the first edition of this book. During
that time I have become indebted to many people. I have spent nearly this entire period with
the Biosystems and Agricultural Engineering Department at Oklahoma State University. This
Department has provided a wonderful atmosphere for intellectual growth and accomplishment.
The faculty, staff, and students that I have been associated with have helped to create a working
environment that was challenging, friendly, and one in which my only limitation was myself.
I am grateful to many individuals. Bill Barfield has continued to be a valued friend and
coworker. Dan Storm, Bruce Wilson, and many graduate students have been especially instru-
mental in much of my research and teaching in the field of statistical hydrology.
My daughter, Dr. Patricia Haan, assistant professor in the Biological and Agricultural
Engineering Department at Texas A&M University, has been very helpful in clarifying some
points in the text and correcting errors.
Certainly my wife of 34 years, Jan, has been most supportive and forgiving as I have devoted
far too much time to work.
As is true of all of us, I owe whatever I have accomplished to my Creator without Whom
I could accomplish nothing.
Acknowledgments for
the First Edition
MUCH OF the material presented in this book was developed for a course taught to students
in the Agricultural Engineering and Civil Engineering Departments at the University of Ken-
tucky. The suggestions and clarifications made by the students in this course over the past 8 years
have been a great aid in attempting to make this book more understandable.
Special acknowledgment must be given to Dan Carey for his careful readings of the entire
manuscript. These readings resulted in several corrections and clarifications. Several individu-
als have read parts of the book and made valuable suggestions for its improvement. Among
those reviewing parts of the manuscript were Donn DeCoursey, David Allen, David Culver,
and personnel of the U.S. Soil Conservation Service under the direction of Neil Bogner.
Several individuals in the Agricultural Engineering Department at the University of Kentucky
offered valuable suggestions and considerable encouragement. Deserving special mention are Billy
Barfield, Blaine Parker. and John Walker.
This undertaking has required sacrifice on the part of my family and especially my wife Janice.
She not only typed the early drafts of the book but offered continued encouragement over the years
as work and revisions were done on the book.
This manuscript was reproduced from photo-ready copy. The excellent typing involved in
preparing this final draft as well as an earlier draft was done by Pat Owens. Buren Plaster drafted
all of the figures.
Of course any failings and shortcomings of this book must be credited to me. My hope is that it
will be found useful in at least partially meeting the need for an elementary treatment of statistical
methods in hydrology. Whatever is accomplished along these lines I owe to our Father for giving me
the will to see this project through and the ability to withstand the setbacks experienced along the way.
Finally I express my appreciation to all of the members of the Agricultural Engineering
Department at the University of Kentucky for their understanding during the preparation of this
manuscript.
Statistical Methods i n
HYDROLOGY
1. Introduction
MORE THAN 25 years ago I set about writing a book on the application of statistical tech-
niques to hydrology. That book, published in 1977, became the first edition of this current work
and was appropriately titled Statistical Methods in Hydrology. Although soundly criticized for
producing a book of the general type "Statistics for ," that was little more than a "relevant
Schuam's Outline series" on statistics with a little hydrology thrown in (Burges 1978), the book
has had a very wide reception, has gone through several printings and has been widely quoted in
the literature. However, as I have reflected on this critique over the years, and as I have used sta-
tistics to address problems in hydrology and observed others doing the same, I have come to the
conclusion that this critique contained a large element of truth.
There is no shortage of very fine books at many levels of complexity on statistics. The
theory of statistical procedures and the assumptions in statistical procedures are well explained
and widely available. The same statistical techniques might be applied to hydrologic data or to
the comparison of the value of the Japanese yen to the U.S. dollar. Statistical techniques are based
in mathematics and probability. The units attached to the data being studied are immaterial from
a statistical standpoint. What is important is the degree to which the data agree with the assump-
tions inherent in the statistical procedure being applied.
Similarly, there are many books on hydrology. Some of these books are quite general, some
are quite theoretical, some are quite empirical, and none are really exhaustive. The problem with
hydrology is that it is, in practice, very messy. For example, we can present in great detail the
mathematical development of equations describing the overland flow of water on planes of vari-
ous types and how flow profiles develop and how runoff hydrographs result at the lower end of
these planes. There exist very elegant solutions for these problems-albeit often numerical
4 CHAPTER 1
procedures are required to arrive at these solutions. With rapid advances in computing technol-
ogy, this presents a rapidly diminishing problem.
The real problem as I see it is that we have developed an elegant solution to a nonexistent
problem. In my lifetime I have observed many rainfall-runoff events and have rarely seen the
type of flow described above except in artificial situations such as parts of parking lots or streets
covering a tiny fraction of a drainage basin. If there is any overland flow, before it goes very far
flow concentration develops and the overland flow "planes" become very nonuniform.
Does that mean it is wrong to develop and present these idealized equations? Does that
mean it is wrong to use models that contain these equations to develop runoff hydrographs? NO!
It simply means that one must be aware of the relationships between the mathematics of the
model and the actual hydrology that is occurring. Through proper selection of roughness coeffi-
cients and other coefficients in such models, good estimates of runoff hydrographs may result.
Yet that does not mean that the model actually describes in exact detail the hydrologic processes
that are occurring. We must not confuse actual hydrologic processes with models of these
processes.
On numerous occasions I have seen those practicing hydrology confusing hydrologic mod-
els with actual hydrologic systems. The complexity, the nonhomogeneity, the dynamic nature of
actual hydrologic systems are not recognized. The uncertainty inherent in parameters used by hy-
drologic models to particularize the model to a specific catchment or hydrologic problem are not
recognized. The numbers produced by the model are taken as the true hydrologic response of the
actual hydrologic system. More disturbing, the algorithms that make up the model are taken as
true and exact representations of the hydrologic systems they purport to represent. Quite likely
the one using the hydrologic model has great skills in modeling and in computers but little
understanding of the complexity of hydrologic systems.
At this point one might be wondering why I have jumped on mathematical models when this
book is about statistics. The answer lies in my experience over the years that statistical methods
are often criticized for not being physically based and not representing what is actually occumng
in the field. Yet all hydrologic models, not just statistical models, are susceptible to this criticism.
Statistical models are often applied just as are mathematical models with little regard to the
assumptions in the models. Some take model results as truth, especially if the statistical or math-
ematical technique is complex. Others will reject model results on the basis that all assumptions
are not met. So basically, in hydrology, we face the same dilemmas whether we use mathemati-
cal or statistical models.
No model describes the actual and complete hydrology of anything but the simplest of
settings. Regardless of what approach we use toward solving an actual hydrologic problem,
compromise must be made with the methodology employed. One can never turn professional
judgment over to any particular hydrologic model whether the model is mathematical, statistical,
or some combination of the two. Any model must be seen as an aid to judgment and not as a
replacement for it.
There are no completely theoretical models and no completely statistical models. All mod-
els have components of both theory and statistics. Both are techniques for quantifying our
understanding and our observations of hydrologic processes. The presence of theory or statistics
may not be a formal presence, but it is there. This leads to the conclusion that all models have
INTRODUCTION 5
statistical components to some degree. Any constants that are estimated based on observations,
even observations formalized into tables like Manning's n values, have been determined by for-
mal or informal application of statistics. Any statistical model should be formulated based on
some understanding of the system being modeled. This understanding may be brought into the
model through a conceptual structure of the model. These conceptual components are what bring
hydrology into the model as opposed to having a purely statistical model. In my view, one should
not ignore hydrology when developing models for use in hydrology no matter how sophisticated
the statistical techniques that are being used. To the extent that hydrologic knowledge is used in
structuring a statistical model, the model may be said to contain conceptual components. Statis-
tical models should not be developed by simply throwing data on every conceivable variable into
some computerized statistical routine and hoping for the best.
As far as the hydrologist is concerned, statistics is not an end in itself. Statistics is a tool that
may help one to understand hydrological data. The fact that to hydrology, statistics is a tool must
be kept foremost in mind. It must also be kept in mind that statistics is just one of several tools
available for application in hydrology.
Hydrologic processes are not driven by principles of statistics but by physical, chemical,
and biological principles, the so-called "Laws of Nature". Often the hydrologic setting is of such
complexity that the underlying component hydrologic processes cannot be expressed in such a
way as to yield a suitable computational framework for describing the system. Perhaps the
mix of surface soil properties, land uses, topography, and so forth are such that the setting of a
particular hydrologic problem cannot be adequately described. Perhaps the complexity and
heterogeneity of the system is such as to preclude deterministic modeling. Perhaps data are avail-
able on a response variable such as stream flow, water quality, or ground water level, but not on
the causative variables of rainfall, evaporation, infiltration, and so on. In such a case statistical
techniques may be needed in an effort to uncover descriptive behavioral relationships among the
data. Such relationships are not cause-effect relationships but descriptive relationships. The
relationships may support hypotheses concerning cause and effect but do not conclusively estab-
lish such relationships.
Over the past 20 years I have seen many inappropriate applications of statistics in hydrol-
ogy. I have seen hydrologists stake their reputation as hydrologists on statements made based on
poor knowledge of statistics. I have also seen statisticians make far-reaching conclusions with a
very elementary knowledge of hydrology; here the argument goes "the data show . . .". The data
are separated from their hydrologic reality and analyzed as pure numbers!
One thing that has compounded the problem of inappropriate use of statistics in hydrology
(or any other field, I suspect) is the ready availability of powerful statistical software that is easy
to use. I applaud the availability of this software but shudder at some of the applications that are
made with it.
Sometimes a statistical procedure is improperly applied or applied in inappropriate circum-
stances. The numbers generated by a statistical analysis are then venerated as absolute truth. It
would be better to apply a technique recognizing and admitting its shortcomings and then using the
results as a guide rather than religiously adopting the results and claiming they represent reality.
This long introduction has been composed to impart some of my hydrologic-statistical
modeling philosophy and to alert the reader that this book will emphasize the assumptions
inherent in statistical techniques and the consequences of violating these assumptions. Statistical
techniques will be explained at the practical level without many derivations and proofs. Refer-
ences to these will be given. The book will be most useful to someone having at least an elemen-
tary knowledge of mathematical statistics and hydrology. This book addresses the interface of
these two disciplines.
The question naturally arises as to what is meant by hydrology in this book. Hydrology
broadly defined is the study of water. The Federal Council for Science and Technology (1962)
defined hydrology as

the science that treats of the waters of the Earth, their occurrence, circulation, and
distribution, their chemical and physical properties, and their reaction with their envi-
ronment, including their relation to living things. The domain of hydrology embraces
the full life history of water on the earth.

This definition is more or less used in this book. The definition is broad and includes topics
some may consider to be more proper to geology, engineering, environmental science, biology,
chemistry, paleontology, or some other science. Some may even feel it includes aspects which are
nonscientific. By using this definition, when the word "hydrology" is used, it includes these other
areas as well.
Statistics will be considered in a limited sense in the context of this book. Statistics will be
defined as

a science devoted to developing an understanding of a system that can be used to make


inferences about the system based on observation relative to that system.

Models are often used in developing this understanding and in making inferences. Model is
a general term that will be taken to mean

a collection of physical laws and empirical observations written in mathematical terms


and combined in such a way as to produce estimates based on a set of known and/or as-
sumed conditions.

There are many ways of collecting physical laws and empirical observations and of com-
bining them to produce a model. Models can generally be represented as

where 0 represents the outputs or quantities to be estimated; f(...) represents the mathematical
structure of the model; I represents inputs to the model, boundary conditions, and initial condi-
tions; P represents parameters that help particularize the model to a specific situation; and e rep-
resents differences between what actually occurs, 0, and what the model predicts, 0,.
INTRODUCTION 7

There are many ways of classifying models. Some people draw sharp distinctions between
statistical models and other models. In practice one cannot do a thorough modeling exercise
without drawing on statistics in some way. Often some type of statistical work has to be done to
come up with values for parameters for a model that might otherwise be considered a nonstatis-
tical model. Thus, the parameters of the model become some function of observations. If another
set of observations were used presumably different parameter values would result. Since obser-
vations (data) in hydrology are generally thought of as random variables and any function of a
random variable is a random variable, the parameters for the model effectively become random
variables and thus a statistical element enters a model that might otherwise not be considered as
a statistical model.
Broadly speaking, quantitative hydrologic models fall on a continuous spectrum of model
"types" ranging from completely deterministic on the one hand to completely stochastic on the
other. A completely deterministic model would be one arrived at through consideration of the un-
derlying physical relationships and would require no experimental data for its application.
Statistical models range in complexity from estimating the most likely outcome or result of an
experiment to describing in detail a sequence (time series) of outcomes that mimic actual out-
comes. All statistical approaches rely on observations. The mathematical techniques used to
extract the information contained in the observations may be as simple as computing an average
or so complex as to require thousands of stochastic simulations.
Most hydrologic models fall somewhere between the extremities of this model spectrum.
Often such models are termed parametric models. A parametric model may be thought of as de-
terministic in the sense that once model parameters are determined, the model always produces
the same output from a given input. On the other hand, a parametric model is stochastic in the
sense that parameter estimates depend on observed data and will change as the observed data
changes. A stochastic model is one whose outputs are predictable only in a probabilistic sense.
With a stochastic model, repeated use of a given set of model inputs produces outputs that are not
the same but follow certain statistical patterns. A statistical model is one arrived at by applying
statistical methods to a set of data to produce an estimation procedure. Multiple regression mod-
els are examples of statistical models. In this sense, all stochastic models are statistical models
but all statistical models are not stochastic models.
No matter how simple the hydrologic system or how complex the hydrologic model, the model
is always an approximation to the system. There are no hydrologic models-deterministic, sto-
chastic, or combined-that represent exactly anything but the most trivial of hydrologic systems.
The digital computer has made possible great advances in all types of hydrologic models.
These advancements are noteworthy for both stochastic and deterministic models and have led
some hydrologists to vigorously adopt the philosophy that all hydrologic problems should be
attacked stochastically and some the philosophy that they should be attacked deterministically. The
purpose of this book is not to promote statistical or stochastic models but to present some basic
statistical concepts that have been found useful as aids for the solution of hydrologic problems.
Many hydrologic problems can best be solved through the joint application of the various
modeling methods. For instance, it may be possible to adequately predict the runoff hydrograph
8 CHAPTER 1
from a simple watershed deterministically given the rainfall input. It is unlikely, however, that
rainfalls that will occur during the life of a water resources project will be deterministically pre-
dictable. Thus, one approach to project evaluation would be a stochastic simulation of rainfall,
deterministic conversion of the rainfall to streamflow, and a statistical analysis of the resulting
streamflows.
Regardless of the type of model that is used, model parameters must be determined in some
way from observed hydrologic data. The validity and applicability of a model depend directly on
the characteristics of the data used to estimate model parameters. A model can be no better than
the data available for parameter estimation. The data used for parameter estimation must be rep-
resentative of the situation in which the model is going to be used. Obviously, if one is attempt-
ing to model streamflow from an urban area, model parameters cannot be estimated from forested
watersheds. Similarly, future hydrologic behavior of a watershed can be modeled based on past
observations only if available historical data are representative of future conditions. If drastic
land use changes are to be made, then the model parameters must be adjusted accordingly.
All techniques used for hydrologic analysis rely on assumptions. Often the strict validity of
the analysis depends on how well the true system meets these assumptions. This is certainly true
of statistical models and statistical methods applied to hydrologic systems.
There are no statistical procedures whose assumptions exactly match particular hydrologic
systems. Likewise there are no hydrologic systems that exactly meet the assumptions made in
any particular hydrologic model.
With this in mind one is forced to the conclusion that models cannot yield an exact solution
to any realistic hydrologic problem. Models must be treated as a tool that can be used to gain
insight and to arrive at potential outcomes in a given hydrologic setting, but the final decision re-
garding any hydrologic process rests with the hydrologist, not the models. The hydrologist may
choose to adopt a solution generated from modeling considerations, but this decision must be
based on the hydrologist's convictions that the solution is hydrologically sound and not simply
on how well the model describes the data. How close the final real solution is to the model
solution will certainly depend on how well the physical setting matches the assumptions of the
modeling techniques employed. It is the hydrologist who must make the determination as to
the relationship between the model result and hydrologic reality.
The fact that a statistical modeling procedure requires assumptions that are not strictly met
in a particular hydrologic setting does not mean that statistically derived results are of no value.
Again, the statistical modeling technique is used to provide insight into the problem at hand and
not the final result. Even when it is known that certain assumptions are violated, useful informa-
tion can often be obtained from a statistical modeling effort.
Throughout this book, assumptions that accompany the statistical technique being discussed
will be set forth and discussed from a hydrologic standpoint. The potential problems associated
with violating the assumptions will be discussed. One of the frustrations that is constantly faced
in using statistical models to represent hydrologic systems is trying to determine if assumptions
are met or to what extent assumptions are not met for a particular set of data and the effect of not
meeting assumptions on conclusions reached using the method.
One might come away feeling that it is inappropriate to use statistics in hydrology. That is not
the case at all. What is inappropriate is for an analyst to relegate absolute hydrologic authority to
a statistical analysis at the expense of hydrologic knowledge of the system and to give no weight
to other tools available, such as mathematical models and common sense.
Deterministic hydrologic models, whether numerical or conceptual, suffer the same prob-
lems in terms of assumptions as do statistical models. Rarely are hydrologic models adequately
tested over the full range of conditions for which they will be applied. Rarely are all of the as-
sumptions associated with hydrologic models actually set forth. For instance, one assumption
inherent in hydrologic models is that a basin's hydrologic response to a rare or extreme event can
be modeled with the same algorithms used to model common or predominate events.
In hydrologic frequency analysis, the criticism is often justifiably leveled that estimating a
rare flood-say a 500-year flood, from a record of 20 or 30 years, none of which are extraordi-
narily large-is fraught with the possibilities of errors. The question is asked, how could
relatively common flow levels have information embedded in them that would determine the
magnitude of a 500-year event? Said in another way by example, in Oklahoma most annual peak
flows from smaller watersheds are generated from thunderstorms that arise over the Great Plains
of the central United States. The really big floods may be the result of a hurricane sweeping in
from the Gulf of Mexico and traveling over Oklahoma. How can flow data from thunderstorms
predict flow magnitudes of hurricane-related floods?
But the same questions apply to deterministic hydrologic models. If a model is formulated
and parameters estimated based on common flow levels, how can one be sure these same pararn-
eter values and algorithms apply to extreme events?
In both cases, flood frequency analysis and modeling, information is gained about the pos-
sible magnitude of the 500-year event. For certain neither estimate is exact! In addition to these
estimates the hydrologist should do some field work, look at channel capacities, possibly look
for evidence of extreme floods in the geologic past (paleohydrology), and rely on as much
hydrologic reasoning as possible to arrive at the final estimate of the 500-year event. One should
additionally attempt to place some type of uncertainty bands on the estimate.
What is being suggested is that responsibility for a hydrologic estimate rests squarely on the
hydrologist rather than on some analytic technique. One cannot blame the log-Pearson type 111
distribution for making a bad flood frequency estimate. The problem is not the distribution itself
(after all the distribution is just a mathematical equation) but the inappropriate application of the
distribution in making the estimate. One cannot blame a hydrologic model if a hydraulic struc-
ture fails because the flow estimated by the model was in error. One may conclude that the model
was inappropriate but it was the hydrologist that made the estimate using the model as a tool.

HYDROLOGIC DATA
Hydrologic data seems to be simultaneously abundant and scarce. We are deluged with data
on rainfall, temperature, snowfall, and relative humidity from around the world on a daily basis
in newspapers, radio and television reports, and on world-wide computer information networks.
Many agencies worldwide collect and archive hydrologic data on streamflow, lake and reservoir
levels, ground water elevations, water quality measures, and other aspects of the hydrologic
cycle. These data are available in many different forms. Currently access to hydrologic data is
being rapidly improved as the data is made available over electronic networks.
Yet in the face of this apparent abundance, data on a particular aspect of the hydrologic cycle
at a particular location for a particular time period are often inadequate or completely lacking. It
is often the task of the hydrologist to use any data that can be found having some application to
the problem at hand, hydrologic models of various kinds, plus their own hydrologic knowledge
to explain past, present, or anticipated hydrologic behavior of the system under study. Statistical
procedures are used to evaluate the data, transfer the data to the problem at hand, select models
and model parameters, evaluate model predictions, organize one's personal conception of how
available data and knowledge come to bear on the problem, make predictions of future behavior
of the system, and many other aspects of hydrologic problem-solving.
Hydrologic data are generally presented as values at particular times, such as a river stage at
a particular time, or values averaged over time, such as the annual flow for a stream for a partic-
ular year. Aggregating data into averages over time intervals may cause a loss of information if
the variability of the process within the time period is of interest. Conversely, aggregation may
make it possible to more clearly visualize long-term trends because short-term variations about
the trend may be removed. The variability from observation to observation in a time series of hy-
drologic data may be very rapid and significant or very minor. Generally systems having a lot of
storage vary more slowly than systems lacking that storage. Figure 1.1 is a plot of the water sur-
face elevation of the Great Salt Lake near Salt Lake City, Utah. This figure shows that during the
period of this record, water level changes of about 20 feet have occurred but year-to-year change
is relatively slow with the exception of 1982-1 984 when a rise of about 4 feet per year occurred
and in the late 1980s when the level dropped rather quickly.
Figure 1.2 shows the annual peak discharge for the Kentucky River near Salvisa, Kentucky.
There is little year-to-year carry-over or storage in this river system, so the flows vary more or
less randomly from one year to the next.
Figure 1.3 shows the water surface elevation of Devils Lake in North Dakota. The behav-
ior of this lake is puzzling in that it has gone from nearly 1440 feet in elevation in 1867 to 1401
feet in 1940 in an almost continuous decline, at which point an erratic but steady increase in ele-
vation began until it reached 1447 feet in 1999.

1840 1860 1880 1x0 1920 1910 198D 1980 aOOO


Year

Fig. 1.1. Water surface elevation of the Great Salt Lake near Salt Lake City, Utah.
INTRODUCTION 11

0
1895 1915 1935 1955 1975 1995
Year

Fig. 1.2. Annual peak flows on the Kentucky River near Salvisa, Kentucky.

1850 1870 1890 1910 1930 1950 1970 1990


Year

Fig. 1.3. Water surface elevation of Devils Lake, North Dakota.

In the case of the Salt Lake data, a model that estimated the water level in one year based
solely on the level the previous year might produce reasonable estimates. The form of such a
model would be y, = y,-, where y, is the water level at time t and y,-, is the water level at the
previous time t - 1. Such a model may give a better prediction of the lake level in year t than
would a model y, = y where y is the average lake level. The opposite is the case in.the Kentucky
,.
River peak flow data. Here y, = would be better than y, = y,- The previous year's flow is of
little value in predicting the current year's flow.
A model for Devils Lake would be difficult to surmise based simply on lake level data,
because even a reasonable estimate for the long-term average lake level could not be determined
on this record of over 100 years. Simply based on the data, one cannot determine the maximum
elevation reached prior to 1867 or what elevation the lake might achieve in the absence of
human interference after 1999. Presumably, physical and hydrologic information would shed
some light on this problem. These considerations will be discussed in detail and quantified later
in the book.
In selecting data for model parameter estimation, it is important to establish that the data are
representative and homogeneous over time or can be adjusted for any nonhomogeneities that
may be present. L anything has occurred to cause a change in the characteristic being analyzed,
the data must either be adjusted to account for the change or analyzed in two sections: one before
the change and one after.
Some common causes of nonhomogeneities are relocating gages (especially rain gages),
diverting streamflows, constructing dams, watershed changes such as urbanization or deforesta-
tion, stream channel alterations and possibly weather modification, as well as natural events of a
catastrophic nature such as earthquakes, humcane floods, and so forth. In some instances the data
can be corrected for changes. One possible adjustment would be by reverse reservoir routing to
determine what streamflows would have been had a reservoir not been constructed. Some
changes such as gradual urbanization of a watershed are difficult to correct.
The statement that the data must be representative means, for example, that data from only
unusually wet or dry periods should not be used alone as this will bias the results of the analysis.
If there are only a few years of record available for analysis, the chances are good that the data
are not representative of the long-term variability that actually exists. Most stochastic models as-
sume that the data being considered are homogeneous and representative.
The concept of the return period of hydrologic events plays an important role in hydrology.
The return period of an event is defined as the average elapsed time between occurrences of an
event with a certain magnitude or greater. For example, a 25-year peak discharge is a discharge
that is equaled or exceeded on average once every 25 years over a long period of time. It does not
mean that an exceedance occurs every 25 years, but that the average time between exceedances
is 25 years. An exceedance is an event with a magnitude equal to or greater than a certain value.
Sometimes the actual time between exceedances is called the recurrence interval. With this
definition for recurrence interval, the average recurrence interval for a certain event is equal to
the return period of that event. In this book, recurrence interval is used in the same sense as return
period.
Of course, the concept of return period can also be applied to low flows, droughts, shortages,
and so on. In this case the return period would be the average time between events with a certain
magnitude or less. Such an event might still be called an exceedance in the sense that the sever-
ity of a drought exceeds some preset level.
Regardless of whether the return period is refemng to an event greater than some value or
to an event less than some value, the return period can be related to a probability of an
exceedance. If an exceedance occurs on the average once every 25 years, then the probability or
chance that the event occurs in any given year is & = 0.04 or 4%. Probability, p, of an event
occurring in any one year and return period, T, in years, are thus related by

This is a fundamental definition in statistical hydrology.


The concept of a random sample is used throughout this book. A sample might be thought
of as a collection of objects selected from a larger collection of these same objects. The larger
INTRODUCTION 13
collection of objects, if it contains all of the objects possible, is called the population. For exam-
ple, 20 years of peak flow data from a certain river is a sample of the possible peak flows on the
river. A random sample is one that is selected in such a fashion that any other sample could have
resulted with equal likelihood. If the 20 years of peak flow data are considered a random sample,
then one is assuming that these 20 years of data are just as likely as any other possible 20 years
of data and vise versa.
In some types of analysis it is assumed that the order of occurrence of the data is not impor-
tant, only the data values are important. The traditional hydrologic frequency analysis is an ex-
ample of this. If a sample contains elements that are independent of each other, then the order of
occurrence of the data is not important. This is the same as saying that the magnitude of an ele-
ment in the sample is not affected by the temporal pattern of the other elements in the sample.
Each element in the sample might be thought of as a random sample of size 1.
On the other hand, there are situations where the order of occurrence of the events is impor-
tant. In designing a storage reservoir to meet projected water demands, the fact that low flows
tend to follow low flows makes it necessary to have a larger reservoir than would be required if
the low flows occurred randomly throughout time. This is known as persistence and indicates the
elements of the sample are not independent of each other. In this case the entire sequence of data
values must be considered the random sample. That is, the sequence contained in the sample
is assumed to be as likely as any other sequence. The individual events in the sample are not
independent.
If one wanted a random sample consisting of 7 observations of daily flows on a river during
a particular year, the daily flows in a particular week of that year could not be used. This is because
the flow on the second, third, and so on, day of the week would be dependent on the flows on the
preceding days. The flow on day 2, for example, would not represent all possible daily flows but
would be highly dependent on the flow during day 1. To get a random sample of daily flows, each
of the 365 daily flows would have to have an equal chance of being selected. The sample of flows
during the 7 consecutive days could be considered as a random sample of size I of weekly flows
(if the week was randomly selected) but not a random sample of size 7 of daily flows.
In any hydrologic data there are errors of various kinds. The errors include measurement
errors, data transmittal errors, processing errors, and others. The errors may be systematic errors
and show up as a bias in the data or they may be random errors. In most error analysis it is
assumed that the errors are random errors and follow the normal distribution. The treatment of
hydrologic data contained in this book is not concerned so much with these types of errors as it
is with sampling errors.
Sampling error is a misnomer in that there are no errors in the usual sense involved. Sam-
pling errors should more properly be called sampling variability, sampling fluctuation, or sample
uncertainty. What is meant by sampling error is simply that a random sample has statistical prop-
erties that are similar to the population parameters but only equal to the population properties as
the sample size gets very large (or the entire population is sampled). If two samples are selected
from the same population, their statistical properties will again be similar but equal to each other
only as the sample size gets very large.
For example, we may desire to know the average annual rainfall at a given location. Assume
we can measure exactly, that is with out any measurement error, the rainfall at the desired
location. Measurements are collected over a 5-year period and the average annual rainfall is
calculated without error in the calculations. A second 5-year period elapses and data from this
period is used to calculate the average annual rainfall. The two estimates will be different.
Neither will equal the true average annual rainfall. The difference in the estimated values and
the true values are the sampling errors. Note we cannot exactly determine the sampling error
since the true average is not known.
Thus, variability or uncertainty in the statistical properties of a population based on esti-
mates of the properties from sampIes is called sampling error. It is clear that errors in the sense of
mistakes, faulty data, or carelessness are not involved in sampling errors. Sampling error is sim-
ply an inherent property of random samples. If it weren't for sampling errors, this book or hun-
dreds of others on statistics would not be needed since populations would then be completely
specified by any sample from that population.

Example 1.I. The mean annual suspended sediment load for the Green River near Munfordville,
, be estimated from the data contained in Appendix B. This data and the resulting
~ e n t u c k ican
estimated mean annual suspended sediment load may contain many types of errors. Systematic
errors could result if the flow was sampled for sediment only when the depth of flow exceeded a
preset stage. This is because low flows would not be sampled.
Generally, the sediment concentration in low flows is less than that in higher flows. Thus a
built in bias or systematic error is produced. Measurement errors could result from plugged sam-
plers, samplers not properly aligned with the direction of flow, allowing the sampler to pick up
some bed load, and a number of other reasons. Data transmittal errors and processing errors can
result from mistakes in transcribing data from data forms, placing data in the wrong columns on
spreadsheets or data entry forms, illegibly written data, and other sources.
Sampling error can be illustrated by assuming that the tabulated data are exactly correct
(contain no systematic, measurement, transmittal, or processing errors). If the mean annual sus-
pended sediment load is calculated for each successive 5-year period, the results are 640,827;
484,739; 497,604; and 460,392 tons per year. Under the no error assumption, 4 different values
of the mean annual suspended sediment discharge have been calculated each of which contains
no errors yet none of which are the same. The difference in the 4 estimates is caused by natural
variability in the phenomena (sediment) being sampled. This difference is called sampling error.
If conditions on the watershed contributing to the Green River near Munfordville never changed
and if the climatic conditions do not change, then theoretically the sampling error can be made as
small as desired by an increase in the sample size above the 5 years used in this illustration. Prac-
tical limitation is imposed by the length of the available sediment load data record.

Much of the statistical machinery discussed in this book is concerned with sampling errors
and the estimation of population characteristics from samples of data. The fact that sampling
errors are inherent in random data does not mean, however, that statistical manipulations and
sophistication can in any way overcome faulty data. The quality of any statistical analysis is no
better than the quality of the data used. It can be worse but no better. Furthermore, statistical
considerations should not be used to replace judgement and careful thought in analyzing hydro-
logic data. In many instances some intelligent thought is worth reams of computer output based
INTRODUCTION 15
on a statistical analysis of some data. Statistics should be regarded as a tool, an aid to under-
standing, but never as a replacement for useful thought.
Rarely will one find a hydrologic problem that exactly fulfills all of the requirements for the
application of one statistical technique or another. Two choices are thus available. One can rede-
fine the problem so that it meets the requirement of the statistical theory and thus produce an
"exact" answer to the artificial problem. The second approach is to alter the statistical technique
where possible and then apply it to the real problem realizing that the results will be an approxi-
mate answer to the real problem. In this case the degree of the approximation depends on the
severity of the violated assumptions. This latter approach is preferable and requires knowledge
of available statistical techniques, of assumptions and theory underlying the techniques, and of
the consequences of violating the assumptions. It is toward this latter approach that this book is
oriented.
Most of the examples and exercises used in this book were selected for pedagogical reasons,
not to promote a particular technique. Thus, when a problem involves fitting a normal distribu-
tion to annual peak flow, the purpose of the problem revolves around learning about the normal
distribution and is not to demonstrate that a normal distribution is applicable to peak flows. Sim-
ilarly, many examples and problems had to be simplified so that they could be realistically solved
with attention focused on the statistical technique and not the many fascinating intricacies of
most real problems. That is not to say the techniques do not apply to real problems-uite the
contrary. However, most real problems involve multiple aspects, lots of data, and many consid-
erations other than statistical ones. Rather than get involved in these other important aspects,
many of the examples and problems are idealizations of real situations.
Because the exercises were selected as a learning aid, it will be instructive to at least read
the problems at the end of each chapter. Many of the problems present useful results that supple-
ment the material in that chapter.
Many actual problems in hydrology require considerable computation. Digital computers are
used for this purpose. Special statistical-numerical procedures have been developed to simplify
the computations involved and improve the accuracy of the results obtained from many of the
analyses presented in this book. These procedures are not presented here. Rather the emphasis is
on the principles involved. Some statistical techniques such as geostatistics and multivariate tech-
niques often require extensive calculation and considerable efficiency is gained by using special-
purpose programs incorporating numerical shortcuts and safeguards against roundoff errors.
Finally, there are many important areas of statistical analysis applicable to hydrology that
are not included in this book. These omitted techniques for the most part require knowledge of
the material contained in this book before they can be applied. Thus, this book is an introduction
to statistical methods in hydrology. Furthermore, the book is not intended as a handbook or
statistical "cookbook for hydrologists. The purpose of this book is to enable the reader to better
apply statistical methods to hydrologic problems through a knowledge of the methods, their
foundations and limitations.
2. Probability and
Probability
Distributions-
Basic Concepts
HYDROLOGIC PROCESSES may be thought of as stochastic processes. Stochastic in this
sense means involving an element of chance or uncertainty where the process may take on any of the
values of a specified set with a certain probability. An example of a stochastic hydrologic process
is the annual maximum daily rainfall over a period of several years. Here the variable would be the
maximum daily rainfall for each year and the specified set would be the set of positive numbers.
The instantaneous maximum peak flow observed during a year would be another example
of a stochastic hydrologic process. Table 2.1 contains such a listing for the Kentucky River near
Salvisa, Kentucky. By examining this table it can be seen that there is some order to the values
yet a great deal of randomness exists as well. Even though the peak flow for each of the 99 years
is listed, one cannot estimate with certainty what the peak flow for 1998 was. From the tabulated
data one could surmise that the 1998 peak flow was "probably" between 20,600 cfs and 144,000
cfs. We would like to be able to estimate the magnitude of this "probably". The stochastic nature
of the process, however, means that one can never estimate with certainty the exact value for the
process (peak discharge) based solely on past observations.
The definition of stochastic given above has some theoretical drawbacks, as we shall see.
Hydrologic processes are continuous processes. The probability of realizing a given value from
a continuous probability distribution is zero. Thus, the probability that a variable will take on a
certain value from a specified set is zero, if the variable is continuous. Practically this presents no
problem because we are generally interested in the probabilities that the variate will be in some
range of values. For instance, we are generally not interested in the probability that the flow rate
will be exactly 100,000 cfs but may desire to estimate the probability that the flow will exceed
100,000 cfs, or be less than 100,000 cfs, or be between 90,000 and 120,000 cfs.
Table 2.1. Peak discharge (cfs) Kentucky River near Salvisa, Kentucky

Year Flow Year Flow Year Flow

With this introduction, several concepts such as probability, continuous, and probability
distribution have been introduced. We will now define these concepts and others as a basis for
considering statistical methods in hydrology.

PROBABILITY
In the mathematical development of probability theory, the concern has been not so much
how to assign probability to events, but what can be done with probability once these assign-
ments are made. In most applied problems in hydrology, one of the most important and difficult
tasks is the initial assignment of probability. We may be interested in the probability that a cer-
tain flood level will be exceeded in any year or that the elevation of a piezometric head may be
more than 30 feet below the ground surface for 20 consecutive months. We may want to deter-
mine the capacity required in a storage reservoir so that the probability of being able to meet the
projected water demand is 0.97. To address these problems we must understand what probability
means and how to relate magnitude to probabilities.
The definition of probability has been labored over for many years. One definition that is
easy to grasp is the classical or a priori definition:

If a random event can occur in n equally likely and mutually exclusive ways, and if na
of these ways have an attribute A, then the probability of the occurrence of the event
having attribute A is n d n written as

This definition is an a priori definition because it assumes that one can determine before the
fact all of the equally likely and mutually exclusive ways that an event can occur and all of the
ways that an event with attribute A can occur. The definition is somewhat circular in that
"equally likely" is another way of saying "equally probable" and we end up using the word
"probable" to define probability. This classical definition is widely used in games of chance
such as card games and dice and in selecting objects with certain characteristics from a larger
group of objects. This definition is difficult to apply in hydrology because we generally cannot
divide hydrologic events into equally likely categories. To do that would require knowledge of
the likelihood or probability of the events, which is generally the objective of our analysis and
not known before the analysis.
The classical definition of probability takes on more utility in hydrology in terms of relative
frequencies and limits.

If a random event occurs a large number of times n and the event has attribute A in
na of these occurrences, then the probability of the occurrence of the event having
attribute A is

na
prob(A) = limit -
n+m n

The relative frequency approach to estimating probabilities is empirical in that it is based on


observations. Obviously, we will not have an infinite number of observations. For this probabil-
ity estimate to be very accurate, n may have to be quite large. This is frequently a limitation in
hydrology.
The relative frequency concept of probability is the source of the relationship given in
chapter 1 between the return period, T, of an event and its probability of occurrence, p.
These two definitions of probability can be illustrated by considering the probability of get-
ting heads in a single flip of a coin. If we know a priori that the coin is balanced and not biased
toward heads or tails, we can apply the first definition. There are two possible and equally likely
PROBABILITY 19

Probability of getting a "head"


from a coin flipping experiment

Number of trials
Fig. 2.1. Coin flipping experiment.

outcomes-heads or tails-so n is 2. There is one outcome with heads so n, is 1. Thus the prob-
ability of a head is !4. If the coin is not balanced so that the two outcomes are not equally likely,
we could not use the a priori definition. We had to know the answer to our question before we
could apply the a priori definition.
This is not the case when the relative frequency definition is used. Obviously we cannot flip
the coin an infinite number of times. We have to resort to a finite sample of flips. Figure 2.1 shows
how the estimate of the probability of a head changes as the number of trials (flips) changes. A
trend toward K is noted. This is called stochastic convergence towards %. One question that might
be asked is, "is the coin unbiased?" One's initial reaction is that more trials are needed. It can be
seen that the probability is slowly converging toward K but after 250 trials is not exactly equal to
!4. This is the plight of the hydrologist. Many times more trials or observations are needed but are
not available. Still, the data does not clearly indicate a single answer. This is where probability
and statistics come into play.
Equation 2.2 allows us to estimate probabilities based on observations and does not require
that outcomes be equally likely or that they all be enumerated. This advantage is somewhat off-
set in that estimates of probability based on observations are empirical and will only stochasti-
cally converge to the true probability as the number of observations becomes large. For example,
in the case of annual flood peaks, only one value per year is realized. Figure 2.2 shows the prob-
ability of an annual peak flow on the Kentucky River exceeding the mean annual flow as a func-
tion of time starting in 1895. Note that each year additional data becomes available to determine
both the mean annual flow and the probability of exceeding that value. Here again, a convergence
toward % is noted yet not assured. In fact, there is no reason to believe that K is the "correct"
20 CHAPTER 2

I Year I
Fig. 2.2. Probability that the annual peak flow on the Kentucky River exceeds the mean annual
peak flow.

probability since the probability distribution of annual peak flows is likely not symmetrical about
the mean.
If two independent sets of observations are available (samples), an estimate of the probabil-
ity of an event A could be determined from each set of observations. These two estimates of
prob(A) would not necessarily equal each other nor would either estimate necessarily equal the
true (population) prob(A) based on an infinitely large sample. This dilemma results in an impor-
tant area of concern to hydrologists-how many observations are required to produce "accept-
able" estimates for the probabilities of events?
From either equation 2.1 or 2.2 it can be seen that the probability scale ranges from zero to
one. An event having a probability of zero is impossible, whereas one having a probability of one
will happen with certainty. Many hydrologists like to avoid the endpoints of the probability scale,
zero and one, because they cannot be absolutely certain regarding the occurrence or nonoccur-
rence of an event. Sometimes probability is expressed as a percent chance with a scale ranging
from 0% to 100%. Care must be taken to not confuse the percent chance values with true proba-
bilities. A probability of one is very different from a 1% chance of occurrence as the former im-
plies the event will certainly happen while the latter means it will happen only one time in 100.
In mathematical statistics and probability, set and measure theory are used in defining and
manipulating probabilities. An experiment is any process that generates values of random vari-
ables. All possible outcomes of an experiment constitute the total sample space known as the
population. Any particular point in the sample space is a sample point or element. An event is a
collection of elements known as a subset.
To each element in the sample space of an experiment a non-negative weight is assigned
such that the sum of the weights on all of the elements is 1. The magnitude of the weight is pro-
portional to the likelihood that the experiment will result in a particular element. If an element is
quite likely to occur, that element would have a weight of near 1. If an element was quite unlikely
to occur, that element would have a weight of near zero. For elements outside the sample space,
a weight of zero is assigned. The weights assigned to the elements of the sample space are known
as probabilities. Here again, the word likelihood is used to define probability so that the defini-
tion becomes circular.
Letting S represent the sample space; Ei for i = 1,2, ..., represents elements in S; A and B
represent events in S; and prob(Ei) represents the probability of Ei, it follows that

Since the sample space is made up of the totality of elements in S, we have

where Ui represents the union or total collection of all of the Ei and

An event, A, is made up of a subset of elements in S so that

and

These concepts are illustrated in figure 2.3 as a Venn diagram.

Fig. 2.3. Venn diagram illustrating a sample space, elements, and events.
22 CHAPTER 2
Using notation from set theory and Venn diagrams, several probabilistic relationships can be
illustrated. If A and B are two events in S, then the probability of A or B, shown as the shaded
areas of Figure 2.3, is given by

Note that in probability the word "or" means "either or both". The notation U represents a union
so that A U B represents all elements in A or B or both. The notation n represents an intersection
so that A n B represents all elements in both A and B. The last term of equation 2.6 is needed
since prob(A) and prob(B) both include prob(A n B). Thus, prob(A r l B) must be subtracted
once so the net result is only one inclusion of prob(A n B) on the right-hand side of the equation.
If A and B are mutually exclusive, then both cannot occur and prob(A n B) = 0. In this case

Figure 2.3 illustrates the case where event A and B are mutually exclusive and figure 2.4
shows A and B when they are not mutually exclusive.
If A" represents all elements in the sample space S that are not in A, then

A" is known as the complement of A. Equation 2.4 indicates that

This statement says that the probability of A or Ac is certainty since one or the other must occur.
All of the possibilities have been exhausted. Since A and A" are mutually exclusive

or we have the very useful result that the probability of an event A is

Equation 2.7 often makes it easy to evaluate probability by first evaluating the probability that an
outcome will not occur.
An example is evaluating the probability that a peak flow q exceeds some particular flow q,.
A would be all q's greater than q, and A" would be all q's less than q,. Because q must be either
greater than or less than q,, prob(q > q,) = 1 - prob(q < q,). We show later that for continuous
random variables prob(q = q,) = 0.
If the probability of an event B depends on the occurrence of an event A, then we write
prob(B IA), read as the probability of B given A or the conditional probability of B given A has
occurred. The prob(B) is conditioned on the fact that A has occurred. Referring to figure 2.4 it is
apparent that conditioning on the occurrence of A restricts consideration to A. Our total sample
PROBABILITY 23

Fig. 2.4. Venn diagram showing A U B and A fl B.

space is now A. The occurrence of B given that A has occurred is represented by A fl B. Thus the
prob(B I A) is given by

assuming of course that prob(A) f 0. Equation 2.8 can be rearranged to give the probability of
A a n d B as

Now if prob(B1A) = prob(B), we say that B is independent of A. Thus the joint probability of
two independent events is the product of their individual probabilities.

Example 2.1. Using the data shown in table 2.1, estimate the probability that a peak flow in excess
of 100,000 cfs will occur in 2 consecutive years on the Kentucky River near Salvisa, Kentucky.

Solution: From table 2.1 it can be seen that a peak flow of 100,000 cfs was exceeded 7 times in
the 99-year record. If it is assumed that the peak flows from year to year are independent, then the
probability of exceeding 100,000 cfs in any one year is approximately 7/99 or 0.0707. Applying
equation 2.9, the probability of exceeding 100,000 cfs in two successive years is found to be
0.0707 X 0.0707 or 0.0050.

Example 2.2. A study of daily rainfall at Ashland, Kentucky, has shown that in July the proba-
bility of a rainy day following a rainy day is 0.444, a dry day following a dry day is 0.724, a rainy
day following a dry day is 0.276, and a dry day following a rainy day is 0.556. If it is observed
that a certain July day is rainy, what is the probability that the next two days will also be rainy?
24 CHAPTER 2
Solution: Let A be a rainy day 1 and B be a rainy day 2 following the initial rainy day. The prob-
ability of A is 0.444 since this is the probability of a rainy day following a rainy day.

prob(A r l B) = prob(A) prob(B IA)

Now, the prob(B1A) is also 0.444 since this is the probability of a rainy day following a rainy day.
Therefore

The probability of two rainy days following a dry day would be 0.276 X 0.444, or 0.122.
Note that the probabilities of wet and dry days are dependent on the previous day. Indepen-
dence does not exist. It can be shown that over a long period of time, 67% of the days will be
dry and 33% will be rainy with the conditional probabilities as stated. If one had assumed
independence, then the probability of two consecutive rainy days would have been 0.33 X 0.33
= 0.1089, regardless of whether the preceding day had been rainy or dry. Since the probabil-
ity of a rainy day following a rainy day is much greater than following a dry day, persistence
is said to exist.

TOTAL PROBABILITY THEOREM


If B1, B,, ..., B, represents a set of mutually exclusive and collectively exhaustive events,
one can determine the probability of another event A from

This is called the theorem of total probability. Equation 2.10 is illustrated by figure 2.5.

Fig. 2.5. Venn diagram for theorem of total probability.


PROBAB ILlTY 25

Example 2.3. It is known that the probability that the solar radiation intensity will reach a
threshold value is 0.25 for rainy days and 0.80 for nonrainy days. It is also known that for this
particular location the probability that a day picked at random will be rainy is 0.36. What is the
probability the threshold intensity of solar radiation will be reached on a day picked at
random?

Solution: Let A represent the threshold solar radiation intensity, B1 represent a rainy day and B,
a nonrainy day. From equation 2.10, we know that

This is an example of a weighted probability. The probability of A given a condition is weighted


by the probability of the condition and summed.

BAYES THEOREM
By rewriting equation 2.8 in the form

and then substituting from equation 2.10 for prob(A), we get what is called Bayes Theorem:

As pointed out by Benjamin and Cornell (1970), this simple derivation of Bayes Theorem
belies its importance. It provides a method for incorporating new information with previous or
so-called prior probability assessments to yield new values for the relative likelihood of events
of interest. These new (conditional) probabilities are called posterior probabilities. Equation
2.11 is the basis of Bayesian Decision Theory. Bayes theorem provides a means of estimating
probabilities of one event by observing a second event. Such an application is illustrated in
example 2.4.

Example 2.4. The manager of a recreational facility has determined that the probability of 1000
or more visitors on any Sunday in July depends upon the maximum temperature for that Sunday
as shown in the following table. The table also gives the probabilities that the maximum temper-
ature will fall in the indicated ranges. On a certain Sunday in July, the facility has more than 1000
visitors. What is the probability that the maximum temperature was in the various temperature
classes?

Temp Prob of Prob of Prob of


O F 1000 or being in TjllOOO
Ti more visitors temp class or more visitors

<60 0.05 0.05 0.005


60-70 0.20 0.15 0.059
70-80 0.50 0.20 0.197
80-90 0.75 0.35 0.5 17
90- 100 0.50 0.20 0.197
>lo0 0.25 0.05 0.025
-
Total 1.000

Solution: Let Tj for j = 1, 2, ..., 6 represent the 6 intervals of temperature. Then from equation
2.11

prob( 1000 or more 1 Tj) prob(Tj)


prob(Tj1 1000 or more) =
Zf= prob(1000 or more 1 Ti) prob(Ti)
prob(1000 or more 1 Tj)prob(Tj)
prob(TjI 1000 or more) =
.05(.05) + .20(. 15) + ..- + .25(.05)
For example

0.05(0.05)
prob(<60FI 1000 or more) = = 0.005
0.507

Similar calculations yield the last column in the above table. Note that ~ , 6 .prob(Tjl
, 1000 or
more) is equal to 1.

COUNTING
In applying equation 2.1 to determine probabilities, one often encounters situations where it
is impractical to actually enumerate all of the possible ways that an event can occur. To assist in
this matter certain general mathematical formulas have been developed.
If El, E2, ..., En are mutually exclusive events such that Ei n Ej = 0 for all i # j where 0 rep-
resents an empty set and Ei can occur in ni ways, then the compound event E made up of
outcomes El, &, ..., En can occur in n,, n2, ... nn ways.
The problem of sampling or selecting a sample of r items from n items is commonly
encountered. Sampling can be done either with replacement, so that the item selected is irnmedi-
ately returned to the population, or without replacement, so that the item is not returned. The
order of sampling may be important in some situations and not in others. Thus, we may have four
PROBABILITY 27

types of samples-ordered with replacement, ordered without replacement, unordered with


replacement, and unordered without replacement.
In case of an ordered sample with replacement, the number of ways of selecting item 1 is n
since there are n items in the population. Similarly, item 2 can be selected in n ways since the first
item selected is returned to the population. Thus two items can be selected in n X n ways.
Extending this argument to r selections, the number of ways of selecting r items from n with
replacement and order being important is simply nr.
If the items are not replaced after being selected, then the first item can be selected in n ways,
the second in n - 1 ways and so on until the r" item can be selected in n - r + 1 ways. Thus r
ordered items can be selected from n without replacement in n(n - l)(n - 2) ... (n - r + 1) ways.
This is commonly written as (n), and called the number of permutations of n items taken r at a
time.

(n)r = n(n - l)(n - 2) ...(n - r + 1) = n!/(n - r)! (2.12)

The notation n! is read "n factorial". n! = n(n - l)(n - 2) ... (2)(1) By definition O! = 1.
Unordered sampling without replacement is similar to ordered sampling without replacement
except that in the case of ordered sampling the r items selected can be arranged in r! ways. That is,
an ordered sample of r items can be selected from r items in (r),, or r!, ways. Thus r! of the ordered
samples will contain the same elements. The number of different unordered samples is therefore
(n),/r!, commonly written as ( y ) and called the binomial coefficient. The binomial coefficient gives
the number of combinations possible when selecting r items from n without replacement.

Finally, in unordered sampling, selecting r items from n with replacement is equivalent to


selecting r items from n + r - 1 items without replacement. That is, we can consider the popu-
lation as containing r - 1 items more than it really does since the items selected will be replaced.
The number of ways of selecting r unordered items from n items is then

The number of ways of selecting samples under the four above conditions is summarized in the
following table:

With Without
replacement replacement

n!
Ordered (4,= (n - r)!
n!
Unordered
(n - l)!r!
Example 2.5. For a particular watershed, records from 10 rain gages are available. Records from
3 of the gages are known to be bad. If 4 records are selected at random from the 10 records, (a)
What is the probability that 1 bad record will be selected? (b) What is the probability that 3 bad
records will be selected? (c) What is the probability that at least 1 bad record will be selected?

Solution: The total number of ways of selecting 4 records from the 10 available records (order is
not important) is

(a) The number of ways of selecting 1 bad record from 3 bad records and 3 good records
from 7 good records is

Applying equation 2.1 and letting a = 1 bad and 3 good records, the probability of a is 105/210
or 0.500.

(b) The number of ways of selecting 3 bad records and 1 good record is

Thus the probability of selecting 3 bad records is 7/210 or 0.033.

(c) The probability of at least 1 bad record is equal to the probability of 1 or 2 or 3 bad
records. The probability of 1 and 3 bad records is known to be 0.500 and 0.033 respec-
tively. The probability of 2 bad records is

Thus, the probability of at least 1 bad record is 0.500 + 0.300 + 0.033 = 0.833. This latter result
could also be determined from the fact that the probability of 0 or 1 or 2 or 3 bad records must
equal one. The probability of at least 1 bad record thus equals

1 - prob(0 bad records) = 1 -


>:(I 210
- 1 - ----
35 -
210
- 0.833
Example 2.6. For the situation described in Example 2.5, what is the probability of selecting at
least 2 bad records given that one of the records selected is bad?

prob(at least 2 bad out of 4)


prob(at least 2 bad out of 41 1 is bad) =
prob(1 bad in 4)

GRAPHICAL PRESENTATION
Hydrologists are often faced with large quantities of data. Since it is difficult to grasp the
total data picture from tabulations such as table 2.1, a useful first step in data analysis is to use
graphical procedures. Throughout this book various graphical presentations will be illustrated.
Helsel and Hirsch (1992) contains a very good treatment of graphical presentations of hydrologic
data. The appropriate graphical technique depends on the purpose of the analysis. As various
analytic procedures are discussed throughout this book, graphical techniques that supplement the
procedures will be illustrated. Undoubtedly the most common graphical representation of hydro-
logic data is a time series plot of magnitude versus time. Figure 1.2 is such a plot for the peak
flow data for the Kentucky River. A plot such as this is useful for detecting obvious trends in the
data in terms of means and variances or serial dependence of data values next to each other in
time. Figure 1.2 reveals no obvious trends in the mean annual peak flow or variances of annual
peak flows. It also shows no strong dependence of peak flow in one year on the peak flow of the
previous year. We previously indicated that figure 1.1, showing the water level in the Great Salt
Lake of Utah, indicated an apparent year-to-year relationship. Such a relationship is reflected in
the serial correlation coefficient. Serial correlation will be discussed in detail later in the book.
For now suffice it to say that the serial correlation for the annual lake level for the Great Salt Lake
is 0.969 and for the annual peak flows on the Kentucky river is -0.067. This indicates a very
strong year-to-year correlation for the Salt Lake data and insignificant correlation for the peak
flow on the Kentucky River. Serial correlation indicates dependence from one observation to the
next. It also indicates persistence. In the Salt Lake data, lake levels tend to change slowly with
high levels following high levels. In the Kentucky River data, no such pattern is evident.
In conducting a probabilistic analysis, a graphical presentation that is often quite useful is
a plot of the data as a frequency histogram. This is done by grouping the data into classes and
then plotting a bar graph with the number or the relative frequency (proportion) of observations
in a class versus the midpoint of the class interval. The midpoint of a class is called the class
mark. The class interval is the difference between the upper and lower class boundaries.
Figure 2.6 is such a plot for the Kentucky River peak flow data. Frequency histograms are of
most value for data that are independent from observation to observation. A frequency his-
togram for the Kentucky River data would have the same general shape and location regardless
of what period of record was selected. The Salt Lake data would have a different location if the
period 1890 to 1910 was used in comparison to the period 1870 to 1890 because the levels were
higher over the later period. Certainly no satisfactory histogram of levels could be developed for
Devils Lake, North Dakota (figure 1.3).
.25 Kentucky River
.24

.19

.14

.09

- .03 -04

.oo
, .oo .01

22.5 37.5 52.5 67.5 82.5 97.5 112.5 127.5 142.5


Peak flow (1000s ds)
Fig. 2.6. Frequency histogram for the Kentucky River peak flow data.

The selection of the class interval and the location of the first class mark can appreciably af-
fect the appearance of a frequency histogram. The appropriate width for a class interval depends
upon the range of the data, the number of observations, and the behavior of the data. Several sug-
gestions have been put forth for forming frequency histograms. Spiegel (1961) suggests that
there should be 5 to 20 classes. Steel and Torrie (1960) state that the class interval should not ex-
ceed one-fourth to one-half of the standard deviation of the data. Sturges (1926) recommends that
the number of classes be determined from

m = 1 + 3.3 log n (2.14)

where m is the number of classes, n is the number of data values, and the logarithm to the base
10 is used.
Whatever criteria is used, it should be kept in mind that sensitivity is lost if too few or too
many classes are used. Too few classes will eliminate detail and obscure the basic pattern of the
data. Too many classes result in erratic patterns of alternating high and low frequencies. Figure 2.7
represents the Kentucky River data with too many class intervals. If possible, the class intervals
and class marks should be round figures. This is not a computational or theoretical consideration,
but one aimed at making it easier for those viewing the histogram to grasp its full meaning.
In some situations it may be desirable to use nonuniform class intervals. In chapter 8 a situ-
ation is presented where the intervals are such that the expected relative frequencies are the same
in each class.
Another common method of presenting data is in the form of a cumulative frequency distri-
bution. Cumulative frequency distributions show the frequency of events less than (greater than)
some given value. They are formed by ranking the data from the smallest (largest) to the largest
(smallest), dividing the rank by the number of data points and plotting this ratio against the
corresponding data value. If the data are ranked from the smaller (larger) data values to the larger
(smaller) values, the resulting cumulative frequency refers to the frequency of observations less
PROBABILITY 31
.I4 1
K e n t u c k y River

25 40 5 5 7 0 8 5 1 0 0 1 1 5 130 1 4 5
P e a k flow (1000s cfs)
Fig. 2.7. Frequency histogram for the Kentucky River data with too many classes.

20 70 120
Flow (1000s ds)
Fig. 2.8. Cumulative frequency distribution of Kentucky River data.

(more) than or equal to the corresponding data value. Figure 2.8 is a cumulative frequency plot
based on the Kentucky River peak flow data. Again, ranking and plotting data in this fashion is
most meaningful if the data are not serially correlated.

RANDOM VARIABLES
Simply stated, a random variable is a real-valued function defined on a sample space. If the
outcome of an experiment, process, or measurement has an element of uncertainty associated
with it such that its value can only be stated probabilistically, the outcome is a random variable.
This means that nearly all quantities in hydrology, flows, precipitation depths, water levels,
storages, roughness coefficients, aquifer properties, water quality parameters, sediment loads,
number of rainy days, drought duration, and so forth, are random variables.
Random variables may be discrete or continuous. If the set of values a random variable can
assume is finite (or countably infinite), the random variable is said to be a discrete random vari-
able. If the set of values a random variable can assume is infinite, the random variable is said to
be a continuous random variable. An example of a discrete random variable would be the num-
ber of rainy days experienced at a particular location over a period of 1 year. The amount of rain
received over the year would be a continuous random variable. For the most part in this text, cap-
ital letters will be used to denote random variables and the corresponding lower case letter will
represent values of the random variable.
It is important to note that any function of a random variable is also a random variable. That
is if X is a random variable then Z = g(X) is a random variable as well. This follows from the
fact that if X is uncertain, then any function of X must be uncertain as well. Physically, this means
that any hydrologic quantity that is dependent on a random variable is also a random variable. If
runoff is a random variable, then erosion and sediment delivery to a stream is a random variable.
If sediment has absorbed chemicals, then water quality is a random variable.

Example 2.7. Nearly every hydrologic variable can be taken as a random variable. Rainfall for
any duration, streamflow, soil hydraulic properties, time between hydrologic events such as flows
above a certain base or daily rainfalls in excess of some amount, the number of times a stream-
flow rate exceeds a given base over a period of a year, and daily pan evaporation are all random
variables. Quantities derived from random hydrologic variables are also random variables. The
storage required in a water supply reservoir to meet a given demand is a function of the demand
and the inflow to the reservoir. Since reservoir inflow is a random variable, required storage is
also a random variable. As a matter of fact, the demand that is placed on the reservoir would be
a random variable as well. The velocity of flow through a porous media is a function of the hy-
draulic conductivity and hydraulic gradient which are both random variables. Therefore, the flow
velocity is also a random variable.

UNIVARIATE PROBABILITY DISTRIBUTIONS


If the random variable X can only take on values x,, x2, ..., x, with probabilities fx(xl),
fX(x2),..., fX(xn)and Cifx(xi) = 1, then X is a discrete random variable. With a discrete random
variable there are "lumps" or spikes of probability associated with the values that the random
variable can assume. Figure 2.9 is a typical plot of the distribution of probability associated with
the values that a discrete random variable can assume. This information would constitute a prob-
ability distribution function (pdf) for a discrete random variable.
The cumulative probability distribution (cdf), FX(xk),for this discrete random variable is
shown in figure 2.10. The cumulative distribution represents the probability that X is less than or
equal to x,.

The cumulative distribution has jumps in it at each xi equal in magnitude to fx(xi) or the proba-
bility that X = xi. The probability that X = xi can be determined from
PROBABILITY 33

0 1 2 3 4 5 6 7 8 9 1 0
X
Fig. 2.9. A discrete probability distribution function.

0 1 2 3 4 5 6 7 8 9 1 0
X
Fig. 2.10. A discrete cumulative distribution function.

The notation fx(x) and Fx(x) denote the pdf and cdf of the discrete random variable X
evaluated at X = x.
Often, continuous data are treated as though they were discrete. ~ o o k i n gagain at the
Kentucky River peak flow data, we can define the event A as having a peak flow in the ithclass.
Letting ni be the number of observed peak flows in the ithinterval and n be the total number of
observed peak flows, the probability that a peak flow is in the ithclass is given by

Thus, the relative frequency, fxi, can be interpreted as a probability estimate, the frequency
histogram can be interpreted as an approximation for a pdf, and the cumulative frequency can be
interpreted as an approximation for a cdf.
Many times it is desirable to treat continuous random variables directly. Continuous random
variables can take on any value in a range of values permitted by the physical processes involved.
Probability distribution functions of continuous random variables are smooth curves. The pdf of
a continuous random variable X is denoted by px(x). The cdf is denoted by Px(x). Px(x) repre-
sents the probability that X is less than or equal to x.

The pdf and the cdf function are related by

The notation px(x) and Px(x) denote the pdf and cdf, respectively, of the continuous random
variable X evaluated at X = x. Thus, p,(a) represents the pdf of the random variable Y evaluated
at Y = a. P,(a) represents the cdf of the random variable Y and gives prob(Y 5 a).
A function, px(x), defined on the real line can be a pdf if and only if

1. px(x) 1 0 for all x E R, the range of X (2.21)

By definition px(x) is zero for X outside R. Also Px(x,) = 0 and Px(x,) = 1 where xl and
xu are the lower and upper limits of X in R. For many distributions these limits are -m to 03
or 0 to 03. It is also apparent that the probability that X takes on a value between a and b is
given by

The prob(a r X 5 b) is the area under the pdf between a and b. The probability that a random
variable takes on any particular value from a continuous distribution is zero. This can be seen
from

Because the probability that a continuous random variable takes on a specified value is zero, the
expressions prob(a 5 X 5 b), prob(a < X 5 b), prob(a 5 X < b) and prob(a < X < b) are all
PROBABILITY 35
equivalent. It is also apparent that Px(x) can be interpreted as the probability that X is strictly less
than x since prob(X = x) = 0.
Figures 2.11 and 2.12 illustrate a possible pdf and its corresponding cdf. In addition to den-
sity functions that are symmetrical and bell-shaped, densities may take on a number of different

a b x
Fig. 2.11. Probability density function.

h
X
Y

a b X
Fig. 2.12. Cumulative probability distribution function.
36 CHAPTER 2

Skewed right (or len) Munimodal

1
'J ' shaped Reverse "J" shaped 'U " shaped

ExDonential

Fig. 2.13. Some possible shapes for probability density functions.

shapes including distributions that are skewed to the right or left, rectangular, triangular, expo-
nential, "J"-shaped, and reverse "J"-shaped (Fig. 2.13).
At this point a cautionary note is added to the effect that the probability density function,
px(x), is not a probability and can have values exceeding one. The cumulative probability distri-
bution, Px(x), is a probability [prob(X 5 x) = Px(x)] and must have values ranging from 0 to 1.
Of course, px(x) and Px(x) are related as indicated by equation 2.20 and knowledge of one spec-
ifies the other.

Example 2.8. Evaluate the constant a for the following expression to be considered a probability
density function:

What is the probability that a value selected at random from this distribution will (a) be less than 2?
(b) fall between 1 and 3? (c) be larger than 4? (d) be larger than or equal to 4? (e) exceed 6?

Solution:
From equation 2.21 we must have
PROBA%ILITY 37

and
A
Px(x) = - for 0 5 x 5 5
125

(a) prob(X 5 2) = Px(2) = 8/125

(b) prob(1 5X 5 3) = Px(3) - Px(l) = 26/125

(d) ~ r o b ( X2 4) = 6 1/125 since prob(X = 4) = 0

(e) ~ r o b ( X> 6) = 1 - Px(6)

Px(x) for X >5 = 1

Piecewise continuous distributions satisfying the requirements for a probability distribution


in which the prob(X = d) is not zero are possible. Such a distribution could be defined by

Px(x) = P,(x) for X < d (2.24)


= P2(x) for X 2 d

where Pz(d) > P,(d), P,(x,) = 0, Pz(x,) = 1, and P,(x) and P2(x) are nondecreasing functions of
X. Figure 2.14 is a plot of such a distribution. For this situation the prob(X = d) equals the mag-
nitude of the jump AP at X = d or is equal to Pz(d) - P,(d). Any finite number of discontinuities
of this type are possible.
An example of a distribution as shown in figure 2.14 is the distribution of daily rainfall
amounts. The probability that no rainfall is received, prob(X = 0), is finite, whereas the proba-
bility distribution of rain on rainy days would form a continuous distribution. A second example
would be the probability distribution of the water level in some reservoir. The water level may be
maintained at a constant level d as much as possible but may fluctuate below or above d at times.
The distribution shown in figure 2.14 could represent this situation.
The relationship between relative frequency and probability can be envisioned by consider-
ing an experiment whose outcome is a value of the random variable X. Let px(x) be the proba-
bility density function of X. The probability that a single trial of the experiment will result in an
outcome between X = a and X = b is given by
-2 0 2 4 6 8 10
X
Fig. 2.14. A possible piecewise continuous pdf for the case prob (X = d) # 0.

In N independent trials of the experiment, the expected number of outcomes in the interval a to b
would be

and the expected relative frequency of outcomes in the interval a to b is

In general, if xi represents the midpoint of an interval of X given by xi - Axi/2 to xi +


Axi/2, then the expected relative frequency of outcomes in this interval of repeated, independent
trials of the experiment is given by

Because the right-hand side of this equation represents the area under px(x) between xi - Axi/2
+
and xi Axi/2, it can be approximated by

Equation 2.25 can be used to determine the expected relative frequency of repeated, independent
outcomes of a random experiment whose outcome is a value of the random variable X.
If N independent observations of X are available, the actual relative frequency of outcomes
in an interval of width Axi centered on xi may not equal fxi as given by equation 2.25 because X
PROBABILITY 39
is a random variable whose behavior can only be described probabilistically. The most probable
outcome or the expected outcome will equal the observed outcome only if px(x) is truly the prob-
ability density function for X and for an infinitely large number of observations. Even if the true
probability density function is being used, the actual frequency of outcomes in the interval Ax,
approaches the expected number only as the number of trials or observations becomes very large.

Example 2.9. Plot the expected frequency histogram using the probability density function of
example 2.8 and a class interval of !4.

Solution: fx, = Axipx(xi)

The desired plot is shown in figure 2.15.

.00075
.00675
.01875
.03675
.06075
.09075
.I2657
.I6875
.2 1675
.27075
Sum .99750

0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
X
Fig. 2.15. Plot for example 2.10.
40 CHAPTER 2
BIVARIATE DISTRIBUTIONS
The situation frequently arises where one is interested in the simultaneous behavior of two or
more random variables. An example might be the flow rates on two streams near their confluence.
One might like to know the probability of both streams having peak flows exceeding a given value.
A second example might be the probability of a rainfall exceeding 2.5 inches at the same time the
soil is nearly saturated. Rainfall depth and soil water content would be two random variables.

Example 2.10. The magnitude of peak flows from small watersheds is often estimated from the
"Rational Equation" given by Q = CIA where Q is the estimated flow, C is a coefficient, I is a
rainfall intensity, and A is the watershed area. The assumption is made that the return period of
flow will be the same as the return period of the rainfall that is used. To verify this assumption it
is necessary to study the joint probabilities of the two random variables Q and I.

If X and Y are continuous random variables, their joint probability density function is
pX,y(x,y) and the corresponding cumulative probability distribution is P, .(x, y). These two are
related by

and
P,,y(x, y) = prob(X 5 x and Y 5 y) = m:J JTm px,,(t, s) ds dt

The corresponding relationships for X and Y being discrete random variables are

fX,Y(~i,
yj) = prob(X = xi and Y = yj) (2.28)

F X , y ( ~y), = prob(X 5 x and Y 5 y) = 2 fx,y(xi, yj)


x,sx y,sy

It should be noted that the bivariate analogy of equation 2.16 is

Some of the properties of continuous bivariate distributions are

1) PXSy(x,m) is a cumulative univariate probability function of X only (the cumulative marginal


distribution of X).

2) PX,y(m,y) is a cumulative univariate probability function of Y only (the cumulative marginal


distribution of Y).
PROBABILITY 41

MARGINAL DISTRIBUTIONS
If one is interested in the behavior of one of a pair of random variables regardless of the
value of the second random variable, the marginal distribution may be used. For instance,
the marginal density of X, px(x), is obtained by integrating px,,(x, y) over all possible values
of Y.

The cumulative marginal distribution is given by

Px(x) = P X , Y (m)
~ , = prob(X I x and Y I m) (2.32a)

Similarly, the marginal density and cumulative marginal distribution of Y are

and

The corresponding relationships for a discrete bivariate distribution are

CONDITIONAL DISTRIBUTIONS
A marginal distribution is the distribution of one variable regardless of the value of the sec-
ond variable. The distribution of one variable with restrictions or conditions placed on the second
variable is called a conditional distribution. Such a distribution might be the distribution of X
given that Y equals yoor the distribution of Y given that x, 5 X Ix,.
42 CHAPTER 2
In general, the conditional distribution of X given that Y is in some region R is arrived at
using the same reasoning that was used in obtaining equation 2.8. The total sample space of Y is
now the region R. Because

then

represents a probability density function of X given that Y is in R. The conditional density of X


given Y is in R is given by

for X and Y continuous. Similarly the conditional distribution of (xIY is in R) for X and Y discrete is

The determination of conditional probabilities from equations 2.37 and 2.38 are done in the
usual way.

for X and Y continuous and

for X and Y discrete.


For the special case where X and Y are continuous and the conditional density of X given
Y = yo is desired, equation 2.37 breaks down because both the numerator and denominator
become zero. In this case

The proof of this may be found in Neuts (1973). In most statistics books pXly(xIY= yo) is simply
written as

and called the conditional density of X given Y.


All of the above results are symmetrical with respect to X and Y. For example

for X and Y continuous where

If the region R of equation 2.37 is the entire region of definition with respect to Y, then

SRPY(S)ds = 1
and

-& s) ds = pX(x)

so that

This results from the fact that the condition that Y is in R when R encompasses the entire -
region of definition of Y is really no restriction but simply a condition stating that Y may
take on any value in its range. In this case, pXIy(xIYis in R) is identical to the marginal
density of X.

INDEPENDENCE
From equation 2.37 or 2.38 it can be seen that in general the conditional density of X given
Y is a function of y. If the random variables X and Y are independent, this functional relationship
disappears (i.e., p X l y ( x l ~
is)not a function of y). In fact, in this case

or the conditional density equals the marginal density. Furthermore, if X and Y are independent
(continuous or discrete) random variables, their joint density is equal to the product of their mar-
ginal densities.
The random variables X and Y are independent in the probabilistic sense (stochastically
independent) if and only if their joint density is equal to the product of their marginal densities.
Independence is an extremely important property. A bivariate distribution is much more
difficult to define and to work with than is a univariate distribution. If independence exists and
the proper pdf for X and for Y can be determined, the bivariate distribution for X and Y is given
as the product of these two univariate distributions.

DERIVED DISTRIBUTIONS
Situations often arise where the joint probability distribution of a set of random variables is
known and the distribution of some function or transformation of these variables is desired. For
example, the joint probability distribution of the flows in two tributaries of a stream may be
known whereas the item of interest may be the sum of the flows in the two tributaries. Some
commonly used transformations are translation or rotation of axes, logarithmic transformations,
nthroot transformations for n equal 2 and 3 and certain trigonometric transformations.
Thomas (197 1) presents the developments that lead to the results presented here concerning
transformations and derived distributions for continuous random variables. The procedures for
discrete random variables is simply an accounting procedure.

Example 2.1 1. Let X have the distribution function

fx(x) = c/x for X = 2 , 3 , 4 , 5

Let Y = x2- 7X + 12. The probability distribution and possible values of Y can be determined
from the following table.

Thus fy(y) = c/3 + c/4 = 3 5 ~ 1 6 0for Y = 0


= c/2 + c/5 = 4 2 ~ 1 6 0for Y = 2
= 0 elsewhere
The value for c can be evaluated from either the requirement that

In either case, the value of c will be found to be 60177.


PROBABILITY 45
For a univariate continuous distribution of the random variable X, the distribution of U
where

is a monotonic function u(X) is monotonically increasing if u(x,) 1 u(x,) for x2 > x, and
monotonically decreasing if u(x2) 5 u(x,) for x, > x,) can be found from

Example 2.12. Find the probability of 0 < U < 10 if U = X' and X is a continuous random vari-
able with

Solution:

or, since U = x',

A check to see that pU(u) is a probability density can be made by integrating pU(u) from 0 to 25

3 6 103'
Now prob(0 < U < 10) = J'O- du = -
O 250 125

The same result could have been obtained by noting that


In the case of a continuous bivariate density, the transformation from p,,,(x, y) to pUvv(u,v)
where U = u(X, Y) and V = v(X, Y) are one-to-one continuously differentiable transformations
can be made by the relationship

where J(li) is the Jacobian of the transformation computed as the determinant of the matrix of
u, v
partial derivatives

The limits on U and V must be determined from the individual problem at hand.

Example 2.13. Given that p,,,(x, y) = (5 - y/2 - x)/14 for 0 < X < 2 and 0 < Y < 2. If U =
X + Y and V = Y/2, what is the joint probability density function for U and V? What are the
proper limits on U and V?

Solution:

The limits on U and V can be determined by noting that Y = 2V and X = U - 2V. Therefore, the
limitofY=OmapstoV=O,Y=2mapstoV= 1,X=OmapstoU=2VandX=2mapsto
U = 2V + 2. These limits are shown in figure 2.16. A check can be made by integrating pu.v(u, v)
overtheregion0 < V < 1,2V < U < 2V + 2.
PROBABILITY 47

Fig. 2.16. Mapping from X, Y to U, V for example 2.13.

Therefore P ~ , ~ ~V)( over


U , the above defined region is indeed a probability density function.

A special case of a bivariate transformation is when the distribution of U = u(X, Y) is desired.


In this case, one method of obtaining pu(u) is to define a dummy random variable V = v(X, Y).
Equation 2.48 is then used to find the joint density of U and V, pU,,(u, v). The univariate density of
U is now the marginal distribution of U found by integrating out V.
Other special cases of bivariate transformations involve the sums, products and quotients of
random variables. If the joint distribution of X and Y is PX,,(x, y) for X > 0 and Y > 0, the
following results are obtained for the distribution of U as a function of X and Y.
Function Ddf

In some cases, the function U = u(X) may be such that it is difficult to analytically determine
the distribution of U from the distribution of X. In this case it may be possible to generate a large
sample of X's (chapter 13), calculate the corresponding U's and then fit a probability distribution
to the U's (chapter 6). It should be noted, however, that this empirical method will not in general
satisfy equations 2.47 or 2.48.

MIXED DISTRIBUTIONS
If pi(x) for i = 1, 2, ..., m represent probability density functions and Xi for i = 1, 2, ..., m
represent parameters satisfying Xi 2 0 and Xy= Xi = 1, then

is a probability density function known as a mixture or mixed distribution because it is composed


of a mixture of pi(x). The parameter Xi may be thought of as the probability that a random vari-
able is from the probability distribution pi(x) and pi(x) is the probability distribution of X given
that X is from the ithdistribution. The cumulative distribution of X is given by

Mixed distributions in hydrology may be applicable in situations where more than one dis-
tinct cause for an event may exist. For example, flood peaks from convective storms might be
described by pl(x) and from hurricane storms by p2(x)-If A1 is the proportion of flood peaks gen-
erated by convective storms and X2 = (1 - XI), is the proportion generated by hurricane storms,
then equations 2.54 and 2.55 would describe the probability distribution of flood peaks.
Singh (1974), Hawkins (1974), Singh (1987a, 1987b) Hirschboeck (1987), and Diehl and
Potter (1987) discuss procedures for applying mixed distribution in the form of equation 2.54 to
flood frequency determinations. Two general approaches are used. One is to allow the data and
statistical estimation procedures to determine the mixing parameter, A, and the parameters of the
distributions, pi(x). The other is to use physical information on the actual events to classify them
and thus determine A. Once classified, the two sets of data can independently be used to deter-
mine the parameters of the pdfs.

Example 2.14. A certain event has probability 0.3 of being from the distribution pl(x) = e-",
x > 0. The event may also be from the distribution p2(x) = 2e-2x, x > 0. What is the probability
that a random observation will be less than I?

Solution:
PROBABILITY 49
Exercises

2.1. (a) Construct the theoretical relative frequency histogram for the sum of values obtained in
tossing two dice. (b) Toss two dice 100 times and tabulate the frequency of occurrence of the
sums of the two dice. Plot the results on the histogram of part a. (c) Why do the results of part b
not equal the theoretical results of part a? What possible kinds of errors are involved? Which kind
of error was the largest in your case?

2.2. Select a set of data consisting of 50 or more observations. Construct a relative frequency
plot using at least two different groupings of the data. Which of the two groupings do you prefer?
Why?

2.3. In a period of one week, 3 rainy days were observed. If the occurrence of a rainy day is an
independent event, how may ways could the sequence consisting of 4 dry and 3 wet days be
arranged?

2.4. If the occurrence of a rainy day is an independent event with probability equal to 0.3, what
is the probability of (a) exactly 3 rainy days in one week? (b) the next 3 days will be rain? (c) 3
rainy days in a row during any week with the other 4 days dry?

2.5. Consider a coin with the probability of a head equal to p and the probability of a tail equal
to q = 1 - p. (a) What is the probability of the sequence HHTHTTH in 7 flips of the coin? (b)
What is the probability of a specified sequence resulting in r H's and s T's? (c) How many ways
can r H's and s T's be arranged? (d) What is the probability of r H's and s T's without regard to
the order of the sequence?

2.6. The distribution given by fx(x) = l / N for X = 1'2'3, . . . ,N is known as the discrete uni-
form distribution. In the following consider N 1 5. (a) What is the probability that a random
value from fx(x) will be equal to 5? (b) What is the probability that a random value from fx(x)
will be between 3 and 5 inclusive? (c) What is the probability that in a random sample of 3
values from fx(x) all will be less than 5? (d) What is the probability that the 3 random values from
fx(x) will all be less than 5 given that 1 of the values is less than 5? (e) If 2 random values are se-
lected from fx(x), what is the probability that one will be less than 5 and the other greater than 5?
(f') For what X from fx(x) is prob(X 5 x) = 0.5?

2.7. Consider the continuous probability density function px(x) = a sin2 mx for 0 < X < 7c. (a)
What must be the value of a and m? (b) What is Px(x)? (c) What is prob(0 < X < 7c/2)? (d) What
is prob(X > a/2 I X < a/4)?

2.8. Consider the continuous probability density function given by px(x) = 0.25 for 0 < X < a.
(a) What is a? (b) What is prob(X > a/2)? (c) What is prob(X > a/2 I X > a/4)? (d) What is
prob(X > a/2 I X < a/4)?
50 CHAPTER 2

2.9. Let px(x) = 0.25 for 0 < X < a as in exercise 2.8. What is the distribution of Y = In X?
Sketch py(y).

2.10. Many probability distributions can be defined simply by consulting a table of definite inte-
grals. For example 5," xn-' e-' dx is equal to r(n) where r(n) is defined as the gamma function
(see chapter 6). Therefore one can define px(x) = xn-' e-'/T(n) to be a probability density func-
tion for n > 0 and 0 < X < m. This distribution is known as the l-parameter gamma distribution.
Using a table of definite integrals, define several possible continuous probability distributions.
Give the appropriate range on X and any parameters.

2.11. The annual inflow into a reservoir (acre-feet) follows a probability density given by p,(x) =
l/(P1- al). The total annual outflow in acre-feet follows a probability distribution given by py(y)
= 1/(P2 - %). Consider that P1 > P2 and al < 04~.(a) Calculate the expression for the probabil-
ity distribution of the annual change in storage. (b) Plot the probability distribution of the annual
change in storage. (c) If P1 = 100,000, a, = 20,000, P, = 70,000 and % = 50,000, what is the
probability that the change in storage will be i) negative and ii) greater than 15,000 acre-feet?

2.12. The probability of receiving more than 1 inch of rain in each month is given in the follow-
ing table. If a monthly rainfall record selected at random is found to have more than 1 inch of
rain, what is the probability the record is for July? April?

Jan -25 Apr .40 Jul .05 Oct .05


Feb .30 May -20 Aug .05 Nov -10
Mar .35 Jun .10 Sept .O5 Dec -20

2.13. It is known that the discharge from a certain plant has a probability of 0.001 of containing
a fish killing pollutant. An instrument used to monitor the discharge will indicate the presence of
the pollutant with probability 0.999 if the pollutant is present and with probability 0.01 if the pol-
lutant is not present. If the instrument indicates the presence of the pollutant, what is the proba-
bility that the pollutant is really present?

2.14. A potential purchaser of a ferry across a river knows that if a flow of 100,000 cfs or more
occurs, the ferry will be washed down stream, go over a low dam, and be destroyed. He knows
that the probability of a flow of this kind in any year is 0.05. He also knows that for each year that
the ferry operates a net profit of $10,000 is realized. The purchase price of the ferry is $50,000.
Sketch the probability distribution of the potential net profit over a period of years neglecting in-
terest rates and other complications. Assume that if a flow of 100,000 cfs or more occurs in a
year, the profit for that year is zero.

2.15. Assume that the probability density function of daily rainfall is given by
(a) Is this a proper probability density function? (b) What is prob(X > 0.5)? (c) What is prob
(X > 0.5 1 X # O)?

2.16. Consider the probability density function given by

This is a mixture of two uniform distributions. (a) Sketch p,(x) for A, = 0.5. (b) Sketch px(x) for
A, = 0.1. (c) Sketch px(x) for A, = 0.333. (d) In a random sample from px(x), 60% of the values
were between 0 and 2. What would be an estimate for the value of A,?

2.17. Show that equations 2.50 through 2.53 are valid.


3. Properties of
Random Variables
IN CHAPTER 2 random variables and their probability density functions were discussed in
general and somewhat abstract terms. Actually, nearly every hydrologic variable is a random
variable. This includes rainfall, streamflow, infiltration rates, evaporation, reservoir storage, and
so on. Any process whose outcome is a random variable can be thought of as an experiment. A
single outcome from an experiment is a realization of the experiment or an observation from the
experiment. Thus, daily rainfall values are observations generated by a set of meteorologic con-
ditions that comprise the experiment.
The terms realization and observation can be used interchangeably; however, an observation
is generally taken to be a single value of a random variable and a realization is generally taken as
a time series of random variables generated by a random experiment. A 10-year record of daily
rainfall might be considered as a single realization of a stochastic process (daily rainfall). A
second 10-year record of daily rainfall from the same location would then be a second realization
of the process.
In this chapter we will be concerned mainly with observations of random variables and with
the collection of possible values that these observations may take on. The complete assemblage
of all of the values representative of a particular random process is called a population. Any sub-
set of these values would be a sample from the population. For example, the pages of this book
could represent a population while the pages of this chapter are a sample of that population. All
of the books in a library might be taken as a population and should this book be found in the
library, it would be a sample from the total population.
Generally, one has at hand a sample of observations or data from which inferences about the
originating population are to be made, and then possibly inferences about another sample from
RANDOM VARIABLES 53

this population. Streamflow records for the past 50 years on a particular stream would be a
sample from which inferences about the behavior of the stream for all time (the population) could
be made. This information could also be used to estimate the behavior of the stream during some
future period of years (another but yet unrealized sample) so that a structure could be properly
designed for the stream. Thus, one might use information gleaned from one sample to make
decisions regarding another sample.
Quantities that are descriptive of a population are called parameters. In most situations these
parameters must be estimated from samples of data. Sample statistics are estimates for popula-
tion parameters. Sample statistics are estimated from samples of data and as such are functions
of random variables (the sample values) and thus are themselves random variables. The average
number of pages in all of the books in a particular library would be a parameter representing
the population (the books in the library). This parameter could be estimated by determining the
average number of pages in all of the books on a particular shelf in the library (a sample of the
population). This estimate of the parameter would be a statistic.
As pointed out in chapter 1, for a decision based on a sample to be valid in terms of the
population, the sample statistics must be representative of the population parameters. This in
turn requires that the sample itself be representative of the population and that "good" parame-
ter estimation procedures are used. One could not get a "good" estimate of the average number
of pages per book in a library by sampling a shelf that contained only fat, engineering
handbooks. By the same token, one cannot get "good" estimates for the parameters of a stream-
flow synthesis model if the estimates are based on a short period of record during which an
extreme drought occurred.
One rarely, if ever, has available a population of observations on a hydrologic variable.
What is generally available is a sample (of observations) from the population. Thus, population
parameters are rarely, if ever, known and must be estimated by sample statistics. By the same
token, the true probability density function that generated the available sample of data is not
known. Thus, it is necessary to not only estimate population parameters, but it is also necessary
to estimate the form of the random process (experiment) that generated the data.
This chapter is devoted to a discussion of parameters descriptive of populations and
how estimates (statistics) for these parameters can be obtained from samples drawn from
populations.

MOMENTS AND EXPECTATION-UNIVARIATE DISTRIBUTIONS


A convenient way of quantifying the location and some measures of the shape of a proba-
bility distribution is by computing the moments of the distribution. Refemng to figure 3.1, the
first moment of the elemental area dA about the origin is given by

and the first moment of the total area about the origin is
X
Fig. 3.1. Moment of arbitrary area.

In case of a random variable and its associated probability density function such as shown
in figure 3.2, the first moment about the origin is again given by

In this case dA = px(x) dx so that

Fig. 3.2. Moment of probability distribution.


RANDOM VARIABLES 55

Generalizing the situation, the ithmoment about the origin is

In the case of a discrete distribution

The ithcentral moment is defined as the ithmoment about the mean, p, of a distribution and
is given by

The expected value of the random variable X is defined to be

E(X) = Jro,x px(x) dx X continuous

E(X) = xjxj fx(xj) X discrete

If g(X) is a function of X, then the expected value of g(X) is given by

E[g(X)] = J_",g(x) px(x) dx X continuous (3.8)

E[g(x)] = xjg(xj) fx(xj) X discrete (3.9)

It is apparent that the expected value of (x - p,)' is equal to the ithcentral moment

and the E(X) = p; is the first moment about the origin.


Some rules for finding expected values are

MEASURES OF CENTRAL TENDENCY

Arithmetic Mean
Generally, the first property of a random variable that is of interest is its mean or average
value. The mean, px, of a random variable, X, is its expected value. Thus
A sample estimate of the population mean is the arithmetic average, X,calculated from

where n is the number of observations or items in the sample. The arithmetic mean can be esti-
mated from grouped data by

where k is the number of groups, n is the number of observations, ni is the number of observa-
tions in the ithgroup and xi is the class mark of the ithgroup.

Geometric Mean
The sample geometric mean, K, is defined as

where l-Ir= xi = x, x2 x3 ... x,.


The logarithm of X , is equal to the arithmetic average of the logarithms of the x{s. The
logarithm of the population geometric mean would be the expected value of logarithm of X.

Median
The sample median, Xmd,is the observation such that half of the values lie on either side of
Xmd.The population median, I J . ~ ,would
, be the value satisfying

J-kpx(x) dx = 0.5 X continuous (3.18)

or pmd= x, where p is determined from

Xr='=, fx(xi) = 0.5 X discrete (3.19)

The median of a sample or a population may not exist.

Mode
The mode is the most frequently occurring value. Thus the population mode, IJ.,,, would be
a value of X maximizing px(x) and thus satisfying the equations

dpx(x)
--
dx
-0 and d2px(x) < 0 x continuous
dx2

or the value of X associated with

Maxi2 fx(xi) X discrete (3.21)


RANDOM VARIABLES 57
The sample mode, X,,, would simply be the most frequently occurring value in the sample.
A sample or a population may have none, one, or more than one mode.

Weighted Mean
The calculation of the arithmetic mean of grouped data is an example of calculating a
weighted mean where ni /n is the weighting factor. In general, the weighted mean is

where wi is the weight associated with the ithobservation or group and k is the number of obser-
vations or groups.

MEASURES OF DISPERSION

Range
The two most common measures of dispersion are the range and the variance. The range of
a sample is simply the difference between the largest and smallest sample values. The range of a
population is many times the interval from -m to or from 0 to m. The sample range is a func-
tion of only two of the sample values but does convey some idea of the spread of the data. The
population range of many continuous hydrologic variables would be 0 to m and would convey
little information. The range has the disadvantage of not reflecting the frequency or magnitude of
values that deviate either positively or negatively from the mean because only the largest and
smallest values are used in its determination. Occasionally, the relative range-the range divided
by the mean-is used.

Variance
By far the most common measure of dispersion is the variance, or its positive square root-
the standard deviation. The variance of the random variable X is defined as the second moment
about the mean and is denoted by 0;.

Thus, the variance is the average squared deviation from the mean. For a discrete population of
size n, equation 3.23 becomes

The sample estimate of cr; is denoted by S; and calculated from


Two basic differences should be noted between equations 3.24 and 3.25. First, in 3.25 F is used
instead of p. This is because in dealing with a sample, the population mean would not be known.
Second, n - 1 is used as the denominator in determining S; rather than n when calculating 0;.
Ci(xi - x ) ~
The reason for this is that would result in a biased estimate for 0;.
n
The variance for grouped data can be estimated from

where k is the number of groups, n is the number of observations, xi is the class mark and ni the
number of observations in the i" group.
The variance of some functions of the rambmvariable X can be determined from the
following relationships:

The units on the variance are the same as the units on x2.The units on the standard devia-
tion are the same as the units on the random variable. A dimensionless measure of dispersion is
the coefficient of variation, defined as the standard deviation divided by the mean. The coefficient
of variation is estimated from

MEASURES OF SYMMETRY
As is apparent from figure 2.13, many distributions are not symmetrical. They may tail off
to the right or to the left and as such are said to be skewed. A distribution tailing to the right is
said to be positively skewed and one tailing to the left is negatively skewed. The skewness is the
third moment about the mean and is given by

skewness = J_"m (x - p)3 px(x) dx

One measure of absolute skewness would be the difference in the mean and the mode. A meas-
ure such as this would not be too meaningful, however, because it would depend on the units of
measurement. A relative measure of skewness, known as Pearson's first coefficient of skewness,
can be obtained by dividing the difference in the mean and the mode by the standard deviation.

P - Pmo
population measure of skewness = (3.32)
(7
RANDOM VARIABLES 59
Mean = Mode = Median

Symmetrical Positive skew Negative skew

Fig. 3.3. Location of mean, median, and mode.

which can be estimated by

x - Xmo
sample measure of skewness =
Sx

The mode of moderately skewed distributions can be estimated from (Par1 1967)
-
Xmo =X - 3(x - xmd)
so that

- ~md)
s a m ~ l emeasure of skewness =

If sample estimates are replaced by population values in equation 3.35, Pearson's second coeffi-
cient of skewness results.
The most commonly used measure of skewness is the coefficient of skew given by

An unbiased estimate for the coefficient of skew based on a sample of size n is

where M, is the sample estimate for p3.The sample coefficient of skew has the advantage of be-
ing a function of all of the observations in the sample. Figure 3.3 shows symmetrical, positively
and negatively skewed distributions.

MEASURES OF PEAKEDNESS
A fourth property of random variables based on moments is the kurtosis. Kurtosis refers to
the extent of peakedness or flatness of a probability distribution in comparison with the normal
60 CHAPTER 3

Leptokurtic, K >3,E >O

/-- Normal, K = 3,E = 0

5- Platykurtic, K <3,E <O

Fig. 3.4. Illustration of kurtosis.

probability distribution. Kurtosis is the fourth moment about the mean. A coefficient of kurtosis
is defined as ,

The sample estimate for the coefficient of kurtosis is

where M4 is the sample estimate for k4. According to Yevjevich (1972a), a less biased esti-
mate for the coefficient of kurtosis is obtained by multiplying equation 3.39 by
n3
--
where n is the sample size.
(n - l)(n - 2)(n - 3)
The coefficient of kurtosis for a normal distribution is 3. The normal distribution is said to
be mesokurtic. If a distribution has a relatively greater concentration of probability near the mean
than does the normal, the coefficient of kurtosis will be greater than 3 and the distribution is said
to be leptokurtic. If a distribution has a relatively smaller concentration of probability near the
mean than does the normal, the coefficient of kurtosis will be less than 3 and the distribution is
said to be platykurtic. Figure 3.4 illustrates kurtosis. The coefficient of excess, 5, is defined as
K - 3. Therefore, for a normal distribution 5 is 0, for a leptokurtic distribution 5 is positive and

for a platykurtic distribution 5 is negative.

MOMENTS AND EXPECTATION-JOINTLY DISTRIBUTED RANDOM VARIABLES


If X and Y are jointly distributed continuous random variables and U is some function of
X and Y, U = g(X, Y), then E(U) can be found by using the methods of chapter 2 to derive the
marginal distribution of U, pu(u), so that
RANDOM VARIABLES 61
A much simpler and more direct method of finding E[g(X, Y)] would be to use the
relationship

In either case, the result is the average value of the function g(X, Y) weighted by the probability
that X = x and Y = y or more simply the mean of the random variable U.
In the discrete case

A general expression for the r, s moment about the origin of the jointly distributed random
variables X and Y is

for X and Y continuous and

for X and Y discrete.


The r, s central moment is defined as

for continuous random variables and as

for discrete random variables.


For most situations, only moments about the origin and about the means are of interest. As
in the case of univariate distributions, the r, s moment about the origin of a bivariate distribution
is equal to the expected value of Xr Ys.
The cases where (r = 1, s = 0) and (r = 0, s = 1) are of special interest. For example

The analogous result holds for E(x"Y').


62 CHAPTER 3
The most useful central moments are for (r = 2, s = 0), (r = 1, s = 1) and (r = 0, s = 2).
For the case (r = 2, s = 0) we have

E[(X - PXl21 = J- J- (X - PXI2PX,Y(X,Y)dx dy


= J- (x - PXI2 .I-PX,Y(X,Y)dy dx
= .I-(X - PXI2 PX(X)dx
= var (X)

The analogous result holds for (r = 0, s = 2). The comparable results for discrete random variables
are easily obtained.

Covariance
The covariance of X and Y is defined as the 1, 1 central moment

For the case where X and Y are independent, equation 3.49 can be written

since px,,(x, y) would equal px(x) py(y). Furthermore, both of the integrals in equation 3.50 are
equal to zero so that

if X and Y are independent. The converse of this is not necessarily true, however.
The sample estimate for the population covariance ax,,is S,,, computed from

Correlation Coefficient
The covariance has units equal to the units of X times the units of Y. A normalized covari-
ance called the correlation coefficient is obtained by dividing the covariance by the products of
the standard deviations of X and Y
RANDOM VARIABLES 63

It can be shown (Thomas 1971) that - 1 5 pxy 5 1. Obviously, if X and Y are independent,
p,,, = 0. Again, the converse is not necessarily true. X and Y can be functionally related and still
have p,,, (and ox,, ) equal to zero. Actually px,, is a measure of the linear dependence between
X and Y. If pxqy= 0, then X and Y are linearly independent; however, they may be related by
+
some other functional form. A value of p,,, equal to 1 implies that X and Y are perfectly related
by Y = a + bX. If pxVy= 0, X and Y are said to be uncorrelated. Any nonzero value of px,,
means X and Y are correlated.
The covariance and the correlation coefficient are a measure of how the two variables X and
Y vary together. If pX,, and ox,, are positive, large values of X tend to be paired with large val-
ues of Y and vice versa. If pX,, and ox,, are negative, large values of X tend to be paired with
small values of Y and vice versa.
The population correlation coefficient p,,, can be estimated by the sample correlation
coefficient as

where sx and sy are the sample estimates for ox and oy given by equation 3.25 and SX,Yis the
sample covariance given by equation 3.52.
Figure 3.5 demonstrates some typical values for r,,,. In figure 3.5a all of the points lie on
the line Y = X - 1; consequently, there is perfect linear dependence between X and Y and
the correlation coefficient is unity. In figure 3.5b the points are either on or slightly off the line
Y = X - 1, and r x , = 0.986. Perfect linear dependence does not exist in this case because some
of the points deviate slightly from the straight line. In measuring and relating naturally occurring
hydrologic variables, a correlation coefficient of 0.986 would be considered quite good and the
resulting straight line, Y = X - 1 in this case, would usually be judged a good usable relation-
ship between X and Y.
In figure 3 . 5 the correlation coefficient has dropped to -0.671. The points in this case are
scattered about the line Y = 1.264 - 1.571X. The scatter of the points is much greater than in
the previous case, although the existence of some dependence (stochastic) is still in evidence.
In figure 3.5d the scatter of the points is great, with a corresponding lack of a strong (sto-
chastic) dependence. Generally, a correlation coefficient of 0.21 1 is considered too small to
indicate a useful stochastic dependence as knowledge about X gives very little information
about Y.
In the last two paragraphs the modifier "stochastic" has appeared with the word dependence.
This is because in reality there are two kinds of dependence-stochastic and functional. Gener-
ally, throughout this book the word dependence alone should be taken to mean stochastic (or sta-
tistical) dependence.
Figures 3.5e and 3.5f contain examples of functionally dependent variables. In figure 3.5e
the relationship is Y = x 2 / 4 for X > 0 and in figure 3.5f the relationship is Y = &.-
for -3 < X < 3. The correlation coefficient for figure 3.5e is 0.963, indicating a high degree of
stochastic (linear) dependence. This illustrates that even though the dependence between X and Y
Fig. 3.5. Examples of the correlation coefficient.

is nonlinear, a high correlation coefficient can result. If the plot of figure 3.5e were to cover a
different range of X, the correlation coefficient would change as well.
Figure 3.5f illustrates a situation where Y and X are perfectly functionally related even
though the correlation coefficient is zero. The functional relationship is not linear, however. This
figure demonstrates that one cannot conclude that X and Y are unrelated based on the fact that
their correlation coefficients are small.
The fact that two variables have a high degree of linear correlation should not be interpreted
as indicating a functional or cause-and-effect relationship exists between the two variables. The
annual water yield on two adjacent watersheds may be highly positively correlated even though
a high yield from one watershed does not cause a high yield from the second watershed. More
likely the same climatic factors and geomorphic factors are operating on the two watersheds,
causing their water yields to be similar. The fact is often overlooked that high correlation does not
necessarily mean a cause-and-effect relationship exists between the correlated variables.
RANDOM VARIABLES 65
Further Pro~ertiesof Moments
If Z is a linear function of two random variables X and Y, then

Equations 3.55 and 3.56 can be generalized when Y is a linear function of n random
variables as follows.

then

and

Var(Y) = E:=, $var(xi) + 22: %ajCov(Xi, X,)


i<j

A noteworthy result of equation 3.56 or 3.58 is that for uncorrelated random variables, the
variance of a sum or difference is equal to the sum of the variances. This is because the variation
in each of the random variables contributes to the variation of their sum or difference.
As a special case of a linear function, consider the Xi to be a random sample of size n. Let
the ai all be equal to l/n. Then Y is equal to X,the mean of the sample. The Var(Y) is the var(T?T)
and can be found from equation 3.58. Since the Xi form a random sample, the Cov(Xi, Xj) = 0
+
for i j and Var(Xi) = Var (X). we now have

Equation 3.59 states that the variance of the mean of a random sample is equal to the vari-
ance of the sample divided by the number of observations used to estimate the mean of the
sample. If X and Y are independent random variables, then the equation 3.49 shows that the
expectation of their product is equal to the product of their expectation.

E(XY) = E(X)E(Y) if X and Y independent (3.60)

The variance of the product XY for X and Y independent can be obtained from

and noting that

Because X and Y are independent, pX,,(x, y) = px(x) py(y) and E ( X Y ) ~becomes E(x2)E(y2)
or E ( X Y ) ~= (& + a;)(& + a t ) . Also from equation 3.60, E2(xY) = E~(x)E~(Y) =
Thus

which reduces to

for X and Y independent.


A final word of caution on closing this section concerning the expected value of a function
of random variables. The caution is that in general

That this is true is obvious from the example of g(X) = X2. From equation 3.23 it can be
seen that E(x2) = a; +

thus demonstrating that in general E(g(X)) # g(E(X)).

SAMPLE MOMENTS
If xi for i = 1 to n is a random sample, then the rfhsample moment about the origin is
RANDOM VARIABLES . 67

and the rthsample moment about the sample mean is

(xi - X)'
M,= z:=,
For the bivariate case involving a random sample of xi and y,, the r, s sample moment about
the origin is

and the r, s sample moment about %, is

The expected value of sample moments is equal to the population moments (Mood et al. 1974).
Two important properties of moments worthy of repeating are:

1. The first moment about the mean is zero.

E(X - px) = E(X) - px = px - px = 0

2. The second moment about the origin is equal to the variance plus the square of the mean.

The moments about the mean are related to the moments about the origin by the following
general equation (Thomas 1971)

For the computation of sample moments it is often convenient to use equation 3.66. The results
of equation 3.66 for the first four sample moments are
68 CHAPTER 3
Sample moments can be computed from grouped data by using the equations

and

where xj and nj are the class mark and number of observations, respectively, in the j" group, n is
the total number of observations, and k is the number of groups.
Moments of greater than third order are generally not computed for hydrologic variables
because of the small sample size. Higher-order moments are very unreliable (have a high vari-
ance) for small samples. For example, the variance of s2 (the variance of the sample variance) is
(Mood et al. 1974)

Yevjevich (1972a) presents general expressions for the variance of the variance, coefficient of
skew, and kurtosis.

PROBABILITY-WEIGHTED MOMENTS AND L-MOMENTS


Probability-weighted moments (PWMs) and linear functions of ranked observations
(known as L-moments) are another way of characterizing a pdf. This discussion of PWMs and
L-moments relies heavily on Stedinger et al. (1994), which should be consulted for more details.
The r~ PWM, P,, is given by

Po is the population mean and an estimator b, of Po is TI. Estimates for other PWMs can
be obtained from order statistics. A random sample of observations can be arranged so that
x(,) 5 x(,-~)I... 5 x(~).The x(,, are known as order statistics. An estimator, b:, for p, for r 2 1
is

where 1 - ( j - 0.35)/n are estimators for P,(x(~,). Stedinger et al. (1994) recommend this esti-
mator for single site estimation despite its bias because it generally results in a smaller mean
square error than the unbiased estimator given below.
RANDOM VARIABLES 69
When unbiasedness is important, the following estimators may be used

Stedinger et al. (1994) recommend these unbiased estimators in regionalization studies.


As previously indicated, L-moments are linear functions of ranked observations. Let xGln)
be
the ithlargest observation in a sample of size n. The ithL-moment, A,, is given by

L-moment estimates for the mean, standard deviation, skewness, and kurtosis are given by

Because L-moments do not involve squares and cubes of observations, they tend to produce less
variable estimates for higher moments, especially when an unusually large or small observation
happens to be present in a sample.
L-moments and probability weighted moments are related by

Estimates, i,of A, are obtained by replacing the P, with sample estimates b,.
PARAMETER ESTIMATION
Thus far, probability distribution functions have been written px(x) or fx(x), depending on
whether they were continuous or discrete. More correctly, they should be written px(x; 0,, I2,..- ,
0,) or fx(x; 0,, I,, ..., Om),indicating that in general the distributions are a function of a set of
parameters as well as of random variables. To use probability distributions to estimate probabil-
ities, values for the parameters must be available. This section discusses methods for estimating
the parameter values for probability distributions. Certain properties of these parameter estimates
or statistics are also discussed. Rather than carry a dual set of relationships-one for continuous
and one for discrete random variables-only the expressions for the continuous random vari-
ables will be displayed. The results are equally applicable to discrete distributions.
The usual procedure for estimating a parameter is to obtain a random sample x,, x2, ..., X,
from the population X. This random sample is then used to estimate the parameters. Thus Gi, an
estimate for the parameter 4, is a function of the observations or random variables. Since iiis a
function of random variables, iiis itself a random variable possessing a mean, variance, and
probability distribution.
Intuitively, one would feel that the more observations of the random variables that were
available for parameter estimation, the closer 6 should be to 0. Also, if many samples were used
for obtaining 6, one would feel that the average value of 6 should equal 0. These two statements
deal with two properties of estimators known as consistency and unbiasedness.

Unbiasedness
An estimate 6 of a parameter 0 is said to be unbiased if E(6) = 0. The bias, if any, is given
by E(6) - 0.

bias = ~ ( 6 -
) 0 (3.75)

The fact that an estimator is unbiased does not guarantee that an individual 6 is equal to 0
or even close to 0, it simply means that the average of many independent estimates for 0 will
equal 0.

Consistency
An estimator 6 of a parameter 0 is said to be consistent if the probability that 6 differs from
0 by more than an arbitrary constant E approaches 0 as the sample size approaches infinity.
Consistency is an asymptotic property because it states that by selecting an n sufficiently
large, the prob (1 6 - 0 I > E ) can be made as small as desired. For small samples (as are many
times used in practice) consistency does not guarantee that a small error will be made. In spite of
this, one feels more comfortable knowing that 6 would converge to 0 if a larger sample were
used.
A single estimate of 0 from a small sample is a problem because neither unbiasedness nor
consistency give us much comfort. In choosing between several methods for estimating 0, in
addition to being unbiased and consistent it would be desirable if the ~ a r ( 6 were
) as small as
possible. This would mean that the probability distribution of 6 would be more concentrated
about 0.
RANDOM VARIABLES 71

Efficiency
An estimator 6 is said to be the most efficient estimator for 0 if it is unbiased and its vari-
ance is at least as small as that of any other unbiased estimator for 0. The relative efficiency of 6,
with respect to 6, for estimating 0 is the ratio of ~ar(6,)to ~ a r ( 6 , ) .

Sufficiency
Finally, it is desirable that 6 use all of the information contained in the sample relative to 0.
If only a fraction of the observations in a sample are used for estimating 0, then some informa-
tion about 0 is lost. An estimator 6 is said to be a sufficient estimator for 9 if 6 uses all of the
information relevant to 0 that is contained in the sample.
More formal statements of the above four properties of estimators and procedures for deter-
mining if an estimator has these properties can be found in books on mathematical statistics
(Lindgren 1968; Freund 1962; Mood et al. 1974).
There are many ways for estimating population parameters from samples of data. A few of
these are graphical procedures, matching selected points, method of moments, maximum likeli-
hood, and minimum chi-square. The graphical procedure consists of drawing a line through plot-
ted points and then using certain points on the line to calculate the parameters. This procedure is
very arbitrary and is dependent upon the individual doing the analysis. Frequently, the method is
employed when few observations are available-with the thought that few observations will not
produce good parameter estimates anyway. When few points are available is precisely the time
when the best methods of parameter estimation should be used.
The method of matching points is not a commonly used method but can produce reasonable
first approximations to the parameters. The procedure can be valuable in getting initial estimates
for the parameters to be employed in iterative solutions that can arise when the method of mo-
ments or maximum likelihood are used.

Example 3.1. A certain set of data is thought to follow the distribution p,(x) = Xe-" for X X 0.
In this particular data set, 75% of the values are less than 3.0. Estimate the parameter X.

Solution:
px(x) = hepAx

Px(x) = Jt Xe-" dt = 1 - e-""

1 - Px(x) = e-Ax
Xx = -In( 1 - Px(x))

Comment: If a sample of size n is available this procedure could be used to obtain n estimates for
h. These n estimates could then be averaged to obtain i. If the probability distribution of interest
had m parameters, then the value of P,(x) and x at m points would be used to obtain m equations
in the m unknown parameters. The method of matching points is not recommended for general use
in getting final parameter estimates. Certainly this method would not use all of the information in
the sample. Also, several different estimates for the parameters could be obtained from the same
sample depending on which observations were used in the estimation process.

Method of Moments
One of the most commonly used methods for estimating the parameters of a probability dis-
tribution is the method of moments. For a distribution with m parameters, the procedure is to
equate the first m moments of the distribution to the first m sample moments. This results in m
equations which can be solved for the m unknown parameters. Moments about the origin, the
mean, or any other point can be used. Generally, for 1-parameter distributions the first moment
about the origin, the mean, is used. For 2-parameter distributions the mean and the variance are
generally used. If a third parameter is required, the skewness may be used.
Similarly, L-moments may be used in parameter estimation by equating sample estimates of
the L-moments to the population expression for the corresponding L-moment depending on the
particular pdf being used. Again, for m parameters, m L-moments would be required. This tech-
nique will be illustrated in chapter 6 for some particular pdfs.

Example 3.2. Estimate the parameter A of the distribution px(x) = he-" for X > 0 by the
method of moments.

Solution: The first moment about the origin of px(x) is

1
Thus, the mean of px(x) is 1/A so that A can be estimated by ); = =.
X

Example 3.3. Use the method of moments to estimate the parameters of

Solution:

x - 0,
let Y=- so that dx = 0, dy
02
RANDOM VARIABLES 73
and

The first integral has an integrand h(y) such that h(-y) = -h(y) and is therefore zero. The
second integral can be written as

Therefore kx = 4 , or the parameter 8, of this distribution is equal to the mean of the distribution
and can be estimated by

The second moment about the mean is equal to the variance.

let y = - so that dx = f i g 2 dy
fie2
and

= 9;

Thus, the parameter 0; is equal to the variance and can be estimated by s; (the sample variance).

62- 2
2 - sx
Substituting the parameter estimates in terms of their population values into the expression
for px(x), the result is

which is the normal distribution.

Maximum Likelihood
Assume we have in hand n random observations x,, x,, ..., xn. Their joint probability dis-
tribution is p,- (x,, x2, ..., xn; 01, 02, ..., 0,). Because for a random sample the xi's are independ-
ent, their joint distribution can be written

Now, this latter expression is proportional to the probability that the particular random sample
would be obtained from the population and is known as the likelihood function.

The m parameters are unknown. The values of these m parameters that maximize the likeli-
hood that the particular sample in hand is the one that would be obtained if n random observa-
tions were selected from px(x; I1,I2,..., 0,) are known as the maximum likelihood estimators.
The parameter estimation procedure becomes one of finding the values of I,,I2,..., 0, that max-
imize the likelihood function. This can be done by taking the partial derivative of L(0,, O,, ..., 0,)
with respect to each of the Oi's and setting the resulting expressions equal to zero. These m
equations in m unknowns are then solved for the m unknown parameters.
Because many probability distributions involve the exponential function, it is many times
easier to maximize the natural logarithm of the likelihood function. The logarithmic function is
monotonic, thus the values of the 0's that maximize the logarithm of the likelihood function also
maximize the likelihood function.

Example 3.4. Find the maximum likelihood estimator for the parameter A of the distribution
px(x) = Ae-'" for X > 0.

Solution:
RANDOM VARIABLES 75

Note that this is the same estimate as obtained in example 3.2 using the method of moments. The
two methods do not always produce the same estimates.

Example 3.5. Find the maximum likelihood estimators for the parameters el, and 0; of the
distribution

Solution (all summations from 1 to n):

Therefore 2 ( x i - 0,) = 0

Example 3.5 shows that the maximum likelihood estimators are not unbiased. It can be
shown, however, that the maximum likelihood estimators are asymptotically (as n +m) unbi-
ased. Maximum likelihood estimators are sufficient and consistent. If an efficient estimator ex-
ists, maximum likelihood estimators, adjusted for bias, will be efficient. In addition to these four
properties, maximum likelihood estimators are said to be invariant, that is, if (6) is a maximum
likelihood estimator of 0 and the function hie) is continuous, then h(6) is a maximum likelihood
estimator of h(0).
The method of moments and the method of maximum likelihood do not always produce the
same estimates for the parameters. In view of the properties of the maximum likelihood estima-
tors, this method is generally preferred over the method of moments. Cases arise, however, where
one can get maximum likelihood estimators only by iterative numerical solutions (if at all), thus
leaving room for the use of more readily obtainable estimates possibly by the method of
moments. The accuracy of the method of moments is severely affected if the data contains errors
in the tails of the distribution where the moment arms are long (Chow 1954). This is especially
troublesome with highly skewed distributions.
Finally, it should be kept in mind that the properties of maximum likelihood estimators are
asymptotic properties (for large n) and there well may exist better estimation procedures for
small samples for particular distributions.

CHEBYSHEV INEQUALITY
Certain general statements about random variables can be made without placing restrictions
on their distributions. More precise probabilistic statements require more restrictions on the dis-
tribution of the random variables. Exact probabilistic statements require complete knowledge of
the probability distribution of the random variable.
One general result that applies to random variables is known as the Chebyshev inequality.
This inequality states that a single observation selected at random from any probability distribu-
tion will deviate more than k a from the mean, k, of the distribution with probability less than or
equal to l/k2.

For most situations this is a very conservative statement. The Chebyshev inequality produces an
upper bound on the probability of a deviation of a given magnitude from the mean.

Example 3.6. The data of table 2.1 has a mean of 66,540 cfs and a standard deviation of 22,322
cfs. Without making any distributional assumptions regarding the data, what can be said of the
probability that the peak flow in a year selected at random will deviate more than 40,000 cfs from
the mean?

Solution: Applying Chebyshev's inequality we have k a = 40,000 cfs. Using 22,322 cfs as an
estimate for a we obtain k = 1.79.
RANDOM VARIABLES 77

The probability that the peak flow in any year will deviate more than 40,000 cfs from the
mean is thus less than or equal to 0.3 11.
Comment: One can see that this is a very conservative figure by noting that only 6 values out of
99 (6/99 = 0.061) lie outside the interval 66,540 5 40,000. By not making any distributional
assumptions, we are forced to accept very conservative probability estimates. In later chapters we
will again look at this problem making use of selected probability distributions.

LAW OF LARGE NUMBERS


Chebyshev's inequality is sometimes written in terms of the mean Z of a random sample of
size n. In such a case equation 3.77 becomes

a;c
If we now let S = l/k2 and choose n so that n 1 7 , we have the (weak) Law of Large Numbers
se-
(Mood and Graybill 1963) which states:
Let px(x) be a probability density function with mean yx and finite variance a;. Let x, be the
mean of a random sample of size n from px(x). Let E and S be any two specified small numbers
a;,
such that (E > 0 , 0 < S < 1. Then for n any integer greater than -
e2s

This statement assures us that we can estimate the population mean with whatever accuracy
we desire by selecting a large enough sample. The actual application of equation 3.79 requires
knowledge of population parameters and is thus of limited usefulness.

Example 3.7. Assume that the standard deviation of peak flows on the Kentucky River near
Salvisa, Kentucky, is 22,322 cfs. How many observations would be required to be at least 95%
sure that the estimated mean peak flow was within 10,000 cfs of its true value if we know noth-
ing of the distribution of peak flows?

Solution: Applying equation 3.79 we have

We must have at least 100 observations to be 95% sure that the sample mean is within 10,000
cfs of the population mean if we know nothing of the population distribution except its standard
deviation. This happens to be very close to the number of observations in the sample (99).
Comment: We will look at this ~roblemaoain later making certain distributional assum~tions.
78 CHAPTER 3
Exercises

3.1. What is the expected mean and variance of the sum of values obtained by tossing two dice?
What is the coefficient of skew and kurtosis?

xt
3.2. Modular coefficients defined as Kt = T are occasionally used in hydrology. What is the
X
mean, variance, and coefficient of variation of modular coefficients in terms of the original data?

3.3. What effect does the addition of a constant to each observation from a random sample have
on the mean, variance, and coefficient of variation?

3.4. What effect does multiplying each observation in a random sample by a constant have on
the mean, variance, and coefficient of variation?

3.5. Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that 1
0 - kQj is greater than 10,000 cfs?
3.6. Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that a single random observation will deviate
more than 10,000 cfs from pQ?

3.7. Using the data of exercise 2.2 calculate the mean and variance from the grouped data. How
do the grouped data mean and variance compare to the ungrouped mean and variance? Which
estimate do you prefer?

3.8. Calculate the covariance between the peak discharge Q in thousands of cfs and the area A in
thousands of square miles for the following data.

3.9. Calculate the correlation coefficient between Q and A for the data in exercise 3.8.

3.10. Calculate the coefficient of skew for Q in exercise 3.8. Note that this estimate is relatively
unreliable because of the small sample.

3.11. Calculate the kurtosis and the coefficient of excess for Q in exercise 3.8. Note that these
estimates are unreliable because of the small sample size.
RANDOM VARIABLES 79
3.12. Complete the steps necessary to arrive at equation 3.56 from 3.55.

3.13. Show that o,oy 2 loxyl

3.14. A convenient relationship for calculating the estimated variance of a sample of data is

2 x? - nx2 - C xi' - ( 2 xi)'


n
s; = -
n-1 n-1

Derive this relationship from equation 3.25.

3.15. The estimated covariance between X and Y of a bivariate random sample can be calculated
from

Derive this expression from equations 3.49. Note that this estimated covariance is biased. In
practice, the final divisor of n is replaced by n - 1 to correct for bias.

3.16. In exercise 2.14, if the future maximum life of the ferry is 15 years, what is the expected
net profit? Neglect the interest or discount rate.

3.17. What are the maximum likelihood estimates for the parameters of the two parameter
exponential distribution? This distribution is given by

3.18. What are the moment estimates for the parameters of the exponential distribution given in
exercise 3.17?

3.19. For the following data, what are the moment and maximum likelihood estimates for the
parameters of the distribution given in exercise 3.17? x = 15.0, 10.5, 11.O, 12.0, 18.0, 10.5, 19.5.

3.20. Calculate the coefficient of skew for the Kentucky River data of table 2.1.

3.21. Calculate the kurtosis of the Kentucky River data of table 2.1.

3.22. Using the data of exercise 2.2, calculate the coefficient of skew from the grouped data.

3.23. Using the data of exercise 2.2, calculate the kurtosis from the grouped data.
3.24. What are the maximum likelihood estimates for cx and P in the distribution

1
3.25. What are the mean and variance of fx(x) = - for x = 1, 2, ..., N?
N
3.26. What are the mean and variance of px(x) = a sin2x for 0 < X < n?

3.27. Use the method of moments to estimate a in px(x) = a sin2 x for 0 < X < n based on the
random sample given by X = 0.5, 2.0, 3.0, 2.5, 1.5, 1.8, l.0,0.8, 2.5, 2.2.

3.28. The r~ moment about xo can be written as E(X - xo)'. Show that the variance is the smallest
possible second moment.
4. Some Discrete
Probability
Distributions and
Their Applications
THUS FAR, probability distributions have been considered in general terms. This chapter is
devoted to some particular discrete distributions and their applications. The following two chapters
are devoted to selected continuous distributions. These chapters are by no means exhaustive treat-
ments of probability distributions; only some of the more common distributions are considered.

HYPERGEOMETRIC DISTRIBUTION
Drawing a random sample of size n (without replacement) from a finite population of size
N, with the elements of the population divided into two groups with k elements belonging to one
group, is an example of sampling from a hypergeometric distribution. The two groups may be de-
fective or nondefective objects, rainy or nonrainy days, success or failure of a project, and so
forth. For discussion purposes we will consider that an element (or outcome) from the population
is either a success or a failure. The probability of x successes in a sample of size n selected from
a population of size N containing k successes can be determined by applying equation 2.1.
The total number of possible outcomes or ways of selecting a sample of size n from N ob-
jects is (F). The number of ways of selecting x successes and n - x failures from the population
containing k successes and N - k failures is (,k) (:I~~) . Thus the probability is
The distribution given by equation 4.1 is known as the hypergeometric distribution where
fx(x; N, n, k) is the probability of obtaining x success in a sample of size n drawn from a popu-
lation of size N containing k successes.
The cumulative hypergeometric distribution giving the probability of x or fewer successes is

There are certain natural restrictions on this distribution. For example: x cannot exceed k, x can-
not exceed n, k cannot exceed N, and n cannot exceed N. N, n, k, and x are all nonnegative inte-
gers. Furthermore, the outcomes must be random and equally likely.
The mean of the hypergeometric distribution is

and the variance is

Example 4.1. The hypergeometric applies in example 2.5. In this example, a success is selecting
a bad record and N = 10, k = 3, n = 4. The solutions can be written in terms of the hypergeo-
metric as

(a) fx(l; 10,4, 3) =


(:)(:) (I;'t)(&)
- -
-
,3)(35)
= 0.500
210
DISCRETE DISTRIBUTIONS 83

Example 4.2. Assume that during a certain September, 10 rainy days occurred. Also assume that
at this particular location the occurrence of rain on any day is independent of whether or not it
rained on any previous day. (This is often not a good assumption).
A sample of 10 September days is selected at random. (a) What is the probability that 4 of
these days will have been rainy? (b) What is the probability that less than 4 of these days were
rainy?

Solution: Use the hypergeometric distribution with

(b) F,(3; 30, 10, 10) = = 0.560

- - -

Example 4.3. Examples of the hypergeometric distribution commonly found in statistics books
include card sampling problems (What is the probability of exactly 2 aces in a 5-card hand
selected at random from a 52-card deck?) and acceptance sampling problems (What is the prob-
ability of selecting 5 defective items from a lot of 50 items if 20 items are selected and the lot
actually contains 12 defectives?)

Solution: Card problem

Acceptance Sampling Problem


84 CHAPTER 4
BERNOULLI PROCESSES

Binomial Distribution
Consider a discrete time scale. At each point on this time scale an event may either occur
or not occur. Let the probability of the event occumng be p for every point on the time scale;
thus, the occurrence of the event at any point on the time scale is independent of the history of
any prior occurrences or nonoccurrences. The probability of an occurrence at the ithpoint on the
time scale is p for i = 1,2, ... A process having these properties is said to be a Bernoulli
process.
An example of a Bernoulli process might be the occurrence of rainy days. The time scale has
units of days. On any particular day, rainfall may or may not occur. If the occurrence of rainfall
on any given day is independent of the past history of rainfall occurrences, the sequence of rainy
and dry days can be considered a Bernoulli process.
As an example of another Bemoulli process, consider that during any year the probability of
the maximum flow exceeding 10,000 cfs on a particular stream is p. Common terminology for a
flow exceeding a given value is an exceedance. Further consider that the peak flow in any year is
independent from year to year (a necessary condition for the process to be a Bernoulli process).
Let q = 1 - p be the probability of not exceeding 10,000 cfs. We can neglect the probability of
a peak of exactly 10,000 cfs since the peak flow rates would be a continuous process. In this ex-
ample the time scale is discrete with the points being nominally 1 year in time apart. We can now
make certain probabilistic statements about the occurrence of a peak flow in excess of 10,000 cfs
(an exceedance).
For example, the probability of an exceedance occumng in year 3 and not in years 1 or 2 can
be evaluated from equation 2.9 as qqp since the process is independent from year to year. The
+
probability of (exactly) one exceedance in any 3-year period is pqq qpq + qqp since the ex-
ceedance could occur in either the first, second, or third year. Thus, the probability of (exactly)
one exceedance in three years is 3pq2
In a similar manner, the probability of 2 exceedances in 5 years can be found from the sum-
mation of the terms ppqqq, pqpqq, pqqpq, ..., qqqpp. It can be seen that each of these terms is
equivalent to p2q3 and that the number of terms is equal to the number of ways of arranging 2
items (the p's) among 5 items (the p's and q's). Therefore, the total number of terms is (z), or 10,
so that the probability of exactly 2 exceedances in 5 years is
This result can be generalized so that the probability of X exceedances in n years is
(:) pxqn-".The result is applicable to any Bemoulli process so that the probability of X occurrences
of an event in n independent trials if p is the probability of an occurrence in a single trial is given by

This equation is known as the binomial distribution.


The binomial distribution and the Bernoulli process are not limited to a time scale. Any
process that may occur with probability p at discrete points in time or space or in individual trials
may be a Bernoulli process and follow the binomial distribution.
DISCRETE DISTRIBUTIONS 85

The cumulative binomial distribution is

and gives the probability of X or fewer occurrences of an event in n independent trials if the prob-
ability of an occurrence in any trial is p.
Continuing the above example, the probability of less than 3 exceedances in 5 years is

The mean, variance, and coefficient of skew of the binomial distribution are

Var(X) = npq (4.8)

The distribution is symmetrical for p = q, skewed to the right for q > p and skewed to the left
for q < p.
Because the probability of a success on any trial is independent of past history, the origin of
the time scale of a Bernoulli process can be taken at any time point. Thus the probability of any
combination of successes or failures is the same for any sequence of n points regardless of their
location with respect to the origin.

Example 4.4. On the average, how many times will a 10-year flood occur in a 40-year period?
What is the probability that exactly this number of 10-year floods will occur in a 40-year period?

Solution: A 10-year flood has p = 1/10 = 0.1

Comment: This problem illustrates the difficulty of explaining the concept of return period. On
the average a 10-year event occurs once every 10 years and in a 40-year period is expected to
occur 4 times. Yet in about 80% (100[1 - 0.20591) of all possible independent 40-year periods,
the 10-year event will not occur exactly 4 times. As a matter of fact the probability that it will
occur 3 times is nearly identical to the probability it will occur 4 times (0.2003 vs. 0.2059). The
number of occurrences, X, is truly a random variable (with a binomial distribution).
The binomial distribution has an additive property (Gibra 1973). That is, if X has a binomial
distribution with parameters n, and p and Y has a binomial distribution with parameters n, and p,
then Z = X + Y has a binomial distribution with parameters n = n, + n, and p.
A useful property of the binomial distribution is that

The binomial distribution can be used to approximate the hypergeometric distribution if the
sample selected is small in comparison to the number of items N from which the sample is drawn.
In this case, the probability of a success would be about the same for each trial, and sampling
without replacement (hypergeometric) would be very similar to sampling with replacement
(binomial).

Example 4.5. Compare the hypergeometric and binomial for N = 40, n = 5, k = 10 and X = 0,
1,2,3,4,5.

Solution:

Hypergeometric Binomial
X fx(x; N, n, k) = fx(x; 40, 5, 10) fx(x; n, p) = fx(x; 5, 10/40)

Comment: This merely indicates that drawing a small sample without replacement from a large
population and drawing the same sample with replacement (so probabilities in each trial are con-
stant) are nearly equivalent.

Example 4.6. The operator of a boat dock has decided to put in a new facility along a certain
river. In an economic analysis of the situation it was decided to have the facility designed to with-
stand floods up to 75,000 cfs. Furthermore, it was determined that if one flood greater than this
occurs in a 5-year period, repairs can be made and the operator will still break even on its opera-
tion during the 5-year period. If more than one flow in excess of 75,000 cfs occurs, money will
be lost. If the probability of exceeding 75,000 cfs is 0.15, what is the probability the operator will
make money?

Solution: Money will be made if no floods exceeding 75,000 cfs occur during the 5-year period.
Let X be the number of floods. From the binomial distribution
DISCRETE DISTRIBUTIOPiS 87
Comment: The probability that the operator will make the investment, work for 5 years, and just
break even is very high

Thus, even though the risk or probability of losing money is low (1 - 0.39 15 - 0.4437 = 0.1648),
the investment may not be an attractive one.

Whenever a decision is made based on uncertain information or relative to a system subject


to random inputs or behavior, there is a chance that the decision will result in an adverse out-
come. A bridge that may be underdesigned, a water supply reservoir that may be too small, and
an investment that may fail are examples of decisions made in the face of uncertainty. These de-
cisions are said to be risky decisions with risk defined as the probability of an adverse outcome.
Generally, all decisions dependent on hydrologic data and hydrologic analysis are risky in this
sense. A risky decision is not a bad decision. Risk must be balanced against costs and available
alternatives. For informed decisions to be made under uncertainty, quantitative estimates of the
resulting risk are desirable. Risk and uncertainty are treated in detail in chapter 17.

Example 4.7. In order to be 90% sure that a design storm is not exceeded in a 10-year period,
what should be the return period of the design storm?

Solution: Let p be the probability of the design storm being exceeded. Based on the binomial
distribution, the probability of no exceedances is given by

1
T = - = 95 years
P
Comment: To be 90% sure that a design storm is not exceeded in a 10-year period, a 95-year
return period storm must be used. If a 10-year return period storm is used, the chances of it being
exceeded are

In general, the chance of at least one occurrence of a T-year event in T years is

It can be shown that as T gets large, this expression approaches 1 - l/e or 0.632. For T = 5,
10, and 25, the probability is 0.67,0.65, and 0.64, respectively. Thus, if the design life of a structure
--

and its design return period are the same, the chances are very great that the capacity of the struc-
ture will be exceeded during its design life. The risk associated with a return period over n years is

risk = 1 - (1 - l/Ty.

The procedure outlined in example 4.7 can be used to determine a design return period when
the allowable risk is stated. Note that the design return period must be much greater than the life of
the project to be reasonably sure that an exceedance will not occur. No matter what design return pe-
riod is selected, there is still a chance that an exceedance will occur. Some may argue that there is an
upper limit to the magnitude of natural events, such as flood peaks. They would argue that a peak of
100,000 cfs from a 1-acre watershed would be impossible. In practice the probability that would be
assigned to an event of this sort is so small that it can be neglected for most practical purposes.
Figure 4.1 shows the design return period that must be used to be a certain percent confident
that the design will not be exceeded during the design life of the project. The parameters on the
curves are the percent chance of no exceedance during the design life. For example, to be 90%
sure that a design condition will not be exceeded during a project whose design life is 100 years,
the project would have to be designed on the basis of a 900-year event. Figure 4.1 is derived from
calculations like those contained in example 4.7.
Figure 4.1 can also be used to evaluate the risk or percent chance of an event in excess of the
design event during the design life. For example, if a project is designed on the basis of a 50-year

Fig. Design return period required as a function of design life to be a given percent confident
(curve parameter) that the design condition is not exceeded.
I

DISCRETE DTSTRIB UTIONS I 89


I
I

that the design condition will be exceeded. h


event and the design life of the project is 10 years, the designer is taking a 19% c ance (100 - 81)

0.4. What is the probability of 3 successes in the next 5 trials? 1


Solution: I

Comment: What has occurred prior to the trials of interest is of no concern since the Bernoulli
process is based on the assumption of independence from trial to trial.

Geometric Distribution
I
The probability that the first exceedance (or success) of a Bernoulli tqal occurs on the
xthtrial can be found by noting that for the first exceedance to be on the xth there must be
X - 1 preceding trials without an exceedance followed by 1 trial with an
the desired probability is pqx-' This is known as the geometric distribution 1

The mean and variance of the geometric distribution are 1

1
E(X) = l/p means that on the average a T-year event occurs on the T~~y ar, which agrees
with our intuitive concept of a return period.

Example 4.9. What is the probability that a 10-year flood will occur for the fir time during the
fifth year after the completion of a project? What is the probability it will be at the fifth year
before a 10-year flood occurs?
I
Solution: The probability that the first exceedance is in year 5 is

The probability that it will be at least the fifth year before the first occurrence is not the same as
the probability of the first occurrence in the fifth year. The expression "at leas& implies the first
occurrence might be in the fifth year or some later year. The desired probability is equal to the
probability of no occurrences in the first 4 years, which is (0.9)~= 0.6561.
Solution: This is the same as the probability of the first occurrence on the tenth year or

Negative Binomial Distribution


The probability that the kth exceedance (success) occurs on the xthtrial (X > k) of a
Bernoulli process can be found by noting that there must be k - 1 exceedances in the X - 1
trials preceding the kth exceedance on the X" trial. The probability of k - 1 exceedances in
X - 1 trials is given by the binomial distribution as ( :I i) pk-'qx-k. The probability that the
X" trial results in an exceedance is p, so the desired probability is given by the negative binomial
distribution.

The mean and variance of the negative binomial distribution are

As might be expected because the negative binomial is based on the binomial, the additive
feature holds. Thus, if X and Y are described by fx(x; k,, p) and f,(y; k,, p) respectively, then
Z = X + Y follows the negative binomial f,(z; k, + k,, p).

Example 4.1 1. What is the probability that the fourth occurrence of a 10-year flood will be on the
fortieth year?

Solution:

Summarv of Bernoulli Process


Ln a Bernoulli process at each instant of time (or location, or trial) an event may either occur
with probability p or not occur with probability q = 1 - p. The probability of the event occur-
ring is independent of the time and independent of the past history of occurrences. The number
of occurrences in a given time interval (or distance or number of trials) follows the binomial
distribution. The probability that the first occurrence is at the xth time is described by the
DISCRETE DISTRIBUTIONS 91

geometric distribution. The probability that the kthoccurrence was at the xthtime is described by
the negative binomial distribution. It was also found that the probability distribution of the length
of time between occurrences can be found from the geometric distribution by noting that the
probability that X trials elapse between occurrences is the same as the probability that the first
occurrence is at the X + first time or fx(x + 1; p) = pqx.

POISSON PROCESS

Poisson Distribution
Consider a Bernoulli process defined over an interval of time (or space) so that p is the prob-
ability that an event may occur during the time interval. If the time interval is allowed to become
shorter and shorter so that the probability, p, of an event occurring in the interval gets smaller and
the number of trials, n, increases in such a fashion that np remains constant, then the expected
number of occurrences in any total time interval remains the same. It can be shown that as n gets
large and p gets small so that np remains a constant, A, the binomial distribution approaches the
Poisson distribution given by

The mean, variance, and coefficient of skew of the Poisson distribution are

As A gets large, the distribution goes from a positively skewed distribution to a nearly symmet-
rical distribution. The cumulative Poisson distribution is

Example 4.12. What is the probability that a storm with a return period of 20 years will occur
once in a 10-year period?

Solution: Using the binomial distribution the exact answer is


Approximating with the Poisson

Thus the solutions are not identical but are quite close to each other.

Example 4.13. What is the probability of 5 occurrences of a 2-year storm in a 10-year period?

Solution: Using the binomial

Approximating with the Poisson

Comment: For this situation n is not large enough and p not small enough for a good approximation.

Example 4.14. What is the probability of fewer than 5 occurrences of a 20-year storm in a 100-
year period?

Solution: n is relatively large and p small so the Poisson will be used.

The Poisson distribution possesses the additive property that the sum of two Poisson
random variables with parameters A, and A, is a Poisson random variable with parameter
A = A, + A,. A Poisson process for a continuous time scale can be defined analogous to a
Bernoulli process on a discrete time scale. The Poisson process refers to the occurrence of
events along a continuous time (or location) scale. The assumptions underlying the process
are:

+
1. The probability of an event in any short interval t to t At is AAt (proportional to the length
of the interval) for all values oft. This property is known as stationarity.
DISCRETE DISTRIBUTIONS 93

2. The probability of more than one event in any short interval t to t + At is negligible in com-
parison to AAt.

3. The number of events in any interval of time is independent of the number of events in any
other non-overlapping interval of time.

The probability distribution of the number of events X in time t for a Poisson process is
given by

(~t)'e-~~
fx(x; At) = A>O; t>O; x=1,2, ...
x!

where fx(x; At) is the probability of X events in time t. Equation 4.20 is a Poisson distribution
with parameter At. The mean and variance of fx(x; At) are E(X) = At and Var(X) = At. The
parameter A is the average rate of occurrence of the event.

Exponential Distribution
The probability distribution of the time, T, between occurrences of the event can be found
by noting that the prob(T < t) is equal to 1 - prob(T > t). The prob(T > t) is equal to the prob-
ability of no occurrences in time t which is fx(O; At) or e-". Thus

which is a cumulative distribution known as the exponential distribution. The probability density
function is

and is the probability distribution of the length of the time interval between occurrences of the
event. The mean and variance of the exponential distribution are 1/A and 1 / ~respectively.
~ ,

Gamma Distribution
The probability distribution of the time to the nthoccurrence can be found by noting that the
time to the nthoccurrence is the sum of n independent random variables, TI + T2 + + T, from
the exponential distribution. The method of derived distributions can be used with the result that
the probability density function of the time to the n" occurrence is

which is the gamma distribution for integer values of the parameter n. The gamma distribution
has E(T) = n/A and Var(T) = n / ~ ~ .
Example 4.15. Barges arrive at a lock an average of 4 each hour. (a) If the arrival of barges at the
lock can be considered to follow a Poisson process, what is the probability that 6 barges will
arrive in 2 hours? (b) If the lock master has just locked through all of the barges at the lock, what
is the probability she can take a 15-minute break without another barge arriving? (c) If the oper-
ation of the lock is such that 4 barges can be locked through at once and the lock master insists
that this always be the case, what is the probability that the first barge to arrive after 4 previous
barges have been locked through will have to wait at least 1 hour before being locked through?

Solution:

(a) For this problem the rate constant is 4 hours-'. The probability of 6 arrivals in 2 hours
can be determined from the Poisson distribution

86e-8
fx(x; At) = fx(6; 8) = -= 0.1221
6!

(b) The probability of no arrivals in 15 minutes is also from the Poisson

Note that this is not the same as the probability that it will be 15 minutes until the next amval.
The time scale is continuous so the probability that it will be exactly 15 minutes until the next
arrival is zero. We can only talk of probabilities associated with time intervals, not specific
times.

(c) The barge must wait for the arrival of 3 additional barges. The probability that the time
T for 3 barges to arrive is greater than 1 hour

prob(T3 > 1) is 1 - prob(T3 5 I).

The probability that T 5 1 for 3 arrivals comes from the gamma distribution

The desired probability is 1 - 0.762 = 0.238.

Summary of Poisson Process


The Poisson process is a discrete process on a continuous time scale. Therefore, the probability
distribution of the number of events in a time T is a discrete distribution, whereas the probability
distributions for the time between events and the time to the n" event are continuous distributions.
DISCRETE DISTRIBUTIONS 95

For a Poisson process, the probability that an event will occur in a short time interval t to t + At
is hAt for all t. The probability that more than one event occurs in At is negligible. The probability
distribution of the number of events in a given time T is the Poisson distribution. The exponential
distribution describes the time between events and the gamma distribution the time to the n" event.

Example 4.16. It has been proposed that an event-based rainfall simulation model can be
constructed by modeling the occurrence of rainstorms by a Poisson process and the amount of
rain in each storm by some continuous probability distribution. In this way, the time between
rainstorms would follow an exponential distribution, the time for X rainstorms would follow a
gamma distribution, and the number of rainstorms in a time interval would follow a Poisson
distribution. Duckstein et al. (1975) and Fogel et al. (1974) used a modification of this approach.
Part of Fogel et al.'s results are shown as figure 4.2.

0 5 10 15 20 25 30
Number of events per year

Fig. 4.2. Distribution of occurrences of warm season rainfall in which the areal mean of five
gages in New Orleans, Louisiana, exceeded 0.50 inches and at least one gage recorded
more than 1.O inch. (Fogel et al. 1974).

MULTINOMIAL DISTRIBUTION
The binomial distribution can be generalized to include the probabilities of outcomes of sev-
eral types rather than the two possible outcomes of the binomial. If the probabilities associated
with each of k distinct outcomes are p,, p2, ...,p,, then in independent trials the probability of XI
outcomes of type 1, X2 outcomes of type 2, ...,Xkoutcomes of type k is given by the multinomial
distribution as
where - x and p are 1 X k vectors. Some restrictions on this distribution are
X, -
k
zi=lpi=l and 'Cf='=lxi=n

The mean and variance of the multinomial distribution are

E(X,) = npi (4.25)

Var(Xi) = npi(l - pi) (4.26)

Example 4.17. On a certain stream the probability that the maximum peak flow during a l-year
period will be less than 5,000 cfs is 0.2 and the probability that it will be between 5,000 cfs and
10,000 cfs is 0.4. In a 20-year period, what is the probability of 4 peak flows less than 5,000 cfs
and 8 peak flows between 5,000 and 10,000 cfs?

Solution: To apply the multinomial distribution we define the third event as a peak flow in ex-
cess of 10,000 cfs. This event has probability 1 - 0.2 - 0.4 = 0.4. The event of a peak flow
greater than 10,000 cfs must occur 20 - 4 - 8 = 8 times. The desired probability is

Comment: The expected result from 20 years of flood peak data would be

E(X,) = npl = 20(0.2) = 4


E(X2) = np, = 8
E(X,) = np, = 8

This problem demonstrates that even though the expected results are 4, 8, and 8, the probability
of this happening is very low.

Exercises

4.1. Compute the terms of the binomial distribution with n = 10 and p = 0.2. Plot in the form
of a histogram.

4.2. Compute the terms of the cumulative binomial with n = 10 and p = 0.2. Plot the terms.

4.3. If a project is designed on a 10-year retum period, what is the probability of at least 1
exceedance during the 10-year life of the project?

4.4. What design retum period should be used to ensure a 95% chance that the design will not be
exceeded in a 25-year period?
DISCRETE DISTRIBUTIONS 97
4.5. Construct a curve relating the design return period to the life of a project when a 90 percent
chance of no exceedance is used.

4.6. What design return period should be used to ensure a 50% chance of no exceedance in a
10-year period?

4.7. What design return period should be used to ensure a 75% chance of no more than 1
exceedance in 10 years?

4.8. Construct an example where the Poisson is not a good approximation for the binomial.

4.9. In a certain locality contractors A, B, and C get about 50%, 25% and 25% respectively of
all water resources projects. Five contracts are coming up for bid. What is the probability that
contractor A will get all 5 jobs? What is the probability that A will get 2 jobs and B will get
2 jobs?

4.10. In 100 years the following number of floods were recorded at a specific location. Draw a
relative frequency histogram of the data. Fit a Poisson distribution to the data and plot the relative
frequencies according to the Poisson distribution on the histogram. Is the Poisson a good
approximation for the data?

No. of floods No. of occurrences

4.1 1. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of 5
successive years without a flood?

4.12. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of
exactly five years between floods?

4.13. Compute the probability of at least 1 n-year event in a k-year period using (a) n = 100,
k = 20; (b) n = 500, k = 50.

4.14. Using the Poisson approximation to the binomial distribution show that the probability of
at least one occurrence of a T-year event in T years is 0.632.
98 CHAPTER 4
4.15. The Bernoulli distribution is given by

What is E(X) and Var(X) for this distribution?

4.16. Use the Poisson distribution to approximate the binomial distribution of exercise 4.1. Plot
the terms of this Poisson distribution on the histogram of exercise 4.1.

4.17. Two widely separated watersheds are selected for a study on peak discharges. If the occur-
rence of flood flows on the two basins can be considered as independent events, what is the prob-
ability of experiencing a total of 5,20-year events on the two watersheds in a 10-year period?

4.18. A well-known scientist has predicted that during a certain 3-year period a severe drought
will occur on the plains east of the Rocky Mountains. He made this prediction based on his
observance of sunspot activity. If the probability of a drought is 0.10 in any year, what is the
probability that the scientist's prediction will come true if the occurrence of a drought is a strictly
random phenomena unrelated to sunspot activity?

4.19. In a certain region there are 20 possible small watersheds suitable for a research project. Un-
known to the project manager, 6 of these basins have subsurface geological features that permit
large quantities of surface water to enter underground formations and leave the basin via subsurface
flow. The project manager wants to select 6 watersheds from the 20 for study. (a) What is the prob-
ability that 1 of the basins having the above described geologic features will be selected? (b) What
is the probability that 3 of these basins will be selected? (c) What is the probability that at least one
of the basins will be selected? (d) What is the probability that all of these basins will be selected?

4.20. In the situation described in exercise 4.19 the project manager wants to pick 3 pairs of
watersheds for the evaluation of an evapotranspiration suppressant. One basin in each pair will
be used for a control and one will be treated with the suppressant. What is the probability that all
of the control watersheds will have the geologic problem while all of the rest will not?

4.21. It is desired to model the number of rainy days in July and August as a Bernoulli process.
Based on the data below and the assumption that the Bernoulli model is applicable: (a) What is
the probability of 10 or more rainy days in each of the months of July and August? (b) What is
the probability of 20 rainy days in the 2-month period? (c) What assumptions concerning the
Bernoulli process are likely violated by this problem? For this problem write answers in terms of
summations. Do not evaluate the summations.

Year 1 2 3 4 5 6 7 8 9 10
No. of rainy days
July 10 15 17 8 9 19 17 14 20 4
August 4 9 8 3 0 10 12 2 8 6
DISCRETE DISTRIBUTIONS 99

4.22. For the binomial distribution show that f,(x; n, p) = f,(x - 1; n - 1, p) f,(l; 1, p) +
fx(x; n - 1, p) fx(O; 1, p). Write out a narrative description of the meaning of this equation.

4.23. Work exercise 4.21 using the Poisson distribution to approximate the binomial.

4.24. Pool the data of exercise 4.21 so that a single estimate is obtained for p of the binomial distri-
bution. Compute the probability of 20 rainy days in the 2-month period of July-August. Compare
this probability to the one computed in part b of exercise 4.21. Which answer would you prefer?

4.25. Using the data of exercise 4.21, what is the probability that the sixth wet day of August
occurs on August 29,30, or 3 1 ?

4.26. Show that for the Poisson process the time for n occurrences follows the gamma distribution.
(Hint: Use the method of derived distributions to find the distribution of the time to 2 occurrences.
Using the distribution of the time to 2 occurrences the method of derived distributions can be used
to get the time to 3 occurrences. This process can then be repeated until a pattern emerges. Induc-
tion could also be used by showing that if the time for n - 1occurrences is given by equation 4.20
by substituting n - 1for n then the time for n occurrences is given by equation 4.20. Also, the time
for 1 occurrence is given by equation 4.19, which is the same as equation 4.20 with n = 1.)
5. Normal Distribution
THE MOST widely used and most important continuous probability distribution is the
Gaussian, or normal distribution. The normal distribution has been widely used because of its
early connection with the "Theory of Errors" and because it has certain useful mathematical
properties. Many statistical techniques such as analysis of variance and the testing of certain
hypotheses rely on the assumption of normality. The errors involved in incorrectly assuming
normality (purposely or unknowingly) depend on the use under consideration. Many statistical
methods derived under the assumption of normality remain approximately valid when moderate
departures from normality are present and as such are said to be robust.
The very name "normal" distribution is misleading in that it implies that random variables
that are not normally distributed are abnormal in some sense. The Central Limit Theorem indicates
the conditions under which a random variable can be expected to be normally distributed. In a
strict theoretical sense, most hydrologic variables cannot be normally distributed because the
range on any random variable that is normally distributed is the entire real line (-03 to +a). Thus
non-negative variables such as rainfall, streamflow, reservoir storage, and so on, cannot be strictly
normally distributed. However, if the mean of a random variable is 3 or 4 times greater than its
standard deviation, the probability of a normal random variable being less than zero is very small
and can in many cases be neglected.

GENERAL N O W DISTRIBUTION
The normal distribution is a 2-parameter distribution whose density function is
NORMAL DISTRIBUTION 101

X
P
Fig. 5.1. Normal distributions with same mean and different variances.

PI P2 $3
Fig. 5.2. Normal distributions with same variance and different means.

In examples 3.3 and 3.5 it was shown that if either the method of moments or the method of max-
imum likelihood is used to estimate the two parameters of this distribution, the result is 8, = p
and 822 = u2 where p and u2 are the mean and variance of X, respectively. For this reason the
normal distribution is generally written as

Thus, the normal distribution is a 2-parameter distribution which is bell-shaped, continuous,


and symmetrical about y (the coefficient of skew is zero). If y is held constant and u2varied, the
distribution changes as in figure 5.1. If u2is held constant and (I varied, the distribution does not
change scale but does change location as in figure 5.2. The parameters y and u2 are sometimes
denoted as location and scale parameters. A common notation for indicating that a random vari-
able is normally distributed with mean p and variance u2is N(y, u2).

REPRODUCTIVE PROPERTIES
If a random variable X is N(p, u2) and Y = a + bX, the distribution of Y can be shown to
be N(a + by, b2u2). Furthermore, if Xi for i = 1, 2, ..., n, are independently and normally
distributed with mean pi and variance ui2,then Y = a + blX, + b2X2+ - - - + b,X, is normally
distributed with

py = a + Cy,lbipi (5.2)
and
2
UY
-
- Cb2" i = 1 iui2 (5.3)

Any linear function of independent normal random variables is also a normal random variable.

Example 5.1. If xiis a random observation from the distribution N(p, u2), what is the distri-
Xi
bution of Z = C:= -?
n

Solution: X is a linear function of xi given by 51 = (xl + x2 + . - -+ xn)/n. From equations 5.2 and
5.3 and the reproductive properties of the normal distribution, % is normally distributed with mean

and variance

Therefore, X is N(p, u2/n).

STANDARD NORMAL DISTRIBUTION


The cumulative distribution function for the normal distribution is

Unfortunately, equation 5.4 cannot be evaluated analytically. Approximate methods of inte-


gration are required. If a tabulation of the integral was made, a separate table would be required
for each value of p and u2. By using the linear transformation

the random variable Z will be N(0, I). This is a special case of a + bX with a = - p/u and b =
l / u . The random variable Z is said to be standardized (has P = 0 and u2 = 1) and N(0,l) is said
to be the standard normal distribution. The standard normal distribution is given by
NORMAL DISTRIB UTTON 103

-2 -1 0 +I +2
Fig. 5.3. Standard normal distribution ( p = 0, u2 = 1).

and the cumulative standard normal is given by

Figure 5.3 shows the standard normal distribution which along with the transformation Z =
(X - p)/u contains all of the information shown in figures 5.1 and 5.2. Both pZ(z)and Pz(z) are
widely tabulated. Most tables utilize the symmetry of the normal distribution so that only posi-
tive values of Z are shown. Tables of Pz(z) may show prob(Z < z), prob(0 < Z < z), or prob(-z
< Z < z). Care must be exercised when using normal probability tables to see what values are
tabulated. The table of Pz(z) in the appendix gives prob (Z < z). There are many routines pro-
grammed into computer software to evaluate the normal pdf and cdf. Some approximations for
the standard normal distribution are given below.
A table of Pz(z) shows that 68.26% of the normal distribution is within 1 standard deviation
of the mean, 95.44% within 2 standard deviations of the mean, and 99.74% within 3 standard
deviations of the mean. These are called the 1,2, and 3 sigma bounds of the normal distribution.
The fact that only 0.26% of the area of the normal distribution lies outside the 3 sigma bound
demonstrates that the probability of a value less than p - 3 0 is only 0.0013 and is the justifica-
tion for using the normal distribution in some instances even though the random variable under
consideration may be bounded by X = 0. If p is greater than 30, the chance that X is less than
zero is many times negligible (this is not always true, however).

Example 5.2. Compare the 1, 2, and 3 sigma bounds under the assumption of normality and
under no distributional assumptions using Chebyshev's inequality.

Solution: The 1, 2, and 3 sigma bounds of N(p, u2) contain 68.26, 95.44, and 99.72% of the
distribution. Thus, the probability that X deviates more than a , 2u, and 3u from p is 0.3174,
0.0456, and 0.0028 respectively.
104 CHAPTER 5

Chebyshev's inequality states that the prob(1X - pI > ka) < l/k2. This corresponds to a
probability that X deviates more than a , 20, and 3 a from p of less than 1.00, less than 0.25, and
less than 0.1 1, respectively.

Comment: By making no distributional assumptions, we are forced to make very conservative


probability statements. It is emphasized that Chebyshev's inequality gives an upper bound to the
probability and not the probability itself.
~ -~

Example 5.3. As an example of using tables of the normal distribution consider a sample drawn
from an N(15,25). What is the prob(15.6 < X < 20.4)?

Solution: The desired probability could be evaluated from

However, this integral is difficult to evaluate. Making use of the standard normal distribution, we
can transform the limits on X to limits on Z and then use standard normal tables.

x = 15.6 transforms t o z = (15.6 - 15.0)/5 = 0.12


x = 20.4 transforms to z = (20.4 - 15.0)/5 = 1.08

The desired probability is

From the standard normal table Pz(l-08) = 0.860 and P,(0.12) = 0.548. The desired prob-
ability is 0.860 - 0.548, or 0.312.

APPROXIMATIONS FOR STANDARD NORMAL DISTRIBUTION


Maidment (1993) presents several approximations for the normal distribution. Let Pz(z) = p
for 0.005 5 Pz(z) 5 0.995 where Z is the standard normal variate. Then z can be approximated
from

Let y = -In (2p). For 0.005 < Pz(z) < 0.5, an approximation for z is given by
NORMAL DISTRIBUTIOW 105
An approximation for Pz(z) for positive values of z is given by

Pz(z) =
1
1 - 0.5 exp -
(83z ;3351)z

-
+ 562
+ 165 1
Of course, for negative values of z, P,(z) for the absolute value of z can be obtained and then
Pz(z> = 1 - Pz(lzl).

Example 5.4. Use a normal approximation to determine prob(l0.5 < X < 20.4) if X is distrib-
uted N(15,25).

Solution: Using the approximation for PZ(z)

prob(z < 1.08) = 1 - 0.5 exp


[(83)(1.08) + 3511 1.08

'03
1.08
+ 165
+ 562
1 = 0.860

prob(0 < z < 1.08) = 0.85987 - 0.50000 = 0.360


Similarly, prob(z < 0.9) = 0.816 so that
prob(0 < z < 0.9) = 0.316 and
prob (-0.9 < z < 1.08) = 0.360 + 0.316 = 0.676

Comment: Often, in solving problems of this type, it is useful to sketch a normal distribution and
then shade in the area corresponding to the desired probability. For this problem the sketch would
be as in figure 5.4.

X
Fig. 5.4. Prob(-0.9 < z < 1.08).
106 CHAPTER 5
- --

Example 5.5. kepeat example 3.7 assuming the Kentucky River data is
-
Solution: Since X is assumed normal, X is N(p, 22,3222/n). Therefore, Z = -
22,322/6
' is N(O, 1).
From the problem statement [X- pI < 10,000. So n must be determined so that

From the standard normal table it is seen that 95% of the normal distribution is enclosed by
- 1.96 < Z < 1.96. From this n is calculated as

or at least 19 observations are required to be 95% sure that X is within 10,000 cfs of p if X is
N(p, 22,322')-
Comment: By assuming normality, the required minimum number of observations has been
reduced from 100 to 19. The Law of Large Numbers has placed a lower limit on n without knowl-
edge of the distribution of X. The price for this ignorance of the distribution of X is seen to be
very great if in fact X is normally distributed.

CENTRAL LIMIT THEOREM


The conditions under which a random variable might be expected to follow a normal distri-
bution are specified by the Central Limit Theorem.

If S, is the sum of n independently and identically distributed random variables Xi


each having a mean, p, and variance, a', then in the limit as n approaches infinity, the
distribution of S, approaches a normal distribution with mean n p and variance nu2.

In practice, if the Xi are identically and independently distributed, n does not have to be very
large for S, to be approximated by a normal distribution. If interest lies in the central part of the
distribution of S, ,values of n as small as 5 or 6 will result in the normal distribution producing
reasonable approximations to the true distribution of S,. If interest lies in the tails of the
distribution of S,, as it often does in hydrology, larger values of n may be required.
As stated above, the Central Limit Theorem is of limited value in hydrology since most
hydrologic variables are not the sum of a large number of independently and identically distrib-
uted random variables. Fortunately, under some very general conditions it can be shown that if Xi
for i = 1,2, ..., n is a random variable independent of Xj for j # i and E(Xi) = pi and Var(Xi)
= ai2, then the sum S, = X, + X2 + - - .+ X, approaches a normal distribution with E(S,) =
2:=pi and Var(S, ) = Z:= a"s n approaches infinity (Thomas 1971). One condition for this
generalized Central Limit Theorem is that each Xi has a negligible effect on the distribution of S,
(i.e., there cannot be one or two dominating Xi's).
NORlMAL DISTRIBUTION 107

This general theorem is very useful in that it says that if a hydrologic random variable is the
sum of n independent effects and n is relatively large, the distribution of the variable will be ap-
proximately normal. Again, how large n must be depends on the area of interest (central part or
tail of the distribution) and on how good an approximation is needed.

Example 5.6. In the last chapter the gamma distribution for integer values of n was derived
as the sum of n exponentially distributed random variables. The mean and variance of the ex-
ponential distribution are given as 1/X and 1/X2, respectively. The Central Limit Theorem
gives the mean and variance of the sum of n values from the exponential distribution as n/X
and n/X2 for large n. This agrees with the mean and variance of the gamma distribution.
In chapter 6, the coefficient of skew of the gamma distribution is given as 2 / f i , which
approaches zero as n gets large. Thus, the sum of n random variables from an exponen-
tial distribution is a gamma distribution which approaches a normal distribution (with y
approaching 0) as n gets large.

CONSTRUCTING PDF CURVES FOR DATA


Frequently, the histogram of a set of observed data suggests that the data may be approxi-
mated by a particular probability density function. One way to investigate the goodness of this
approximation is by superimposing a pdf on the frequency histogram and then visually compar-
ing the two distributions. Statistical procedures for testing the hypothesis that a set of data can be
approximated by a particular distribution are given in chapter 8.
Consider the data of table 2.1 and the frequency histogram of figure 2.6. The probability (or
relative frequency) of a peak flow in any one of the class intervals assuming a normal distribu-
tion can be obtained by integrating the normal distribution over the limits of the class interval.
For example, the expected (according to the normal distribution) relative frequency in the first
interval can be calculated from

because the mean of the data is 66,540 cfs and the standard deviation is 22,322 cfs. This integral
is easily evaluated using standard normal tables as 0.0322.
An approximation to the relative frequency in a class interval can also be made by using
equation 2.25b.

Using the standard normal distribution through the transformation


108 CHAPTER 5
Table 5.1. Expected relative frequencies according to the normal distribution for the Kentucky River data

Class Expected
Mark Relative Frequencies Observed
Xi Zi Pz(zi) f xi Relative Frequencies

0.03 16
0.0659
0.1122
0.1564
0.1783
0.1663
0.1270
0.0793
0.0405
0.0169
Sum 0.9744

for the first class interval Axi = 10,000, zi = (25,000 - 66,540)/22,322 = - 1.8609, pZ(zi) =
0.0706 (from equation 5.5) and cr is estimated by s = 22,322.

Similar calculations for each of the class intervals are shown in table 5.1, with the results plotted
in figure 5.5. The sum of the expected relative frequencies is not 1 because the entire range of the
normal distribution was not covered.

0 20 40 60 80 100 120 140


Peak flow (1 000 cfs)
Fig. 5.5. Comparison of normal distribution with the observed distribution, Kentucky River
peak flows.
NORMAL DISTRIBUTION 109
The procedure of integrating p,(x) over each class interval or of using equation 2.25b can
be used for any continuous probability distribution to get the expected relative frequencies for
that distribution.

NORMAL APPROXIMATIONS FOR OTHER DISTRIBUTIONS


The normal distribution can be shown to be a good approximation to several other distri-
butions both discrete and continuous. Before using the normal to approximate some other
distribution, care must be taken to see that the conditions for the approximation to be valid are
met. Generally, the approximations are quite good in the central part of the distribution with the
accuracy dropping off in the tails of the distribution. Throughout our study of distributions, the
sensitivity of the tails of distributions to distributional assumptions will be of concern. This is
of particular importance in hydrology, when the magnitude of a rare event is to be estimated,
because this estimate must come from the tail of the distribution being used.
Whenever a continuous distribution is used to approximate a discrete distribution, half-
interval corrections must be applied to the continuous distribution. For example, the probability
that X is equal to some positive integer X can be evaluated for a discrete distribution. This same
probability is zero if a continuous distribution is used. When a continuous distribution is used to
approximate the prob(X = x), the prob(x - % < X < x + %) must be evaluated. This illustrates
the general rule that a % interval correction must be added to the upper limit and subtracted fi-om
the lower limit. The prob(X = x, x + 1, x + 2, ..., y) in a discrete case is approximated by
prob(x - % < X < y + %) in the continuous case. The prob(X 5 x) in a discrete case is ap-
proximated by prob(X < x + %) in the continuous case. More examples of these corrections are
shown in table 5.2.
The Central Limit Theorem provides the mechanism by which the normal distribution be-
comes an approximation for several other distributions.

Binomial Distribution
It was stated in chapter 4 that if X is a binomial random variable with parameters n, and
p and Y is a binomial random variable with parameters n2 and p, then Z = X + Y is a bino-
mial random variable with parameters n = n, + n2 and p. Extending this to the sum of sev-
eral binomial random variables, the Central Limit Theorem would indicate that the normal

Table 5.2. Corrections for approximating a discrete random variable by a continuous random variable

Discrete Continuous
110 CHAPTER 5
distribution approximates the binomial distribution if n is large. Thus, as n gets large the
distribution of

approaches a N(0, 1). This is sometimes known as the DeMoivre-Laplace limit theorem (Mood
et al. 1974).

Example 5.7. X is a binomial random variable with n = 25 and p = 0.3. Compare the binomial
and normal approximation to the binomial for evaluating the prob(5 < X 5 8).

Solution: Using the binomial distribution this is equivalent to

Using the normal approximation, the probability is determined as prob(5.5 < X < 8.5), which is
0.476. Therefore, the exact probability of 0.483 is approximated by the normal to be 0.476 for an
n of 25.

Negative Binomial Distribution


Following reasoning similar to that given for the binomial distribution, the negative bino-
mial distribution with large k can be approximated by a normal distribution. In the case of the
negative binomial, the distribution of

approaches N(0, 1) as k gets large.

Example 5.8. Work example 4.11 using the normal approximation for the negative binomial.

Solution: The desired probability is prob(39.5 < X < 40.5). Using the standard normal distri-
bution, the limits on Z are

This compares favorably with the 0.0206 computed using the negative binomial.
NORMAL DISTRIBUTION 111

Poisson Distribution
The sum of two Poisson random variables with parameters A, and A, is also a Poisson
random variable with parameter h = h, + A,. Extending this to the sum of a large number of
Poisson random variables, the Central Limit Theorem indicates that for large h, the Poisson may
be approximated by a normal distribution. In this case the distribution of

approaches an N(0, 1). Since the Poisson is the limiting form of the binomial and the binomial
can be approximated by the normal, it is no surprise that the Poisson can also be approximated by
the normal.

Continuous Distributions
Many continuous distributions can be approximated by the normal distribution for certain
values of their parameters. For instance, in example 5.6, it was shown that for large n the gamma
distribution approaches the normal distribution. To make these approximations one merely
equates the mean and variance of the distribution to be approximated to the mean and variance of
the normal and then uses the fact that

is N(0, 1) if X is N(p, u2).Not all continuous distributions can be approximated by the normal
and for those that can the approximation is only valid for certain parameter values. Things to
look for are parameters that produce near zero skew, symmetry, and tails that asymptotically
approach p,(x) = 0 as X approaches large and small values. Again, it is emphasized that ap-
proximations in the tails of the distributions may not be as good as in the central region of the
distribution.

Exercises

5.1. Consider sampling from a normal distribution with a mean of 0 and a variance of 1. What is
the probability of selecting (a) an observation between 0.5 and 1.5? (b) an observation outside the
interval -0.5 to +0.5? (c) 3 observations inside and 2 observations outside the interval of 0.5 and
1.5? (d) 4 observations inside the interval 0.5 to 1.5 exactly two of which are not in the interval
-0.5 to l.O?

5.2. What is the probability of selecting an observation at random from an N(100,2500) that is
(a) less than 75? (b) equal to 75?

5.3. For the Kentucky River data of table 2.1, what is the probability of a peak flow exceeding
100,000 cfs if the peaks are assumed to be normally distributed?
5.4. Construct the theoretical distribution for the data of exercise 2.2 if it is assumed that the data
are normally distributed. From a visual comparison with the data histogram, would you say the
data are normally distributed?

5.5. Work exercise 4.1 using the normal approximation to the binomial and plot the results on the
histogram developed for exercise 4.1.

5.6. Show that if X is N(p, 0') then Y = a + bX is N(a + bp, b'(r2).


5.7. For a particular set of data the coefficient of variation is 0.4. If the data are normally dis-
tributed, what percent of the data will be less than 0.0?

5.8. A sample of 150 observations has a mean of 10,000, a standard deviation of 2,500 and is
normally distributed. Plot a frequency histogram showing the number of observations expected
in each interval.

5.9. The appendix contains a listing of the annual runoff from Cave Creek watershed near Fort
Spring, Kentucky. What is the probability that the true mean annual runoff is less than 14.0 in. if
one can assume the true variance is 22.56 in.'? What other assumptions are needed?

5.10. Random digits are the numbers 0, 1, 2, ..., 9 selected in such a fashion that each is equally
likely (i.e., has probability 1/10 of being selected). An experiment is performed by selecting 5
random digits, adding them together and calling their sum X. The experiment is repeated 10
times and X is calculated. What is the probability that X is less than 21.5? (Exercise 13.9 requires
that this experiment be carried out.)

5.1 1. Plot the individual terms of the Poisson distribution for A = 2. Approximate the Poisson
by the normal and plot the normal approximations on the same graph.

5.12. Repeat exercise 5.11 for A = 9.

5.13. Assume the data of exercise 4.21 is normally distributed. (a) Within each month what is
the probability of 10 or more rainy days? (b) What is the probability of 20 or more rainy days in
the July-August period? (c) What is the difference in assuming the data are normally distrib-
uted, and in assuming the data are binomially distributed and approximating the binomial with
the normal?

5.14. Plot the observed frequency histogram and the frequency histogram expected from the nor-
mal distribution for the annual peak flows for the following rivers. Discuss how well the normal
approximates the data in terms of the coefficient of variation and skewness. (Note: data are in the
appendix or may be obtained from the Internet).

a) North Llano River near Junction, Texas


NORMAL DISTRIBUTION 113
b) Cumberland River at Cumberland Falls, Kentucky

C)Piscataquis River near Dover-Foxcroft, Maine

5.15. The occurrence of rainstorms is sometimes considered to be a Poisson process so that the
time between rainstorms is exponentially distributed. If for a certain locality the mean of this
exponential distribution is 10 days, what is the probability that the elapsed time for 15 storms to
occur will exceed 120 days?

5.16. Lane and Osborn (1973) present the following data for the mean number of days with more
than 0.10 inches of precipitation at Tombstone, Arizona. If the occurrence of more than 0.10
inches of rain in any month can be considered as an independent Poisson process, what is the
probability of fewer than 30 days with more than 0.10 inches of rain in one year at Tombstone?

Month No. of days Month No. of days

Jan. 2 July 7
Feb. 2 Aug. 7
Mar. 2 Sept. 3
Apr. 1 Oct. 2
May 0 Nov. 2
June 2 Dec. 2
-
Total 32

5.17. An experimenter is measuring the water level in an experimental towing channel. Because
of waves and surges, a single measurement of the water level is known to be inaccurate. Past
experience indicates the variance of these measurements is 0.0025 ft2. How many independent
observations are required to be 90% confident that the mean of all the measurements will be
within .02 feet of the true water level?

5.18. At a certain location the annual precipitation is approximately normally distributed with
a mean of 45 in. and a standard deviation of 15 in. Annual runoff can be approximated by
R = -7.5 + 0.5P where R is annual runoff and P is annual precipitation. What is the mean and
variance of annual runoff? What is the probability that the annual runoff will exceed 20 in.?

5.19. Plot a frequency distribution for a mixture of two normal distributions. Use as the first
distribution an N(0, 1) and as the second an N(l, 1). Use as values for the mixing parameter 0.2,
0.5, and 0.8.
6. Continuous Probability
Distributions
THERE ARE many continuous probability distributions in addition to the normal distribu-
tion. This chapter covers some of these distributions, methods for estimating their parameters,
properties of the distributions, and potential applications for them. Further discussion on distribu-
tion selection is contained in chapter 7. Other books may be consulted for more detailed treatment
of the various distributions (Kececioglu, 1991). Rao and Harned (2000) is particularly applicable
to hydrology.

UNIFORM DISTRIBUTION
If a continuous random process is defined over an interval a to P and the probability of an
outcome of this process being in a subinterval of a to P is proportional to the length of the subin-
terval, the process is said to be uniformly distributed over the interval a to p (figure 6.1). The
probability density function for the continuous uniform distribution is

1
PX(X) = fora < X < p

and the cumulative distribution function is

X - a
P,(x) = --- fora < X < p
P-a
CONTINUOUS DISTRIBUTIONS 115

Fig. 6.1. Uniform distribution.

The mean and variance of the uniform distribution are

The skewness is zero since the distribution is symmetrical about the mean. The methods of
moments yields the following estimators for the parameters a and P:

The method of maximum likelihood when applied to the uniform distribution results in
the estimators for a and p being the smallest and largest sample values respectively. That this
is the case can be seen by writing out the likelihood function and then selecting those values of
a and p (within the constraints that a < X < p for all X) that maximize the function.
The uniform distribution finds its greatest application as the distribution of Px(x) for all
probability density functions. That is the prob(Px(x) < y) is uniformly distributed over the inter-
val 0 < y < 1 for any continuous probability distribution. This fact is used in generating random
observations from some probability distributions.

Example 6.1. Use the method of moments to estimate the parameters of the uniform distribution
based on the following sample: 1, 4, 3, 4, 5, 6, 7, 6, 9, 5. What are the maximum likelihood
estimators for this sample?
116 CHAPTER 6
Solution: By method of moments
-
x = 5.00 and s = 2.26

By maximum likelihood
& = 1.00 (smallest sample value)

fi = 9.00 (largest sample value)


Comment: This problem illustrates that the method of moments and the method of maximum like-
lihood do not always produce the same parameter estimates. In this case, the parameters estimated
fi
by moments are not reasonable since values of X outside the limits of & and are present in the
sample. This is a common problem when the method of moments is used to estimate the pararne-
ters of the uniform distribution for small samples. Of course, for large samples neither the moment
nor the maximum likelihood estimates will be "good if the sample is not truly a random sample
from a uniform distribution.

TRIANGULAR DISTRIBUTION
The triangular distribution shown in Figure 6.2 is given by

It is unlikely that any natural hydrologic process would exactly follow a triangular distribution.
The distribution may be a reasonable approximation to the actual but unknown distribution of
some hydrologic quantities. The triangular distribution has been used in simulation studies
involving bounded random variables whose central tendencies are known.
The mean, variance, and coefficient of skew of the triangular distribution are
CONTINUOUS DISTRIE3UTIONS 117

X
Fig. 6.2. Triangular distribution (here y is 6 of equation 6.6).

The parameter 6 gives the mode of the triangular distribution. If 6 is known, the parameters
ci and p may be estimated based on the method of moments as

where
A =3
B = 36 - 9K
C = 9K2 - 96%+ 3S2 - 18s;
Some special cases of the triangular distribution yield the following estimators:
Mode
A

a I3

A treatment of a generalized triangular distribution is contained in chapter 16 beginning


with equation 16.59.

EXPONENTIAL DISTRIBUTION
In chapter 4 it was shown that the exponential distribution arises as the probability distribu-
tion of the time between occurrences of events of a Poisson process. Among other things, the
exponential distribution has been used as the distribution of the time between rainfall events in
118 CHAPTER 6
stochastic precipitation models. The exponential density function is given by

px(x) = ~ e - " X > 0, A > 0

and the cumulative exponential by

The mean and variance of the exponential distribution are

The coefficient of skew is a constant, 2, indicating the exponential is skewed to the right for all
values of A. The curve labeled = 1 in figure 6.4 is an exponential distribution with A = 1.
Examples 3.2 and 3.4 demonstrated that when either the method of moments or maximum like-
lihood is used for parameter estimation, the result is

or the parameter A may be estimated by the reciprocal of the sample mean.

Example 6.2. Haan and Johnson (1967) studied the physical characteristics of depressions in
north-central Iowa. The data tabulated below shows the number of depressions falling into vari-
ous classes based on the surface area of the depression. Plot a relative frequency histogram of the
data. Superimpose on the histogram the best fitting exponential distribution. Estimate the proba-
bility that a depression selected at random will have an area greater than 2.25 acres.
Area (acres) No. of depressions

106
36
18
9
12
2
5
1
4
5
2
6
3
1
1
-1
Total 212
C0NTI;tiOUS DISTRIBUTIONS 119
Solution: The relative frequencies are computed by dividing the number of depressions in each
class by the total number of depressions. The best fitting exponential is estimated by using equa-
tion 6.15 to estimate the exponential parameter A. X is calculated from equation 3.16 as 1.27
acres. Then ); = 1fi = 0.787. The expected relative frequency in each class is then calculated
from equation 2.25b as

where xi is the midpoint of the class interval, Ax, = %, and pA(xi)is the exponential distribution
of area given by
A -

pA(xi)= hepAxi

Therefore

fxi = (1/2)0.7S7e-0.787xi
For example, for the second class interval

f0.75 =
0.393 e-0.787(.75) = 0 22

compared to an observed value of 36/212, or 0.17.


The estimated probability that a depression will have an area in excess of 2.25 acres is

The observed fraction of depressions with areas in excess of 2.25 acres is 31/212, or 0.146.

0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
Area (acres)

Fig. 6.3. Observed and expected (according to the exponential distribution) number of depres-
sions in various size categories for example 6.2.
GAMMA DISTRBUTION
The distribution of the sum of n exponentially distributed random variables each with
parameter A is a gamma distribution with parameters T = n and A. In general, -q does not have to
be an integer. A comprehensive treatment of the gamma distribution and other distributions in I&
gamma family of distributions is given by Bobee and Ashkar (1991). The gamma density func-
tion is given by

T(T) is the gamma function having the properties

( )=( - 1 For -q = 1, 2, 3, ...

( + 1 = ( ) For -q > 0
qT)= Jr tq- 'e-' dt For T > 0

The mean, variance and coefficient of skew for the gamma distribution are

The gamma distribution is positively skewed with y decreasing as -q increases. Plots of the
distribution for various values of -q and A are shown in figure 6.4. A wide variety of shapes rang-
ing from reverse J-shaped for -q < 1 to single peaked with the peak (mode) at x = (-q - 1)/A for
T > 1 can be produced by the gamma density function. Changing A and holding -q constant
changes the scale of the distribution, whereas changing -q and holding A constant changes the
shape of the distribution. Thus, A and -q are sometimes known as scale and shape parameters.
The cumulative gamma distribution is

If T is an integer, the cumulative gamma distribution is given by (Mood et al. 1974)

Some computer spreadsheets will evaluate Px(x) for the gamma distribution.
CONTINUOUS DISTRIBUTIONS 121
Gamma pdf

Fig. 6.4. Gamma distribution with several values for -q and A.

The exponential distribution is a special case of the gamma distribution with -q = I . If X


and Y are independent gamma random variables with parameters -ql, A and q2, A respectively,
then Z = X + Y is a gamma variable with parameters T = T, + -q2 and A = A. This can be ex-
tended to the sum of any number of independent gamma random variables having a common
parameter A. It is an expected result because in chapter 4 the gamma distribution was shown to
arise as the distribution of the sum of n independent exponential random variables.
The moment estimators for the parameters of the gamma distribution result from equations
6.18 and 6.19 as

The maximum likelihood estimators for A and q are given by

where x, is the sample geometric mean and +(x) = d In r(x)/dx is the psi-function. Thom
(1958) has proposed an approximate relationship based on the truncation of a series expansion of
the maximum likelihood estimator for -q given by
Table 6.1. Correction factor for the maximum likelihood estimator for the parameter -q of the
gamma distribution

where y is In Z - E,A: is a correction term arising because of the truncation and Inx is the
mean natural logarithm of the observations. Table 6.1 contains the values A+ for 4 for ranging
from 0:2 to 5.6. For 4 > 5.6 the correction is negligible (as it is anyway for many practical situ-
ations regardless of the value of 4). The procedure for finding the correcf on factor is to assume
that 4 is equal to the first term of equation 6.24 and use the A: from table 6.1 corresponding to
this initial estimate for4.
The parameter h is then estimated by

Thom (1958) states that for -q < 10 the method of moments produces unacceptable esti-
mates for both h and T. For near 1 the method of moments uses only 50% of the sample infor-
mation for estimating h and only 40% for q. This means the maximum likelihood estimators
would do as well with one half the number of observations.
Greenwood and Durand (1960) present the following rational fraction approximations for
the maximum likelihood estimators

for

and

for
CONTTNUOUS DISTRIBUTIONS 123
where

A is then estimated from equation 6.25. Greenwood and Durand (1960) state that the maximum
error in equation 6.26 is 0.0088% and in equation 6.27 is 0.0054%.
Equations 6.24-6.27 produce estimates for T and A that have a slight asymptotic bias. For
small samples the bias may be appreciable (Shenton and Bowman 1970). Bowman and Shenton
(1968) present the following approximate relationship for estimating the bias in the parameter T
when equations 6.24-6.27 are used.

0.1 11 0.032
317 - 0.677 +-
T
+7
E($ - T) E
for n r 4 and T r 1
n-3

where E($ - T) is the bias in T with error of less than 1.4%. The result of using this relationship
for estimating the bias in $ for a sample size n from a gamma distribution having a population
parameter of T = 2 is shown in figure 6.5. In practice, equation 6.29 can be used to correct $ for
bias. If the population T were known, there would of course be no need for estimating T.
Bowman and Shenton (1968) suggest that the bias in $ can be approximated from

which yields

The gamma distribution has been widely used in hydrology (Bobee and Ashkar 1991).
Rainfall probabilities for durations of days, weeks, months, and years have been estimated by the

0 10 20 30 40 50 60 70 Ba 90 100

Sample size n

Fig. 6.5. Expected bias in $ for the gamma distribution with T = 2.


gamma distribution (Barger and Thom 1949; Barger, Shaw and Dale 1959; Friedman and Janes
1957; Mooley and Crutcher 1968). Annual runoff (Markovic 1965) has been described by the
gamma distribution. ,

Example 6.3. The annual water yield for Cave Creek near Fort Spring, Kentucky (USGS #
03288500) is shown in the following table. Estimate the parameters of the gamma distribution for
this data using both the method of moments and the method of maximum likelihood. Assuming
the data follows a gamma distribution, estimate the probability of an annual water yield exceed-
ing 20.00 inches.
Annual Runoff Annual Runoff
Year (inches) Year (inches)

Solution: Method of Moments

Method of Maximum Likelihood (Thom procedure)


CONTINUOUS DISTRIBUTIONS 125
Method of Maximum Likelihood (Greenwood and Durand procedure)

Thus, the maximum likelihood estimators are ); = 0.485 and $ = 7.107. These estimates
may be corrected for bias using either equation 6.29 or 6.30. If 6.30 is used

Note that E($ - T) = E($) - E(T) = 7.107 - 5.922 = 1.I85

If q = 5.922 is substituted into equation 6.29, the result is E($ - -q)= 1.141 which is in
good agreement with the 1.185 produced by equation 6.30. The final estimated for q is now
$ = 5.922 and ); = $/x= 0.404. Using the method of moments the parameter estimates are

$ =9.513 and ); = 0.649, whereas the maximum likelihood estimates are +j= 5.922 and ); =
0.404. Following the recommendation of Thom (1958), the latter estimates will be used in esti-
mating the probability of an annual water yield in excess of 20.00 inches.

Thus 1 - Px(20.00) is 0.176, which is the desired probability. The prob(yie1d > 20.00) = 0.176
if the annual water yield follows a gamma distribution with parameters q = 5.922 and A =
0.404. In these calculations Microsoft Excel 97 was used to evaluate the gamma distribution.
Comment: If the moment parameter estimates had been used, the resulting probability would
have been 0.132, which is reasonably close to 0.176. This is because -q is reasonably close to the
10.00 that Thom (1958) suggested is the smallest value of -q for which the method of moments
results in good parameter estimates. For this data C, = 2/* = 0.82, so that the distribution is
moderately skewed to the right. If the normal distribution had been used to estimate prob(X >
20.00), the result would have been 0.126, which again is a reasonable approximation. However,
if the annual water yield with a return period of 100 years or a 1% chance of being exceeded is
evaluated by the gamma with q = 5.922 and A = 0.404 and by the normal with p = 14.56 and
a = 4.75,the results are 32.2 inches and 25.6 inches-again showing the sensitivity of estimates
of rare events to the distributional assumption even though in the main body of distribution the
agreement is good.
Generally, 18 observations are not enough to make reliable probability estimates or to
determine the proper probability distribution to use. It is a small enough number that one can fol-
low through all of the needed calculations for this example in a short time on a desk calculator,
however. The fact that the gamma and normal estimates differ greatly for this data at large return
periods does not mean the gamma (or the normal) is a better approximation for the data. This
question will be taken up later. Exercise 6.21 should be consulted for another approximate solu-
tion to this example.

LOGNORMAL DISTRIBUTION
The Central Limit Theorem was used in deriving the general result that if a random variable
X is made up of the sum of many small effects, then X might be expected to be normally distrib-
uted. Similarly, if X is equal to the product of many small effects, that is if X = X, X2...Xn,then
the logarithm of X, In X, can be expected to be normally distributed. This can be seen by letting
Y = In X so that Y = In(XlX2...X,) = In XI + In X2 + + In X,. Because the Xi are random
a - -

variables, the In Xi are also random variables and Y = In X is a random variable made up from
the sum of many other random variables. From the Central Limit Theorem, Y can be expected to
be normally distributed with mean py and variance u;.

The distribution of X can be found from

Because Y = In X

and

Note that equation 6.31 gives the distribution of Y as a normal distribution with mean py
and variance a;. Equation 6.32 gives the distribution of X as the lognormal distribution with
parameters py and a;. Y = In X is normally distributed while X is lognormally distributed.
CONTINUOUS DISTRBUTIONS 127

The parameters py and o; can be estimated by Y and S$ in the usual manner by first trans-
forming all of the Xi's to Yi's by

then

and

with all of the summations from 1 to n. If a digital computer is used the above equations are
easily applied. Y and S; may be determined without taking the logarithms of all of the data from

where C, is the coefficient of variation of the original data (C, = S,/X). These relationships are
not general results but depend on data being lognormally distributed.
The mean, variance, and coefficient of variation of the lognormal distribution are

The coefficient of skew of the X's is

Thus, the lognormal distribution is positively skewed with the skew decreasing as the coefficient
of variation decreases. Based on the properties of the normal distribution, the skewness of the
logarithms of lognormal data is zero.
Tables of the standard normal distribution can be used to evaluate the lognormal distribu-
tion. From equation 6.32 we have p,(x) = py(y)/x. But py(y) is a normal density function. From
equation 5.7, py(y) = pz(z)/sy or
The prob(X 5 x) is equal to the prob(Y 5 y) because Y = In X is a monotonic, single
valued function. Since Y is normally distributed, prob(Y 5 y) = prob(Z 5 z) where

Therefore, standard normal tables can be used with the proper transformations to evaluate px(x)
and Px(x) for the lognormal distribution.
Certain reproductive properties of the lognormal follow directly from the reproductive
properties of the normal distribution. For example, if X is lognormally distributed then Y = a x b
is lognormally distributed with

Pln Y = In a + bPln x
and
aty = b u h x
2 2

This follows from the fact that In Y = In a + b In X, In X is normally distributed, and In Y is a


linear function of In X so is also normally distributed. Thus Y is lognormally distributed. This
can be extended so that if X,, X2, ..., Xn are independent and lognormally distributed, then
Y= x$. .. X: is lognormally distributed with

and

Two special cases of the above are if Z = XY and Z = X/Y with X and Y being independ-
ently and lognormally distributed, then Z is lognormally distributed with its mean and variance
easily determined from equations 6.43 and 6.44.
Because of its simplicity, its ready availability in tables for its evaluation, and the fact that
many hydrologic variables are bounded by zero on the left and positively skewed, the lognormal
distribution has received wide usage in hydrology.

Example 6.4. Use the lognormal distribution and calculate the expected relative frequency for
the third class interval of the data in table 5.1.

Solution: The expected relative frequency according to the lognormal distribution is

The evaluation of px(x) from equation 6.41 requires an estimate for py and 0,. These are
estimated from equations 6.35 and 6.36.
CONTINUOUS DISTRIBUTIONS 129

or the expected relative frequency in the interval 40,000 to 50,000 according to the lognormal
distribution is

Example 6.5. Assume the data of table 5.1 follow the lognormal distribution. Calculate the
magnitude of the 100-year peak flow.

Solution: The 100-year peak flow corresponds to a prob(X > x) of 0.01. X must be evaluated
such that P,(x) = 0.99. This can be accomplished by evaluating Z such that Pz(z) = 0.99 and
then transforming to X. From the standard normal tables the value of Z corresponding to Pz(z) of
0.99 is 2.326. From equation 6.37

The values of s, and are given in example 6.4.

y = 0.326(2.326) + 11.0524 = 11.812


x = exp(y) = 134,683 cfs

The 100-vear ~ e a flow


k according to the lognormal distribution is about 134.700 cfs.

EXTREME VALUE DISTRIBUTIONS


Often, interest exists in extreme events such as the maximum peak discharge of a stream or
minimum daily flows. The extreme value of a set of random variables is also a random variable.
The probability distribution of this extreme value random variable will in general depend on the
sample size and the parent distribution from which the sample was obtained. Hahn and Shapiro
(1967), Ang and Tang (1984), Kececioglu (1991), Rao and Hamed (2000), and Benjamin and
Cornell (1970) contain very readable treatments of some of the extreme value distributions.
130 CHAPTER 6
Consider a random sample of size n consisting of x,, x,, ..., x,. Let Y be the largest of the
sample values. Let Py(y) be the prob(Y 5 y) and Pxi(x) be the prob(Xi 5 x). Let py(y) and pxi(x)
be the corresponding probability density functions.
Py(y) = prob(Y 5 y) = prob(al1 of the x's 5 y). If the x's are independently and identically
distributed we have

Therefore, the probability distribution of the maximum of n independently and identically


distributed random variables depends on the sample size n and the parent distribution Px(x) of the
sample. A similar result can be derived for the distribution of the smallest of n independently and
identically distributed random variables.

Example 6.6. Assume that the time between rains follows an exponential distribution with a
mean of 4 days. Also assume that the time between rains is independent from one rain to the next.
Irrigators may be interested in the maximum time between rains. Over a period of 10 rains, what
is the probability that the maximum time between rains exceeds 8 days?

Solution: 10 rains means 9 interrain periods, or n = 9. From equation 6.45 the probability that
the maximum interrain time is less than 8 days is

In this example, Px(y) is the cumulative exponential with parameter A = 1 / x = 1/4.

Therefore, the probability that the maximum interrain time will be greater than 8 is 1 - 0.27 1
= 0.729.

Comment: The probability density function for the maximum interrain time is from equation 6.46

This distribution is plotted in figure 6.6 for various values of n. Note that for even moderately
large n, the probability is very high that the extreme value (longest intkain time) will be from
the tail of the parent (exponential) distribution.

Frequently the parent distribution from which the extreme is an observation is not known
and cannot be determined. If the sample size is large, use can be made of certain general asymp-
totic results that depend on limited assumptions concerning the parent distribution to find the
CONTINUOUS DISTRIBUTIONS 131

Fig. 6.6. Distribution of the largest sample value from a sample of size n from an exponential
distribution.

distribution of extreme values. Much of the work on extreme value distributions is due to
Gumbel (1954, 1958).Three types of asymptotic distributions have been developed based on
different (but not all) parent distributions. The types are:

a. Type I-parent distribution unbounded in direction of the desired extreme and all
moments of the distribution exist (exponential type distributions).

b. Type 11-parent distribution unbounded in direction of the desired extreme and all
moments of the distribution do not exist (Cauchy type distributions).

c. Type 111-parent distribution bounded in the direction of the desired extreme (limited
distributions).

Interest may exist in either the distribution of the largest or smallest extreme values. Exarn-
ples of parent distributions falling under the various types are:

a. Type I-extreme value largest: normal, lognormal, exponential, gamma

b. Type I-extreme value smallest: normal

c. Type II-extreme value largest or smallest: Cauchy distribution (Hahn and Shapiro
1967; Thomas 1971)

d. Type III-extreme value largest: beta distribution (Hahn and Shapiro 1967; Gibra 1973;
Benjamin and Cornell 1970)

e. Type III-extreme value smallest: beta, lognormal, gamma, exponential


The type I1 or Cauchy type extreme value distributions have found little application in
hydrology. The distribution of the largest extreme value in hydrology generally arises as a type I
extreme value largest distribution because most hydrologic variables are unbounded on the right.
(See Van Montfort [I9701 for a test to determine whether a type I or type I1 extreme value largest
best fits the observed data.) The distribution of extreme value smallest commonly found in hydro-
logic work is the type 111extreme value smallest since many hydrologic variables are bounded on
the left by zero. The following is a treatment of these two (type I largest and type III smallest) ex-
treme value distributions plus the type I smallest because of its symmetry with the type I largest.

Extreme Value Type I


The type I extreme value has been referred to as Gumbel's extreme value distribution, the
extreme value distribution, the Fisher-Tippet. type I distribution, and the double exponential
distribution. The type I asymptotic distribution for maximum (minimum) values is the limiting
model as n approaches infinity for the distribution of the maximum (minimum) of n independent
values fiom an initial distribution whose right (left) tail is unbounded and which is an exponential
type; that is, the initial cumulative distribution approaches unity (zero) with increasing
(decreasing) values of the random variable at least as fast as the exponential distribution approaches
unity. The normal, lognormal, exponential, and gamma distributions all meet this requirement for
maximum values while the normal distribution satisfies the requirement for minimum values.
The type I extreme value distribution has been used for rainfall depth-duration-frequency
studies (Hershfield 1961) and as the distribution of the yearly maximum of daily and peak river
flows. Gumbel (1958) states that this latter application assumes 1) the distribution of daily
discharges (the parent distribution) is of the exponential type, 2) n = 365 is a sufficiently large
sample and 3) the daily discharges are independent. Gumbel states that the first and second
assumptions cannot be checked because the analytical form of the distribution of discharges is
unknown and that the third assumption is clearly not true so that the number of independent
observations is something less than 365. In spite of violating the last assumption, experience with
the type I for the maximum of daily discharges has been reasonably good. Maximum annual
flood peaks would more nearly fulfill assumption 3 although the effective sample size would be
much less than 365.
The probability density function for the type I extreme value distribution is

px(x) = -exp
a
+------
[-" n 7)
- exp +---

where the - applies for maximum values and the + for minimum values. The parameters a and
p are scale and location parameters with p being the mode of the disMbution. The type I for max-
imum and minimum values are symmetrical with each other about f3. Figure 6.7 is a plot of the
distributions for a = 3,897 and f3 = 7,750.
The mean and variance of the extreme value type I distribution are

E(X) = f3 + yea (maximum) (6.48)

= f3 - yea (minimum)
CONTINUOUS DISTRIBUTIONS 133

Largest
Smallest

-20 -10 0 10 20 30 40
X (1000s)

Fig. 6.7. Example of extreme value type I density curves.

7rL
Var(X) = -a2 (both)
6
where yeis the Euler number having a value of 0.577216.
The skewness coefficient is

y = 1.I396 (maximum)

= - 1.1396 (minimum)

Thus, the type I has a constant coefficient of skewness.


If the transformation

is used, the type I extreme value density function becomes

where the - applies for the maximum values and the + for the minimum values. The cumula-
tive distribution is

py(y) = Jr, exp[T t - exp(+ t)] dt -m <Y <

= exp[ -exp(- y)] (maximum) (6.53)

= 1 - exp[-exp(-y)] (minimum) (6.54)

The designation "double exponential" distribution follows from these equations. The cumu-
lative distribution for maximum and minimum values are related by
134 CHAPTER 6
The parameters of the type I extreme value distribution can be estimated in a number of
ways. Lowery and Nash (1970) compared several methods and concluded that the method of
moments was as satisfactory as other methods. If the method of moments is used. The estimators
are

and
y e 6
p=x--
A

S (maximum)
5T

= X-+ -y e 6 S (minimum)
5T

The maximum likelihood estimators (Lowery and Nash 1970) can be determined by a
simultaneous solution to the equations

fi
Unfortunately, these equations cannot be easily solved explicitly for & and so that a numerical
solution is required.
The type I extreme value distribution for maximums has been used to define the "mean an-
nual flood". The probability that an observation from this distribution will exceed the mean of the
distribution is 1 - Py(y) where Py(y) is evaluated from equation 6.53 for y = ( p - P)/a. Since
p = E(X) = P + yea (equation 6.48), we simply have that y = y, and Py(y) = 0.5703. The
probability of a value in excess of the mean is 1 - P,(y) = 0.4297. The return period of a flood
equal in magnitude to the mean is

1
T= = 2.33 years
1 - PY(Y)

Often the "mean annual flood" refers to a flood with a return period of 2.33 years.

Extreme Value Type III Minimum (Weibull)


The extreme value type III distribution arises when the extreme is from a parent distribution
that is limited in the direction of interest. This distribution has found use in hydrology as the
CONTINUOUS DISTRIBUTIONS 135
distribution of low stream flows. Naturally low flows are bounded by zero on the left. The type
III for minimum values is also known as the Weibull distribution and is defined as

The cumulative Weibull is given by

The mean and variance of the distribution are

Hahn and Shapiro (1967) give the coefficient of skew as

The parameters of the Weibull distribution can be estimated by the method of moments by sub-
stituting the sample mean and variance for the population mean and variance respectively in
equations 6.62 and 6.63 and then solving the two equations simultaneously for & and 6.
The maximum likelihood estimates can be determined by letting

and then solving the equations

and

simultaneously for & and );. fi is then given by

Either method of parameter estimation is difficult. Exercise 6.18 provides a method for
simplifying the solution of the moment equations.
136 CHAPTER 6

Fig. 6.8. Examples of extreme value type III minimum (Weibull) density curves.

The Weibull probability density function can range from a reverse-J with a < 1, to an ex-
ponential with a = 1 and to a nearly symmetrical distribution (figure 6.8) as a increases. If the
lower bound on the parent distribution is not zero, a displacement parameter must be added to the
type III extreme value distribution for minimums so that the density function becomes

and the cumulative distribution function becomes

By using the transformation

tables of e-Ycan be used to determine Px(x). Equation 6.68 is sometimes known as the 3-parameter
Weibull distribution, or as the bounded exponential distribution.
The mean and variance of the three parameter Weibull distribution are

and
CONTINUOUS DISTRIBUTIONS 137
The coefficient of skew is again given by equation 6.64. Through algebraic manipulation, equa-
tions 6.70 and 6.7 l can be put in the form (Gumbel 1958)

E = p - oB(a)
where

The moment estimates for a , P, and E can now be obtained by 1) solving equation 6.64 for
8,
&, 2) solving 6.74 and 6.75 for A(a) and B(a), 3) solving 6.72 for and 4) solving 6.73 for i .
Table 6.2 can be used to simplify the calculations.

Table 6.2. Table for solution of equations 6.64,6.74,and 6.75


138 CHAPTER 6

Example 6.7. The minimum annual daily discharges on a stream are found to have an average of
125 cfs, a standard deviation of 50 cfs, and a coefficient of skew of 1.4. Using both the type 111
minimum and the type I minimum extreme value distributions, evaluate the probability of an
annual minimum flow being less than 100 cfs.

Solution: Type III minimum using interpolation in table 6.2.

6 = 125 + 50(0.098) =-129.9 (eq. 6.72)


i = 129.9 - 50(1.36) = 61.9 (eq. 6.73)
prob(X 5 100) = Px(lOO) = 1 - e-Y (eq. 6.69) where

Type I minimum

& = V&s/.rr = 0.78(50) = 39 (eq. 6.56)

b = f + ye V&s/.rr (eq. 6.57)


= 125 + .45(50) = 147.5
prob(X I100) = P,(100) = 1 - exp(-eY) (eq. 6.54)

where Y = (X - b)/& = (100 - 147.5)/39 = -1.22


~ ~1 )- 0.744 = 0.256
Px(lOO) = 1 - e ~ ~ ( - e - ' . =

Comment: The results of applying these two distributions to this problem are very different. This
should be expected as it is a situation where the type I for minimums would not be expected to
apply because there would be a lower bound and because the coefficient of skew was given as 1.4
whereas the coefficient of skew for the type I minimum is - 1.1396.

Discussion
The theory on which the extreme value distributions depend is not as strong as the Central
Limit Theorem for the normal distribution. More assumptions concerning the underlying or
CONTINUOUS DISTRIBUTIONS 139
parent distribution must be made and the rate of convergence to an asymptotic extreme value
distribution may be rather slow. However, the extreme value distributions do provide a connec-
tion between observed extreme events and models that may be used to evaluate the probabilities
of future extreme events.
The conditions under which the various extreme value distributions arise are such that for
many parent distributions (lognormal, gamma) the distribution of maximum values and the dis-
tribution of minimum values are not of the same type. The minimum values from a lognormal
would be expected to follow the type 111distribution while the maximum values would follow a
type I distribution.
Various types of extreme value distributions are related. The logarithms of a random vari-
able that follows a type III minimum are distributed as the type I minimum extreme value distri-
bution. Chow (1954) has shown that if the coefficient of variation of the type I maximum extreme
value distribution is 0.364, the distribution is practically the same as the lognormal distribution
with the same coefficient of variation and coefficient of skew (1.139).

GENERALIZED EXTREME VALUE DISTRIBUTION


Stedinger et al. (1994) discuss the generalized extreme value (GEV) distribution given by

.(XI = exp(- [ 1 - K(x -


a
"1') for r io

The Gumbel distribution is obtained when K = 0. For IKI < 0.3, the general shape of the GEV is
similar to the Gumbel extreme value distribution with some differences in the right tail. The
parameters 5, a, and K are location, scale, and shape parameters. For K > 0, the distribution has
a finite upper bound at 5 + a / and
~ corresponds to the extreme value type I11 distribution for
maximums that are bounded on the right. The moments of the GEV are

The Var(X) exists for K > -0.5. Other restrictions are

a a
K>O then x < 5 + - f o r ~ < O then x>5+-
K K

For K > - %, the skewness is


where sign(^) is + or - 1 depending on the sign of K. For K > - 1, the order r probability
weighted moment P, of a GEV is

and may be estimated by equations 3.71. The L-moments, Xi may then be estimated by equa-
tions 3.74.
The parameters of the GEV in terms of L-moments are:

where

Quantile values from the GEV can be determined from

where Px(xp)is the cdf of X. In chapter 7 an example of the use of the GEV for flood frequency
analysis is given.

BETA DISTRIBUTION
A distribution that has both an upper and lower bound is the beta distribution. Generally, the
beta distribution is defined over the interval 0 to 1. It can, however, be transformed to any interval a
to p. If the limits of the distribution are unknown, they become parameters of the distribution, mak-
ing it a 4-parameter rather than a 2-parameter distribution. The beta density function is given by

'
The function B(a, P) = J,' xa- '(1 - x ) ~ -dx is called the beta function. The beta function is
related to the gamma function by
CONTINUOUS DISTRIBLTIONS 141
The beta function is tabulated. The mean and variance of the beta distribution are

The mean and variance can be used to get the moment estimators for a and P.

PEARSON DISTRIBUTIONS
Karl Pearson (Elderton 1953) has proposed that frequency distributions can be represented by

By choosing appropriate values for the parameters, equation 6.90 becomes a large number of
families of distributions including the normal, beta, and gamma distributions.
The Pearson type 111has found application in hydrology especially as the distribution of log-
arithms of flood peaks. This distribution can be written

with the mode at X = 0. The lower bound of the distribution is X = -a. The difference in the
mean and mode is 6 and the value of px(x) at the mode is po. It can be shown that the Pearson type
I11 is the same as the 3-parameter gamma distribution. By shifting equation 6.91 so that the mode
is at X = a and the lower bound is at X = 0, we have

The gamma distribution has the mode at (q - l)/A and the mean at q/A. Thus a = (-q - l)/A
and 6 = -q/A - (q - l)/A = l/A. The value of px(x) at the mode for the gamma distribution is

Substituting these quantities into 6.92 results in

which is the gamma distribution.


SOME IMPORTANT DISTRIBUTIONS OF SAMPLE STATISTICS
We have already seen that sample statistics as functions of random variables are themselves
random variables. Statistical tests depend on the probability distribution of test statistics, which
are merely sample statistics. In this section three of the important distributions of sample
statistics are briefly discussed.

Chi-Square Distribution
If Z is a standardized, normally distributed random variable, Z = (X - k)/a, then

where Y is the sum of squares of n random values of Z and has a chi-square distribution with n
degrees of freedom. The chi-square distribution is a special case of the gamma distribution when
X = j/2 and is a multiple of %. The distribution thus has a single parameter v = 27 known as the
degrees of freedom. The expression for the distribution is

The mean and variance of the distribution are

The parameter v is usually known in any application of the chi-square to statistical testing.
Equation 6.95 produces the moment estimator for v as = X.In figure 6.4, the curve labeled
X = % is a chi-square distribution with v = 6. The coefficient of skew for the chi-square distri
bution is 2*/. The cumulative chi-square distribution is contained in the appendix in the
form

for various values of v and a.


The fact that the sum of squares of n random standard normal variates is a chi-square
distribution with v = n makes it apparent that if Xi is a chi-square random variable with param-
eter ni then X = CXi is a chi-square random variable with parameter v = I n i if all the Xi are
independent.
If z,, z2, ..., Z, is a random sample from a standard normal distribution, then y = Cz! =
C(xi- ?T)2/a2has a chi-square distribution with v = n - 1. Furthermore since s2 = I ( x i - x ) ~ /
(n - I), the quantity (n - l)s2/a2has a chi-square distribution with v = n - 1 (Mood et al. 1974).
CONTITWOUS DISTRIBUTIONS 143

The t Distribution .

If Y is a standardized normal variate and U is a chi-square variate with v degrees of freedom


and Y and U are independent, then

has a t distribution with v degrees of freedom. The t distribution is given by

The mean and variance of the t distribution are

n
Var(T) = - for n >2
n -2

A table of the cumulative t distribution is in the appendix in the form

for various values of v and a.


One use of the t distribution is as the sampling distribution of the mean from a normal dis-
tribution with unknown variance. If we write

and then divide the numerator and denominator by a , we get

where Y = (X - p)/(u/ 6) has a standard normal distribution and U = (n - 1)s2/u2has a X2


distribution. Thus from equation 6.98, T has a t distribution with n - 1 degrees of freedom.
As v of the t distribution gets large, the t distribution approaches the standard normal distri-
bution. Thus, for large samples, the sampling distribution of the mean of a normal distribution
with unknown variance approaches a normal distribution. We have already seen that the distri-
bution of the sample mean from a normal distribution with a known variance is exactly a normal
distribution. One can reason that as the sample size increases, the estimate for the variance im-
proves to the point where the sampling distribution of the mean of a normal distribution with
unknown variance can be approximated by the sampling distribution of the mean of a normal
distribution with a known variance, which is itself a normal distribution. In practice one rarely
knows the variance of the distribution from which a sample is obtained.

Example 6.8. A sample of size 8 from a normal distribution results in X = 12.7 and s2 = 9.8.
What is the probability that X is in error by more than 1.0?

Solution: (% - p,)/m has a t distribution with n - 1 degrees of freedom. To be in error by


more than 1.0 units we must have I X - p,l > 1.0.

The desired probability is the area to the right of t = 0.904. By interpolation in the t table,
this value is found to be 0.198. By symmetry, the area to the left of -0.904 is 0.198. The desired
probability is 0.198 + 0.198 = 0.396.
If a standard normal distribution had been used rather than a t distribution, it would have been
necessary to find prob(lZ1 > 0.904). This probability can be found from the standard normal table
to be 0.366. Thus, even for a sample as small as 8, the normal is a reasonable approximation.

The F Distribution
If U is a chi-square variate with y = m degrees of freedom and V is a chi-square variate
with y = n degrees of freedom and U and V are independent, then

has an F distribution with yl = m and y2 = n degrees of freedom (m and n are known as the
numerator and denominator degrees of freedom, respectively). The F distribution is given by

The mean and variance of the F distribution are


COKTINUOUS DTSTRIBUTIONS 145
The cumulative F distribution is contained in the appendix as a function of m and n for
values of P,(f) = 0.90,0.95,0.975,0.99, and 0.995.
The table contains values of F,,,, such that 100a% of the distribution with m and n
degrees of freedom lies to the left of F,,,,.,. For example, the probability that a random observa-
tion from an F distribution with 5 numerator and 10 denominator degrees of freedom exceeds
4.2 is 1.00 - 0.975 = 0.025.

TRANSFORMATIONS
Often a transformation can be made in an attempt to anive at a probability distribution that
will describe the data. Common transformations are logarithmic transformations, translations
along the x axis, and n" power transformations for n = K,, %, 2, and 3.
We have already made one application of the logarithmic transformation to get the lognor-
mal distribution from the normal distribution. Other distributions can be transformed by means
of this transformation as well. Benson (1968) and an Interagency Subcommittee on Water Data
(1982) discuss the use of the log-Pearson type I11 distribution for flood frequencies.
Translations are especially useful in the case of bounded distributions. We made use of a
translation in deriving the 3-parameter extreme value type I11 for minimums from the correspon-
ding 2-parameter distribution. In general, a translation is accomplished by subtracting a location
parameter, E, from the random variable. For example

could be considered a 2-parameter exponential distribution with the lower bound at X = E.


Exercises 3.17 and 3.18 deal with estimating the two parameters of this distribution. In general,
the addition of a displacement parameter, if the displacement parameter is unknown, makes
parameter estimation via maximum likelihood much more difficult.
Moment estimators are relatively simple in that the addition of a displacement parameter affects
the mean by p = p, + E and has no effect on the variance or skewness. Thus, a 3-panmeter gamma
distribution might be given by

with the moment estimators for h, T, and E determined from


146 CHAPTER 6

The fact that y must be now used to estimate means that for small samples accuracy is lost, be-
cause y is based on the third sample moment. As shown earlier, the 3-parameter gamma is the
same as the Pearson type 111distribution.
Sangal and Biswas (1970) have used the 3-parameter lognormal distribution obtained by fit-
ting a normal distribution to the logarithms of (X - E) where E is a parameter that must be esti-
mated from the data. They found for 10 Canadian rivers that the 3-parameter lognormal
distribution fit the observed distribution of peak flows. They also state that the Gumbel extreme
value distribution is a special case of the 3-parameter lognormal distribution.
The three parameter lognormal is given by

where

Stidd (1953) and Kendall (1967) discuss transforming variables by Y = x"' and then fitting a
normal distribution to Y. They discuss this transformation in terms of precipitation probabilities.

Exercises

6.1. Show that the mean of the uniform distribution is (P + a)/2 and the variance is
(P - 4 / 1 2 .

6.2. What is the skewness and kurtosis of the uniform distribution?

6.3. What is the skewness and coefficient of variation for the exponential distribution?

6.4. Fit the gamma distribution to the data of exercise 2.2. Plot the expected relative frequency
according to the gamma distribution on the plot of exercise 2.2.

6.5. Repeat exercise 6.4 using the lognormal distribution.

6.6. Fit the lognormal distribution to the Kentucky River data of table 2.1. Is this a good
approximation for the data?

6.7. Work exercise 5.8 using the lognormal distribution.


CONTINUOUS DISTRIBUTIONS 147
6.8. A set of data having a mean of 4.5 and a standard deviation of 2.0 is thought to follow the
type I extreme value distribution for maximums. What proportion of the observations from this
distribution exceed 6.0? Plot the probability density function.

6.9. Repeat exercise 6.8 using the type I extreme value distribution for minimums.

6.10. Repeat exercise 6.8 using the Weibull distribution.

6.11. Repeat exercise 6.8 using the lognormal distribution.

6.12. Show that the exponential distribution is memoryless [i.e., show that prob(X 2 t +
T ~ X> t ) = prob(X > t)].

6.13. Plot the probability density function and the cumulative probability distribution for the
lognormal distribution with p, = 50,000 and ox= 25,000.

6.14. Plot the theoretical distribution of the largest value selected from a normal distribution
with p = 4 and u = 4 for sample sizes of n = 2,5,9, and 33. Compare the results with those of
example 6.6.

6.15. Derive expressions analogous to equations 6.45 and 6.46 for the smallest of n independ-
ently and identically distributed random variables.

6.16. Verify equation 6.52 from equations 6.47 and 6.51.

6.17. Assume that during month 1 the mean and standard deviation of the monthly rainfall are
0.750 and 0.433 inches, respectively. Similarly, during month 2 the mean and standard deviations
of monthly rainfall are 3.000 and 0.866 inches, respectively. Assume monthly rainfall amounts
can be approximated by the gamma distribution and that rainfall in month 2 is independent of
rainfall in month 1. What is the probability of receiving more than 3 inches of rain during the
two-month period?

6.18. Show that for the 2-parameter Weibull distribution the parameter a is a function only of the
coefficient of variation. Using this fact, describe a procedure for estimating a and lj of the distri-
bution.

6.19. If peak discharge, q, is lognormally distributed with mean p, and variance oi, what is the
probability distribution of stage S? Assume stage and discharge are related by q = asb.

6.20. Work exercise 6.19 assuming the peak discharges are distributed as the type I extreme
value distribution.
4
6.21. In example 6.3 let be approximated by 6.0. Calculate from equation 6.25 and then
evaluate the prob(yie1d > 20.0) by using the equation following equation 6.21. Compare the
results with those of example 6.3.

6.22. Use the method of moments to estimate the parameters of the 3-parameter lognormal
distribution for the North Llano River near Junction, Texas. What is the return period of a mean
annual flow of 273 cfs or more?

6.23. Calculate the return period associated with an annual runoff of 0.500 inches for Walnut
Gulch near Tombstone, Arizona (Data in Appendix C). Assume (a) lognormal distribution, (b)
gamma distribution, (c) extreme value type I, (d) normal distribution.

6.24. Assume the data of exercise 4.10 are distributed as a 2-parameter exponential distribution.
Estimate the parameters of this distribution and prepare a table comparing the observed and
expected number of floods over the 100-year period.
7. Frequency Analysis
ONE OF the earliest and most frequent uses of statistics in hydrology has been that of
frequency analysis. Early applications of frequency analysis were largely in the area of flood flow
estimation. Today nearly every phase of hydrology is subjected to frequency analysis. Although
most of the discussion in this chapter centers on flood flows or peak flows, the techniques are
generally applicable to a wide range of problems including runoff volumes, low flows, rainfall
events of various kinds, water quality parameters, measures of ground water levels and flows,
and many other environmental variables. The statistical and mathematical manipulations dis-
cussed in this chapter do not depend on the units of measurement or the quantity measured. The
assumptions that are made, however, must be carefully compared to the situation under study.
The goal of a frequency analysis is to estimate the magnitude of an event having a given
frequency of occurrence or to estimate the frequency of occurrence of an event having a given
magnitude. The frequency is often stated in terns of a return period, T, in years, or a probability
of occurrence in any one year, p. Other terminology commonly used includes the estimation of a
"quantile" or "percentile" of the probability distribution of the quantity of interest. The loopth
percentile is simply the event having a probability, p, of occurring. The term "quantile" is used in
a similar manner. The 9othquantile is the same as the 9othpercentile. The loopthpercentile or the
loopthquantile is the value, xp, of the random variable X satisfying

where p is the exceedance probability, or the probability that x, is exceeded.


There have been and continue to be volumes of material written on the proper probability
distribution to use in various situations. One cannot, in most instances, analytically determine
which probability distribution should be used. Certain limit theorems such as the Central Limit
Theorem and Extreme Value Theorems might provide guidance. One should also evaluate the ex-
perience that has been accumulated with the various distributions and how well they describe the
phenomena of interest. Certain properties of the distributions can be used in screening distribu-
tions for possible application in a particular situation. For example, the range of the distribution,
the general shape of the distribution, and the skewness of the distribution often indicate that a
particular distribution may or may not be applicable in a given situation. When two or more dis-
tributions appear to describe a given set of data equally well, the distribution that has been tradi-
tionally used should be selected unless there are contrary overriding reasons for selecting another
distribution. However, if a traditionally used distribution is inferior, its use should not be contin-
ued just because "that's the way it's always been done".
The first part of this chapter discusses empirical frequency analysis by plotting data in the
form of a cumulative probability distribution. The second topic covered is analytical frequency
analysis based on probability distributions. A simplified technique based on frequency factors is
shown for determining the magnitude of an event with a given return period. In general, the
frequency factor is a function of the distributional assumption that is made and of the mean,
variance, and, for some distributions, the coefficient of skew of the data. Regional frequency
analysis is then discussed. Regional frequency analysis attempts to use data from several
locations in a "homogeneous" region to determine the frequency relationship for a point having
limited data. The chapter closes with a discussion of the frequency analysis of precipitation data
and other forms of hydrologic data.
Frequency analysis of hydrologic data requires that the data be homogeneous and independent.
The restriction of homogeneity ensures that all the observations are from the same population. Non-
homogeneity may result from a stream gaging station being moved, a watershed becoming
urbanized, or structures being placed in the stream or its major tributaries. Different types of storms,
such as frontal storms and storms associated with hurricanes, may introduce nonhomogeneity. In
this latter situation a mixed population model may be required for the frequency analysis.
The restriction of independence ensures that a hydrologic event such as a single large storm
does not enter the data set more than once. For example, a single storm system may produce two
or more large runoff peaks only one of which (the largest) should enter the data set. Dependence
may also result when a major rainfall occurs, producing very wet antecedent conditions on a
catchment. A subsequent rainfall may then produce much larger flows than would have occurred
had a more normal antecedent condition existed. The flow from the second storm is then de-
pendent on the fact that the first storm had occurred. Runoff from only one of these events, the
largest one, should enter the analysis.
For the prediction of the frequency of future events, the restriction of homogeneity requires
that the data on hand be representative of future flows (i.e., there will be no new structures,
diversions, land use changes, etc., in the case of stream flow data). Recently, the possibility of cli-
mate change has been raised as a factor contributing to nonhomogeneity of a hydrologic record.
If climate change is occurring at a rate rapid enough to affect the usefulness of a particular
hydrologic analysis, this change must be reckoned with in the analysis,
FREQUENCY ANALYSIS 151

Hydrologic frequency analysis can be made with or without making any distributional
assumptions. The procedure to be followed in either case is much the same. If no distributional
assumptions are made, the observed data are plotted on any kind of paper (not necessarily prob-
ability paper) and judgment used to determine the magnitude of past or future events for various
return periods. If a distributional assumption is made, the magnitude of events for various return
periods is selected from the theoretical "best-fit" line according to the assumed distribution. If an
analytical technique is used, the data should still be plotted so that one can get an idea of how
well the data fit the assumed analytical form and to spot potential problems.

PROBABILITY PLOTTING
Once data for a frequency analysis have been selected, they must be carefully scrutinized to
ensure all of the observations are all valid representations of the hydrologic characteristic under
consideration. For example, in a flood frequency data set consisting of the annual maximum flow,
it is possible that the lower values are merely flows somewhat above the flows for the remainder
of the year but do not truly represent high flows or flood flows. In such a case, some truncation
of low flows might be instituted with the analysis done on the truncated data set and adjusted to
the full record length.
After accepting the data as valid, basic statistics (mean, variance, skewness) of the data
should be computed and the data plotted as a probability plot. Plotting probability density func-
tions and cumulative probability distributions on arithmetic paper has already been discussed. In
general, when the cumulative distribution function, Px(x), is plotted on arithmetic paper versus
the value of X, a straight line does not result. To get a straight line on arithmetic paper, Px(x)
would have to be given by the expression Px(x) = ax + b or p,(x) = a, the uniform distribution.
Thus, if the cumulative distribution of a set of data plots as a straight line on arithmetic paper, the
data follows a uniform distribution. Probability paper can be developed so that any cumulative
distribution can be plotted as a straight line. Generally, the scaling of the probability axes is
unique for each of the different probability distributions to plot as a straight line. The scaling of
the probability axis may even have to change as the parameters of a particular distribution
change. Constructing probability paper is a process of transforming the probability scale so that
the resulting cumulative curve is a straight line. Many types of probability paper are comrner-
cially available, including paper for the normal, lognormal, exponential, certain cases of the
gamma, extreme value (type I), Weibull, and chi-square distributions.
A few computer software packages provide for plotting using a normal distribution proba-
bility scale. Some of the packages will plot probability directly whereas others use the Z trans-
formation of the normal distribution. The resulting plots are similar. When the Z transformation
is used, the probability associated with the plotted Z values must be independently determined.
The most common probability paper has a normal probability scale and either an arithmetic
(normal probability paper) or logarithmic (lognormal probability paper) scale. Normally distrib-
uted data will plot as a straight line on normal probability paper and lognormally distributed data
will plot as a straight line on lognormal probability paper. One way to determine if data might be
from a normal or lognormal distribution is to plot the data on normal and lognormal probability
paper and visually determine if a straight line is obtained.
A probability plot is a plot of a magnitude versus a probability. Determining the probability
to assign a data point is commonly referred to as determining the plotting position. For a
population, determining the plotting position is merely a matter of determining the fraction of
the data values less (greater) than or equal to the value in question. Thus the smallest (largest)
population value would plot at 0 and the largest (smallest) population value would plot at 1.00.
Assigning plotting positions to sample data is not as straightforward. Generally, a sample will
not contain the smallest or largest value of the unknown population. Thus, plotting positions of
0 and I should be avoided for sample data unless one has additional information on the popula-
tion limits.
Plotting position may be expressed as a probability from 0 to 1 or a percent from 0 to 100.
Which method is being used should be clear from the context. In some discussions of
probability plotting, especially in hydrologic literature, the probability scale is used to denote
prob(X > x) or 1 - Px(x). In this book we will adopt this convention. The reason for this is that
the return period, Tx(x), is l/prob(X > x) = 1/(1 - Px(x)), or the reciprocal of the probabil-
ity scale. One can always transform the probability scale from 1 - Px(x) to Px(x) or even Tx(x)
if desired.
Probability plotting of hydrologic data requires that individual observations or data points
be independent of each other and that the sample data be representative of the population (unbi-
ased). Some common types of sample data are complete duration series, annual series, partial
duration series, and extreme value series.
The complete duration series consists of all available data. An example would be all the
available daily flow data for a stream. This particular data set would most likely not have inde-
pendent observations. Complete duration series data are rarely subjected to a standard frequency
analysis because they likely contain significant serial correlation. Since what is generally of
interest are rare events, often only the largest or smallest event over a period of time, generally
one year, are selected. Such a series is known as the annual series. The data in table 2.1 is an
annual series.
The partial duration series consists of all values above (below) a certain base. All peak flows
above 40,000 cfs in the Kentucky River, Salvisa, Kentucky, would represent a partial duration
series. This series may have more or less values in it than the annual series. For example, there
would be 9 years that would not have contributed any data to a partial duration series with a base
of 40,000 cfs for the data in table 2.1; however, some years may have more than one peak above
the base. The partial duration series is also known as the 'peaks over threshold' series.
The annual series and the partial duration series approach one another for long return
periods. Beard (1974) has shown that the relationship between annual series and partial duration
series flood peaks varies throughout the United States and recommends the use of empirically de-
rived, regionalized relationships. Frequently, the annual series and the partial duration series are
combined so that the largest (smallest) annual value plus all independent values above (below)
some base are used. For periods of record longer than about 10 years, the annual series and the
partial duration series give very similar results.
The extreme value series consists of the largest (smallest) observation in a given time inter-
val. The annual series is a special case of the extreme value series with the time interval being
one year.
FREOUENCY ANALYSIS 153
Regardless of the type of sample data used, the plotting position can be determined in the
same manner. Gumbel (1958) states the following criteria for plotting position relationships:

1. The plotting position must be such that all observations can be plotted.

2. The plotting position should lie between the observed frequencies of (m - l)/n and m/n
where m is the rank of the observation beginning with m = 1 for the largest (smallest) value
and n is the number of years of record (if applicable) or the number of observations.

3. The return period of a value equal to or larger than the largest observation and the return
period of a value equal to or smaller than the smallest observation should converge toward n.

4. The observations should be equally spaced on the frequency scale.

5. The plotting position should have an intuitive meaning, be analytically simple, and be easy to use.

Several plotting position relationships are presented in Chow (1964) and Singh (1992). A general
plotting position relationship is given by

where a and b are constants (Adamowski 1981). Some of the most common relationships for
plotting positions are shown in Table 7.1. Unless specifically stated to the contrary, the Weibull
relationship is used in the remainder of this book. Benson (1962a), in a comparative study of
several plotting position relationships, found on the basis of theoretical sampling from extreme

Table 7.1. Common plotting position relationships

Name Source Relationship

California California (1923)

Hazen Hazen (1930)

Weibull Weibull (1939)

Cunnane (1978)

Gringorton Rao and Hamed (2000)

Adamowski Adamowski (198 1)


154 CHAPTER 7
value and the normal distributions that the Weibull relationship provided estimates that were
consistent with experience.
The Weibull plotting position formula meets all 5 of the above criteria: 1) All of the
observations can be plotted since the plotting positions range from l/(n + I), which is greater than
zero, to n/(n + I), which is less than one. Probability paper for distributions with infinitely long tails
does not contain the points zero and one; 2) The relationship m/(n + 1) lies between (m - l)/n and
m/n for all values of m and n; 3) The return period of the largest value is (n + 1)/1, which
approaches n as n gets large, and the return period of the smallest value is (n + l)/n = 1 + 1In,
which approaches 1 as n gets large; 4) The difference between the plotting position of the (m +
and m~ value is l/(n + 1) for all values of m and n; and 5) The fact that condition 3 is met plus the
simplicity of the Weibull relationship fulfills condition 5.
One objection to the Hazen plotting position is that the return period for the largest
(m = 1) event is 2n, or twice the record length. An objection to the California plotting position
is that the smallest value (m = n) has a plotting position of 1, which implies that the smallest
sample value is the smallest possible value. A value of 1 cannot be plotted on many types of
probability paper.
It should be noted that all of the relationships give similar values near the center of the
distribution but may vary considerably in the tails. Predicting extreme events depends on the tails

Table 7.2. Determination of plotting position for Kentucky River data

Flow Rank pp Flow Rank pp Flow Rank pp Flow Rank pp


FREQUENCY ANALYSIS 155

of the distribution, so care must be exercised. The quantity 1 - Px(x) represents the probability
of an event with a magnitude equal to or greater than the event in question. When the data are
ranked from the largest (m = 1) to the smallest (m = n), the plotting positions correspond to
1 - Px(x). If the data are ranked from the smallest (m = 1) to the largest (m = n), the plotting
position formulas are still valid; however, the plotting position now corresponds to the
probability of an event equal to or smaller than the event in question, which is Px(x). Probability
paper may contain scales of Px(x), 1 - Px(x), TX(x),or a combination of these.
Plotting data on probability paper results in an empirical distribution of the data. As an
example of probability plotting, consider the data in table 2.1. The steps in plotting this data are:

1. Rank the data from the largest (smallest) to the smallest (largest) value. If two or more obser-
vations have the same value, several procedures can be used for assigning a plotting position.
The procedure adopted here is to assume they have different values and assign each a unique
rank. For example, in the data of Table 7.2, the value of 82,900 is assigned a rank of both 22
and 23 since it occurs twice in the data set.

2. Calculate the plotting position.

3. Select the type of probability paper to be used. Normal probability paper is used in this
example.

4. Plot the observations on the probability paper.

The data of Table 2.1 are ranked and the plotting positions calculated based on the Weibull
relationship in Table 7.2. Figure 7.1presents the plotted data.

0.5 1 2 5 10 20 30 50 70 80 90 95 98 99
Exceedance proability

Fig. 7.1. Normal probability plot of Kentucky River flow data.


156 CHAPTER 7
When probability plots are made and a line drawn through the data, the tendency to extrap-
olate the data to high return periods is great. The distance on the probability paper from a return
period of 20 years to a return period of 200 years is not very much; however, it represents a
10-fold extrapolation of the data. If the data do not truly follow the assumed distribution with
population parameters equal to the sample statistics, the error in this extrapolation can be quite
large. This fact has already been referred to when it was stated that the estimation of probabilities
in the tails of distributions is very sensitive to distributional assumptions. Because one of the
usual purposes of probability plotting is to estimate events with longer return periods, Blench
(1959) and Dalrymple (1960) have criticized the blind use of analytical flood frequency methods
because of this tendency toward extrapolation.
If a set of data plots as a straight line on probability paper, the data can be said to be distrib-
uted as the distribution corresponding to the probability paper. Because it would be rare for a set
of data to plot exactly on a line, a decision must be made as to whether or not the deviations from
the line are random deviations or represent true deviations, indicating that the data does not fol-
low the given probability distribution. Examining figure 7.1, it is apparent that, with the excep-
tion of the largest value, the deviations from a straight line are small. It might be assumed that the
data can be approximated by the normal distribution.
So far two tests, both based on judgment, have been described for determining if a set of
data follows a certain distribution. The first method was to visually compare observed and theo-
retical frequency histograms and the second to visually compare observed and theoretical
cumulative frequency curves in the form of probability plots. In chapter 8, statistical tests based
on these two visual tests will be presented.

Historical Data
Occasionally, flood information outside of the systematic flow record is available from his-
torical sources such as newspaper reports, earlier flood investigations, or from paleohydrologic
investigations. Such data contain valuable information that should not be ignored in a frequency
analysis. Bulletin 17B of the United States Water Resources Council (1981) demonstrates com-
puting the plotting position of the historical observations on the basis of the historical record
length. Likewise, the plotting position of the systematic data is computed on the basis of the
historic record length, except that the rank used in the calculation is adjusted by a factor, W,
depending on the historic record length, H, the number of historic flows, Z, and the length of the
systematic record, N. These are related by

H-Z
w=----
N

The adjusted rank for the systematic data is

with m being the unadjusted rank of the total record (systematic plus historic).
FREQUENCY ANL4LYSIS 157
Thus, if 20 years of systematic data and 2 historic observations larger than any values in the
systematic record are available from a 50-year period preceding the systematic record, the plot-
ting position for the 2 largest values would be 1/71 = 0.014 and 2/71 = 0.028. The weighting
factor would be

The remaining plotting positions would be calculated from the adjusted rank given by

The adjusted rank is then used in the plotting position relationship (equation 7.2). Thus, for
m = 3 (the largest systematic flow observation), the plotting position using the Weibull plotting
position relationship would be [3.40(3) - 61/71? or 0.0592, and for m = 22 (the smallest value)
the plotting position would be [3.40(22) - 61/71, or 0.9690. This compares to plotting positions
of 1/21, or 0.0476, and 20/21, or 0.9523, respectively, if the historic data had been ignored. If
the historic data had simply been used to augment the systematic record without using the
weighting factor, the plotting positions for these two events would have been 1/23, or 0.0435,
and 2/23, or 0.0870, respectively. Clearly, a plotting position of 0.0435 assigns too high a
probability of occurrence to the largest systematic value. Knowledge that there were 48 years
with no flows larger than the two historical events has been ignored in this later case. It is also
apparent that the weighting procedure adjusts the plotting position toward a more frequent
occurrence for the largest systematic value thus taking into account the fact that two flows greater
in magnitude than the largest systematic flow occurred.
Bulletin 17B also suggests the flow statistics be computed by weighting the contribution of
the systematic record to the various statistics by the factor W. Thus the adjusted mean is

where the X represents the systematic record and X, the historic data. Similarly, the variance and
skew can be determined from

If a log based distribution such as the lognormal or log Pearson III is being used, the X's and Xz7s
would be based on logarithms.
158 CHAPTER 7
Outliers
When probability plots of hydrologic data are made, frequently one or two extreme events
are present that appear to be from a different population because they plot far off of the line
defined by the other points. For example, it is entirely possible that a 100-year event is contained
in 10 years of record. If this is the case, assigning a normal plotting position of 1/11 to this value
would not be reflective of its true return period. Unfortunately, the true return period is not
known. The treatment of these "outliers" is an unresolved and controversial question. The fact
that this occurs frequently in hydrologic data should not be surprising.
Using methods discussed in chapter 4, the probability of at least one occurrence of an n-year
event in a k-year record can be calculated as 1 - (1 - 1 /n)k. For example, the probability of at
least one occurrence of a 100-year event in a 32-year record is 1 - 0 . 9 9 ~or ~ ,0.275. If we have
four independent 32-year records, we expect one to contain at least one 100-year event. This is
the case even though the 100-year event is from the same population as the other 3 1 events in the
32-year record.
Bulletin 17B suggests that outliers can be identified from

where XHand XLare threshold values for high and low outliers and K, can be approximated from

K,, = 1.055 + 0.981 log,, n (7.8)

where n is the number of observations.


If a peak in the record exceeds XH and historical information of the type discussed earlier
is available regarding that peak, it should be removed from the systematic record and treated as
historical observation as discussed in the section Historical Data. If historical information on the
flow is not available, it should be retained as a part of the systematic data. If a flow is less than
XL, that value should be deleted from the record and conditional probability procedures as
explained in the section on Treatment of Zeros should be employed. More detail on the treat-
ment of outliers is contained in Bulletin 17B. One should be very reluctant to ignore high
outliers in a flood frequency analysis unless strong evidence exists that the data point contains
substantial error.

ANALYTICAL HYDROLOGIC FREQUENCY ANALYSIS


Probability plotting without any distributional assumptions is an empirical method of
frequency analysis. If one is willing to make a distributional assumption concerning a data set, an
analytical frequency analysis may be done by estimating the parameters of the assumed
distribution and then using the fitted distribution to estimate the relationship between magnitudes
and probabilities. Such a procedure would be a direct application of the techniques of chapter 6.
For example, if the lognormal distribution is used, the parameters of the distribution would be
FREQUENCY ANKYSIS 159
estimated based on either the actual observations or their logarithms. Then the magnitude of a
flow having a particular exceedance probability or return period would be based on the lognormal
distribution and the estimated parameters.
Fitting probability distributions to data and estimating quantiles or probabilities from these
distributions has the advantage of smoothing the data and of making it possible for standardizing
frequency estimation procedures. It also provides a consistent way for extrapolating short records
to obtain estimates corresponding to 50- to 200-year flows. Of course, such extrapolations are
fraught with ambiguities. The selection of an appropriate probability density function is critical as
is having an adequate sample from which to estimate the parameters of the selected distribution.
Rao and Hamed (2000) have an extensive discussion of the mathematical properties of most
of the probability distributions that are used in hydrologic frequency analysis.
Chow (195 1) has shown that many frequency analyses can be reduced to the form

where XTis the magnitude of the event having a return period T and KTis a frequency factor. This
relationship comes about by writing any X as

and then stating that AX, the deviation from the mean, is the product of the standard deviation s
and a frequency factor K.

KT depends on the probability distribution being used and the return period.
Recalling that c, = s/X, equation 7.11 takes on the form of equation 7.9. Chow (1951,
1964) presents the frequency factors for many different types of frequency distributions.
Equation 7.9 can also be used to construct the probability scale on plotting paper so that the
distribution corresponding to KT plots as a straight line. The use of frequency factors is equiva-
lent to using the method of moments for estimating the parameters of a pdf.

Normal Distribution
For the normal distribution it can easily be shown that KT is the standardized normal
variate Z. The standard normal distribution, along with equation 7.9, can be used to determine
the magnitude of normally distributed events corresponding to various probabilities. For
example, the magnitude of a 20-year peak flow for the data of table 2.1 can be determined by
calculating

and
160 CHAPTER 7
The 20-year event corresponds to a prob(X > x) of .05, so the probability of an event less
than the 20-year event is 0.95. The value of Z corresponding to a probability of 0.95 is found
from standard normal tables to be 1.645. Thus

X20 = X(l+ c,K~~)


= 66,540(1 + 0.335 X 1.645)

= 103,209 cfs

which agrees with the value given by figure 7.1.

Lognormal Distribution
For the lognormal distribution, the magnitude of a flow with a given return period can be de-
termined by recalling that the logarithms of the flow are normally distributed. The data are first
converted to their natural logarithms by Y = ln(X). The mean and standard deviation of the log-
arithms are then determined. XT is then given by

where 7 and s, are based on the natural logarithms of X, and Kn is from the standard normal
distribution.

Log Pearson Type I11 Distribution


Benson (1968) reported on a method of flood frequency analysis based on the log Pearson
type I11 distribution, which is obtained when the base 10 logarithms of observed data are used
along with the Pearson type UI distribution (equation 6.91). This method is applied as follows:

1. Transform the n annual flood magnitudes, Xi, to their logarithmic values, Yi (i.e., Yi = logloXi
for i = 1,2? .. n).
.?

2. Compute the mean logarithm, 7.

3. Compute the standard deviation of the logarithms, s,.

4. Compute the coefficient of skewness, C,.

5. Compute
FREQUENCY ANALYSIS 161
where KT is obtained from table 7.3. Note that this relationship is identical to equation 7.9 except
the logarithms are used.

6. Compute X, = antilog Y, = loY'.

This method has as a special case the lognormal distribution when C, = 0. For short periods
of record, the skew coefficient calculated from equation 7.13 may not be a reliable estimate of the
population skew coefficient and it may be desirable to replace it with a regionalized coefficient

Table 7.3a. KT values for positive skew coefficients Pearson type III distribution'

Recurrence interval (years)


Skew 1.0101 2 5 10 25 50 100 200
coef. Percent chance ( 2 )
Y 99 50 20 10 4 2 1 0.5

Interagency Advisory Committee on Water Data (1982).


162 CHAPTER 7

Table 7.3b. KTvalues for negative skew coefficients Pearson type III distribution'

Recurrence interval (years)


Skew 1.0101 2 5 10 25 50 100 200
coef Percent chance (2)
Y 99 50 20 10 4 2 1 0.5

Interagency Advisory Committee on Water Data (1982).

(Beard 1962, 1974; Benson 1968). Figure 7.2 contains regionalized skew coefficients of annual
streamflow maximum logarithms computed by the U.S. Geological Survey.
The frequency factors of table 7.3 can be used for the Pearson type 111 distribution in the
same manner as for the log Pearson type ID. The actual data values rather than their logarithms
would then be used.
Approximate values of KT for the Pearson Type 111distribution are given by
FREQUENCY ANALYSIS 163

Fig. 7.2. Generalized skew coefficients of annual maximum stream flow logarithms.

where K, is the standard normal deviate (Interagency Advisory Committee on Water Data
1982). Because of certain limitations on this approximation, the use of the table for KT is rec-
ommended. Obviously, the use of analytic approximations for KT for any of the distributions
makes the calculations for flows of various return periods quite easy using spreadsheets or other
computer software. Table 7.4 contains the maximum percent error in equation 7.14 as compared
to Table 7.3. Note that a 1% error in KT does not translate directly to a 1% error in flow. For
example, when the log Pearson type 111 is used in example problem 7.2, the 100-year flow is
estimated at 29,719 cfs. The skewness of the logarithms was 0.296, so use of equation 7.14 has
a maximum error of 0.09%. With such an error, KT would be 1.0009 X 2.542, or 2.567, and
the resulting flow estimate would be 29,752 cfs, which represents a difference of 0.11 % from

Table 7.4. Errors in the use of equation 7.15 for estimating KT log Pearson distribution
164 CHAPTER 7
the estimate using the table value. This is a very small error when one considers the uncertain-
ties present in estimates of this kind. Often interpolation has to be done in table 7.3, which may
introduce more error than the use of equation 7.14. Only for C, < -2.5 is the error in KT > 2%
for T of 50, 100, and 200 years.

Extreme Value Type I Distribution (Gumbel Distribution)


Chow (1951) presents the following relationship for the frequency factor for the extreme
value type I maximum distribution

where ye is the Euler number (0.577216) and Tx(x) is the desired return period of the quantity
being calculated. Potter (1949) presents some curves that simplified the application of the
extreme value type I. Kendall (1959) presents the frequency factors shown in table 7.5 for the
extreme value type I distribution. The values computed from equation 7.15 are equivalent to an
infinite sample size in table 7.5.

Table 7.5. Frequency factors for extreme value type I distribution

Sample Return period


size
n
FREQUENCY ANALYSIS 165
Other Distributions
Any of the distributions discussed in chapter 6 can be fit to data by using the methods
discussed in that chapter. Frequency factors for some of the other distributions are given by Chow
(1951, 1964).

GENERAL CONSIDERATIONS
Many proponents (and opponents) of one analytical form or another for flood flow frequen-
cies have come to the fore over the past few decades. The proponents claim that some particular
method is superior to some other method and "prove" their claim by a few rationalizations and
some case studies. The fact remains that these rationalizations involve questionable assumptions.
There is no direct theoretical connection between any analytical form of the frequency distribu-
tion and the underlying mechanisms governing flood flows except through the limit theorems.
The primary consideration in selecting a particular analytical form for the frequency distribution
is that the distribution "fit" the observed data (Anderson 1967; Benson 1968).
Benson (1968) reported on the results of a study by a work group consisting of 18 represen-
tatives from 12 federal agencies of the U.S. government. This group studied 6 methods of flood
frequency analysis on 10 streams located throughout the United States. The records on these
streams ranged in length from 40 to 97 years with an average of 55 years. The drainage areas
ranged from 16.4 to 36,800 square miles. The six methods of analysis consisted of 1) the gamma
distribution, 2) Gumbel distribution, 3) Gumbel distribution using the logarithms of the data, 4)
lognormal distribution, 5) log Pearson type 111distribution, and 6) Hazen's method. The compu-
tational procedures used were much like those presented in this book. The Hazen method consists
of using an equation like equation 7.8 along with a table of empirically derived frequency factors
that are a function of the return period and the coefficient of skew (Hazen 1930). Large differ-
ences were produced by the 6 different methods especially at long return periods. The results
showed that the lognormal, log Pearson type 111, and Hazen methods were about equally good.
The group suggested that the log Pearson type 111be used unless there was a good reason to use
some other method. This recommendation was made even though the group realized that "there
are no rigorous statistical criteria on which to base a choice of method". Benson's (1968) report
states that the study showed that "the range of uncertainty in flood analysis, regardless of the
method used, is still quite large" and that many questions concerning it remain unresolved.
In a follow-up study, Beard (1974) examined flood peaks from 300 stations scattered through-
out the United States. Several probability distributions were tried, including the log Pearson type HI,
lognormal, Gumbel's extreme value distribution, and the 2- and 3-parameter gamma distributions.
Beard concluded that only the lognormal and log Pearson type 111 with a regionalized skew
coefficient were not greatly biased in estimating future flood frequencies. He stated that the latter
distribution produced somewhat more consistent results but that ... regardless of the methodology
"

employed, substantial uncertainty in frequency estimates from station data will exist ...".
In selecting a particular analytical form for a frequency curve, one may be tempted to select
a distribution with a large number of parameters. Generally, the more parameters a distribution
has, the better it will adapt to a set of data. However, for the sample size usually available in
hydrology, the reliability in estimating more than 2 or 3 parameters may be quite low. Thus, a
compromise must be made between flexibility of the distribution and reliability of the parameters.
166 CHAPTER 7
Recognizing the short record lengths often available for frequency analysis, methods of aug-
menting natural data by synthetic data are being developed. In some cases the rainfall record per-
taining to a watershed is much longer than its streamflow record. In this event it may be possible to
calibrate a deterministic streamflow model to the watershed and then use the long rainfall record to
generate a long synthetic streamflow hydrograph. This synthetic hydrograph can then be combined
with existing data into a single frequency analysis. In the absence of rainfall records, it may be pos-
sible to transfer records from a nearby station or to stochastically generate a series of rainfall data.
This data could then be used with the calibrated deterministic model to augment natural streamflow
data. One might consider weighting the natural data more than the augmented data in the final fre-
quency analysis. Regression and correlation techniques might be used to relate peak flows to rain-
fall or to peaks from nearby gages and using this relationship to extend the available record.
It was because of the many factors and uncertainties that are involved in the selection of a
probability distribution to use in flood frequency determinations, that several agencies of the U.S.
Federal government developed the guidelines published as "Guidelines for Determining Flood
Flow Frequency," commonly known as Bulletin 17B (Interagency Advisory Committee on Water
Data, 1982). Bulletin 17B has become a standard for flood frequency analysis of annual flood
peak discharges.
The developers of Bulletin 17B recognized that "there is no procedure or set of procedures
that can be adopted which, when rigidly applied to the available data, will accurately define that
flood potential of any given watershed. Statistical analysis alone will not resolve all flood fre-
quency problems." The basic Bulletin 17B approach is to use the log Pearson type I11 distribution
as explained above. Because this distribution is a 3-parameter distribution, the coefficient of
skew is used when estimating the parameters by the method of moments.
The skew coefficient is sensitive to extreme flood values and thus difficult to estimate from
small samples typically available for many hydrologic studies. Figure 7.2 presents a map of gen-
eralized skew coefficients for the logs of peak flows taken from Bulletin 17B of the Interagency
Committee. The station skew coefficient calculated from observed data and generalized skew co-
efficients can be combined to improve the overall estimate for the skew coefficient. Under the as-
sumption that the generalized skew is unbiased and independent of the station skew, the mean
square error (MSE) of the weighted estimate is minimized by weighting the station and general-
ized skew in inverse proportion to their individual mean square errors according to the equation
(Tasker 1978):

where Gw is the weighted skew coefficient, G is the station skew (from equation 7.1 3), G is the
generalized skew (from figure 7.2), MSEc is the mean square error of the generalized skew, and
MSEGis the mean square error of the station skew. MSEE is taken as a constant, 0.302, when the
generalized skew is estimated from figure 7.2. MSEGcan be estimated from (Wallis, Matalas, and
Slack 1974):
FREQUENCY ANALYSIS 167
where
A=-0.33+0.081GI ifIGI10.90 (7.18)
=-0.52+0.301GI ifIGI>0.90

N = record length

It is recommended that if the generalized and station skew differs by more than 0.5, the data
and flood producing characteristics of the watershed should be examined and possibly greater
weight given to the station skew.

CONFIDENCE INTERVALS
Any stream flow record is but a sample of all possible such records. How well the sample
represents the population depends on the sample size and the underlying population probability
distribution, which is unknown. Both the form and parameters of the underlying distribution
must be estimated. If a second sample of data were available, certainly different estimates would
result for the parameters of the distribution even if the same distribution were selected. Different
parameter estimates will obviously result in different return period flow estimates. If many
samples were available, many estimates could be made of the distribution parameters and
consequently many estimates could be made of return period flows-say Qloo.One could then
examine the probabilistic behavior of these estimates of Qloo.The fraction of the Q,,'s that fell
between certain limits could be determined.
In actuality, we have just one sample of data from which to make estimates of QT. Statisti-
cal procedures are available for estimating confidence intervals about estimated values of QTthat
will give a measure of uncertainty associated with QT. Confidence limits give a probability that
the confidence limits contain the true value for QT.A 90% confidence limit indicates that 90% of
the time intervals so calculated will contain the true estimate for QT.
Letting L,and UT be the lower and upper confidence intervals

where a is the degree of confidence expressed as a percent. Exact determination of L,and UT


depend on the underlying parent population. Bulletin 17B of the Interagency Committee presents
some approximate relationships for confidence intervals

where X and s, are the sample means and standard deviations and KT, and KT,, are the lower and
upper confidence coefficients. If a distribution like the log Pearson type III distribution is used, X
168 CHAPTER 7
and sx are based on the logarithms of the data and L,and UT are the logarithms of the confidence
limits.
Approximations for KT,, and KT,Ubased on large samples and the noncentral t-distribution are

where

In these relationships, KT is the frequency factor of equation 7.9, Z, is the standard normal
deviate with cumulative probability c = 50 + a / 2 if a is expressed as a percent. If a is 90%,
then c is 95%. The sample size is n. Confidence limits can be placed on frequency curves plotted
on probability paper by making calculations such as above for several values of T.

TREATMENT OF ZEROS
Most hydrologic variables are bounded on the left by zero. A zero in a set of data that is be-
ing logarithmically transformed requires special handling. One solution is to add a small constant
to all of the observations. Another method is to analyze the non-zero values and then adjust the
relation to the full period of record. This method biases the results as the zero values are essen-
tially ignored. A third and theoretically more sound method would be to use the theorem of total
probability (equation 2.10).

Because prob(X 1 xlX = 0) is zero, the relationship reduces to

In this relationship, prob (X # 0) would be estimated by the fraction of non-zero values and
prob(X 1 xlX # 0) would be estimated by a standard analysis of the non-zero values with the
sample size taken to be equal to the number of non-zero values. This relation can be written as a
function of cumulative probability distributions.
FREQUENCY ANALYSIS 169
or

where Px(x) is the cumulative probability distribution of all X (prob(X 5 xlX 2 0)), k is the
probability that X is not zero, and Px*(x) is the cumulative probability distribution of the non-
zero values of X (i-e., prob(X < X ~ X # 0)). This type of mixed distribution with a finite
probability that X = 0 and a continuous distribution of probability for X > 0 was discussed in
chapter 2. Jennings and Benson (1969) have demonstrated the applicability of this approach to
analyzing flood flow frequencies with zeros present.
Equation 7.23 can be used to estimate the magnitude of an event with return period Tx(x) by
solving first for Px*(x) and then using the inverse transformation of P,*(x) to get the value of X.
For example the 10-year event with k = 0.95 is found to be the value of X satisfying

Note that it is possible to generate negative estimates for Px*(x) from equation 7.23. For
example, if k = 0.25 and Px(x) = 0.50, the estimated Px*(x) is

This merely means that the value of X corresponding to Px(x) = 0.50 is zero. This makes sense
because Px(x) = 0.50 corresponds to the 2-year flow, or the flow equaled or exceeded every
other year. If only 25% or 114 of the annual flows are greater than zero, then the flow exceeded
every other year must be zero.

Example 7.1. Seventy-five years of peak flow data are available from an annual series; 20 of
the values are zero; and the remaining 55 values have a mean of 100 cfs, a standard deviation
of 35.1 cfs, and are lognormally distributed. (a) Estimate the probability of a peak exceeding
125 cfs. (b) Estimate the magnitude of the 25-year peak flow.

Solution:

(a) prob(X > 125) = 1 - prob(X 5 125) = 1 - Px(125)

Applying equation 7.23

Px*(125) can be evaluated by solving equation 7.12 for KN and then using the table for the
normal distribution to get the desired probability.
From equations 6.35 and 6.36

From a table of the standard normal distribution, this K, for a C, of 0.351 corresponds to a
prob(X, < x) of 0.795.

Px(125) = 1 - 0.733 + 0.733(0.795) = 0.850


prob(X 2 125) = 1 - Px(125) = 0.15 or T = 1/0.15 = 6.7 yrs.

The probability of a peak flowing any year exceeding 125 cfs is 0.15. The conditional probability
of a peak exceeding 125 cfs given that the peak is not zero is 1 - 0.795 = 0.205.

(b) Px*(x) = [Px(x) - 1 + k]/k = [l - (l/T) - 1 + k]/k


= (1 - 0.04 - 1 + 0.733)/0.733 = 0.945

The value of X corresponding to Px*(x) = 0.945 can be obtained from equation 7.12. Z, for
P(x) = 0.945 is 1.60. Therefore, X2, = exp(4.547 + 0.341 *1.60) = 163 cfs.
-- - -- - - - -- -

Example 7.2. Table 7.6 contains annual peak flow data for Black Bear Creek near Pawnee,
Oklahoma, for the years 1945 through 1997.

(a) Plot the data on normal and lognormal probability paper.

(b) Plot the "best" fitting normal, lognormal, extreme value type I, and log Pearson type I11
distributions on the plot of part a.

(c) Estimate the 100-year peak flow based on the four distributions of part b.

(d) Estimate the 90% confidence intervals on the log Pearson type ID estimates.

(e) Estimate the 100-year peak flow using the log Pearson type 111with a weighted skew
coefficient based on the station skew and the generalized skew coefficients.
FREQUENCY ANALYSIS 171
Table 7.6. Annual peak flow data for Black Bear Creek near Pawnee, Oklahoma

Year Flow Year Flow Year Flow


(cfs) (cfs) (cfs)

Solution:

(a) The plotting positions are calculated by ranking the data from largest to smallest and
then using the relationship pp = m/(n + 1) where m is the rank and n is the number of
observations (53). Since the largest observation is 30,200 cfs, it is assigned a pp of 1/54,
or 0.0185. The second largest value is 19,200 cfs with a pp of 2/54 or 0.0370, and so
forth until the smallest value of 1,560 cfs with a pp 53/54 or 0.9815. The data are
plotted in figures 7.3.

(b) The best fitting lines for the various distributions can be obtained by calculating several
points from equation 7.9. The basic statistics of the data are found to be

Data In of data

Mean 6683 8.568


Std dev 5337 0.68 1
Skewness 2.262 0.296

The next step is to determine the appropriate frequency factors for various return periods for
the four distributions. The frequency factor for the nonnal and log normal distributions comes
from the standard normal distribution. KT for the extreme value and log Pearson distributions
come from equations 7.15 and 7.14, respectively. Sample calculations follow for a return period
of 20 years.
N LN EV 1 LP3
prob flow flow flow flow

-3.000 -2.000 -1.000 0.000 1 .OOO 2.000 3.000


z
Fig. 7.3a. Flood frequency curves for Black Bear Creek using the standard normal z and arith-
metic flow scales.

Fig. 7.3b. Flood frequency curves for Black Bear Creek using the standard normal z and loga-
rithmic flow scales.
FREQUENCY ANALYSIS 173

Exceedance probability

Fig. 7 . 3 ~ .Flood frequency curves for Black Bear Creek using normal probability paper.

\.
-

.... .. .. Normal
- Lognormal
-- Extreme value
log Pearson
Data

Exceedance probability

Fig. 7.3d. Flood frequency curves for Black Bear Creek using lognormal probability paper.

Normal distribution:

Lognormal distribution:

XT = EXP (L(1 + CvyKT))= EXF' ( (+


8.568 1 o'6811.645))=16132
-
8.568
Extreme value distribution:

Log Pearson distribution:


The calculations must be based on the logarithms.

XT = EXP(L(1 + CvYKT))= EXP (8.568 (1 + -


o'68 1.724))
8.568
= 17029

Figures 7.3a-d show the resulting plot of the data and the best fitting distributions. The four
plots all contain the same information but show different formats. The first two plots use the z
transformation and the second two plots use normal probability scales. Both arithmetic and log-
arithmic scales are shown for flow. Note that the normal distribution plots as a straight line when
the arithmetic scale is used and the lognormal distribution plots as a straight line when the log-
arithmic scale is used.

(c) The 100-year flow estimates are contatined in the last line of the above table.

(d) The calculations of the confidence intervals are contained in the following table:

(4) (5
Kt, 1 Kt, u

- 1.7499
0.1786
1.1122
1.6588
2.1360
2.2795
2.6994
3.089 1
FREOUENCY ANALYSIS 175

1 2 5 10 20 3 0 4 0 5 0 6 0 7 0 80 90 95 9899
Exceedance probability

Fig. 7.4. Flood frequency curves for Black Bear Creek with confidence intervals.

Explanation of columns in above table: (1) Return period; (2) From equation 7.14; (3)
From equation 7.22b; (4) and ( 5 )From equation 7.21 ; (6) and (7) Equation 7.20; (8) Exp(col(6));
(9) Exp(col(7)); (10) Last column of previous table
The results are plotted in the figure 7.4.

(e) Station skew, G, is 0.296. The generalized skew from figure 7.2 is -0.22. From equa-
tions 7.16 to 7.18 we get

Qloo= exp(Y(l + CvyKT))= exp ( (


8.568 1 +-
8.568
2.3082)) = 25333 C ~ S
176 CHAPTER 7

Example 7.3. Estimate the 100-year flow for Black Bear Creek using the GEV distribution.

Solution:
From example problem 7.2, E = 6683, sx = 5337, and C, = 0.296.
From equations 3.7 1

From equations 3.74

From equations 6.8 1-6.84

cx[T(l + K) - 11
c=h,+ = 4108
K

From equation 6.85 for T = 100 and Px(x,) = 0.99


cx
x, = 5 + -(1
K
- [-ln(P,(x,))]") = 29,803 cfs

TRUNCATION OF LOW FLOWS


In some situations the lower flows in an annual series of peak flow data may not truly
represent flood flows. Such a situation may arise when no particularly heavy rainfalls or snow
melts occur over a period of a year. It may then be desirable to truncate or delete these low peak
flow values and analyze the remaining peaks adjusting the probabilities as discussed in the
FREQUENCY ANALYSIS 177

5000 10000 15000 2000(


Truncation level (cfs)

Fig. 7.5. Estimated 100-year flow as a function of truncation level for Black Bear Creek.

Treatment of Zeros section. Figure 7.5 shows the estimated 100-year peak flow for Black Bear
Creek using data from example problem 7.2, the log normal distribution, and various truncation
levels. For example, if a truncation level of 3000 cfs is selected, the 11 values less than 3000
would be truncated and the k in equation 7.23 would be (53 - 11)/53, or 79.

USE OF PALEOHYDROLOGIC DATA


Baker (1987) has defined paleohydrology as the study of past or ancient flood events that
occurred prior to the time of human observation or direct measurement. Paleohydrologic
techniques provide means of obtaining data over periods of time much longer than are available
from systematic records or even historical data. Paleohydrologic data may enable the evaluation
of long-term hydrologic conditions by complementing existing short-term systematic and histor-
ical records, providing information at ungaged locations, and helping reduce uncertainty in flow
estimates. Paleohydrology is discussed by Baker (1987), Kochel and Baker (1982), Costa (1987),
Jarrett (1991), and Stedinger and Baker (1987).
Once the magnitude and year of occurrence of a paleoflood is determined, that flow value can
be assigned a return period. For example, if it is determined that 3000 years ago there was a flood
in excess of any flow since that time, the flow could be assigned a return period of 3000 years.
Questions of the stationarity of flood flows, the dating of paleofloods, and the difficulty of esti-
mating the magnitude of paleofloods must be addressed in any paleoflood study.

PROBABLE MAXIMUM FLOOD


The probable maximum flood (PMF) is the flow that can reasonably be expected under con-
ditions that maximize runoff conditions from the most severe combination of meteorologic and
hydrologic conditions for the drainage basin in question. A PMF does not directly enter into a
flood frequency analysis since the probability of such a flood is unknown. The PMF may provide
an upper bound to a frequency analysis. The concept of a PMF has been criticized (Yevjevich
1968) as being neither probable nor maximum, yet it has found wide use for hydrologic designs
for facilities whose failure would endanger human life or cause great economic loss.
178 CHAPTER 7
DISCUSSION OF FLOOD FREQUENCY DETERMLNATIONS
A flood frequency study concerning the American River near Sacramento, California
(National Research Council 1999), illustrates many of the difficulties in flood frequency analysis.
Data available in that study included 93 years of systematic data, historical data, paleohydrologic
data, and data derived through the use of rainfall-runoff models. Still considerable controversy
exists as to the proper estimate for the 100- and 200-year flood flows.
The foundation of any frequency analysis is the selection of a particular probability distri-
bution for describing the data. The parameters of this distribution are estimated and the
magnitude of events for various return periods are calculated. Methods for plotting the observed
data on probability paper and for constructing the best fitting line according to the selected
distribution have also been discussed.
At this point it should be clear that there is nothing inherently hydrologic about frequency
analysis procedures. They are simply statistical techniques that operate on numbers. The fact that
the numbers being used are peak flows is of no concern to the technique. It should be of great
concern to the analyst, however.
Statistical frequency analysis simply attempts to extract information about the probabilistic
behavior of a set of numbers from the numbers themselves. In hydrologic frequency analysis this
probabilistic behavior is then generally extrapolated by the analyst to frequencies of occurrence
well beyond that contained in the original set of numbers. From these extrapolations the flows
having return periods of 25, 50, 100, or even 500 years are determined. The straightforward
application of hydrologic frequency analysis as generally einployed uses no or very little hydro-
logic knowledge. In actuality, rare flows are determined by the hydrologic conditions that exist at
the time of these flows and not by the statistical behavior of a sample of maximum peak flows
that may have occurred some time in the past. Resolving the apparent conflict between these
statements is what separates the hydrologist from the statistician.
Statistics are descriptive of a set of observed data. Statistics do not define a cause and effect
relationship or a physical relationship. Any conclusion drawn on the basis of a statistical fre-
quency analysis assumes that the sample of data on hand is representative of a wider range of data
known as the population. In hydrologic terms what this means is that if we have a sample of
15 years or so of observed annual maximum peak flows and use this data to estimate the 100-year
flood, we are assuming the hydrologic behavior of the basin during the 100-year flood is some-
how imprinted in the 15 years of observed data and that the statistical technique being used can
uncover this imprint and use it. To determine if this is truly the case, the hydrology of the basin
must be examined. Some of the questions that must be answered are:

1. Is the type of storm that is likely to produce the 100-year flow represented in the observed
sample?

2. Is the contributing area of the basin the same for extreme floods as it is for small ones?

3. Are there ponds and reservoirs that may discharge at high rates during rare floods and not dur-
ing smaller flows? What is the possibility of a dam breach and what would be the resulting
flow?
FREQUENCY ANALYSTS 179

4. Are the channel flow and storage characteristics the same for extreme flows as they are for
smaller flows?

5. Are land use and soil characteristics such that flows from rare storms may relate to precipita-
tion in a manner different from more common storms?

6. Are there seasonal effects such that rare floods are more likely to occur in a different season
than the more common floods?

7. Is the rare flood represented in the sample of data? If so how is it treated? Is it assigned a re-
turn period of 15 years where in fact its return period may be much greater than that?

8. Are changes going on within the basin that may cause change in the hydrologic response of
the basin to rainstorms?

9. Are there climatic changes occurring that may influence flood flow frequencies?

These last few paragraphs paint a discouraging picture for flood frequency analysis. That
need not be the case as long as one does not discard hydrologic knowledge in the process.
Often, the questions posed can be answered in such a way as to make the statistical analysis
valid. At other times, when problems with the statistical procedures are recognized, adjust-
ments can be made in the resulting flow estimates to more accurately reflect the hydrology of
the situation.
Hydrologic frequency analysis should be used as an aid in estimating rare floods. Some-
times the estimates made on the basis of the statistical frequency analysis can be taken as the
final estimate. Sometimes the statistical estimate may need to be adjusted to better reflect the
hydrology of the situation.
It should be kept in mind that other hydrologic estimation techniques suffer from some of
the same difficulties as do the statistical techniques. For example, if a hydrologic model is being
employed, the parameters of the model must be estimated in some way. This is generally done on
the basis of observed data from the basin in question, from observed data, from a similar basin or
from so-called physical relationships such as Manning's equation, infiltration parameters, and so
forth, and a set of accompanying tables. Regardless of how the parameters are estimated, the
same type of questions regarding these estimates and the nature of the hydrologic model itself
must be answered as outlined above for frequency analysis estimates. We cannot substitute math-
ematical and empirical relationships for hydrologic knowledge any more than we can substitute
statistics for hydrologic knowledge.
Based on this discussion, one might conclude that the magnitude of rare events should not
be estimated because the estimates may be so uncertain. Generally, however, this is not one of the
options available. An estimate must be made. Hydrology must not be ignored in making this
estimate. Statistical, modeling, or empirical flow estimates should be made and then adjusted, if
required, to reflect the hydrologic situation. This is not to say a factor of safety is to be applied.
Adjustments should be based on hydrology, not rules of thumb.
180 CHAPTER 7
REGIONAL FREQUENCY ANALYSIS
Regional flood frequency analysis has three major components, namely, delineation of
homogeneous regions, determination of appropriate probability density functions (or frequency
curves) of the observed data, and the development of a regional flood frequency model (i.e., a
relationship between flows of different return periods, basin characteristics, and climatic data).

Delineation of Homogeneous Regions


Effective regionalization requires defining regions, generally geographic regions, that are
similar and then capturing hydrologic relationships, generally empirical, for the region. The
reason for this is that better predictions should result using data from a hydrologically similar
region than from a dissimilar region. The standard error of estimate for homogeneous regions
should be less than the standard error of estimate obtained without dividing the area into homo-
geneous regions.
Regions are generally defined based on several considerations.Among these considerations
are political boundaries, catchment boundaries, geologic boundaries, and climatic boundaries.
Regional definitions require judgment. Generally, regions cannot be defined independent of the
personal judgment of the individual doing the analysis. A particular technique for assigning data
to homogeneous groupings, known as Cluster Analysis, is covered in chapter 12.
Often, the actual definition of a region is arrived at somewhat by trial and error. Data are an-
alyzed, grouped, and examined. Data may be moved from one group to another in an attempt to
improve the quality of the analytic relationship being developed.
Although regional estimation techniques, such as the regional flood frequency analysis,
have been useful in the transfer of data from gaged to ungaged sites, they have also ushered in
several problems. If all the gaged stations simply represent realizations of the same underlying
population, then a straightforward pooling approach would be appropriate. The Index Flood
method discussed earlier makes such an assumption. This method assumes that the region from
which the observed data are obtained is homogeneous. The first task which must be completed
in the process of regional flood frequency analysis is therefore the identification of homoge-
neous regions.
Homogeneous regions may be defined as regions having similar hydrologic, climatic,
and physiographic characteristics. The criteria most often used to delineate homogeneous
regions are based on either geographic consideration (basin characteristics, weather regimes)
or basin response characteristics (such as probability distributions and regional statistical
flood parameters-e.g., skewness, coefficient of variation, etc.). There seems to be no
uniquely objective approach to the delineation of homogeneous regions. It is generally
agreed, however, that grouping basins within a homogeneous region will yield regional
relationships with lower standard errors than those for entirely different areas (Kite 1977).
Residual analysis has occasionally been used as a tool for defining homogeneous regions. The
residual pattern from a linear regression of a given design flood for the entire study area is
examined and regions are then delineated on the basis of geographic proximity of the positive
and negative residuals (Gingras and Adamowski 1993).
A second approach of defining homogenous regions is to group all stations with the same prob-
ability distributions or those that have constant distribution parameters (Hosking et at. 1985a,b).
FREOUENCY ANALYSIS 181
De Coursey (1973) applied discriminant analysis, a multivariate procedure, to flood data from
Oklahoma to form groups of basins having a similar flood response. Bum (1988, 1989, 1990)
described techniques for identifying homogeneous regions based on the correlation structure of
the observed data, cluster analysis, and the Region of Influence (ROI) approach, respectively.
The importance of identifying hydrologically homogeneous regions was further demonstrated by
Lettenmaier et al. (1987) in a study that showed the effect on extreme flow estimation of regions
containing heterogeneity.
Of the many approaches that have been used to identify homogeneous regions, cluster
analysis, a multivariate technique, has been getting more prominence in this field. This is prima-
rily due to the fact that although cluster analysis does not entirely eliminate subjective decisions
associated with the other methods, it greatly facilitates interpretation of a data set. The objective
of cluster analysis is to group gaging stations that have similar hydrologic or basin characteris-
tics. The most common similarity measures in cluster analysis is the Euclidean distance.

Historical Development
Weldu (1995) reviewed several articles on regional flow estimation. The earliest approach
to the regionalization problem was to use empirical equations relating flood flow to drainage area
within a particular region (Benson 1962~).The formulas were based on few data for a particular
region and contain one or more constants whose values are empirically determined. Such a for-
mula, in generalized form, is

where Q is the flow, C is coefficient related to the region, and A is the drainage area. The above
equation, although simple to derive and apply, does not address the frequency of the flow and the
effect of variations in precipitation or topography on the flows that are not accounted for. The var-
ious "culvert formulas" used by railroad and highway engineers, such as the Talbot formula
(AISI 1967) are of this general type. The Talbot formula is widely used and is denoted by

where a = cross-sectional area of culvert in ft2.


Various empirical formulas were later devised that attempted to include the concept of fre-
quency and that involved rainfall in computing flood peaks. Perhaps the most widely used of
such formulas is the Rational Formula (Shaw 1983), which expresses the peak flow (Q,) in terms
of the rainfall intensity (i) with the desired return period, drainage area (A), and a coefficient that
accounts for basin characteristics (C) as

One major weakness in this type of empirical formula is that the coefficients will remain
constant only within regions in which other hydrologic factors vary little, which implies that the
regions must of necessity be fairly small.
Statistical Methods
Other methods of regionalization include the application of statistical techniques to hydrologic
data. Statistics provides a means of reducing a mass of data to a few useful and meaningful figures.
The distribution of the data could be represented by a probability density function or a curve that
defines the frequency of values of the variable. Statistical procedures may also provide methods of
relating dependent variables to one or more independent variables through regression analysis.
Most applications of statistical techniques require a considerable amount of data. The value
of the analysis is directly related to the quantity and quality of the data that are available. Often,
hydrologic estimates are required at locations where there is little or no data. The design of a
bridge opening or culvert, for example, on one of the many streams for which there is no data
may be required. Regionalization is an attempt to use data from locations in the same region as
the point of interest to make hydrologic estimates at the point of interest.
Regional flood frequency models have extensively been used in hydrology for transfemng
data from gaged to ungaged sites. Two such regionalization procedures, namely the index-flood
and regression-based methods, have evolved over the years and have extensively been used in
regional flood frequency analysis.
This treatment will focus on flood frequency analysis. The goal is to estimate flood flows of
various return periods for streams and locations where there is little or no data.

Frequency Distributions
After a homogeneous region has been identified, the next stage in the specification of the
regional flood frequency model is the choice of appropriate frequency distribution(s) to represent
the observed data. The distributions most commonly used in hydrology are normal, lognormal,
Gumbel extreme value distribution (type I), and log Pearson type 111. The U.S. Water Resources
Council (1982) conducted studies involving comparison among different probability distribution
functions and their recommendation was to use the log Pearson type I11 as the basic distribution
for defining the annual flood series. The Council also recommended that this distribution be fitted
to sample data using the method of moments. In a more detailed study, the U.K. Natural
Environment Research Council (1975) found that 3-parameter distributions such as the log
Pearson type 111and the generalized extreme value distribution (GEV) were found to fit data from
35 annual flood series better than the 2-parameter distribution functions.
The log Pearson type III (LP 111) distribution has extensively been used in flood frequency
analysis since its favorable recommendation by the Water Resources Council in 1976. The
frequent use of the LP III attracted a number of detailed mathematical and statistical studies
regarding its role in flood frequency analysis. Various alternative fitting techniques for the LP I11
distribution have been suggested by Matalas and Wallis (1973) and Condie (1977). These
researchers carried out comparisons between the method of moments and the method of maxi-
mum likelihood, and concluded that the latter method yielded solutions that are less biased than
the method of moment estimates. Bobee (1975) and Bobee and Robitaille (1977) suggested using
moments of the original data instead of using moments of the logarithmic values. Nozdryn-
Plotnicki and Watt (1979) studied the method of moments, the method of maximum likelihood,
and the procedure proposed by Bobee (1975), and found that none of the methods were superior
FREQUENCY ANALYSIS 183
than the others and concluded that the method of moments was the best because of its computa-
tional ease.
An important step in a regional flood frequency analysis is to ensure that the data that are
being used are of good quality. The data must be representative of the region and they must be
representative of the long-term flood characteristics of the region. Data on the physical charac-
teristics of the catchments and any other data that are used must be of good quality. There are no
regional flood frequency techniques that can overcome faulty data.
After collecting and screening the data, the first step is to fit various pdfs to the observed
peak flow data at locations where sufficient data exist. Once all of the available data are fit to the
candidate distributions, assumptions and statistical tests must be made in an effort to select the
distribution that best describes each data set. This selection of pdfs is based on probability plots
of observed data along with the fitted distributions. Statistical tests such as the chi-square test and
the KolmogorovSmirnov test discussed in chapter 8 may be made. Personal judgment based on
the probability plots is also used.
Once the best fitting pdf is selected for each data set, that pdf can be used to estimate the
peak flow for various return periods. The pdf which best fits the data for the majority of the sta-
tions or locations included in the study is generally used for all locations.
Several options are now available for the next phase of the analysis:

1. Develop a relationship between the peak flows of various return periods and measurable char-
acteristics of the catchments producing the flows (QT
- = -f(X)).
-

2. Develop a relationship between parameters of the pdf that best fits a majority of the flow data
and measurable characteristics of the catchments producing the flows (0- = - f(X)).
-

3. Develop a dimensionless flood frequency curve for the region plus a relationship between
some index flood for each catchment and measurable characteristics of the catchments (QT/Q
vs T and Q, = f(X)).
-

Regression-Based Procedures
All three of the options mentioned above require relationships with measurable characteris-
tics from the catchments for which flow data are available. Characteristics that might be included
in the analysis include precipitation variables, such as mean annual rainfall and 24-hour rainfalls
for various return periods. Physical characteristics such as catchment area, land slopes, stream
lengths, stream slopes, and land use might be included. Soils information such as permeability
and water holding capacities can be used. There are also a large number of geomorphic parame-
ters such as drainage density, catchment shape factors, and measures of elevation changes that
might be included.
The result of this data collection effort will be a matrix of data having n observations on m
catchments. Therefore - X is an m X n matrix. The n observations on each catchment come about
from making a single measurement or observation on each of the n characteristics included in the
analysis. The m represents the number of catchments in the study. Thus, a study that involved
30 catchments and 12 characteristics on each catchment would produce a data matrix having
30 rows (one for each catchment) and 12 columns (one for each characteristic).
A regional flood frequency approach, in addition to the m X n data matrix of independent
variables, will include an m X p data matrix of dependent variables which are the peak flow
estimates for the various return periods. With return periods of 2,5, 10,25,50, and 100 years, p
will be 6. With 30 catchments, a 30 X 6 matrix of dependent variables, where the rows are the
catchments and the columns correspond to the various return periods, will result.
Multiple regression techniques can now be used to relate the dependent variables to the in-
dependent variables based on the 30 observations on hand. Regression based on the regional data
and based on logarithms of the regional data can be investigated. Through the estimation process
based on multiple regression, the independent variables that are not useful in predicting the
dependent variables can be eliminated. The goal is to find if the peak flow for the various return
periods can be estimated based on a small subset of the original n catchment characteristics.
Although not always possible, it is desirable to use the same subset of independent variables for
predicting each of the p dependent variables. This will help to ensure a consistent set of predic-
tions for various return periods on a particular catchment.
Using multiple regression to estimate the magnitude of a flood event that will occur on
average once in T years, denoted by QT, by using physical and climatic characteristics of the
watershed has a long history (Benson 1962c, 1964; Benson and Matalas 1967; Thomas and
Benson 1970). Sauer (1974) developed regional equations relating flood frequency data for
unregulated streams in Oklahoma to basin characteristics through multiple linear regression
techniques. Similar studies have been done throughout the United States (Jennings et al. 1993).
The Hydrology Committee of the U.S. Water Resources Council (1981) investigated numerous
methods of estimating peak flows from ungaged watersheds and found that the results obtained
using regional regression compared favorably well with more complex watershed models.
A logarithmic transformation of the QT,physiographic, and climatic data may be required to
linearize the regression model and to satisfy other assumptions of regression analysis. The rela-
tionship most commonly used is of the form

where XI, X,, ... X, represent the basin and climatic data, and b,, b,, b,, ... b, are the regres-
sion parameters. Regression parameters may be estimated using the ordinary least squares
(OLS), weighted least squares (WLS) or generalized least squares (GLS). OLS do not account
for unequal variances in flood characteristics or any correlations that may exist between
streamflows from nearby stations. To overcome these deficiencies in the OLS method, Tasker
(1980) proposed the use of WLS regression with the variance of the errors of the
observed flow characteristics estimated as an inverse function of the record length. Using a
weighting function of
FREQUENCY ANALYSIS 185

where N is the number of stations, toand t , are constants, and ni is the record length of station i,
Tasker (1980) reported that the WLS produced a smaller expected standard error of predictions
than the OLS. Using Monte Carlo simulation, Stedinger and Tasker (1985) demonstrated that
the WLS and GLS provide more accurate estimates of regression parameters than the OLS. A
major drawback of the WLS and GLS is the need to estimate the covariance matrix of the resid-
ual errors. The covariance matrix of the residual errors is a function of the precision with which
the model can predict the streamflow values.
Estimating a peak flow for some return period on an ungaged catchment now becomes an
exercise in applying the appropriate regression equation to the ungaged catchment. The required
catchment characteristics are used in the appropriate prediction equations to estimate the peak
flow.
Regional frequency analysis using option b is very similar to option a except the dependent
variables in the regression analysis are the parameters or some function of the parameters of the
pdf selected to represent the flood peak flows. If a lognormal distribution is used, there will be 2
dependent variables, the mean and standard deviation of the logarithms of the flows. If the log
Pearson type It1 is used, there will be 3 dependent variables.

.98 .95 .90 -80 .70 .60 .SO .40 .30 .20 .I0 .05 .02

Based on data from Durant and Blackwell (1959)


For stations in parts of Alberta and
Saskatchewan, Canada. 1911-1956 (REGION A)

0 Point from individual station

+ Median of 18 stations

Only max and min station


value plotted in this region

1.02 1.11 2 5 10 20 50
Return period (yrs.)

Fig. 7.6. Regional flood frequency curve.


186 CHAPTER 7

Again, it is desirable to use the same set of independent variables to predict all of the pa-
rameters of the selected pdf. This is because the parameters will most likely be correlated. Using
the same set of independent variables helps ensure that one maintains a consistent relationship
among the parameters of the pdf.
Estimating peak flows for an ungaged catchment consists of using the derived prediction
equations to estimate the parameters of the flow frequency pdf. These parameters are then used
in the pdf to estimate flow magnitude with the desired return periods.

Index-flood method
Another widely used statistical procedure in regional flood frequency analysis is the index-
flood method. This method, first described by Dalrymple (1960), involves the derivation and use of
a dimensionless flood frequency distribution applicable to all basins within a homogeneous region.

Regional-Index Flood Relationship


The next step in the index-flood method is to define the index flood. The ratios of peak flows
of various return periods to this index flood are then computed. The ratios are of the form QT/QI
where QT is the flood with return period T, and QI is the index flood. The index flood is often
taken as the mean annual flood or the 2-year flood.
A plot is made of QT/QI versus T containing data for all of the watersheds. A line is drawn
through the median of the data in this plot. The resulting line is the regional flood frequency line.
In the past, the index-flood method was widely used to perform regional frequency analysis
(Dalrymple, 1960; Benson, 1962). The basic premise of the method is that a combination of
streamflow records maintained at a number of gaging stations will produce a more reliable record
than that of a single station and thus will increase the reliability of frequency analysis within a
region. The index flood method consists of two major steps. The first involves the development
of dimensionless ratios by dividing the floods at various frequencies by an index flood, such as
the mean annual flood for each gaging station (Stedinger 1983; Lettenrnaier and Potter 1985;
Lettenmaier et al. 1987). The averages or medians of the ratios are then determined for each
return period to estimate a dimensionless regional frequency curve. The second step consists of
the development of a relationship between the index-flood and physiographic and climatic char-
acteristics of the basin. Flood magnitudes and frequencies at required locations within the region
can then be estimated by rescaling the corresponding dimensionless quantile by the index flood.
The index-flood method, once the standard U.S. Geological Survey (USGS) approach, is based
on the assumption that the floods at every station in the region arise from the same or similar
distributions (Chowdhury et al. 1991). At some stage this procedure fell out of favor, primarily
due to the fact that the coefficient of variation of the flows, which is assumed to be constant in an
index-flood method, was found to be inversely related to the watershed area (Stedinger 1983).
This implies that the standard deviations of the normalized data do not remain constant for vari-
ous values of basin areas, because the coefficient of variation of the observed data is equal to the
standard deviation of the normalized flows. This can be demonstrated as follows. Let Yi be the
normalized flows given by:
FREQUENCY ANALYSIS 187
where xi represents the ordered observed flows (with x, being the largest observation and x, the
smallest) and Z is the mean observed flow, then the coefficient of variation of the observed data,
CV,, is given by:

Substituting the standardized value of Xi

the right-hand side of this equation is nothing but the standard deviation of the normalized flows.
The index-flood method started to be popular once again in the late 1970s and early 1980s since
the introduction of the probability weighted moments (PWM), a generalization of the usual
moments of a probability distribution (Greenwood et al. 1979). Greis and Wood (1983) reported
that improved regional estimates of flood quantiles were obtained by applying the PWM over the
conventional methods such as the method of moments and maximum likelihood estimation.
Parameter estimation by PWM requires the calculation of moments Mijkdefined as

where i, j, and k are real numbers and X is a random variable with distribution function, F(x)
where F(x) = Prob(X 5 x). M1,o,ois identical to the conventional moment about the origin and
the probability weighted moments corresponding to MITO,,or Mk are denoted as

All higher-order PWMs are linear combinations of the ranked observations x, r . . . 5 x,,
which is an indication that PWM estimators are subject to less bias than ordinary moments.
Ordinary moment estimators such as variance (s2)and coefficient of skewness (C,) involve squar-
ing and cubing of observations respectively, with a potential to give greater weight to outliers,
resulting in a substantial bias and variance. However, one major weakness of the PWM is that it
cannot be used to estimate parameters for those distributions which cannot be expressed in
inverse form, such as LP 111.

Regionalization Using L-Moments and the GEV Distribution


Hosking et al. (1985) and Stedinger et al. (1994) discuss regional flood frequency analysis
using L-moments and the generalized extreme value distribution. The following is adapted from
their work.
The generalized extreme value distribution (GEV) is given by

Consider K sites with flood records Xi(k) for i = 1, 2, ...,n, and k = 1, 2, ..., K. Normalize the
Xi(k) by dividing the observations at a site by the mean of the observations at that site.

1. At each site compute the three L-moments X,(k), X2(k), and X3(k) of the normalized observa-
tions using the probability weighted moments (PWM) estimators. The L-moments are linear
combinations of the ranked observations.

where xQjis the jthorder statistic of the normalized observations with x(,) the smallest and
x(,) the largest.

2. To get a normalized frequency distribution, compute the average of the normalized L-


moments of order r = 2 and r = 3.

- c:= ~t[ir(k)/fil(k)l
A: =
1
k
for r = 2 , 3
Ck=lWk

The w, are weights. The weights might be based on n,.

3. Use the fir to obtain the parameters and X:


of the normalized regional GEV by letting
FREQUENCY ANALYSIS 189
Then, from the expression of P,(x), compute

which is the regional flood frequency curve evaluated at probability p or T = 1 - l/p.

4. Estimate the loopthpercentile flood distribution at any site k by

where 6: is the at-site sample mean for site k.

For sites without flow records on which to estimate h:, a regional regression could be used to
develop an equation of the form

where X is a set of physical and hydrologic characteristics.

Regionalization Using Modeling


Conceptual hydrologic models are, in a sense, regionalization tools. A hydrologic model is
used to estimate flow characteristics at a particular location. The model requires as input certain
parameter values that must be estimated. In the absence of flow at the point of interest, these
parameters must be estimated from experience on other similar catchments. Some way of corre-
lating model parameter values with catchment characteristics is required. These relationships are
then used to estimate the values for the parameters of the catchment of interest. This represents a
regionalization approach in that parameters are estimated by transferal of information from other
basins to the basin of interest.

FREQUENCY ANALYSIS OF PRECIPITATION DATA


The amount of rainfall (depth) that can be expected to occur in a given period of time
(duration) on the average once every so many years (frequency) is an important design variable
for many hydraulic structures. Depth-duration-frequency relationships have been developed for
the United States (Hershfield 1961) for durations of 30 minutes to 24 hours and return periods of
1 to 100 years and published as U.S. Weather Bureau TP 40.
The procedure used in developing TP 40 (Hershfield 1961) was to prepare four key base
maps showing the 2-year, 1-hour; 2-year, 24-hour; 100-year, 1-hour; and 100-year, 24-hour rain-
falls for the United States. Annual series data were used consisting of the maximum 60-minute
190 CHAPTER 7
Table 7.7. Empirical factors for converting partial duration series to annual series (Hershfield 1961)

Return period Conversion factor

and 24-hour rainfall depths converted to a partial duration series by using the factors shown in
table 7.7. For example, if the 5-year partial duration series value estimated from the maps is 2.00
inches, the corresponding annual series depth would be 0.96(2.00) or 1.92 inches. For return
periods greater than 10 years, the conversion factor is essentially unity.
The 2-year rainfall amounts were determined by plotting on log-log paper the return period
versus the rainfall depth using the California plotting position formula (Table 7.1), drawing a
smooth curve through the points, and reading the 2-year value.
The 100-year rainfall amounts were determined by using the type I Extreme Value distribu-
tion for selected stations with long rainfall records. The ratio of the 100-year to the 2-year rain-
fall amount was then determined for these stations and a map prepared showing the value of this
ratio. The 100-year rainfall amounts for the stations with short records was estimated by the
100-year to 2-year ratio.
The rainfall depths for other return periods were determined by plotting the 2-year and
100-year depths on special paper, connecting the points by a straight line, and reading off the
desired rainfall depths. The spacing of the return periods along the abscissa of this special paper
was empirical from 1 to 10 years based on free-hand plotting of partial duration series data and
theoretical according to the type I extreme value distribution from 20 to 100 years. The transition
between 10 and 20 years is smoothed by hand from the type I values.
The rainfall depths for durations other than 1 hour or 24 hours were obtained by plotting the
1-hour and 24-hour values on a second special paper and connecting the points with a straight
line. This diagram was obtained empirically from an analysis of records from 200 first-order U.S.
Weather Bureau stations. The depth of rainfall for the 30-minute duration is obtained by multi-
plying the 1-hour value by 0.79.
From these analyses, curves called depth (or intensity)-duration-frequency curves can be
prepared. Data from the maps in TP 40 can be used to determine depth-duration-frequency
(DDF) relationships for locations where actual data does not exist. Often, in developing DDF
curves, the interpolation from the maps of TP40 may result in rather rough plots. The curves can
be smoothed by using an empirical smoothing equation. One such equation is

KTFx
D=
(T + b)"

where D is the depth, T is the duration, and F is the frequency of the rainfall. The coefficients K,
x, b, and n may be estimated using nonlinear regression techniques. Figure 7.7 shows the results
of such an analysis for Stillwater, Oklahoma, based on TP40 data.
FREQUENCY ANALYSIS 191

0.1 1 10 100
Duration (hrs)

Fig. 7.7. Rainfall depth-duration-frequency relationship for Stillwater, Oklahoma.

Rainfall data for longer durations, such as weeks or months, can be analyzed by using the
gamma distribution. Barger and Thom (1949) have shown the gamma distribution applicable to
rainfall data. Barger, Shaw, and Dale (1959), Friedman and Janes (1957), Strommen and Hors-
field (1969), and Mooley and Crutcher (1968) are among those who have used the gamma distri-
bution for rainfall.
By using equation 7.23, it can be seen that the probability of a rainfall R exceeding X is
given by

and the probability of R being less than x is given by

where k is the probability of rain or the proportion of time intervals with rainfall and P*(x) is the
cumulative probability distribution of rain given that R Z 0. often the gamma disfribution is used
for rainfall data. The parameters of the gamma distribution generally are determined by using
equations 6.18 and 6.19. Bridges and Haan (1972) have presented a technique for determining
the reliability of rainfall estimates from the gamma distribution based on simulation studies.

FREQUENCY ANALYSIS OF OTHER HYDROLOGIC VARIABLES


The principles set forth on flood frequencies and rainfall frequencies also apply to frequen-
cies of other hydrologic variables. Basically, the quantity to be analyzed must be defined, the data
tabulated, and then a frequency analysis made. For instance, in the case of flow volume-frequency
192 CHAPTER 7
studies, the duration(s) of interest must be specified and then the maximum or minimum flow
volumes for each year having the specified duration are tabulated. The maximum flow volumes
would be used in the case of flood-flow volumes and the minimum volumes would be used in the
case of low-flow studies.
Frequency analysis can be applied on water quality parameters such as dissolved oxygen,
biological oxygen demand, sediment loads, and many other quantities. Care must be taken to see
that the data used meet the necessary requirements of homogeneity, independence, and represen-
tativeness. For example, if sediment concentration frequencies are being studied and part of the
data are collected during low flows and part during high flows, the data may not be homogeneous
because of the relationship between sediment concentration and flow rate.

Exercises

7.1. Assume.that daily rainfall on rainy days follows an exponential distribution. The average
daily rainfall on rainy days is 0.3 inches. If 30% of all days are rainy, what is the probability that
on some future day, the amount of rainfall received will exceed 1.OO inch? Assume daily rainfalls
are independent.

7.2. Derive a table of frequency factors for the exponential distribution corresponding to T = 2,
5, 10,20,50, and 100 years.

7.3. Select several streams in a single locality and prepare a plot of the ratio of the T-year flood
to the mean annual flood (as in figure 7.6).

7.4. An analysis of 50 years of data showed that the probability of a flood peak exceeding 90,000
cfs on a certain river was -02. During a 10-year period 2 such peaks occurred. If the original
estimate of the probability of this exceedance was correct, what is the probability of getting 2
such exceedances in 10 years?

7.5. Forty years of peak streamflow data are available. All but one of the data points indicate that
a lognormal distribution with = 125,000 cfs and sx = 50,000 describes the data very nicely.
The one outlier is equal to 285,000 cfs. What is the probability that an event of 285,000 cfs or
greater could occur in the 40-year period if the flood peaks truly follow the lognormal distribu-
tion with X and sx as given?

7.6. Select a set of data consisting of 20 or more independent observations. Plot these data on nor-
mal probability paper using several of the plotting position relationships contained in table 7.1.

7.7. Compute the 100-year peak flow for the annual series data of example 7.2 assuming the data
follow the gamma distribution.

7.8. Prepare a plot on log-log paper of low flow frequency-volume-duration for Cave Creek near
Fort Spring, Kentucky. Plot volume in inches as the ordinate, duration in months (use 1, 2, 3, 6,
FREQUENCY ANALYSIS 193
and 12 months) as the abscissa and use as curve parameters frequency (use 2, 5, 10, and 25
years).

7.9. Work exercise 7.8 for maximum flow frequency-volume-duration on Cave Creek.

7.10. Plot the annual runoff data for Walnut Gulch near Tombstone, Arizona, on normal and
lognormal probability paper. Does either of these distributions appear to "fit" the data?

7.1 1. Plot on normal probability paper the annual runoff data for (a) Piscataquis River near
Dover- Foxcroft, Maine, (b) North Llano River near Junction, Texas, and (c) Spray River, Banff,
Canada. Is there any apparent relationship between the curvature (or lack of it) and the skewness?

7.12. Work exercise 7.11, only plot the data on lognormal probability paper.

7.13. For the Piscataquis River near Dover-Foxcroft, Maine, estimate the 100-year annual flow
assuming the data follow the (a) normal distribution, (b) lognormal distribution, (c) Pearson type
III distribution, (d) log Pearson type I11 distribution, (e) extreme value distribution.

7.14. Work exercise 7.13 for the 100-year annual flow on the North Llano River near Junction,
Texas.

7.15. Work exercise 7.13 for the 100-year annual flow on the Spray River, Banff, Canada.

7.16. In reference to exercises 7.13,7.14 and 7.15, which distribution would you expect to give
the "best" estimate for the 100-year flow on each of the three rivers? Discuss in terms of the
means, variances, coefficient of variation, and skewness.

7.17. Plot the annual peak discharge of Walnut Gulch near Tombstone, Arizona, on lognormal
probability paper. Draw in what you consider the best fitting straight line. Estimate the mean and
variance of the data from this plot.

7.18. Plot the suspended sediment load data for the Green River at Munfordville, Kentucky on
normal and lognormal probability paper. Draw in the best fitting straight line.

7.19. Use the lognormal distribution to estimate the 25-year runoff volume for July on Walnut
Gulch near Tombstone, Arizona. Plot the data on lognormal probability paper and draw in the
theoretical best fitting straight line.
8. Confidence Intervals
and Hypothesis Testing
IN CHAPTER 3, parameter estimation was discussed in general terms. In chapters 4,5, and
6 specific methods for estimating the parameters of certain probability distributions were
discussed. Again, it should be recalled that parameter estimates are called statistics, are functions
of the sample (random) values, and are themselves random variables. Parameter estimates have
associated with them probability distributions.
Thus far we have discussed methods of getting point estimates for parameters and certain
properties of these point estimates. The possible errors in these point estimates due to inherent
variability in random samples of data have not been discussed. This chapter considers the relia-
bility of parameter estimates and the testing of hypotheses regarding population parameters.
Hypothesis testing and confidence interval estimation may be classed as parametric or
nonparametric depending on whether or not assumptions are made regarding the probability
distribution of the observations and/or the parameters under consideration. Parametric and
nonparametric tests have certain assumptions in common. They both rely on independence in
the observations and randomness of the sample. They both require samples of data to be
representative of the situation under analysis. Parametric statistics deal with actual values of
observations while nonparametric methods often rely on the ranking or relative position of
data values.
The use of parametric statistics is frequently criticized because of deviations from the
distributions assumed by a particular test. One of the consequences of deviating from the
assumed distribution is that the level of significance of the test is no longer exact. This may be a
serious problem, but in most cases is not. Generally, the selection of the level of significance is
somewhat arbitrary. Early statisticians used 5 and lo%, so everybody uses 5 and lo%! If one
HYPOTHESIS TESTING 195
doesn't know how to select a level of significance, it makes little sense to be overly concerned if
the level of significance is unknown due to deviations from distributional assumptions. What is
purported to be an exact test becomes an approximate test, but that is often the nature of hydro-
logic analysis. Uncertainty abounds! An approximate test provides information to the decision
maker just as does a "so-called exact test and is certainly better than no test at all. Several pa-
pers are available indicating that nonparametric procedures are nearly as good as parametric pro-
cedures for some tests when distributional assumptions are met and are superior when distribu-
tional assumptions are not met (Helsel and Hirsch 1992).
In any application of hypothesis testing or confidence interval estimation, it must be kept in
mind that assumptions must be made concerning the data and the process under study. It is
unlikely that in an actual application the assumptions will be exactly met. Again, if the assump-
tions are not fully met, then the tests or confidence intervals become approximate.
If we reject the hypothesis that two streams have different BOD loadings, we do not neces-
sarily believe their BOD loadings are exactly the same. It would be rare indeed to have two nat-
ural streams that have identical BOD loadings or any other quantifiable characteristic. We know
before we run the test, indeed before we collect any data, that the BOD loadings are not precisely
the same on two streams.
What we are really concerned with is whether the BOD loadings are "significantly" differ-
ent. In statistical jargon, we are assessing whether the difference we detect in BOD is of such a
magnitude that it cannot be attributed to chance if the BOD loadings in the two streams are in fact
the same and meet the conditions of the test.
For example, consider a situation where the BOD level on two streams is sampled. Assume
that on each of the streams the true distribution of BOD is N(4, 1) and the BOD in the two streams
is uncorrelated. These are strong assumptions that we can never verify completely. If we could,
then statistical testing would be superfluous. It is hypothesized that the BOD levels are the same
in the two streams. The investigator decides to sample each of the streams and declare the BOD
levels different if the samples from the two streams differ by more than 1 mg/l. What is the prob-
ability an error will be made?
The error that might be made is to declare the BOD in the two streams different when, in fact,
they are, unknowingly to the investigator, the same. Since the BOD level is actually N(4, I), the
difference in two independent samples is N(O,2). The probability of selecting a random number
from an N(O,2) that is larger in absolute value than one is the probability of making an enor with
the test. Since the test statistic, the observed difference, has an N(O,2) pdf, the standardized Z value
corresponding to a difference in excess of the absolute value of one is (1 - 0)/* = 0.707. The
probability of Z exceeding 0.70 in absolute value for a standard normal distribution is 0.48. There
is a 48% chance of rejecting the hypothesis even though it is true.
If the investigator thinks this probability of an error is too great, the appropriate value for the
test statistic consistent with the acceptable error probability can be determined. For example, if
the investigator wants to be 90% confident of not concluding the streams are different when in
fact they are not, the cutoff value for Z is such that the prob(Z > z,) = 0.05, which corresponds
to Z = 1.645. Then the actual difference is computed from (d - 0 ) / = ~ 1.645 or d = 2.33.
Therefore, the stream would be considered not significantly different unless the absolute value of
the difference in the samples from the streams exceeded 2.33 mg/l.
If the BOD distribution on one stream was N(3, 1) and on the other N(4, I), the distribution
of BOD would have been truly different on the two streams. The distribution of the difference in
BOD would be N(l, 2). The probability of getting a difference in excess of rt 1 would be the
probability of a value <O or > 1 from an N(l, 2). Again, using the standard normal distribution,
this probability can be found to be 0.74. In this case, the BOD distributions are different yet there
is a 26% chance of erroneously concluding they are not different.
What becomes apparent is that there is always a chance of making an error in statistical
tests of hypotheses. The first part of the example demonstrates how one could wrongly conclude
a difference when none existed and the second part shows how one could fail to detect a differ-
ence when one does exist. These two errors are rejecting a true hypothesis-known as a Type I
error-or accepting a false hypothesis known-as a Type I1 error.
The probability of a Type I and a Type I1 error are usually denoted by cx and P, respectively.
In this example when the true situation was no difference, cx was 0.48. In the situation where there
was a difference, P was 0.26.

CONFIDENCE INTERVALS
A parameter 0 is estimated by 6. The statistic 6 is a random variable having a probability
distribution. If 6 can take on any value in some continuous range, then prob(0 = 6) is zero.
Rather than a point estimate for 0, it may be more desirable to get an interval estimate such that
the probability that this interval contains 0 can be specified. Such an interval is known as a con-
fidence interval. This statement may be written

where L and U are the lower and upper confidence limits, so that the interval from L to U is the
confidence interval and 1 - a is the confidence level, or confidence coefficient. Note that in
equation 8.1, 0 is not a random variable. One does not say that the probability that 0 is between
L and U is 1 - cx but that the probability is 1 - cx that the interval L to U contains 0. The differ-
ence in these two interpretations is subtle but based on the fact that 0 is a constant while L and U
are random variables.
Mood et al. (1974) discuss a general method for determining confidence intervals. Ostle
(1963) presents expressions for the confidence intervals for many different statistics. In the
discussion to follow, a procedure known as the method of pivotal quantities for determining con-
fidence limits will be illustrated. This method consists of finding a random variable V that is a
function of the parameter 0 but whose distribution does not involve any other unknown parame-
ters. Then v, and v, are determined such that

prob(v, < V < v,) = 1 - a (8.2)

This inequality is then manipulated so that it is in the form of equation 8.1 where U and L are ran-
dom variables depending on V but not 0.
HYPOTHESIS TESTTNG 197
Mean of a Normal Distribution
As an example of using equation 8.2, the confidence intervals on the mean of a normal
distribution will be determined. We have shown that the quantity

has a t distribution with n - 1 degrees of freedom, where n is the number of observations used
to estimate Z. Using equation 8.2 we have

If it is desired that the confidence interval be symmetrical in probability, v, and v2 can be


chosen so that the probability that a random t is less than v, equals the probability that a random
t exceeds v2. Since the 100(1 - a ) percent confidence interval is being sought, both of these
probabilities must be a / 2 . The probability that the confidence intervals do not contain 0 has been
divided equally between the upper and lower bounds. In the following the notation t,, corre-
sponds to the value o f t such that the probability of a random t with n degrees of freedom being
less than t,, is a (see figure 8.1).
Equation 8.3 is equivalent to

Since the t distribution is symmetrical, t,/,,-, -


- - t, - - ,.Therefore

Fig. 8.1. Illustration of confidence intervals using the t distribution.


This latter equation is in the form of equation 8.1, so the confidence limits are

Because F and s, are both random variables, L and U are random variables as well, with
estimates 1 and u given by equation 8.4. Note that the assumption that the observations are
normally distributed was made.

Example 8.1. The sample mean and variance of the Kentucky River data contained in table 2.1
have been calculated as Z = 66,540 and sx = 22,322. What are the 95% confidence limits on the
mean assuming the sample is from a normal population?

Solution:

From the t table in the appendix

From equation 8.4

Thus, we can say that we are 95% confident that the interval 62,076 to 71,004 contains the true
population mean.
Comment: If a 90% confidence interval is calculated, it is found to be 62,817 to 70,263. Thus, the
90% confidence interval is shorter than the 95% confidence interval but our degree of confidence
that the interval contains F, has decreased from 95% to 90%.
If a second independent sample of peak flows on the Kentucky River near Salvisa were avail-
able, this sample would have a different mean and variance. In this case, the 95% confidence
intervals would be different as well. If many samples were available and the 95% confidence
limits were calculated for each, 95% of the confidence limits would contain the true population
mean while 5% would not if the data were actually from a normal distribution. The 100(1 - a)%
confidence interval on the mean can be made as small as desired by increasing the sample size.
This is because s, decreases as the sample size is increased. An increase in the reliability of the
sample mean comes at the expense of an increase in the sample size. Unfortunately, in many
hydrologic problems the sample size is fixed. For a normal distribution, equations 8.4 provides a
means for determining the sample size required in order to estimate J.L, within a given reliability.
HYPOTHESIS TESTING 199
If the population variance of the normal distribution is known, then the pivotal quantity in
equation 8.3 becomes (X - y)/u,, which has a standard normal distribution. The confidence
limits then become

where z, -a/2 is the value of Z from the standard normal distribution such that the area to the right
of Z is a/2.
Equations 8.4 and 8.5 are based on the assumption that the underlying population of the
random variable X has a normal distribution. Only through the Central Limit Theorem can these
relations be applied to non-normal distributions. Confidence limits calculated by these relation-
ships for the means of random samples from non-normal populations are only approximate with
the approximation improving as the sample size increases. If these approximations are not satis-
factory, other methods are available (Ostle 1963; Mood et al. 1974).

Variance of a Normal Distribution


The quantity (n - 1)s2/u2has a chi-square distribution with n - 1 degrees of freedom. Let-
ting this quantity equal V in equation 8.2 results in

Choose v, equal to x2~ / ~ ,and , as x12 -a/2,n-


~ -v2 ,.Then

which is in the form of equation 8.1. Thus, the confidence limits on u2 are

Again, equations 8.6 are strictly valid only if X is from a normal distribution and approxi-
mate for X from a non-normal distribution-with the approximation improving as the sample
size increases.
Fig. 8.2. Confidence limits on a chi-square distribution.

The chi-square distribution is not symmetrical so that s i - 1 is not equal to u - si. As the
sample size and, thus, the degrees of freedom on the chi-square distribution increases, the distri-
bution approaches a symmetrical distribution so that the upper and lower confidence limits are
nearly the same distance from s;. This is illustrated in figure 8.2.

Example 8.2. Determine the 90% confidence limits on the variance for the situation described in
example 8.1.

Solution:

The 90% confidence intervals on the standard deviation are found (by taking the square
roots of the above limits) to be 20,001 to 25,33 1 cfs.
Comment: In the preceding two examples the confidence limits on the mean and variance of a
normal distribution were calculated. If the joint confidence limits on ?and
i s; are desired, they
cannot be computed separately as was done in these examples. Mood et al. (1974) discuss the
estimation of ioint confidence intervals.

One-Sided Confidence Intervals


Situations may arise where one is only interested in an interval estimate on one side of a pa-
rameter. For instance, it may be desired to find only a lower confidence limit. In this situation
equation 8.1 becomes
HYPOTHESIS TESTING 20 1
The same procedure for finding L would be followed as was used in the two-sided case,
except now all of the probability a will be in one tail. For instance, the one-sided lower limit on
the mean of a normal distribution with an unknown variance would be

The analogous results would hold for any one-sided, lower or upper confidence limit.

Parameters of Probability Distributions


For a wide class of distributions for large samples, the maximum likelihood estimators for
the parameters of the distribution are asymptotically normally distributed with the true parameter

{
value as the mean and a variance of nE - In px(x, 8)
K g IT1 .

Using this information, it is possible to construct confidence intervals and joint confidence
intervals for the parameters of these distributions. The book by Mood et al. (1974) should be con-
sulted for the procedures to be used.

HYPOTHESIS TESTING
Often the acceptability of statistical models can be judged without actually making any
statistical tests. This would be the case when observed data is predicted very closely by the model
or when observed data deviates very greatly from the model. On the other hand, a common
occurrence is for the observed data to deviate some from the model but not enough for one to
state that the model is obviously inadequate. In this latter situation one must determine whether
the deviations represent true inadequacies in the model, or whether the deviations are chance
variations from the true model.
The general procedure to be followed in making statistical tests is

1. Formulate the hypothesis to be tested.

2. Formulate an alternative hypothesis.

3. Determine a test statistic.

4. Determine the distribution of the test statistic.

5. Define the rejection region or critical region of the test statistic.

6. Collect the data needed to calculate the test statistic.

7. Determine if the calculated value of the test statistic falls in the rejection region of the distri-
bution of the test statistic.
Table 8.1. Errors in hypothesis testing

True situation True situation

Decision Hypothesis true Hypothesis false


Accept hypothesis No error Type II error
Reject hypothesis Type I error No error

For many statistical tests, steps 2 4 have been completed and may be found in a wide vari-
ety of statistics books. For many of the tests that a hydrologist might like to make, adequate test
statistics and their distributions have not been determined-largely because of restrictive as-
sumptions. Nonpararnetric tests relieve this problem to some extent.
It is not possible to develop tests that are absolutely conclusive. All of the tests have a
possibility of two kinds of error-rejecting a true hypothesis (Type I error) or accepting a false
hypothesis (Type I1 error). Table 8.1 depicts the two types of errors. The probability of a Type I
error is denoted by a and the probability of a Type I1 error by P. The significance level is defined
as 100(1 - a ) (in percent). In testing hypotheses, the probability of a Type I error can be speci-
fied; however, the probability of a Type I1 error is not known unless the true parameter values
being tested are known. In general as the value of a decreases, the magnitude of P increases.
As an example, assume we select an observation xo at random from a normal distribution
with variance a; and hypothesize that the distribution has a mean po.The test statistic could be xo
itself, which has a normal distribution with unknown mean and variance a;. If the hypothesis is
true (something that is not known or the test would not be made), the distribution of the test
statistic would be a normal distribution with mean po and variance 0; and would appear as in
Figure 8.3. If it is decided to accept the hypothesis if xo is within 2 standard deviations of po and
reject the hypothesis otherwise, the critical region or rejection region would be the shaded area in
Figure 8.3. From the properties of the normal distribution, it is known that 95.44% of the area of
the normal curve is within 2 standard deviations of the mean, so the critical region occupies 4.56%
of the area. It is also apparent that there is a 4.56% chance that x, will be in the critical region and
the hypothesis rejected even though it is true. Thus, by definition a = 0.0456, or there is a 4.56%
chance of making a Type I error due to random variation in the x, selected. It is more common to
specify a and from this information determine the critical region. For example, if one wanted a to
be 0.10, then the critical region would be I (xo - po)/aoI > 1.645, which is the value of the stan-
dard normal distribution such that the area outside the limits - 1.645 to 1.645 is 0.10.

Po-2% P o Po+2q,
Fig. 8.3. Critical region.
HYPOTHESIS TESTING 203

Fig. 8.4. Illustration of a and P.

In order to evaluate p, the true parameter values must be known. Again, consider selecting
a single value xo from a normal population with variance 4 and an unknown mean. Let the
hypothesis be that p = po and the alternative be p # po. If p actually equals p,, then the
situation depicted in figure 8.4 would exist and there is a loop% chance that xo will fall in the
acceptance region of N(po, a;) and thus a Type I1 error committed. From figure 8.4 it can be seen
that as a is increased, p will decrease. It can also be seen that the nearer p1is to po, the greater
will be p. This is because it is increasingly difficult to tell the difference between the two distri-
butions. It is not possible to determine the magnitude of P because it is a function of the unknown
population mean p,. Example 8.3 shows how P can be evaluated if p1is known. Of course, p1
would not be known or else one would not hypothesize p = po.

Example 8.3. Assume a single observation is selected from a normal distribution with mean
p1 = 7 and variance a; = 9. It is hypothesized that p = po = 5. If the test is conducted at the
10% significance level, what is P?

Solution:
Reference should be made to figure 8.5.

Fig. 8.5. Illustration for example 8.3.


a = 0.10
a / 2 = 0.05 which corresponds to z, -,/ =,1.645
(Xu - po)/uo = 2, where Xu is the boundary of the upper critical region
(Xu - 5)/3 = 1.645
Xu = 9.935
A, = the area of a normal distribution with mean of 7 and variance of 9 to the left of 9.935.
The standardized variate corresponding to Xu = 9.935 is

The area to the left of z, = 0.978 from a standard normal distribution is 0.8365. Similarly, if X,
is the boundary of the lower critical region, we have (x, - 5)/3 = - 1.645, or x, = 0.0645. A, is the
area of a normal distribution with mean 7 and variance 9 to the left of 0.0645. z, = (0.0645 - 7)/3
or z, = -2.3,l. A, = 0.0104. Now P = A, - A, or P = 0.8365 - 0.0104 = 0.8261. Thus, the
probability of accepting the hypothesis that p = 5 when in fact p = 7 is 0.8261 when a is 0.10. The
probability of a Type II error is 0.8261.

If calculations such as those contained in example 8.3 are carried out for various values of
pl, a curve relating P to p1can be constructed. Such a curve is shown in figure 8.6. Figure 8.6
shows the p curve for a = 0.05 and a = 0.10. Curves such as shown in figure 8.6 are often
called operating characteristic (OC) curves.
Figure 8.6 verifies the earlier statements that P increases as a decreases and P increases as
the true mean, pl, approaches the hypothesized mean, po. In fact, as p1 gets close to po, the

PI
Fig. 8.6. Probability of a type I1 error as a function of the true mean for example 8.3.
HYPOTHESIS TESTING 205
POWER = 1-8

I
- 10 -5
I I
0
I
5
I
10
I
15
1
20

PI
Fig. 8.7. Example power curve.

probability of accepting p = po when p = p, is true gets very large. This may not be a serious
problem in practice because we may not care, for instance, whether p is 5 or 5.5.
The quantity 1 - P is called the power of a test. Ideally, we would like the power to be large
for all values of p, .In fact in testing a hypothesis, we would like a to be small and the power to be
large. Figure 8.7 shows that power of a test is a function of a and true parameter values. The power
of a test is also a function of the test itself. For instance, we could have chosen as our test statistic
& + 3 and then rejected the hypothesis if x, + 3 fell in the critical region. Figure 8.7 compares the
power of this test with the test that rejected the hypothesis if x, fell in the critical region.
Figure 8.7 shows that for certain values of p,, the X, + 3 test is more powerful than the X,
test. Ideally, we would like to use the test that was the most powerful over the entire range of the
unknown parameter. Such a test is known as a uniformly most powerful test. Unfortunately, uni-
formly most powerful tests do not exist in many situations.
Selecting which test to use comes down to the purposes of the test and the consequences of
making an error. In our example, if accepting the hypothesis p = 5 when in fact p;> 5 is a very
serious error, whereas accepting it if p < 5 is of little consequence, we might prefer the X, + 3
test becuase it is more powerful in the region p > 5. If the consequence of an error depended only
on the magnitude of the error, the X, test might be preferred.
From the above discussion, it should be apparent that the selection of a and the type of test
to be used depends on the problem at hand. Mood et al. (1974) discuss these concepts in more
theoretical terms. The level of significance, a , is usually chosen to be 0.10, 0.05, or 0.01. In
theory, a should be based on the problem at hand. In practice, a is generally arbitrarily selected.
Many tests of hypothesis are of the type 0 = I1 versus the alternative 0 # 0,. Accepting
such a hypothesis as true does not mean that one strictly feels that 0 = but rather that 0 is not
significantly different from el. For example, if we calculate the mean of a random sample and then
accept the hypothesis that the true mean is 5, we may not believe that the true mean is exactly 5 but
rather the true mean is not significantly different from 5. What constitutes a significant difference
has been defined by the type of test used and the level of significance. Furthermore, a statistically
significant difference and a physically significant difference are not the same. For example, if
6 = 4.0 is an estimate for 0 and a test of hypothesis shows 6 is not significantly different from
zero, it does not mean 0 = 0 should be used in some physical analysis if this physical analysis is
sensitive to differences in 0 of this order of magnitude. A physically significant difference depends
on the problem being studied.
The following is a discussion of several common tests of hypotheses. The hypothesis to be
tested is denoted by H, and the alternative hypothesis by Ha. For the tests that follow to be cor-
rect statistical tests, the assumptions involved in developing the test statistic must not be violated.
A primary assumption is that the statistics are estimated based on a random sample. In practice,
at least some.of the assumptions are generally violated- with the result that the tests are only
approximate tests. This approximation is manifest in the fact that the actual level of significance
will not be equal to 100a%. Because these tests are often approximate due to assumption viola-
tions does not render the tests of no value. It is the analyst that must make the decision, not a
statistical test following some prescribed procedure. The analyst may put less weight on a statis-
tical test in arriving at a decision if the violations of the assumptions of the statistical test are of
concern, however.

H,: p = p l , Ha: p = p., Normal Distribution, Known Variance


In this case, H, is a simple hypothesis and Ha is a simple alternative hypothesis. The test
statistic is developed by considering that

has a standard normal distribution. If p1 > p2, then H, is rejected if

If p1< p2, then H, is rejected if

In the preceding expressions, z, -,


represents the point on the standard normal distribution
such that prob(Z 2 z1-,) = a.

H,: p = p,, Ha: p = p2, Normal Distribution, Unknown Variance


The test statistic for this situation is
HYPOTHESIS TESTING 207
H, is rejected if

and
-
x L + tl-,,n-l sx/v'h for p1 < p2 (8.12)

H,: p = po, Ha: p # po, Normal Distribution, Known Variance


This hypothesis is a simple hypothesis with a compound alternative hypothesis.
Again, the test statistic is

H, is rejected if

Izl = 1 (X - Po)
ax/ v'h
(> zl-+.

H,: p = po, Ha: p # p, Normal Distribution, Unknown Variance


Generally, a population variance is not known and must be estimated. In that case, H, is
tested by using

(X - Po)
t=
sx/ v'h
H, is rejected if

It1 = 1 (K - Po)
sx,& 1 > t1-a,2n-1

This test cannot be applied to every set of data. The assumption has been made that the
observations are from a normal distribution.
- - --

Example 8.4. The annual runoff for Cave Creek near Fort Spring, Kentucky, for the period 1953
to 1970, has a mean of 14.65 inches and a standard deviation of 4.75 inches. Test the hypothesis
that the mean annual runoff is 16.5 inches.

Solution: The testing procedures we have available to us all are based on the assumption of
normality. If we assume the annual runoff is normally distributed, we can use equation 8.14 to
test H,: (I. = 16.5 versus Ha: p # 16.5.
There are 18 observations. The test statistic is
208 CHAPTER 8

Using a 95% level of significance, a = 0.05 and tl - = 2.11. Because I t I =


- b.975,17
1.65 < 2.11, we do not reject the hypothesis that the mean is 16.5.
Comment: Some statisticians do not like to "accept" H,. Their reasoning is that we have not
proven H,, only found strong evidence to support it. As a result of a statistical test, their conclu-
sions would be either reject H, or fail to reject H,. It should be kept in mind, however, that we
have not proven H,.
For instance, in this example, we have calculated the sample mean to be 14.65 and accepted
the hypothesis that the population mean is 16.5. This illustrates two points. First, the data and the
test obviously do not prove that k = 16.5. Second, what we really have accepted is not that the
mean is 16.5 but that when sampling from this distribution using a sample of size 18, the differ-
ence between the sample mean of 14.65 and the hypothesized mean of 16.5 can reasonably be
ascribed to chance variations due to the random sample. Our conclusion is that based on this
sample, we cannot say that the population mean is not 16.5 or based on this sample the popula-
tion mean is not (statistically) significantly different from 16.5.

Test for Differences in Means of Two Normal Distributions


If the variances of the two normal distributions are known, then the H,: k, - k2 = 6 versus
Ha: k1 - k2# 6 can be tested by calculating the test statistic

In this case, Z has a standard normal distribution, so the rejection region is 1 z I > z,
If the variance of the two normal distributions are equal but unknown, the H,: k1 - p2 = 8
versus Ha: p1 - p2 # 6 is tested by calculating the statistic

which has a t distribution with n1 + n, - 2 degrees of freedom. Thus, H, is rejected if

Again, note that these two tests are based on sample normality. For large samples, the Central
Limit Theorem may enable us to use these tests as approximate tests for nornormal samples.
Gibra (1973), Ostle (1963) and others discuss testing the H,: k, - p2 = 6 versus Ha: k1 -
k2# 6 when sampling from two normal populations with unknown and unequal variances. Ostle
recommends the following approximate procedure. Compute the test statistic
HYPOTHESIS TESTING 209
The hypothesis is rejected if

where

w, = s:/nl

Test of H,: u2 = u i versus H,: u2 # u i Normal Population


A test of H,: u2 = ui versus H,: u2 # ui when sampling from a normal distribution with
sample size n can be made by calculating the test statistic

and then accepting H, if

Otherwise H, is rejected.

Test of H,: a: = a; versus H,: a: # a; for Two Normal Populations


To test the hypothesis that the sample variances of two normal populations are equal, the
sample test statistic is

ST
where is the larger sample variance. F is distributed as an F distribution with n, - 1 and n2 - 1
degrees of freedom, where n, is the sample size for the sample having the larger variance and n2 is
the sample size for the sample with the smaller variance. H, is rejected if

Test for Equality of Variances from Several Normal Distributions


To test the H,: a: = a$ = ... a$for k independent samples each from a normal population
with mean ki and variance a', it is first necessary to calculate the k sample variances s'. The
210 CHAPTER 8

quantity Q/h is approximately distributed as a chi-square distribution with k - 1 degrees of


freedom where

and

H, is rejected if

In this test, Ha is 02 that are not all equal. This means that at least one is different from
the other (~2.
The test is known as Bartlett's test for homogeneity of variances. Homogeneity of
variance is also known as homoscedasticity.

Example 8.5. For the preceding example, test the hypothesis that the variance is 36.00.

Solution: The assumption of normality is used. The test is based on equation 8.18 using a = 0.05

From a chi-square table

Since 10.65 is between 7.6 and 30.2, H, is not rejected.

TESTING THE GOODNESS OF FIT OF DATA TO PROBABILITY DISTRIBUTIONS


Two ways of judging whether or not a particular distribution adequately describes a set of
observations have already been discussed. Both of these methods required a visual judgment of
goodness of fit. One method was to compare the observed relative frequency curve with the
HYPOTHESIS TESTtNG 21 1

hypothesized relative frequency curve. The second method was to plot the data and the hypothe-
sized distribution as a cumulative probability distribution on appropriate paper and judge as to
whether or not the hypothesized distribution adequately describes the plotted points. Statistical
tests corresponding to these visual tests will be discussed. In the following discussion, the
hypothesis being tested is that the data are from a specified probability distribution.

Chi-square Goodness of Fit Test


One of the most commonly used tests for goodness of fit of empirical data to specified the-
oretical frequency distributions is the chi-square test. This test makes a comparison between the
actual number of observations and the expected number of observations (expected according to
the distribution under test) that fall in the class intervals. The expected numbers are calculated by
multiplying the expected relative frequency by the total number of observations. The test statis-
tic is calculated from the relationship

where k is the number of class intervals, and Oi is the observed and Ei the expected (according to
the distribution being tested) number of observations in the ithclass interval. The distribution
of X: is a chi-square distribution with k - p - 1 degrees of freedom, where p is the number of
parameters estimated from the data. The hypothesis that the data are from the specified distribu-
tion is rejected if

Example 8.6. As an example of using the chi-square test, consider the Kentucky River data of
table 2.1 and test the hypothesis that the data are from a normal distribution. The observed and
expected numbers in each class interval are obtained by multiplying the relative frequency by 99,
which is the number of observations. Table 8.2 shows the calculation of x:. The degrees of

Table 8.2. Chi-square test on Kentucky River data


(0 -E ) ~
Observed Expected
Class mark number number E

25,000 3 5.03 0.820


35,000 6 6.57 0.050
45,000 16 11.10 2.162
55,000 16 15.39 0.025
65,000 18 17.51 0.0 14
75,000 13 16.35 0.686
85,000 13 12.54 0.017
95,000 7 7.89 0.100
105,000 3 4.08 0.284
115,000 -4 -2.55 0.823
Total 99 99 4.982
212 CHAPTER 8

Table 8.3. Chi-square test on Kentucky River data (modified)

Observed Expected (0 -E)~


Class mark number number E

-7
Total 99

freedom is k - 3, or 7, since two parameters (pXand a;) were estimated for the normal distri-
bution. Comparing x:. of 4.98 with Xg,90,7 = 12.0, it is concluded that the normal distribution can
not be rejected for this data for a = 0.10. If x:. had exceeded X:-,,k-,-,, the hypothesis that the
normal distribution describes the data would be rejected.
In constructing Table 8.2 the expected number in a class interval is based on n[Px(xi) -
P,(X~-~)] for all intervals except the first and last ones. For the first interval the expected number
is -(xi) and for the last interval n[Px(w) - Px(x,-,)I. In these expressions xi represents the
right boundary of the i" class.
Comment: By examining table 8.2 and equation 8.21, it is apparent that the chi-square goodness
of fit test is quite sensitive in the tails of the assumed distribution. Because of this many statisti-
cians recommend that classes be combined if the expected number in a class is less than 3 (or 5).
If the 5 criteria is used, the first two classes and the last two classes must be combined. This
makes the calculation of X2 as shown in table 8.3 and X: value is reduced to 3.62. The degrees of
freedom are reduced to 5.

Perhaps a better way of conducting the chi-square goodness of fit test is to define the class
intervals so that under the hypothesis being tested the expected number of observations in each
class interval is the same. This means that the class intervals will be of unequal width and that the
interval widths will be a function of the distribution being tested.

Example 8.7. A chi-square test for normality of Kentucky River data using 10 class intervals
each having the same expected frequency can be conducted as follows.
Ten class intervals means that the expected relative frequency or probability in each interval
is 0.1. The class boundaries can be determined by solving the inverse of the cumulative distribu-
tion. For instance, the boundaries of the 4thclass intervals are given by the values of x satisfying
Px(x) = 0.3 and Px(x) = 0.4.
HYPOTHESIS TESTING 2 13
Table 8.4. Chi-square test based on equal expected numbers per class interval

Class Lower Upper Observed Expected ( 0 - E)?


number boundary boundary number number E

1 -0c) 37933 8 9.9 0.365


2 37933 47753 15 9.9 2.627
3 47753 54834 13 9.9 0.97 1
4 54834 60885 7 9.9 . 0.849
5 60885 66540 7 9.9 0.849
6 66540 72195 14 9.9 1.698
7 72195 78246 5 9.9 2.425
8 78246 85327 12 9.9 0.445
9 85327 95147 8 9.9 0.365
10 95 147 00 10
- 9.9
- 0.001
Total 99 99 10.596

Table 8.4 contains the data for conducting the chi-square test based on 10 class intervals
having equal expected numbers of observations (99/100 or 9.9) in each interval. In this case, Xi
is 10.60, which is less than Xi.90,7
of 12.02. The hypothesis is, again, not rejected.

Distributional Tests Based on Cumulative Distributions


Conover (1980) presents a good discussion of statistical tests based on cumulative distribu-
tions. The most commonly used of these tests is the Kolmogorov-Smirnov one sample test (also
known as the Kolmogorov test). The hypothesis being tested is that a set of empirical observa-
tions come from a particular, known, and completely specified cumulative distribution. This test
is conducted as follows:

1. Let Px(x) be the completely specified theoretical cumulative distribution function under the
null hypothesis.

2. Let S,(x) be the sample cumulative density function based on n observations. For any
observed x, Sn(x) = k/n, where k is the number of observations less than or equal to x.

3. Determine the maximum deviation, D, defined by

4. If, for the chosen significance level, the observed value of D is greater than or equal to the crit-
ical tabulated value of the Kolmogorov-Smimov (K-S) statistic, the hypothesis is rejected.
The Kolmogorov-Smimov test statistic is included in the appendix.

This test can be conducted by calculating the quantities Px(x) and Sn(x) at each observed
point, or by plotting the data as in figures 7 . 3 and
~ d and selecting the greatest deviation on the
probability scale of a point from the theoretical line. If the latter approach is used, care must be
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability

Fig. 8.8. Graphical determination of critical K-S value.

taken to select the largest deviation on the probability scale which is not necessarily linear. The
largest deviation of the empirical distribution from the known distribution is sought. The empir-
ical distribution gives the Prob(X i x) and is thus a step function with steps at each data point.
Rather than falling on a data point, the largest deviation may be at a point where the probability
takes a step change. Example 8.8 and figure 8.8 illustrate the determination of this maximum
deviation, which in this case occurs just prior to X = 18 and has a value of 0.29 on the probabil-
ity scale. In this case, the known distribution is an exponential distribution. There are eight data
points. The critical value for the K-S statistic with a = 0.10 is 0.411. Thus, the hypothesis that
the data are from this particular distribution cannot be rejected.
When few observations are available, it is very difficult to use a statistical test to find an
appropriate distribution for the data. In figure 8.8 only 8 observations are available. Obviously,
the chi-square test can not be used because adequate data for grouping are not available. As al-
ready seen, the K-S test is insensitive for a sample of this size since it requires a large deviation
to reject the hypothesis with this small sample. If the K-S test is used to test the hypothesis
that these data came from a uniform distribution or a normal distribution, these hypotheses could
not be rejected either. With small samples, the power of the K-S test is not very great and the
probability of failing to reject a false hypothesis, a Type I1 error, is great.

Example 8.8. Consider the data values 18,29,45,56, 50,40,20, 10. Test the hypothesis that the
data are from an exponential distribution with known mean of 33.5.

Rank Ranked data sx px Isx - pxI Isx-, - pxl


HYPOTHESIS TESTING 215

The critical value is 0.411 for n = 8 and ci = 0.10. The hypothesis cannot be rejected.

Note that for the Kolmogorov-Smirnov test, P,(x) is a completely specified, cumulative
probability distribution. That is no parameters for the distribution must be estimated from
observed data. Crutcher (1975) points out that when parameters must be estimated to specify
P,(x), the Kolrnogorov-Smirnov test is conservative with respect to the Type I error. That is, if
the critical value is exceeded by the test statistic obtained from the observed values, the hypoth-
esis is rejected with considerable confidence. Crutcher (1975) presents a table of critical values
for sample sizes of 25 and 30 as well as infinitely large samples for the exponential, gamma,
normal, and extreme value distributions when parameters of these distributions must be estimated.
In general, these critical values are smaller than the values given in the Kolmogorov-Smirnov
table in the appendix.
Conover (1980) discusses Lilliefors's extension of the K-S test to the normal distribution
with mean and variance estimated from the data (Lilliefors, 1967) and the exponential distribu-
tion with mean estimated from the data (Lilliefors, 1969). The tests are conducted as with the
K-S except that the critical values are smaller. Conover (1980) presents tables for the required
critical values. Based on data in Conover, letting KS represent the critical value of the
Kolmogorov-Smimov statistic and L represent the critical value for the Lilliefors test, the
approximation L = a + bKS can be used where a and b are given in the following table for 4 to
30 observations. For n greater than 30 the approximation L = c / 6 from Conover ( 1 980) yields
reasonable estimates for the critical values.

Distribution a a b c

Normal 0.10 0.02 1 0.586 0.805


Normal 0.05 0.027 0.565 0.886
Normal 0.01 0.040 0.528 1.031
Exponential 0.10 0.003 0.780 0.977
Exponential 0.05 0.009 0.767 1.075
Exponential 0.01 0.016 0.744 1.274

Example 8.9. Repeat example 8.8 assuming the mean is unknown.

Solution: The calculated mean is 33.5, so the observed maximum deviation is 0.29, as before. If
the calculated mean had been other than 33.5 the values for Px would change. The critical D
based on the Lilliefors test using the exponential distribution and the approximations above is

with ci = 0.10. The tabled value for KS is 0.411. Therefore L is found to be 0.003 +
0.780(0.411) or 0.324. The hypothesis cannot be rejected.
21 6 CHAPTER 8

Example 8.10. Test the hypothesis that the Kentucky River peak flow data are normally distrib-
uted. Use the Kolmogorov-Smirnov test.

Solution: The data are plotted in figure 8.9. The maximum deviation between the best fitting
line, Px(x), and the plotted points, S,(x), on the probability scale is about 0.074 at X = 55,200
cfs (table 8.5). Because the test for normality is being done and the mean and variance are esti-
mated from the data, Lilliefors approach is used. For a = 0.10 and n = 99, the critical value
for the Lilliefors statistic is 0.805/* or 0.081. Table 8.5 shows the calculations needed to find
the maximum deviation. The maximum deviation is the maximum value in the columns under

Normal distribution
Fig. 8.9a. Normal probability plot of Kentucky River data on annual flow.

Normal distribution
Fig. 8.9b. Lognormal probability plot of Kentucky River data on annual flow.
Table 8.5. coritirzzred

Rank Data Sx Px Sx - Px S(x - 1) - Px Rank Data Sx Px Sx - Px S(x - 1) - Px

87100 0.82
87200 0.83
88900 0.84
89400 0.85
91500 0.86
92500 0.87
93700 0.88
94300 0.89
96100 0.9
98400 0.91
99100 0.92
101000 0.93
105000 0.94
107000 0.95
111000 0.96
112000 0.97
115000 0.98
144000 0.99
Max dev
HYPOTHESIS TESTING 219

I .i . i i i I i I i . . i . I
0 10 20
. .
30
. .
40
. .
50
, .
60
. .
70
. .
80 90 100

Fig. 8.10. Critical correlation values for normality test.

Sx - Px and S(x - 1) - Px. Because the largest value is less than the critical value, the
hypothesis of a normal distribution cannot be rejected.

Other tests for normality include the Shapiro-Wilkes test (Conover 1980) and a test based
on the correlation between the standardized Z-value associated with the plotting position and the
values of the observations (Helsel and Hirsch, 1992). The critical correlation values are shown in
figure 8.10. The test is credited to Looney and Gulledge (1985). For this test the Weibull plotting
position should not be used.
Table 8.6 shows the Kentucky River data, the plotting positions calculated from the
Cumane (1978) relationship and the Z-values associated with the plotting positions. The corre-
lation between the flow and the Z-values is 0.989. Figure 8.10 shows that for 99 observations and
a = 0.05, the critical correlation value is about 0.98. Thus, the hypothesis of normality can not
be rejected. It is interesting to note that the correlation between the logarithms of the values and
the Z-values is 0.993 indicating that a hypothesis of log normality cannot be rejected either.

Comparing Two Empirical Distributions


Occasionally, it is desired to determine if the distributions of two independent, random sam-
ples are the same. A two-sided Kolmogorov-Smirnov test can be used to assist in this determi-
nation. Available are two independent samples of size m and n. P,(x) and P2(y) represent the two
unknown distributions. The hypothesis that P,(x) = P2(y) is done by comparing the empirical
distributions and finding the maximum value of IS,(x) - S2(y)l over all x and y. Let this maxi-
mum difference be T. The hypothesis is rejected if T exceeds a critical value. Critical values of T
are given in Conover (l980), Beyer (1968), and other statistical handbooks.

Example 8.11 The following data are from two independent samples. Test the hypothesis that the
underlying distribution is the same for both samples.
X = 2.25,2.63, 3.09, 3.47, 3.76,4.01,4.14,5.51,6.10,6.33
Y = 5.37,5.60,6.33, 8.90
Table 8.6. Kentucky River data

Flow Flow Flow


HYPOTHESIS TESTING 22 1

The maximum deviation is 0.70. From Conover (1980), with a = 0.10, the critical test
value is 13/20, or 0.65. Thus, one can reject the hypothesis that the two samples are from the
same distribution. Conover (1980) can be consulted for more details on this test and for a com-
panion one-sided test.

General Comments of Goodness of Fit Tests


Many hydrologists discourage the use of the chi-square and Kolmogorov-Smirnov tests when
testing hydrologic frequency distributions. The reason for this is the importance of the tails of
hydrologic frequency distributions and the insensitivity of these statistical tests in the tails of the
distributions. In the example above with 99 observations and a = 0.10, a critical value of the
Kolmogorov-Smirnov statistic of 0.12 was obtained. It is nearly impossible to get a deviation of
this magnitude in the tails of distributions when the procedures outlined in this chapter are followed.
The sensitivity of the chi-square test can be improved in the tails of the distribution if classes are not
combined to get an expected frequency of 3 to 5 as recommended earlier. The disadvantage of this
Xz
is that a single observation in a class with a low expectation can result in a value in excess of the
critical value. This single observation can lead to rejecting the hypothesis. Unfortunately, no satis-
factory alternate tests are presently available for making goodness of fit tests.
Neither the chi-square test nor the Kolmogorov-Smirnov test are very powerful in the sense
that the probability of accepting the hypothesis when it is in fact false is very high when these
tests are used. This is especially true for small samples. These criticisms of the goodness of fit
tests can be illustrated in the exercises dealing with simulation, as shown in chapter 13.

Exercises

8.1. A sample of 20 random observations produced a mean of 145 and a variance of 30. What
are the 95% confidence intervals on the mean assuming a normal distribution if (a) the true
variance is estimated by 30; (b) the true variance is 30. Discuss the reason you feel that the
confidence intervals computed for part (a) are wider than for part (b).

8.2. What are the 95% confidence intervals on the variance for the samples of exercise 8. l?

8.3. Test the hypothesis that the true mean of the data producing the sample whose properties are
given in exercise 8.1 is 165.

8.4. Discuss any connection between hypothesis testing and confidence intervals that you can
discern. What are the differences?
222 CHAPTER 8
8.5. Assuming the data are normally distributed, test the hypothesis that the mean peak discharge
on the Kentucky River near Salvisa (table 2.1) for the period 1895-1 9 16 is different than it is for
the period 1939-1 960.

8.6. Repeat exercise 8.5, except test for equality of variances.

8.7. Using the data of table 2.1, test the hypothesis that the variances of the peak discharges are
the same for the three periods 1895-1916,1917-1938,1939-1960.

8.8. Test the hypothesis that the mean monthly rainfall for September and October are the same
on the Walnut Gulch watershed near Tombstone, Arizona. What assumptions did you make? Are
these assumptions reasonable?

8.9. Repeat exercise 8.8 for equality for variances.

8.10. Test the hypothesis that the difference in the mean monthly rainfall on Walnut Gulch near
Tombstone, Arizona, for September and October is 0.50 inches. Discuss the validity of the
assumptions that are made.

8.1 1. Test the hypothesis that monthly rainfall in October on the Walnut Gulch watershed near
Tombstone, Arizona, is normally distributed.

8.12. Test the hypothesis that annual rainfall on the Walnut Gulch watershed near Tombstone,
Arizona, is normally distributed.

8.13. Comment on the results of exercises 8.1 1 and 8.12 in terms of the Central Limit Theorem.

8.14. Would the plotting position relationship used in exercise 7.6 have any effect on the results
of a test for normality on the data set you selected?

8.15. Use the Kolmogorov-Smimov test to answer exercise 7.10.

8.16. Use the Kolmogorov-Smimov test to test for normality the three sets of data plotted in
exercise 7.11.

8.17. Use the KolmogorovSmirnov test to test for lognormality of three sets of data plotted in
exercise 7.12.

8.18. Work exercise 8.16 using the chi-square test.

8.19. Work exercise 8.17 using the chi-square test.

8.20. What distribution do you think would fit the data of exercise 2.2? Use the chi-square test
to evaluate your assertion.
HYPOTHESIS TESTING 223

8.21. The following are experimentally determined values of Manning's n for plastic pipe as
determined by Haan (1965). Test the hypothesis that the mean value of n is different from the
recommended design value of 0.0090.
9. Simple Linear
Regression
NOTATION
IN THIS chapter an upper case letter will represent a variable, a lower case letter will represent
the difference between a variable and its mean, and a subscript will be used to denote a particular
value for the variable. Thus Y represents a variable which may take on values Y,, Y,, Y3, and so on.
-
Y is the mean of Y. y = Y - Y and yi = Yi - Y.Parameters are denoted by Greek letters and a
corresponding English letter is used to denote an estimate for the parameter. Thus a is a parameter
estimated by a (& = a). The lower case letter e will be used to denote the difference between an
observed value of Y and its predicted value ?.Thus Y - ? = e and Yi - Pi = ei.All summations
m this chapter will run from 1 to n unless otherwise specified, where n is number of observations
on Y and X.

SIMPLE REGRESSION
Possibly the most common model used in hydrology is based on the assumption of a linear
relationship between two variables. Generally, the objective of such a model is to provide a
means of predicting or estimating one variable, the dependent variable, from knowledge of a sec-
ond variable, the independent variable. The statistical procedure used for determining a linear
relationship between two variables is known as regression. Often the term regression is reserved
for use when all of the X variables being considered are random variables. In this book liberties
will be taken and the term applied whether or not the X variables are random variables. As used
in this chapter, dependent and independent are not the same as dependence or independence of
random variables. Here, dependent means that the variable may be expressed as a (linear)
SIMPLE REGRESSION 225

o Data
+ Mean //
- c=a+bx
/
/
----- 95% CI on regression line / 8
8

25
so[ --- 95% CI on individual ~redictedv ,/

01
30
/ I
35
I
40 45
1 I
50
I
55
Annual precipitation (in.)
Fig. 9.1. Annual rainfall-runoff relation for Cave Creek.

function of a second variable known as the independent variable. Obviously, if the variables are
strictly independent in a statistical sense, one variable would give no information about the other.
Figure 9.1 shows a situation where it may be desirable to find a linear relationship between the
annual runoff, Y, and the annual precipitation, X, for Cave Creek near Lexington, Kentucky. The
annual runoff is the dependent and the rainfall the independent variables. The data used in
constructing figure 9.1 is contained in table 9.1.

Table 9.1. Annual precipitation and runoff for Cave Creek, near Lexington, Kentucky

Precip. Runoff Precip. Runoff


Year (inches) (inches) Year (inches) (inches)
226 CHAPTER 9
Two questions are of immediate concern. Can a model of the form

adequately represent the relationship between Y and X? For what values of a and P is the repre-
sentation the best? Here E is the difference in Y and a + PX.
In looking at the question of the "best" straight line, a criteria for judging "bestness" is
needed. One intuitive criteria would be to estimate a and P by a and b so as to minimize the
deviation ei between the observed values of Y, Yi, and the predicted values of Y, Y,. In this way,
values for a and b would be sought that minimize the sum

Closer scrutiny of equation 9.2 reveals that it is not desirable to minimize the sum in an algebraic
sense becausethat would be equivalent to finding an a and b such that E ei is -a.
Another criteria might be to find an a and b such that X ei is zero. The fallacy with this can
be seen by considering two points. If the line Y = a + bX goes through the two points, then X ei
would be zero; however, the sum is also zero for any line that over-predicts one point by the same
amount that it under-predicts the second point. Thus, there is an infinity of lines such that
E ei = 0, and an additional restriction or criterion is needed to select a single line.
The X ei may be positive or negative. A criterion that is not sign dependent is needed. Such
a criterion might be to minimize X leil or to minimize X e'. Since absolute values are difficult to
work with mathematically, the second criterion is generally selected. Thus it is desired to
estimate a and p by a and b such X e' is a minimum. Denoting this sum by M, we have

This sum can be minimized with respect to a and b by taking the partial derivatives of M
with respect to a and b and setting the resulting equations equal to zero.

These equations can then be written in the following form, known as the normal equations.

The solution of the normal equations in terms of a and b is


SIMPLE REGRESSION 227

Equations 9.6 and 9.7 provide estimates for a and b such that C. e' is a minimum. Because
the procedure is based on minimizing the error sum of squares, C. e', the estimates a and b are com-
monly called least squares estimates. Equation 9.4 indicates that this solution also satisfies C. ei =
0. Equation 9.7 indicates that the line Y = a + bX goes through the point Y = Y and X = X.
The line Y = a + bX is commonly known as the regression line of Y on X. The procedure
of determining a and b is known as simple regression. The term "simple" regression is used when
only one independent variable is involved, as opposed to multiple regression when several inde-
pendent variables are involved. The parameter estimates, a and b, are known as the regression
coefficients.
Equations 9.6 and 9.7 show that a and b are functions of the sample values of Y and X. If
another sample of observations were obtained and a and b were estimated from this sample,
different estimates would result. We have already seen that

Similarly

Thus, ei represents the deviation between an observed Yi and its predicted value qibased on the
regression equation estimated from the particular sample of data at hand. E; represents the devia-
tion between an observed Yi and the assumed true but unknown relation between Y and X given
byY = a + P X .

Example 9.1. Determine the regression coefficients for the data plotted in figure 9.1.

Solution: The data required for solving equations 9.6 and 9.7 are contained in table 9.2. The
equation used to calculate b would depend on the method of calculation. If a small desk calcula-
tor is used, the first of equations 9.6 might be employed. If an electronic calculator or computer
is used, the latter of equations 9.6 might be employed. Generally, less roundoff error will result
if the latter form of equation 9.6 is used. In practice, readily available software would be used.

Therefore ? = - 13.195 1 + 0.6480X. This line is plotted in figure 9.1.


228 CHAPTER 9
Table 9.2. Calculations on data of table 9.1

13.26
3.31
15.17
15.50
14.22
21.20
7.70
17.64
22.91
18.89
12.82
11.58
15.17
10.40
18.02
16.25
Total 234.04
Average 14.63

Comment: The last two columns of table 9.2 contain qi and Yi - 9,. Note that except for
-
rounding errors, Y = 9 , C (Y, - q i ) = C ei = 0 and E = 0.

EVALUATING THE REGRESSION


The second question is now considered. Can the data be adequately described by the re-
gression line? Naturally, the answer to this query depends on the definition of adequate. The
question will not be answered here but methods for assessing the adequacy of the model will be
explored.
One approach that does not involve any assumptions is to determine how much of the vari-
ability in the dependent variable is explained by the regression. The variability in the dependent
variable is quantified as a sum of squares. From figure 9.2 it can be seen that Yi can be expressed as

or
Y, - Pi = (Y, - P) - (9,- 7)
Through algebraic manipulations, it can be shown that

g (Y, - Pi)' = g (Y, - Y)' - 2 (qi-


SIMPLE REGRESSION 229

Fig. 9.2. Components of Y.

Rearranging terms results in

Z (Y, - Y)' = Z ( y i - Pi)' + z(Pi - Q2


However, 2 (Yi - Y)2 is equal to C Y; - nY2 so we have

The total sum of squares, 2 Y:, has been partitioned into three components. These three
components are:

1. n F , the sum of squares due to the mean

2. C (Yi - Pi)2 = C e:, the sum of squares of deviations from regression or the residual sum of
squares

3. X (Pi - y)2, the sum of squares due to regression


The sum of squares about the mean or the sum of squares corrected for the mean is

2 (Yi - Y)2 = I:y; = X Y; -n p = X (Y, - + 2 (Pi - Y)2 (9.11)


which may be written

C y2 = C e2 + b C xiyi
230 CHAPTER 9
Therefore, the total sum of squares corrected for the mean is made up of two components-
the sum of squares of deviation from regression (also known as the error or residual sum of squares)
and the sum of squares due to regression. The larger the sum of squares due to regression in com-
parison to the residual sum of squares, the more of the total sum of squares corrected for the mean
is explained by the regression equation. The ratio of the sum of squares due to regression to the total
sum of squares corrected for the mean can be used as a measure of the ability of the regression line
to explain variations in the dependent variable. This ratio is commonly denoted by 2 and may be
written in a number of ways.

sum of squares for regression


2=
sum of squares corrected for mean

2 is called the coefficient of determination. If the regression equation perfectly predicts every
value of Yi , then ei would be zero for every i and 2 e' would be zero. Under these conditions,
equation 9.11 states that 2 y' = 2 (Pi - u)2, so that from equation 9.13 2 is seen to be one. On
the other hand, if the regression equation explains none of the variations in Y, then C e' will equal
2 yZ and C (Pi- Y)2will be zero. Under this condition 2 will be zero. Thus, the range in possi-
ble values for 2 is from 0 to 1. The closer 2 is to 1, the better the regression equation "fits" the
data. 2 is the fraction of the total sum of squares about the mean that is explained by the regression
equation.
From equations 9.6 and 9.13 we can write

Because 0 < 2 < 1, we have - 1 < r < 1. The sign on r is identical to the sign on b because sx
and s, are always positive. From equation 9.14 it can be seen that r may also be written as

which would be equal to the sample correlation coefficient if X and Y were both random
variables. In fact, r is commonly called the correlation coefficient and can be shown to
be equal to the correlation between Y and ?. Correlation is discussed in more detail in
chapter 11.
SIMPLE REGRESSION 23 1

Example 9.2. What percent of the variation in Y is accounted for by the regression of example 9. l?

Solution:

Thus, 66% of the variation in Y is explained by the regression equation. The remaining 34% of
the variation is due to unex~lainedcauses.

CONFIDENCE INTERVALS AND TESTS OF HYPOTHESES


Thus far in the discussion of simple regression no assumptions have been made conceming
the model. In order to use some well-developed theorems conceming hypothesis testing and
confidence interval estimation, it is necessary to make the assumption that the E~ are identically
and independently distributed as a normal distribution with a mean of zero and a variance of 2 .
(A shorthand way of writing this is ei is i.i.d. N(0,d)). For further discussion of the assumptions
involved in regression analysis, see the closing section of this chapter, General Considerations.
Also see Johnston (1963) and Graybill (1961).
This assumption contains many implications. The fact that the E(E,) = 0 has been guaran-
teed by our estimation procedures. The assumption of independence means that the correlation
between E~ and ej for any i # j must be zero. The assumption that the ei are identically distributed
with variance a2means that the variance of ei must equal the variance of E~ for all i and j. That is,
the variance of ei cannot change as Xi changes. This is known as homoscedasticity. Finally we
must have the ei normally distributed.
The assumption of normality of the E~can be checked by the procedures of chapter 8. A rough
check would be to note that, for the normal distribution, 95% of the values of E~ should be within
2 standard deviations of the mean or only about 5% of the residuals should lie outside the inter-
val -20 to 20. For a further discussion of examining the ei, reference should be made to Draper
and Smith (1966).
Under the normality assumption, we have E(E) = 0. The Var(e) is given by

The positive square root of the Var(~)is known as the standard error of the regression equation.
An unbiased estimate (Graybill 1961) for V a r ( ~ is
~ )s2 calculated from

The least squares estimation procedure produces estimates for a and b such that the standard
error of the regression equation is a minimum.
Another way to look at the coefficient of determination is to write equation 9.13 as

r" = (I:y? - I: e?) - C. e2


- 1 - 7
I:Y? I:~i
232 CHAPTER 9

Fig. 9.3. Variability in linear regression.

Therefore, if the estimated standard error of the regression equation is nearly equal to the
standard deviation of Y, 8 will be close to zero and the regression equation is of little value in
explaining variation in Y.
Figure 9.3 depicts the relationships among the pdfs of X, Y, and e in a linear regression.
What is of interest is the spread or variance in the pdf of e, s2,in comparison to that of Y, s;. The
smaller is s2 in comparison to s;, the greater is 8 and the stronger is the linear relationship
between Y and X. This is stated mathematically by equation 9.18.

Example 9.3. Is there reason to believe the residuals of example 9.1 are not normally
distributed?

Solution:

95% of the e, should be between -2s and 2s or between -5.94 and +5.94. An inspection of
table 9.2 shows that none of the 16 observations are outside this interval. The number of
observations is not sufficient to determine if the ei are N(0, a2), however, there is not sufficient
evidence to reject this possibility.
SIMPLE REGRESSION 233
Inferences on Regression Coefficients
In order to place confidence intervals on a and P and to test hypotheses concerning them, it
is necessary to know the Var(a) and Var(p) which will be designated as a : and 0;and estimated
by s: and s:. si and st can be estimated from

and

where s2 is estimated from equation 9.17.


If the model is correct, then the quantities b/sb and a/sa are distributed as a t distribution
with n - 2 degrees of freedom. Thus the confidence limits on a can be estimated from

where s, is estimated from equation 9.20.


Similarly, the confidence limits on P are estimated from

where sb is estimated from equation 9.19.


Test of hypotheses concerning a and P can be made by noting that (a - ao)/saand (b - po)/sb
both have t distributions with n - 2 degrees of freedom. Thus the hypothesis H,: a = a, versus Ha:
a # a, is tested by computing

H, is rejected if It1 > tl-a/2,n-2.


Similarly, H,: P = Po versus Ha: P # Po is tested by computing

(b - Po)
t=
Sb

H, is rejected if It( > tl-a,2n-2.


234 CHAPTER 9
The significance of the overall regression equation can be evaluated by testing the
hypothesis that P = 0. The H,: P = 0 is equivalent to H,: r = 0. If this hypothesis is accepted,
then 9 may be estimated by 7. --
Note that if r = 0, equation 9.18 shows that s2 s t , or the
regression line does not explain a significant amount of the variation in Y. In this situation one
would be as well off using Y as an estimator for Y regardless of the value of X.

Example 9.4. Compute the 95% confidence intervals on a and p and test the hypothesis that
a = 0 and the hypothesis that P = 0.500 for the regression of example 9.1.

Solution:

sa=s
[a
-+-
X2

Ex?]
'D

The 95% confidence intervals on a are

The 95% confidence intervals on p are

To test H,: a = 0 versus Ha: a # 0, compute

Because It1 > f0.975.14,


we reject H,: a = 0.
SIMPLE REGRESSION 235

To test the H,: p = .5 versus Ha: p # .5, compute

Since It1 < t()975,14,we cannot reject H,. The slope is not significantly different from 0.5.
Comment: The significance of the overall regression can be evaluated by testing H,: P = 0.
Under this hypothesis

Because It( > f0.975,14 we reject H,. The regression equation explains a significant amount of the
variation in Y.

Confidence Intervals on Regression Line


Confidence interyals on the regression line can be determined by first calculating the
variance of vkwhere Ykrepresents the predicted mean value of Pk
for a given Xk.
A
-
Yk = a + bXk
From equation 3.56

u2X
Mood and Graybill (1963) give Cov(a, b) = -- Therefore
E x2'

The standard error of could be estimated by s+kcalculated as

Equation 9.25 indicates that the variance of & depends on the particular value of X at which
A -
the variance is being determined. The ~ a r ( & )is a minimum when Xk = X and increases as Xk
deviates from X.
236 CHAPTER 9
Confidence limits on the regression line are now given by

where = a + bX, and sqk is given by equation 9.26. Because s e increases as xk or x,-X
-
increases, the confidence intervals are the narrowest at X, = X and widen as Xk deviates
from X
The confidence limits on an individual predicted value of Y would be wider than the
confidence interval on th_eregression line since for an individual Y, the Var(~)or 0
' would have
to be added to the ~ a r ( & ) .Thus the variance of an individual predicted value of Y would be
A

~ a r ( Y , )+ a2.Confidence intervals on an individual predicted value of Y could then be esti-


mated from equations 9.27 where the expression

would be substituted for skk.The confidence limits on a future predicted value of Y are the same
as those for an individual predicted value of Y.

Example 9.5. Calculate the 95% confidence limits for the regression line of example 9.1.
Calculate the 95% confidence interval for an individual predicted value of Y for the same
problem.

Solution: s = 2.97, n = 16,C x' = 570.0559, b.975,14 = 2.145 and X = 42.94.Therefore,from


equations 9.27 we have for the 95% confidence intervals on the regression line

where the - applies to the lower limit, 1, and the + to the upper limit, u. Similarly, the 95%
confidence interval on an individual predicted value of Y is given by

By substituting various values of Xk into these equations, the desired confidence limits are
obtained. These intervals are plotted in figure 9.1.
SIMPLE REGRESSION 237
Confidence Iintervals on Standard Error
Confidence intervals may be placed on a2 by noting that the quantity (n - 2)s2/a2 is
distributed as a chi-square distribution with n - 2 degrees of freedom. Thus, confidence limits on
a2are given by

where s2 is determined from equation 9.17.

EXTRAPOLATTON
The extrapolation of a regression equation beyond the range of X used in estimating a and
p is discouraged for two reasons. First, as can be seen from figure 9.1 and equation 9.27, the
confidence intervals on the regression line become very wide as the distance from is increased.
Second, the relation between Y and X may be nonlinear over the entire range of X and only
approximately linear for the range of X investigated. A typical example of this is shown in -
figure 9.4.

GENERAL CONSIDERATIONS
Many authors discuss several different linear models depending on the assumptions made
concerning Y, X, and E (Graybill 1961; Benjamin and Cornell 1970; Mood and Graybill 1963).
These different models revolve around whether X (or - X in multiple regression) is a random or
nonrandom variable, whether measurement errors are made on Y and/or X, the distribution of X
if X is a random variable, and the joint distribution of Y and X if X is a random variable.

True relation /

/
/

Fig. 9.4. Effect on nonlinearity and extrapolation.


The most common assumptions are:

1. X is a nonrandom variable measured without error, Y is a random variable, and E(Y,IX) is


normally and independently distributed with mean a + PX and variance a2.

2. Y and X are both random variables having a joint distribution, the conditional distribution of
Y is N(a + PX, a2), and the marginal distribution of X is independent of a , P and a 2 .

It turns out that under either of the above conditions, the procedures given in this chapter are
valid for tests of hypotheses and confidence interval estimation at a specified level of significance.
Graybill (1961) points out that the power of the tests are not the same for the two conditions.
If X is a fixed variable measured without error and ei is independently and identically
distributed N(0, a2); or Y and X are from a bivariate normal distribution and are measured
without error.; or Y and X are from a bivariate non-normal population with the conditional distri-
bution of Y being N(a + PX, a2)and the marginal distribution of X independent of a , P and a2;
then the least squares estimates of a , p and a2are also maximum likelihood estimators. The least
squares estimates for the regression coefficients are unbiased.
If significant measurement errors are made on the X variables, then complications arise. For
this situation reference can be made to Graybill (1961) or Johnston (1963). Certainly, measure-
ment errors are always present; however, if these errors are small relative to X, then the theory
presented in this chapter and chapters 10, 11, and 12 may still be applied.
The reason that measurement errors on X cause problems can be seen by considering the
model Y = a + PX + E. If Y and X contain measurement errors, then Y and X are not observed.
What is observed is Y * and X*, where

where ey and ex are the measurement errors on Y and X. Thus, the normal equations are solved
in terms of Y* = a + PX* + E, or Y + ey = a + p(X + ex) + E = a + f3X + f3ex + E.
Now if ex is small in comparison to X, this latter equation becomes Y = a + f3X + E - ey, or
Y = a + f3X + e,, which can be handled by the methods outlined in this chapter.
Recall that no distributional assumptions are required to get the least squares estimates for
a and f3. The assumptions are involved when confidence intervals and tests of hypotheses are of
concern, or when it is desired to state that the least squares estimates for a and P are also maxi-
mum likelihood estimates. Johnston (1963) points out that the least squares estimates for a and
p are biased if significant measurement errors are present on X.
One of the assumptions used in developing confidence intemals and tests of hypotheses was
that the E~ are independent. If E, is correlated with E ~ + , ,the least square estimates of a and f3 are
unbiased, however, the sampling variance of a and f3 will be unduly large and will be underesti-
mated by the least squares formulas for variances rendering the level of significance of tests of
hypotheses unknown. Also, the sampling variances on predictions made with the resulting equa-
tion will be needlessly large. Correlation between E, and frequently arises when time series
data are being analyzed. This type of correlation is known as autocorrelation or serial correlation.
SIMPLE REGRESSION 239

Fig. 9.5. Illustration of situation where Var(ei) # s2for all i.

Johnston (1963) discusses least squares estimation procedures in the presence of autocorrelation.
Autocorrelation of errors is discussed in more detail in the next chapter of this book.
In some situations the assumption of homoscedasticity [Var(~,)= 0' for all i] is violated.
Quite commonly, Var(ei) increases as X increases. Such a situation is depicted in figure 9.5. Draper
and Smith (1966) and Johnston (1963) discuss least squares estimation under this condition.
Another point to be made concerning hypothesis testing in general is that a statistically
significant difference and a physically significant difference are two entirely different quantities.
For example, when the H,: P = 0 was tested in example 9.4, the conclusion was that the
regression line explained a significant amount of the variation in Y. This refers to a statistically
significant amount of the variation at the chosen level of significance. It means that recognizing
an a%chance of an error, the relationship Y = a + bX cannot be attributed to chance. It does
not imply a cause and effect relationship between Y and X.
Looking at the confidence limits on the regression as plotted in figure 9.1 and the scatter of
the data, it can be seen that this simple relationship Y = a + bX leaves a lot to be desired in
terms of predicting annual runoff. Whether or not the derived relationship is usable depends on
the use to be made of the predicted values of Y and not on the fact that the Ho: p = 0 is rejected.
It may be that the standard error of the equation, s2,is so large as to render the estimate made with
the equation in some particular application too uncertain to be used even though the equation is
explaining a statistically significant portion of the variability in the dependent variable.

Exercises

9.1. The following data are the maximum air and soil temperatures (bare soil at 2-inch depth)
recorded for the first 30 days of July 1973, at Lexington, Kentucky. Derive a linear relationship
via simple regression for predicting the maximum soil temperature from the maximum air
240 CHAPTER 9
Max Temp
Air Soil Air Soil Air Soil

temperature. Estimate a and 3 for the resulting regression. Test the hypothesis that (a) the inter-
cept is 0, (b) the slope is 1, (c) the regression explains a significant amount of the variation in the
maximum soil temperature. Would you recommend using this relationship for predicting maxi-
mum soil temperature?

9.2. The asterisks following the soil data in exercise 9.1 indicate days on which rainfall occurred.
Using only these rainfall days, work exercise 9.1.

9.3. Calculate the regression coefficients in the relationship Q, = a + bQ where Q, is the annual
suspended sediment load and Q is the annual water discharge for the Green River at Munfordville,
Kentucky. Calculate the standard error of the regression equation and the correlation coefficient.
Plot the data along with the 95% confidence intervals on the regression line. Is this a usable
prediction equation?

9.4. Show that the correlation coefficient in simple regression is equivalent to the correlation
between Y and ?.

9.5. Calculate the regression equation for the data of table 9.1 considering the runoff as the in-
dependent variable and the precipitation as the dependent variable. Rearrange the resulting
equation to be in the form of the prediction equation of example 9.1. Does the resulting
regression equation agree with the regression equation in example 9.1? Should it agree? Why?
Which equation should be used?

9.6. A technique used by hydrologists to detect changes in the hydrologic response of a watershed
is to examine mass curves for changes in slope. A mass curve is a plot of the accumulation over
time of one variable versus the accumulation over time of a second variable. The data below are
the annual runoff and precipitation for Thorne Creek experimental watershed in Pulaski County,
Virginia. It is thought that there was a change in the hydrologic characteristics of this watershed
during the 11-year period of study. Plot the accumulated precipitation as the abscissa and the
accumulated runoff as the ordinate. Does there appear to be a change in the rainfall-runoff
relationship? During what year? Calculate the slope of the regression lines describing the data
SIMPLE REGRESSION 24 1

both before and after the apparent change. Test the hypothesis that these slopes are not signifi-
cantly different.

Year Precipitation Runoff Year Precipitation Runoff

9.7. Occasionally it is desirable to restrict the intercept of a simple regression to 0, thus requir-
ing the regression line to pass through the origin. Derive the normal equation for the slope in this
case. Use the resulting equation to calculate the slope of the line describing the data plotted for
exercise 9.6. Neglect the apparent change in the slope for this problem (i.e., use all of the data to
estimate b in the equation accumulated runoff = b [accumulated precipitation]).

9.8. Hydrologists frequently use watershed physical characteristics as an aid in studying


watershed hydrology. The data below are the area (square miles) and length (miles) of several
Colorado mountain watersheds (Julian et al. 1967). Derive a linear regression equation for
predicting the area of similar watersheds as a function of the watershed length. Plot the data and
the derived regression line. Plot the 95% confidence intervals on the regression line.

Area Length Area Length


10. Multiple Linear
Regression
NOTATION
THE NOTATION set forth in chapter 9 will be followed in this chapter unless otherwise
y ,X
noted. Additionally, vectors and matrices will be denoted by underlined letters such as - -or -
b.
The inverse of a matrix - X-' .The transpose of -
X will be denoted by - X will be denoted by -X'. The
x,
A
number of rows and columns in a matrix will be shown as - if X has n rows and p columns.
nXp -
v
-
Thus represents a column vector with n elements. The element of X - corresponding to the
n X l
i' row and the jthcolumn will be denoted by Xi,. The expression - X = [Xi,,] indicates that -
X is
made up of elements Xij. A matrix made up of elements which are deviations from a mean will
be denoted by a lower case, underlined letter -y . The i, jthelement of -
y will be given as yij. The
v
i' element of a vector will be given by Yi.
n X l
The concepts of chapter 9 must be understood before proceeding to this chapter. Calcula-
tions would normally be done on a computer for problems dealing with multiple regression.
Standard programs are available, so the emphasis in this chapter is not on computing but on the
principles involved in multiple regression.

GENERAL LINEAR MODEL


Quite often, a dependent variable may be expressed as a linear combination of several other
quantities. For example, the peak rate of runoff from watersheds in a given region may be related
to the watershed area, slope of the mainstream, rainfall, and so on. A linear regression model for
MULTIPLE REGRESSION 243
predicting peak runoff would then c~ntainall of these variables. This is an extension of the linear
model discussed in chapter 9 to include several independent variables.
A general linear model is of the form

where Y is a dependent variable, XI, X2, ..., Xp are independent variables, P,, P2, ..., Pp are
unknown parameters, and E is an error component. This model is linear in the parameters, Pj.

is also linear in the parameters, pj, whereas the models

and

are not linear in the parameters.


In practice, n observations would be available on Y with the corresponding n observations
on each of the p independent variables. Thus, n equations like equation 10.1 can be written, one
for each observation. The p unknown parameters are estimated from the n equations. Thus, n
must be equal to or greater than p. In practice, n should be at least 3 or 4 times as large as p. The
n equations are

where Yi is the ithobservation on Y and Xij is the ithobservation on the j' independent variable.
Equations 10.2 can be written

for i = 1 to n. In matrix notation the equations become


Y is an n X 1vector of observations, -
where - X is an n X p matrix made up of n observations on
each of p independent variables, and -P is a p X 1vector of unknown parameters. If the matrices
in equation 10.4 are written out, we get

When the model is written in the form of equation 10.5, it is easy to see that - Y is an n X 1
vector of observations on the dependent variable, - X is an n X p matrix made up of n observations
on each of p independent variables, and - P is a p X 1 vector of unknown parameters. For equation
10.4 to have an intercept term, it is necessary that Xi,l = 1 for all i. p1 is then the intercept. In the
following development, it is assumed that Xi,, = 1 for i = 1 to n.
The model discussed in chapter 9,

is a special case of equation 10.5 with Xi,1 = 1, Xi,2 = X, P1 = a and P2 = P.


Following the pattern of chapter 9, the unknown parameters, - P, can be estimated by
minimizing C e' where

In matrix notation

Differentiating this expression with respect to and setting the partial derivative equal to
zero results in

which represents the normal equations. The solution of equation 10.7 is obtained by premulti-
plying by (x'x)-'.
--
MULTIPLE REGRESSION 245
p can be estimated by
or we have the result that -

The - X'X- matrix plays an important role in estimating - P and in the variance of the 6,'s.
The -X'X - matrix is made up of the sum of squares and cross products of the independent variables.
For the p x p matrix - X'X- to be inverted, its rank must be p. That is, no row or column can be a
linear function of any combination of the other rows and columns. If this occurs, it is known as
multicolinearity.
z = [z,,,], then -
If we define zi, to be (Xi, - X,)/s, and let - zfz/(n
- - 1 ) is a p X p correlation
matrix, - R = [Rij], where Rid is the correlation coefficient between the ith and jth independent
variables. By definition, Rid = 1 for i = j. If 1Rij( = 1 for some i # j, then the ithindependent
variable is a linear function of the j' independent variable and the rank of - X'X- will be less than
p. This means that an independent variable cannot be a (perfect) linear function of any other
independent variable. Furthermore, for the rank of - X'X
- to be p, an independent variable cannot
be linearly dependent on any linear function of the remaining independent variables. For exarn-
ple, if p is 4 and X2 = a x 1 + bX3 + c, then X2 is a linear function of XI and X3 SO that the rank
of -
X'X - would be at most 3. If there is near linear dependence in X,the calculation of (- X '-
X)~'
may involve roundoff errors and loss of significance leading to nonsensical estimates for - P
(Draper and Smith 1966).
As in the case of simple regression, the total sum of squares can be partitioned into three parts.
Draper and Smith (1966) demonstrate that equation 9.10 can be written in matrix notation as

Y'Y
so that the three components of the total sum of squares, - - or 2 Y: are:

1. n p , the sum of squares due to the mean

2. Y'Y - ~ ' x ' Y = (Y - x ~ ) ' ( Y- ~ 6 =)e'e


- - - - = 2 e? = 2 (Yi - ?i)2, the sum of squares
-- --- - -

of deviations from regression or the residual sum of squares

~ '-x -
3. - ' Y- n? = 2 (Pi- Y)', the sum of squares due to regression

A multiple coefficient of determination, R ~can


, now be defined from equation 9.13 as

Sum of squares due to regression


~2 =
Sum of squares corrected for the mean
Table 10.1. ANOVA for multiple regression

Degrees Sum Expected


of of mean
Source freedom squares square

Mean 1 nY2
Regression P-1 fi'x'y - n p
---

Residual n-P -- P'x'Y


YY - --- s2
Total n YY
--

As in the case of 3, the range of R2 is from 0 to 1. The multiple correlation coefficient is


defined as the positive square root of R2. Again, R2 is the fraction of the total sum of squares
?=-
corrected for the mean that is explained by the regression equation - ~ 6 .
Quite frequently the partitioning of the sum of squares is shown as in table 10.1 in the form
of an analysis of variance (ANOVA) table. A mean square in the ANOVA is simply a sum of
squares divided by its degrees of freedom. A

Continuing the analogy with simple regression, define e as Y - XP. The estimation proce-
dure guarantees that E(e)
- =- 0. An unbiased estimate for the Var(e,) or u2is s2 where

e'e-
- ( Y - @ ) ' ( Y-
- X S )- (Y'Y-fi'x'~)
- - - -- - - -
-
n-P n-P "- P
The standard error of the regression equation u is estimated by s. An expression for R2 that
is analogous to equation 9.18 is

Again, this shows that if the regression equation is explaining a large part of the variation in Y.
The standard error of the equation will be significantly less than the standard deviation of Y.

Example 10.1. Benson (1962) studied flood frequencies on many streams in the northeastern
United States. The following table contains a partial listing of some of Benson's data. Using this
data: (a) Estimate the regression coefficients for the model

where Q is the mean annual flood in thousands of cfs, A is the watershed area in thousands
of square miles, and I is the average annual maximum 24-hour rainfall depth in inches.
MULTIPLE REGRESSION 247

(b) Calculate R'.., (c) Calculate Q~for each observation on the independent variables. (d) Calcu-
late ei for each Qi.

Station No. Q A I Q e

Solution: To maintain consistency in notation, let Yi = Qi, Xi,I = 1, Xi,, = Ai, Xi,, = Ii. For this
problem n = 14 and p = 3. The column of data under Q is the 14 X 1 vector - Y, a column of 1's
along with the data under A and I is the 14 X 3 matrix X,and the 3 X 1 vector - P is made up of
b,, b,, and b3. From equation 10.8, we have

(&'&)-' is found to be
3.71678
-0.18094
- 1.37537
-0.18094
0.02028
0.06124
- 1.37537
0.06 124
0.52329 1
248 CHAPTER 10

The parameter estimates are b,


= 1.6570, b2 = 13.1510, and 6, = 0.01 12.
From equation 10.10, we get
-
( @ ' ~-' ny2)
~
R2 =
- - - ny2)
(Y'Y

This means that 99% of the variation in Y is explained by the regression equation

Values for Q contained in the above table were calculated from this relationship.
Values for ei'were computed from

and are also contained in the above tabulation.


The ANOVA table for this example would be

Source d.f. Sum of squares Mean square

Mean 1 6,606.381
Regression 2 13,182.600 6591.300
Residual 11 171.090 15.554
Total 14 19,960.071

From the ANOVA table R2 = 13,182.600/(19,960.066 - 6,606.381) = 0.99 and s2 =


15.554 or the estimated standard error of the regression equation is s = 3.94.
Comment: The purpose of this example is to demonstrate the meaning of the various matrices
and to provide practice in their calculation. Hydrologic significance should not be attached to the
high R2 since the watersheds are all close to one another (Maine) and the units on Q are cfs and
the watershed area is contained in the equation. Many of the gaging stations are located at vari-
ous points along the same stream. The number of significant figures that are carried in the calcu-
lations should be as large as practical. In reporting the results, the number of significant figures
should be reduced. Thus, the reported results on the above regression might be

If a large number of significant figures are not carried in computing the (x'x)-'matrix,
significant errors can result. To demonstrate this, the elements of the X'X and X'Y matrices were
rounded to two decimal places resulting in estimates for b of b1 6,
= 1.10, = 12.24, and b3
=
5.28. Computational problems of this type are rarely a problem when using well-established
computer routines unless there is near colinearity in the -X matrix.
MULTIPLE REGRESSION 249
CONFIDENCE INTERVALS A I D TESTS OF HYPOTHESES
As was the case in simple regression, in order to use some well-developed theorems on con-
fidence intervals and tests of hypotheses in multiple regression, some assumptions must be made.
All of the comments of chapter 9 regarding the assumptions in simple regression remain valid in
multiple regression. The assumption will now be made that the E, are identically, independently,
and normally distributed with mean 0 and variance u2 .That is, the ei are iid N(0, u2)(see General
Comments section of chapter 9.)

Confidence Intervals On Standard Error


Confidence intervals can be placed on u2by noting that the quantity (n - p)s2/u2has a chi-
square distribution. Thus, the confidence limits on u2are

Inferences on the Regression Coefficients


To make inferences concerning -
,. P the variance of -6 must be estimated. The variance-
P is given by
covariance matrix of -

which can be shown to be

fii
The variance of is equal to the covariance of fii with itself and is therefore c? times the ia diagonal
element of (X'X)-I.
-- fii
The covariance of with fij is c? times the i, j~ element of (-X '-X ) ~ ' .If we let
C =
- -- X'X then C-' = (x'x)-'
--
and

where c,' is the ithdiagonal element of (x'x)-'.


If the model is correct, then the quantity bi/sgiis distributed as a t distribution with n - p
degrees of freedom where si, is an estimate for upiand is calculated as the positive square root of

Confidence intervals on Pi are given by


A test of the hypothesis Pi = Po where Po is a known constant can be made by noting that
(Pi - P o ) / s ~has a t distribution. Thus, to test H,: Pi = Po versus H,: Pi # Po, the test statistic

is computed. H, is reject if I t 1 > tl - up,n- p.


,.
Because in general Biis not independent of P, (their covariance is given by c i 1 0 2 ) ,
repeated application of equation 10.17 to test H,: Pi = Poi and H,: Pj = Poj are not independent
tests.
A test of H,: pi = 0 versus Ha: pi # 0 is equivalent to testing the hypothesis that the iLh
independent variable is not contributing significantly to explaining the variation in the dependent
variable. If H,: pi = 0 is not rejected, it is often advisable to delete the i~ independent variable
from the model and recalculate the regression.
A test of the hypothesis that the entire regression equation is not explaining a significant
amount of the variation in Y is equivalent to H,: P2 = P3 = - - - = pp = 0 versus Ha:at least one
of these p's is not zero. Since pi is not independent of Pj, repeated application of equation 10.17
is not a valid way to test this hypothesis. Use can be made of the fact that the ratio of the mean
square due to regression to the residual mean square has an F distribution with p - 1 and n - p
degrees of freedom. To test H,: P2 = P3 = = Pp = 0, calculate the test statistic
0 . -

and reject H, if F exceeds F1-,,p-l,n-p.


A test of the hypothesis that k of the independent variables are not contributing significantly
to explaining the linear variation in the dependent variable can be made by rearranging the model
so that the last k variables are the ones to be tested. The hypothesis is that the last k independent
variables are not contributing significantly to explaining the linear variation in Y. In practice, the
model does not have to be so arranged. The order of the X's makes no difference. The assump-
tion here is the last k variables are under test. This makes the notation easier. This is equivalent
to H,: PP-,+, - - Pp-,+2 --
= Pp = 0 versus Ha: at least one of these p's is not zero. To test
- . a

H,, denote the full model as the model containing all p of the independent variables. Denote as
the reduced model the model obtained by deleting the last k independent variables. The reduced
model contains p - k independent variables. Now let
Q2 = sum of squares due to regression on the full model with p - 1 degrees of freedom
Q1 = residual sum of squares on the full model with n - p degrees of freedom
Q2* = sum of squares due to regression on the reduced model with p - k - 1 degrees of
freedom
The quantity
iMULTIPLE REGRESSION 25 1

will have an F distribution with k and n - p degrees of freedom. H, is rejected if F exceeds


F1-a,k,n-p.
Note that Q2 - Q2* is the reduction in the sum of squares due to regression brought about
by deleting k independent variables. If Q2* nearly equals Q2 , then the deletion of the k variables
has not greatly changed the ability of the model to explain the linear variation in Y. Under these
conditions F will be small and H, will not be rejected, indicating that one might eliminate the last
k variables from further consideration. Rejection of H, does not imply that all of the last k vari-
ables are important-it only implies that at least one of these variables is explaining a significant
amount of the variation in Y.

Confidence Intervals on the Regression Line


To place confidence limits on Y, where Yh = -
XhP,
- it is necessary to have an estimate for the

variance of P,.In this discussion 9, is an estimate of Y (a scalar) at the point Xh


- (a 1 X p vector)

fi
in p dimensional space. is a p X 1 vector consisting of the estimates for -
P. The var(9,) is given
by (Draper and Smith, 1966)

which can be estimated by replacing u2 with s2.The confidence limits on Yh are given by

The confidence intervals on an individual predicted value of Y, are given by equations 10.21
where var(9,) is replaced by the variance of an individual predicted value of Y at -Xh which is
given by u2(1 + -
xh(Xrx)-'x',).
-- -

Other Inferences in Regression


Many other tests of hypotheses can be made and confidence intervals constructed relative to
multiple regression. For example, one might make tests concerning linear relationships among
the b's or that the p's obtained from one situation are equal to those obtained from another
situation. Reference can be made to Graybill (1961), Johnston (1963), Draper and Smith (1966),
or Neter et al. (1996) for these and other tests.

Example 10.2. For the regression equation of example 10.1: (a) Test the hypothesis that the
regression equation is not explaining a significant amount of the variation of Y. (b) Test the H,:
p2 = 0. (c) Test the H,: p, = 0. (d) Calculate the 95% confidence limits on P2. (e) Calculate the
95% confidence limits on the regression line at the point A = 4,000 square miles and I =
2.0 inches. (f) Calculate the 95% confidence intervals on u2.
252 CHAPTER 10
Solution:

(a) This H, is equivalent to H,: p, = p3 = 0 versus Ha: at least one of p, or p3 Z 0. The


test is conducted by calculating the test statistic from equation 10.18. The quantities in equation
10.18 are contained in the ANOVA table with the numerator being the mean square due to
regression and the denominator being the residual mean square.

The tabulated F,95,2 is 3.98. Therefore, H, is rejected. The regression equation does explain a
significant amount of the variation in Y.

(b) H,: p, =0 Ha: p, Z 0

The test statistic is from equation 10.17.

= 2.201, so we reject the H,. Area does explain a significant


The tabled value of t is t,975,11
amount of the variation in Y.

(c) H,: p3 = 0 Ha: P3 Z 0

The test statistic is again from equation 10.17

Because It1 < t.975,11,we cannot reject H,. The mean annual maximum 24-hour rainfall depth
does not explain a significant amount of the variation in the mean annual peak flow.

(d) The 95% confidence limits on P2 are calculated from equations 10.15 as

(e) The 95% confidence limits on the regression line at X2,h= 4.00 and X3,h= 2.0 are
determined from equation 10.2 1. The var(Ph)is from equation 10.20.

var(9,) = 15.554X - - 'x;


-h (XIX)- -

(&I&)-' is given in example 10.1


MULTIPLE REGRESSION 253

(f) The 95% confidence intervals on u2 are calculated from equation 10.13

The 95% confidence intervals on u can be obtained by taking the square root of these limits to
obtain 2.80-6.69.
Comment: The hypothesis H,: P2 = 0 and H,: P, = 0 were both tested in this example as
though the tests were independent. In fact, P2 and P, are not independent. The cov(p2, b,)
can
be determined from C;: s2 as .0612(15.554) = 0.9519. The correlation between p2 6,
and can
be estimated from cov(b2, fi,)/(up, up3)as 0.9519/(0.562 x 2.85) = 0.59. The test of H,: P, =
0 is made relative to the full model that includes all of the P's. The acceptance of H, implies that
p, = 0 given that p1 and p2 are in the model. In general, if there are p p7s and H,: Pi = 0 is
tested for each of them, with the result that k of the hypotheses can be accepted, one cannot
eliminate these k variables from the model on the basis of this test alone because each of the
individual H,: Pi = 0 assumes all of the other p - 1 p7sare still in the model. To eliminate k
variables at once, the test must be based on equation 10.19.
As an example of the application of equation 10.19, the H,: P, = 0 will be tested. The
ANOVA for the full model is contained in example 10.1. The reduced model is simply Y = b1 +
b2X where X is the watershed area in thousands of square miles. Because this is a simple regres-
sion situation, we can compute the sum of squares due to regression from b 2 Zxiyi where
b = 2 xiyi/Z xf.The result of this calculation is the sum of squares due to regression for the
reduced model, which is 13,182.60.
The test statistic from equation 10.19 is

The table value of F.95.1,11 = 4.84, so we fail to reject H,: p3 = 0. Note that this test is
identical to the test conducted in part (c) of this example. From F and t tables it can be seen that
254 CHAPTER 10

F,-,,,, -
- t:-an,n, SO for the special case where k = 1 variable is being tested, equations 10.17
and 10.19 produce identical results.
Because H,: P3 = 0 was not rejected, the next logical step is to eliminate I from the model
and consider only A. Ln so doing the resulting regression equation is

The dependence of p's again is evident because the intercept is not the same as was obtained
when rainfall depth was included in the model. This is a somewhat special example in that P2
accounts for nearly all of the variation in Y, leaving virtually none of the variation to be explained
by P3. Again, one reason for this unusual situation is the units on Y and A and the proximity of all
of the watersheds to each other, resulting in similar rainfalls on all of the watersheds. Unless the
relationship between the dependent variable and an independent variable is quite strong, vari-
ability in the dependent variable due to variability in the independent variable cannot be detected
if there is little variability in the independent variable.

WHICH LINE IS BEST


A common situation in which multiple regression is used is when one dependent variable
and several independent variables are available and it is desired to find a linear model for pre-
dicting unobserved values for the dependent variable. The model that is developed does not nec-
essarily have to contain all of the independent variables. Thus, the points of concern are: 1) can
a linear model be used and 2) what independent variables should be included?
A factor complicating the selection of the model is that in most cases the independent vari-
ables are not statistically independent at all but are correlated. One of the first steps that should
be done in a regression analysis is to compute the correlation matrix - R of the independent vari-
ables. The correlation matrix can be computed as follows. Let

where Kj and sj are the mean and standard deviation of the j" independent variable. Then define
z = [zij] so that the correlation matrix is

where Rij is the correlation between the i" and j" independent variables. -R is a symmetric matrix
because Rij = Rj,i.We have already seen that if Ri,j = 1 for i # j, then either variable i or vari-
able j must be omitted from the model or else the - X'X
- matrix cannot be inverted. If Ri,j is close
P estimated. If Rijis close to unity,
- - can be inverted and -
to unity (but not equal to unity), then X'X
then the var(bi) or var(bj) may be very large. Tests of hypothesis on Pi and Pj may indicate that
neither is significantly different from zero when in fact either Pi or Pj when used alone may be
significantly different from zero. The problem here is that since Xi and Xj are nearly linearly
MULTIPLE REGRESSION 255

related, they both are attempting to explain the same thing in the linear model. By having both Xi
and Xj in the model, the part of the variation in Y that either would explain if used alone may be
split between them in such a fashion that neither is significant. In other words, the effect of one
explanatory factor (which may be reflected in either Xi or Xj) is being divided between two
correlated variables.
Retaining variables in a regression equation that are highly correlated (multicolinearity)
makes the interpretation of the regression coefficients difficult. Many times the sign of the
regression coefficient may be the opposite of what is expected if the corresponding variable is
highly correlated with another independent variable in the equation. Multicolinearity is discussed
below.
A common practice in selecting a multiple regression model (and one that is not necessarily
being advocated) is to perform several regressions on a given set of data using different
combinations of the independent variables. The regression that "best" fits the data is then
selected. A commonly used criterion for the "best" fit is to select the equation yielding the largest
value of R2.
Looking at equations 10.21, another and perhaps better criterion is apparent. The confidence
intervals on the regression line are a function of s, the estimated standard error. The line with the
smallest standard error will have the narrowest confidence intervals.
Often the two criteria of the largest R2 and the smallest s give the same results-but not
always. As more variables are added to a regression equation, the R2 value can never decrease.
Thus, from the standpoint of the R2 criterion, one should use all of the available variables. This,
however, makes a clumsy equation and one in which it is extremely difficult to place a meaning-
ful interpretation on the coefficients.
As more variables are added to a regression equation, the standard error may get larger. This
can be seen from equation 10.11. Every time a variable is added, n - p gets smaller as does
--- -
Y'Y P'X'X.
- - However, the numerator may not, and often does not, decrease proportionally to
n - p, so that as variables are added s may actually increase. This is a tip-off that the added
variables are not contributing significantly to the regression and can just as well be left out.
All of the variables retained in a regression should make a significant contribution to the
regression unless there is an overriding reason (theoretical or intuitive) for retaining a non-
significant variable. The variables retained should have physical significance. If two variables are
equally significant when used alone but are not both needed, the one that is easiest to obtain or
easiest to interpret should be used.
The number of coefficients estimated should not exceed 25-35% of the number of observa-
tions. This is a rule of thumb used to avoid "over-fitting", whereby oscillations in the equation
may occur between observations on the independent variables.
Thus far all decisions on which regression equation to use have been made by the
investigator. In many cases this is the most reliable method of selecting a regression equa-
tion. Using computers, it is possible to perform many regressions on large sets of data. This
has led to several formal procedures for selecting a regression equation. Two methods will
be discussed here-all-possible-regressions and stepwise regression. For a discussion of
some other techniques, reference should be made to Draper and Smith (1966) and Neter
et al. (1996).
All-possible-regressions involves calculating regression equations having every possible
combination of the X variables. If all of the equations are required to have an intercept term,
2P-' regression equations would have to be calculated where p is the number of independent
variables, one of which is always equal to one to produce the intercept term. Thus, if p = 4, 8
regression equations would be calculated (not an impossible task or a bad procedure); however,
if p = 11,1024 regressions would have to be calculated and examined. Thus, as p gets even mod-
erately large, the number of regressions required becomes prohibitive and intelligent thought
could eliminate many of them. When this many regressions are calculated, the probability of
getting a significant regression by chance becomes large.
One of the most commonly used procedures for selecting the "best" regression equation is
stepwise regression. This procedure consists of building the regression equation one variable at a
time by adding at each step the variable that explains the largest amount of the remaining unex-
plained variation. After each step all the variables in the equation are examined for significance
and discarded if they are no longer explaining a significant amount of the variation. Thus, the first
variable added is the one with the highest simple correlation with the dependent variable. The
second variable added is the one explaining the largest variation in the dependent variable that
remains unexplained by the first variable added. At this point the first variable is tested for sig-
nificance and retained or discarded depending on the results of this test. The third variable added
is the one that explains the largest portion of the variation that is not explained by the variables
already in the equation. The variables in the equation are then tested for significance. This pro-
cedure is continued until all of the variables not in the equation are found to be insignificant and
all of the variables in the equation are significant. This is a very good procedure to use but care
must be exercised to see that the resulting equation is rational. An alternative stepwise procedure
is to start with a full model and eliminate variables one at a time with the least significant vari-
able being chosen for elimination. At each step the significance of all remaining variables is
checked to ensure the retained variables are, in fact, more important than the eliminated ones.
Of course, the real test of how good the resulting regression model is depends on the ability
of the model to predict the dependent variable for observations on the independent variables that
were not used in estimating the regression coefficients. To make a comparison of this nature, it is
necessary to randomly divide the data into two parts. One part of the data is then used to develop
the model and the other part to test the model. Unfortunately, in hydrologic applications there are
often not enough observations to cany out this procedure.

EXTRAPOLATION
The comments on extrapolation contained in chapter 9 relative to simple regression are
equally applicable to multiple regression. In multiple regression an additional problem arises. It
is sometimes difficult to tell the range of the data. In example 10.1, A ranges from 0.091 to 8.27
and I ranges from 1.7 to 3.2. Is the point A = 6.0 and I = 2.7 in the range of the data?
A plot of A and I is shown in figure 10.1. From this plot it is apparent that A and I do not
cover the entire range defined by 0.091< A < 8.27 and 1.7 < I < 3.2. The point A = 6.0 and
I = 2.7 does not appear to be in the range of the data. In more than 2 dimensions it is much more
difficult to visualize the range of the data.
MULTIPLE REGRESSION 257

Fig. 10.1. Range of data used in example 10.1.

AUTOCORRELATED ERRORS
One of the assumptions that is made in linear regression is that the errors are independent.
This means that there should be no correlation between the errors at successive observations.
Correlation in the errors from one observation to the next is common in time series data, espe-
cially if the hydrologic system involves considerable storage. For example, if the dependent
variable is the elevation of the ground water in a particular observation well on a monthly basis,
it would not be uncommon that if this water level were under-predicted at a particular time step,
it would tend to be under-predicted in the next time step. Correlation of this type is often called
autocorrelation or serial correlation. The chapters in this book on Correlation and on Time Series
Analysis deal with this topic as well. Neter et al. (1996) has a good treatment of regression when
serial correlation is present.
It is important to note that what is of concern is autocorrelation in the error term of the
regression model, not in the dependent or the independent variables. Often, but not always,
autocorrelation in the dependent variable leads to autocorrelation in the error term of the regres-
sion model. Time series data such as daily or monthly streamflow, monthly ground water levels,
and monthly reservoir levels generally have significant serial correlation, and regressions using
these as dependent variables often have serial correlation in the error terms. The error term rep-
resents deviations between the predicted and observed values of the dependent variable. Serial
correlation in the predicted variables can arise because the model predicts similar values from
one time step to the next. It is only when over-predictions at one time step tend to follow over-
predictions at the previous time step and under-predictions tend to follow under-predictions that
serial correlation in the error term exists.
Serial correlation in the errors can be detected by examining a time series plot of the errors
and noting any patterns. Random scattering of the errors indicates a lack of serial correlation or
258 CHAPTER 10
independence of the errors. Any pattern in the errors may be indicative of serial correlation. The
correlogram (chapter 14) of the errors can also be computed. A large first order serial correlation
indicates correlated errors.
Estimated regression coefficients in the presence of serial correlation in the errors are unbi-
ased but their variances are incorrectly estimated, and thus the level of significance of hypothe-
sis tests regarding these coefficients is unknown. The standard error of the regression equation is
also affected so that hypothesis tests involving the standard error are also at an unknown level of
significance.
Serial correlation may indicate that one or more important explanatory variable is missing
from the regression equation. Serial correlation implies that

where E, is the error at time t, p is the serial correlation, and E, is independent with mean zero. Neter
et al. (1996) indicate that if E, is iid N(0, a2) then e, has a mean of 0 and a variance of a2/(1 - p2)
where p is the first order serial correlation between e, and e,-,. This in turn implies that

Transforming all variables so that

we can perform a regression of Y: versus XI, and eliminate the problem of serial correlation in
the errors. Equation (10.25) requires that p be known. It can be estimated by computing the first
order serial correlation if the errors from the original equation involving Y, and the Xi,,?sfrom

where 6 is the estimated standard error of the original regression equation.


An alternative to this estimation of p would be to include Y,-, and the as predictor
variables so that equation (10.25) would become

In the event that turns out to be iid N(0, u2), standard tests of hypotheses can then be used
E,
to eliminate nonsignificant P's. In equation (10.28) the Y,-, and Xi,l-l are known as lagged
variables.
Lagged variables can often represent changes in storage. We know from continuity
MULTIPLE REGRESSION 259
where I is inflow, 0 is outflow and AS is the change in storage for a particular hydrologic system.
In many systems of areal extent A, (Y, - Y,-,)A may be proportional to the change in storage
from time t - 1 to t. A prediction of Y, might be based on the difference in inflow and outflow
from t - 1 to t and Y,-,.

T - 6 = (Y, - Y,-,)A
It + It-1 Ot + Ot-1
At - At = (Y, - Y,- ,)A
2 2

which may be written

and is in the form of equation 10.28.


Equation (10.28) may eliminate serial correlation in the error term but may introduce mul-
X matrix. Multicolinearity in -
ticolinearity in the - X is discussed below. Multicolinearity may arise
from serial correlation in one or more of the X variables. Further, if Y, is linearly related to the
Xt7s,then Y,-I can be expected to be Linearly related to the X,-,'s.

Testing for Serial Correlation


One way to detect serial correlation in the errors is through the correlogram (chapter 14).
Serial correlation in the errors would be indicated by a significant first order autocorrelation
coefficient. A test for serial correlation is presented in chapter 11. The hypothesis is that p(k) = 0
versus p(k) # 0 where p(k) is the kh order serial correlation. The test for k = 1 is of special
concern.
Possibly the most widely used test for serial correlation in the error term is the Durbin-
Watson test. In hydrology any correlation tends to be positive rather than negative because the
correlation often comes about due to storage in the system under analysis. Ground water levels
change slowly because of ground water storage. Flows in major rivers change slowly because of
storage in the watershed. Storage tends to promote positive serial correlation. Thus, the normal
test for serial correlation would be for p = 0 versus p > 0. The test statistic is

An exact test is not available, but Durbin and Watson have obtained lower and upper bounds
dL and du such that values of D outside these bounds lead to the decision that the hypothesis can
not be rejected if D > dUand the hypothesis is rejected if D < dL.If dL< D < du, the test is in-
conclusive. Tables of d, and dUare contained in the appendix for various values of n, p, and for
levels of significance equal to 0.05 and 0.01.
260 CHAPTER 10
Neter et al. (1996) indicate that a test for negative serial correlation can be done by using as
a test statistic 4 - D. The test is then the same as for positive serial correlation. Helsel and Hirsch
(1992) indicate that the Durbin-Watson statistic requires the data to be evenly spaced in time.

Corrective Action
When serial correlation in the errors is detected, the first step should be to determine if some
important explanatory variable is missing from the regression equation. Often in hydrology the
serial correlation is a result of storage in the system. In this case, a measure of this storage may
need to be included as a predictor variable. In other cases some function of time may correct the
problem.
Aggregating data over longer time periods may reduce or eliminate serial correlation. As the
time between observations increases, the dependence of one observation on another can be
expected to decrease. At large enough time intervals, independence may be achieved.
As indicated earlier, the inclusion of lagged variables, both on the dependent and the inde-
pendent variables, may help reduce serial correlation. The chapter on time series modeling
should be consulted for more on this topic.

MULTICOLINEARITY
In multiple linear regression it is unfortunate that the predictor variables in -
X are called "in-
dependent" variables. This terminology reflects that - Y is being predicted as a function of -
X. Thus
Y has been termed the "dependent" variable because -
- Y is thought to depend on -X. By extension,
the X's have become known as the "independent" variables because they are what - Y is depend-
ent upon. Independence has a special meaning in statistics that differs from the above, as we have
seen. We know that if all of the X's in - X are mutually independent, then the correlation matrix
computed from - X will be a diagonal matrix with ones on the diagonal and zeros elsewhere.
In most natural sciences where "independent" variables are measured values from uncon-
trolled experimentation,it is rare to achieve true statistical independence. Some level of correlation
almost always exists among the predictor variables. These correlations among the independent
variables are often called multicolinearities. Much of the discussion on multicolinearitycomes from
Neter et al. (1996). Generally the term multicolinearity is reserved for the case when rather strong
correlations exist within the - X matrix.
As the name implies, multiple linear regression attempts to exploit linear relationships be-
tween - Y and - X to develop a prediction or descriptive equation for - Y. If two X variables, say X,
and X2, are perfectly linearly related, then r,,, = 1. Furthermore, all of the information relative to
a linear relationship between - Y and -
X,, will be contained in the relationship between - Y and -
X,. In
other words, nothing is gained by including both - X, and X, in a linear regression with -
Y. As a mat-
ter of fact, there is not a unique linear relationship between - Y and -
X, and -X, if -
X, and -
X, are per-
fectly correlated. Each of the relationships will predict the same value of Y for all X, and X2pairs
that follow the linear relationship between - X, and -
X,.
When - X, and -X2 are perfectly correlated, the residuals of the regressions -Y on -
X,, -
Y on -X,,
and -Y and - X, and -X, will all be exactly the same since the same information, in a linear sense,
will be contained in all three regressions. (Note the brief mention of multicolinearity following
equation 10.8.)
MULTIPLE REGRESSION 26 1

If we now relax the requirement that - Xl and -X2 are perfectly correlated to requiring that they
be "highly" correlated, an approximation to what is discussed above results. Now the residual
sum of squares of regressions of - Y on -X,, -Y on -
X,, and -Y on -
X, and - X, will be nearly the same
depending on the strength of the linear relationship between XI - and - X2. Thus, if a regression of
-Y on -X1 is performed followed by a regression of - Y on X1
- and - X2, the reduction in the residual
sum of squares brought about by the addition of - X, will be small because very little information
in a linear sense is added to the regression.
What may happen if both - X, and X, are included is that the linear effects between -Y and - XI
or -X2 may be split between - Xl and -X2 in such a fashion that the regression coefficients do not
make physical sense. For example, they may have the wrong sign. Furthermore, the individual
regression coefficients may test nonsignificant on both - X1 and - X2 even though the overall
regression is significant.
By splitting the importance of either - X2 among both X,
X, or - - and - X,, the variance of the
regression coefficients on - Xl and - X2 become larger, indicating increased sampling variability
relative to these coefficients. Again, this is brought about by splitting the effect of one important
linear relationship among two (or more) variables that are closely linearly related. Substantial
changes in the values for the regression coefficients upon the addition or removal of a variable
from a regression equation is an indication that multicolinearity may be present.
Having both -XI and -X, in the regression equation will not cause prediction problems as long
as the predictions are confined to the region of - X, and -X, defined by the original data sets. This
means that values used for - X, and -
X, for prediction must exhibit the same near linear relation-
ship as did the original values used in estimating the regression coefficient.
Multicolinearity is not restricted to correlations between pairs of X variables. It also in-
cludes correlation between any one of the X's and any linear combination of any of the remain-
ing X's. Obviously, correlations between pairs of X's are easily detected from the correlation
matrix of -X. Correlations with linear functions of several X's are not always easily detected. One
way to identify the possibility of an X being correlated with a linear combination of the other X's
is to compute the regression of - Xi on -
X*, where - X* is -
X with -Xi removed. The multiple R, can
be examined and used as an indication of multicolinearity. This procedure can be carried out for
-
alloftheX,'s,i = 2, ..., p.
A summary of what has been indicated about the effect of multicolinearity is:

1. Multicolinearity in itself does not inhibit the predictive ability of a regression model provided
the prediction is made within the regions of the independent variables used in deriving the
regression coefficients.

2. Multicolinearity may contribute to an inflated variance in the estimated regression coeffi-


cients. The sampling error of the coefficients may be large resulting in individual coefficients
being nonsignificant even though the overall regression is indicating a definite linear relation-
Y and -
ship exists between - X.

3. Individual regression coefficients may be hard to interpret in terms of their impact on Y. They
may even have the wrong sign. Thus, even though the overall equation makes a valid predic-
tion, the contributions of the individual X variables may not be decipherable.
262 CHAPTER 10
4. The values for individual regression coefficients may change substantially upon the addition
or deletion of an X variable that involves multicolinearity.

Detection of Multicolinearity
Some general indications of the possible presence of multicolinearity that have been identi-
fied are:

1. Large correlations in the correlation matrix of X.


-

2. Regression coefficients that do not make good physical sense.

3. Nonsignificant regression coefficients on important variables.

4. Large changes in the values of regression coefficients upon the addition or deletion of a vari-
able from the regression equation.

Possibly the most commonly used formal method for detecting multicolinearity is through
the use of the Variance Inflation Factor, VIF, defined as

1
VIF =
1 - R;

where is the multiple coefficient of determination between Xi and all of the other X's in the re-
gression equation. When R: is zero, then Xi is linearly independent of the other X's and the VIF
is one. If R: = 1, then the Var(Pi) and the VIF are unbounded. Large values of VIF indicate the
presence of multicolinearity. The exact value of VIF at which multicolinearity is declared de-
pends on the individual investigator. Some use a value of 5 and others 10. A VIF of 10 corre-
sponds to an R: of 0.90 and a VIF of 5 corresponds to ~ ? e ~ utoa 0.80.
l
Some will compute an average VIF over all p - 1 regression coefficients and declare that if
this average VIF is "considerably" larger than one, multicolinearity is indicated.
Some statistical packages will compute the VIE Some statistical packages use an indicator
called the tolerance, which is l/VIF. Thus, a VKF of 10 corresponds to a tolerance of 0.1 and a
VIF of 5 corresponds to a tolerance of 0.2.

AN APPLICATION OF MULTIPLE REGRESSION


The following illustration of using multiple regression is adapted from Haan and Read
(1970). Apart of their study was devoted to developing a prediction equation for the mean annual
runoff for small watersheds in Kentucky. The data for the example is contained in table 10.2. The
number of observations (13) is very small and does not permit splitting the sample and using a
portion of the data for testing the resulting model. Table 10.3 contains definitions of the symbols
used in table 10.2. The correlation matrix for the independent variables is contained in table 10.4.
MULTIPLE REGRESSION 263
Table 10.2. Data from Haan and Read (1970)

Watershed Runoff Precipitation A S L P di RS F Rr


No.

Table 10.3. Definition of symbols used by Haan and Read (1970)

Runoff Mean annual runoff (inches)


Precipitation Mean annual precipitation (inches)
A Area (square miles)
S Average land slope (%)
L Axial length (miles)
P Perimeter (miles)
4 Diameter of largest circle that can be drawn entirely within the basin (miles)
Rs Shape factor-ratio of dj to do where do is the diameter of the smallest circle that
can be drawn which entirely encloses the basin (-)
F Stream frequency-ratio of number of streams in basin to total area of basin
(square miles)
Rr Relief ratio-ratio of total relief to largest dimension of basin generally parallel to
main stream (feet per mile)

Table 10.4. Correlation matrix for data of Haan and Read (1970)
Table 10.5. Regression analysis of data of Haan and Read (1970) (10 independent variables)

Analysis of Variance

Degrees of Sum of Mean


Source freedom squares square

Regression
Residual
Total corrected for mean
R = 0.98
Std. Error = 0.69

Variable b sri t

Constant
Precipitation
A
S
L
P
di
Rs
F
R,

Since the correlation matrix is symmetrical, it is customary to show only the diagonal elements
and the elements either above or below the diagonal.
The mean and standard deviation of runoff are 16.55 and 1.93 inches, respectively. Table
10.5 contains the results'ofl the multiple regression of runoff on all 9 of the independent
variables. Because an intercept term was included, p is equal to 10. In the ANOVA table, the
sum of squares for the mean and the total sum of squares are not shown. Instead the total sum of
squares corrected for the mean is given. The F that is given is the calculated F for the overall
regression equation (from equation 10.18) used in testing the hypothesis that the regression does
not explain a significant amount of the variation in Y. Because F,,,,,, is 8.81, this hypothesis is
rejected.
The lower part of table 10.5 contains the estimated regression coefficients, the standard
errors of the regression coefficients, and the calculated t (equation 10.17) used in testing H,: Pi = 0.
The only b's with calculated t's greater than 2.0 are those based on precipitation, P, and R,.If all of
the variables except these three and the intercept are eliminated at one time, the regression shown
in table 10.6 results. In going to the second regression, R*has been reduced from 0.97 to 0.91, the
F increased to 28.7, and the standard error has remained unchanged. All of the regression coeffi-
cients with the exception of the intercept are now significantly different from zero at the one
percent level of significance since t.95,5,9is 3.25.
MULTIPLE REGRESSION 265
Table 10.6. Regression analysis of data of Haan and Read (1970) (4 independent variables)

Analysis of Variance

Degrees of Sum of Mean


Source freedom squares square

Regression 3 40.64 13.55


Residual 9 4.25 .47
Total 12 44.89
R' = 0.91 R = 0.95
F = 28.7 Std. Error = 0.69

Variable I3
- Sa t

Constant -9.65 4.440 -2.17


Precipitation 0.430 0.093 4.62
P 0.620 0.075 8.25
b 0.010 0.002 5.19

The t test used to test the hypothesis that Pi = 0 makes the test assuming that all of the
other p's are still in the equation. Thus, when a decision is made to eliminate more than one
variable, the t's are unreliable and the F test using equation 10.19 should be used. This test
determines if several variables are simultaneously making a significant contribution to
explaining the variation in the dependent variable. As an illustration of the use of equation
10.19, the hypothesis that PA = P, = p, = P,i = P,, = P, = 0 be tested. For this example
n = 13, p = 10, k = 6, Q2 = 43.45, Q2* = 40.64, and Q, = 1.44. The F calculated from
equation 10.19 is 0.98. Since F.95,6,.3= 8.94, it is concluded that the variables A, S, L, di, R,,
and F are not significant.
The resulting prediction model is

Runoff = -9.65 + 0.43 Precipitation + 0.62 P + 0.010 R,


The observed values of runoff and values predicted from the above equation are shown in
the lower half of table 10.6.
To demonstrate the behavior of s, R', and F, several regressions were run using various
combinations of the data in table 10.2. The results of these regressions are summarized in table
10.7 and figure 10.2. This table illustrates that R2 never increases as variables are removed from
the equation, whereas s may decrease as some variables are removed and then increase as more
variables are removed. R~ approaches unity as the number of variables is increasing. If the num-
ber of variables were increased to 12, then p would be 13 (because the model has an intercept)
and R' would be unity. In figure 10.2 the lines connect the best values of the quantities s, R', and
F contained in table 10.7. This is because it is possible, for example, to have many combinations
Table 10.7. Some results of several regressions on S, R', and F

Variables included

Eq.No. Precipitation A S L P 4 R, F R, s R' F

* = Variable included and was significant.


x = Variable included and was not significant.

Fig. 10.2. Behavior of s, R ~and


, F as a function of n for data contained in Table 10.2.

of 3 variables in the regression equation and each combination would produce a different s, R ~ ,
and F.

TRANSFORMING LINEAR MODELS


Many models are not naturally linear models but can be transformed to linear models. For
example
MULTIPLE REGRESSION 267
is not a linear model. It can be linearized by using a logarithmic transformation

1nY = l n a + plnX (10.32)

where

Standard regression techniques can now be used to estimate a' and 6' for equation 10.33
and a and 6 estimated from equations 10.34. Two important points should be noted. First, the
estimates of a and p obtained in this way will be such that E (Yf - Yi )2 is a minimum and not
" 1

such that E (Y, - qi )2 is a minimum. Second, the error term on equation 10.33 is additive
(Y' = a' + P'X' + E') implying that it is multiplicative on equation 10.31 (Y = axp).These
errors are related by E' = ln E. The assumptions used in hypothesis testing and confidence intervals
must now be valid for E' and the tests and confidence intervals made relative to the transformed
model.
In some situations the logarithmic transformation makes the data conform more closely to
the regression assumptions. For example, if the data plot as in figure 10.3, a logarithmic trans-
formation may make the assumption of constant variance on the error more realistic.
The normal equations for a logarithmic transformation are based on a constant percentage
error along the regression line, whereas the standard regression is based on a constant absolute
error along the regression line. For example, the difference between Yi = 200 and Yi = 100 on
an arithmetic scale is 100 times as large as the difference between Yi = 2 and Yi = 1. However,
on a logarithmic scale In 200 - In 100 = 5.29832 - 4.60517 = .69315, which is the same as

Fig. 10.3. Example of the effect of a logarithmic transformation on the error variance.
In 2 - In 1 = .693 15 - .000 = .69315. In a situation of this type, the standard regression pro-
cedure would attempt to fit the point at Y = 100 in order to minimize X (Y - qi)'at the
expense of the point Y, = 1 because its contribution to X (Y, - ?i)2 is small. The logarithmi-
cally transformed model would give equal percentage weight to both points.
The above discussion can be extended to the model

through the transformation

Other models and transformations are available. For example

can be transformed to
1nY = h a + PX
Yevjevich (1972a) lists several possible transformations. Whatever the transformation, it
must be remembered that the principles of and assumptions regarding least squares apply to the
transformed model, not the original model.

INDICATOR VARIABLES IN REGRESSION


Consider a relationship between Y and X that may be a function of the year in which the data
were collected, as shown in figure 10.4.
If a single regression is performed using the model

Fig. 10.4. Use of a simple indicator variable.


MULTIPLE REGRESSION 269
the line labelled "overall" results. If two regressions are done, one on the 1991 data and one on
the 1992 data, the two individually labeled lines result. It is possible using indicator variables to
obtain the two individual lines with a single regression using the model

where I is an indicator variable. Using this approach, the data would be coded such that I would
be 0 for one of the years (say 1991) and 1 for the other year. The resulting equation would then
effectively be

Y =a + bX for 1991
Y = (a + c) + bX for 1992

Thus, the slopes for the two regressions are the same, but the intercepts are a function of
year. The advantage of using the indicator variable is that all of the data are used to estimate a
common slope for the two lines. If two independent regressions were done, the slopes would
likely be different.
Indicator variables can be used to generate two lines having different slopes but a common
intercept using the model

or having different slopes and intercepts using the model

Y=a+bX+cI+dIX

In this later case the result would be

Y =a + bX for 1991
Y = (a + c) + (b + d)X for 1992

Obviously, the use of indicators uses extra degrees of freedom and thus requires more data for
parameter estimation.
The use of indicator variables can be extended to produce three lines. For three equally
spaced lines having a common slope, the appropriate model is

where values of - 1,0, and 1 are used for I. The resulting models are

Y=(a-c)+bX forI=-1
Y = a + bX for1 = O
Y = (a + c) + bX for I = 1
270 CHAPTER 10
Three unequally spaced lines can be generated using the model

Y =a + bX + cI1 + d12 (10.38)


and assigning values to I1 and I, as follows

I1 I2 Line

Line 1 0 0 Y=a+bX
Line 2 0 1 Y = (a + d) + bX
Line 3 1 0 Y = (a + C) + bX
The resulting three equations are shown above.
Three lines with different slopes and intercepts can be generated from the model

using the same values for the indicator variables as above, with the result

Y =a + bX for line 1
Y = (a + d) + (b + f)X for line 2
Y = (a + c) + (b + e)X for line 3

Occasionally, it is desirable to fit a line through a set of data such that the line has a definite
break in its slope at some fixed point X = C. Figure 10.5 shows such a situation.
A regression of the form

Fig. 10.5. Indicator variables and a change in slope.


MULTIPLE REGRESS ION 27 1

will accomplish this where I = 0 for X 5 C and I = 1 for X > C. The resulting equations are
Y = a + bX forXIC
Y = (a - cC) + (b + c)X for X >C
Three slopes can be accomplished using a model of the form

where C, and C, (C, < C,) are the values of X at which the slope changes and the indicator
variables have values given by

I,=O forX<C, I,=l forXZC,


I, = 0 for X < C, I, = 1 for X 1 C,

The resulting equations are


Y=a+bX X<Cl
Y = (a - cC1) + (b + c)X C, 5 X < C,
Y = (a - cC, - dC,) + (b + c + d)X C2 < X

Finally, a line with a change in slope and a jump or discontinuity as shown in figure 10.6 can
be estimated using the model

where I, and 1, are 1 for X > C and 0 for X r C. The resulting equations are

Fig. 10.6. Discontinuity plus a change in slope.


272 CHAPTER 10
GENERAL COMMENTS
Regression analysis should be regarded simply as a tool for exploiting linear tendencies that
may exist between a dependent variable and a set of independent variables. It is also a useful
device for estimating the parameters of a model that is linear or can be transformed to a linear model.
Any regression analysis should be preceded by a great deal of thought devoted to what
variables should be included in the analysis, how these variables might influence the dependent
variable, the correlations among the independent variables, and the ease of using a predictive
model based on the selected independent variables. The ready availability of digital computers
and library regression programs has led many to collect data with little thought, throw it into the
computer, and hope for a model. This temptation must be avoided.
Not infrequently, an investigator finds that a satisfactory regression equation cannot be
developed from the data at hand. This should not be surprising because if a relationship exists, it
may be much more complex than is indicated by a linear model. Commonly, factors that are
important in determining the behavior of a dependent variable are omitted from a regression
equation. In this case a good predictive model cannot be expected.
In some regression problems, it is possible to improve the model by including cross-product
terms (called interactions) by multiplying together two independent variables to form a new
variable. Thus, a variable X, may be defined as XrXs.Ratios may be used such as X, = Xr/Xs.
Powers of variables may improve the model X, = X: where n is a known constant. If any of these
procedures are used, care must be exercised to see that large correlations (see chapter 11) are not
built into X'X.
One may frequently know (or think they do, anyway) the factors affecting a particular phe-
nomena. They cannot, however, easily measure these factors and are forced to use either another
related factor or a rough measure of the important factor. For instance, flood peaks depend among
other things on how rapidly the surface runoff reaches a particular point on a stream. This, in turn,
depends on surface flow characteristics such as the steepness and roughness of the flow surface
and the distance the flow must travel to a stream channel, plus the stream flow characteristics
such as roughness, hydraulic radius, slope, tortuosity, length, and so forth. All of the factors are
not linearly related to flood peaks and could not be included in the model if they were. Indices or
summaries are used, such as the average land slope and the average channel slope. It is hoped that
the real causative factors are correlated with these indices sufficiently to reflect this true impor-
tance and that the dependent variable is linearly related to the indices. These are large and
important assumptions. They point out that there is a limit to how well one can predict a depend-
ent variable with a regression model.
If it is at all possible, the first step in a regression analysis should be the development of the
form of the predictive model based on a rational analysis of the problem. Regression analysis can
then be used to develop the parameters of the model, test the importance of the variables
included, and develop confidence intervals for the predictions.

LOGISTIC REGRESSION
Frequently, one must be able to classify a variable into one of two possible classes. For
example, in looking at ground water for drinking one might want to class the water as acceptable
(Y = 1) or unacceptable (Y = 0). Based on a set of independent variables, it may be desired to
IMULTIPLE REGRESSION 273
determine the probability, p, that water from a particular well is acceptable for drinking. This is
equivalent to determining prob (Y = 1).
A regression model for classifying the binary variable Y as a 0 or 1 might be written

The expected value of Yi is given by

If Yi is a binary random variable such that

prob(Yi = 1) = pi and prob(Yi = 0) = 1 - pi (10.45)

then

so that

A major difficulty with this regression model is that the assumptions of ordinary least
squares regression are violated in that the error term is not iid N(0, a2). Neter et al. (1996) show
that ei are not normal since they can take on only values of 1 - - PIXi
- when Yi = 1 and -- PIXi
-
when Yi = 0. They also show that a: is (PIXi)(l
-- - PIXi),
- indicating a nonconstant variance.
Experience has shown that often p or E(Y) is related to - PIXi
- in a sigmoidal fashion as in
figure 10.7. Such a function can be expressed as

Fig. 10.7. Typical logistics models.


274 CHAPTER 10
or equivalently

Defining the odds, Od, as the ratio of the probability that Y = 1 to the probability that Y = 0,
one obtains

Letting p' = ln(Od)results in

The transformation to p' is sometimes called a logit transformation. As p goes from 0 to 1, p'
goes from -m to 03.
Equation 10.51 provides an alternative to equation 10.43. Neter et al. (1996) presents details
of maximum likelihood estimation of the p's for equation 10.51. The procedure is known as
logistic regression with equation 10.51 being the logistic model. Some statistical packages con-
tain routines for carrying out the computations involved in logistic regression. The programs
result in estimates for the p's and the standard error of the estimate for Pi, sbi.
A test of the hypothesis that pi = 0 versus pi # 0 is made by computing

The hypothesis is rejected if 1 zc I > z1-,/2 where z, -,,, is the standard normal variate and a is
the level of significance. Most logistic regression computer programs provide estimates of sbi.
P = 0 and
The significance of the overall model is tested by formulating the hypothesis that -
using a chi-square test based on p - 1 degrees of freedom where p is the number of coefficients
estimated. The chi-square value is based on the likelihood ratio. Again, the calculated chi-square
value is generally provided by a logistic regression program. If the calculated chi-square exceeds
the table value, then the hypothesis that -P = 0 is rejected.
Neter et al. (1996) and Helsel and Hirsh (1992) discuss other aspects of logistic regression
including evaluating the overall ability of the model to correctly classify observed values of the
dependent variable.
The predicted value, E(Pi), represents the predicted probability that Yi should be set equal
to 1. One classification scheme is to use the logistic regression model to estimate E(Pi) and set Yi
= 0 if E(PJ < 0.5 and Yi = 1 if E(P,) a 0.5. Such a decision rule is appropriite if it is equally
likely that the actual outcome is 0 or 1 and it is desired to have equal probability of incorrectly
classifying the outcome.
A decision rule can also be arrived at in other ways. For instance, if some observations had
been set aside and not used to estimate the model, various decision rules might be evaluated and
the one selected that performs the best in classifying these holdout observations. The same
scheme could be used if there were no holdout observations by using the model to classify the
MULTIPLE REGRESSION 275

observations used to develop the model. Obviously, this latter procedure would not independ-
ently evaluate the model because the same observations are used to evaluate as were used to
develop the model.

Example 10.3. In a certain locality wetland areas are thought to be impacted by groundwater
pumpage. By examining a wetland, ecologists can determine if a wetland is impacted or not. By
looking at certain bio-indicators such as fungi lines on trees, the normal water level for a wetland
may be determined. Water level records can be used to estimate the median water level. The
distance to the nearest pumping water well is also known. It is desired to develop a model for
classifying a wetland as impacted (Y = 0) or not impacted (Y = 1) based on the difference in the
median water level and the normal water level, X2, and the distance to the nearest pumping well, X3.

Solution: A logistic regression model of the form of equation 10.49 is fit to the data shown in
table 10.8. The results of the logistic regression are shown in table 10.9. Table 10.9 shows that
the overall regression is significant (X1 = 40.14) but that fi3on X3 is not significant
(z, = 0.022/0.180 = 0.12). A second logistic regression was computed eliminating X3 with the

Table 10.8. Data for example problem

Obs. No. L50 Impact Dist Predicted impact E (Y) Residual

0.004
0.005
0.007
0.007
0.014
0.015
0.016
0.019
0.02 1
0.022
0.025
0.029
0.037
0.038
0.045
0.054
0.062
0.082
0.207
-0.390
0.658
-0.304
-0.123
-0.028
-0.008
-0.003
-0.001
-0.001
0.000
(continued)
276 CHAPTER 10
Table 10.8. (continued)

Obs. No. L50 Imuact Dist Predicted imuact E (Y) Residual

Table 10.9. First attempt logistic regression report

Parameter Estimation Section

Regression Standard z Problem


Variable coefficient error Beta = 0 level

Intercept 7.014575 2.638804 2.66 0.00


L50 3.48942 1 1.355271 2.57 0.01
Dist 2.245529 X lo-' 0.1801355 0.12 0.90
Model in transformation form
7.014575 + 3.489421*L50 + 2.245529 X lo-'* Dist
Note that this is XB. Prob (Y = 1) is 1/(1 + Exp(-XB)):
Model Summary Section

Model Model Model Model


R~ D.E* chi-square problem

Classification Table
Predicted
Actual 0 1 Total

0 Count 16.00 2.00 18.00


Row percent 88.89 11.11 100.00
Column percent 88.89 9.52 46.15
1 Count 2.00 19.00 2 1.OO
Row percent 9.52 90.48 100.00
Column percent 11.11 90.48 53.85
Total Count 18.00 21.00 39.00
Row percent 46.15 53.85
Column percent 100.00 100.00
Percent correctly classified = 89.74.
*D.E = degrees of freedom.
MULTIPLE REGRESSION 277

results shown in table 10.10. A significant overall regression (Xz = 40.12) and both fil and 6, are
significantly different from zero. The resulting model is

Table 10.10. Second attempt logistic regression report

Parameter Estimation Section

Regression Standard z Problem


Variable coefficient error Beta = 0 Level

Intercept 7.094155 2.553 149 2.78 0.005460


L50 3.a3749 1.283279 2.68 0.007284
Model in transformation form
7.094155 + 3.443749*L50
Note that this is XB. Prob(Y = 1) is 1/(1 + Exp(-XB)).
Model Summarv Section

Model Model Model Model


R~ D.F. chi-square problem

Classification Table
Predicted
Actual 0 1 Total

0 Count
Row percent
Column percent
1 Count
Row percent
Column percent
Total Count
Row percent
Column percent
Percent correctly classified = 89.74.

Misclassified Rows Section

Actual Predicted
Row @OUP POUP Score Residual
278 CHAPTER 10

Fig. 10.8. Logistics regression model for classifying wetlands.

This model correctly classified 35 of the 39 observations. Figure 10.8 shows the data and the
resulting model. Using a probability level of 0.5, E(Y) = 0.5, for classification, the value of X2
that the model equation 10.53 specifies as the division between impacted and unimpacted
wetlands is

-7.09
x2= ----- - -2.06 feet
3.44

If an E(Y) = 0.78 is used as a cutoff for impact evaluation, only 2 of the wetlands would be
misclassified. The true test of the model would be how it performs on an independent data set.

Exercises

10.1. Use the matrix methods of this chapter to work example 9.1.

10.2. Compute R for example 10.1.

10.3. Use the matrix methods of this chapter to work example 9.4.

10.4. Use the matrix methods of this chapter to work example 9.5. Calculate the confidence
interval for the point X equals 50.0 inches of rainfall.

X is an n X p matrix of n observations on p variables, and -


10.5. If - Z is the n X p matrix of
deviations of the variables from their means, what is contained in the matrix ZIZ/(n - I)?
= [zij] = [xij - xj])

10.6. Use the data in table 10.2 to develop a prediction equation for annual runoff using the
XP
model Y = ~ & f " - - - x?. Would you prefer this equation over the one contained in table
10.6? Compare the equations in terms of the confidence interval on the regression lines at the
mean values of the variables contained in the respective equations.
MULTIPLE REGRESS ION 279

10.7. Show that x (Y - ?


)
2 Y'Y
=- ~ '-
--- x '-
Y.

10.8. Derive the normal equations that minimize 2 (Y - Q)' for the model Y = axb. Suggest
a method for solving these equations.

10.9. The relationship between stage and discharge (rating curve) for many streams has been
found to follow an equation of the type Q = a sbwhere Q is the discharge and S is the stage.
Using the following data from the Cumberland River at Cumberland Falls, Kentucky, derive such
a rating curve. Test the hypothesis that b = 1.5.

10.10. The data in table 10.11 is a partial listing of the data used by Benson (1964) in a study of
floods in the Southwest. Derive a prediction equation for Q,, the mean annual flood, in terms of
the remaining variables. Consider both the models given by equation 10.1 and by the multiple
regression extension of equation 10.24.

Table 10.11. Independent variables, by station, in rain-flood area

(continued)
Table 10.11 . (continued)

A, contributing drainage area in square miles.


S, main-channel slope(85 to 10 percent points), in feet per mile.
St, percentrage of area in lakes and ponds, increased by 1 percent.
E, altitude index (mean of 85 and 10 percent points), in feet above mean sea level.
L, basin length (total length of main channel), in miles.
H, basin rise (elevation difference between 85 and 10 percent points), in feet.
P, mean annual precipitation, in inches.
I, 10-year, 24-hour rainfall intensity in inches.
R, ratio of runoff to precipitation during months when annual peak discharges occur.
R,, mean annual runoff, in inches.
Q,, mean annual flood, cfs.
11. Correlation
IN CHAPTER 3 the population correlation coefficient between two random variables X and
Y was defined in terms of the covariance of X and Y and the variances of X and Y as

The sample estimate r ~for, p,,,~ is similarly given by

where sx,, is the sample covariance between X and Y, and sx and sy are the sample standard
deviations of X and Y, respectively. Figure 3.5 and the accompanying description discussed some
typical values for r,,~ and their meaning. Here it was emphasized that 1) rxYycan range from - 1
to 1; 2) r,,~ = f1 implies a perfect linear relationship between X and Y; 3) r,,, = 0 implies
linear independence but leaves room for other types of dependence; and 4) if X and Y are
independent, then rX,, = 0.
In chapters 9 and 10 the concept of correlation was extended to give a measure of the
strength of the linear relationship between a random variable Y and a second variable which was
a linear function of one or more X variables, each of which may or may not be a random variable.
Throughout the text many of the results that have been developed have included the
assumption that the random variables were independent or that the sample being analyzed was
composed of random observations. A random observation simply means that every possible
element in the sample space has an equal chance of being selected during any trial.
Random variables may be either uncorrelated (r,,, = 0) or correlated (rX,Y# 0). Even when
sampling from uncorrelated populations, it would be rare for the sample correlation coefficient to
be exactly zero. More likely it will deviate from zero due to chance. Thus, statistical tests are
needed to evaluate whether the deviation of the sample correlation coefficient from zero may be
ascribed to chance or whether the deviation is too large to attribute to chance.
If successive observations in a time series of hydrologic data are correlated, this must be
taken into account in any inferences made about the data or in attempts to model the process that
produced the data. Again, a procedure is required for determining if the sampled elements from
a time series can be considered as random. These and other properties of correlation are the sub-
ject of this chapter.

INFERENCES ABOUT POPULATION CORRELATION COEFFICIENTS


Situations frequently arise where it is desired to test H,: r,,~ = 0 or H,: rX,Y= r* where r*
is known. These and other tests about the population correlation coefficient will be discussed in
this section. For a more detailed treatment, reference can be made to Graybill (1961).
As in the case of all hypothesis tests, certain assumptions are needed. In this section the
assumption is made that X and Y are random variables from a bivariate normal distribution. The
population correlation coefficient is given by p and the sample estimate of p given by r is based
on a random sample.
If p = 0, then the quantity

has a t distribution with n - 2 degrees of freedom, where n is the sample size. Thus, to test
H,: p = 0, the test statistic is calculated from equation 11.3 and H, is rejected if It/ > t,
If n is moderately large (n > 25), then the quantity W is approximately normally distributed
with mean w and variance (n - 3)-' where

: [:+:I
W = - In - = arctanh r

and

To test the hypothesis H,: p = p* against the alternative Ha: p # p* for p*, a known
constant, the quantity
CORRELATION 283
can be considered to be normally distributed with a mean of zero and a variance of one. If
IzI> z,-~/,(Zis the standard normal variable), H, is rejected.
Confidence limits on p can be estimated from

Consider k bivariate normal populations having population correlation coefficients of


pl, p,, ..., pk and sample correlation coefficients of r,, r,, ..., rkbased on samples of size n,, n,, ...,
n,. Then the hypothesis H,: p, = p, = . - - = pk = p* for p*, a known constant, is tested by
noting that

X2 = X:= , (arctanh ri - arctanh p*)2(ni- 3) (11.8)

has a chi-square distribution with k degrees of freedom. H, is rejected if X2 > x;-,,~. Rejection
of the hypothesis infers that at least one of the pi's is not equal to p*.
The hypothesis H,: p, = p, = - - - = pk (all correlation coefficients are equal) is tested by
noting that

has a chi-square distribution with k - 1 degrees of freedom. In equation 11.9, Wi is given by


equation 11.4 as Wi = arctanh ri and

H, is rejected if X2 > X:-u,k-l. Rejection of this hypothesis infers that at least one of the pi's is
not equal to the other pj's for i # j.
If the hypothesis that all of the correlation coefficients are equal is not rejected, it may be
desirable to calculate a "best" combined estimate 7 of the common correlation p ("best" means
weighted with inverse variance). Such an estimate is given by

where W is given by equation 11.10


and

Example 11.1. Burges and Johnson ( 1973) present the following sample correlation coefficients
for monthly flow volumes for the Sauk River in Washington and Arroyo Seco in California. In the
following table rj represents the sample correlation coefficient between the monthly flow
volumes in months j and j - 1. Assume the coefficients are based on 30 observations each and
that the parent populations are all bivariate normal (Burges and Johnson actually used the
lognormal distribution in their study). 1) Test the hypothesis that p, for the Sauk River is equal to
0.50.2) Compute the 95% confidence limits for p8 of the Sauk River. 3) Test the hypothesis that
p, on Arroyo Seco is zero. 4) Test the hypothesis that on each of the streams all of the monthly
correlation coefficients are equal. 5) Assume the hypothesis in part 4 is accepted for the Sauk
River and estimate an average correlation coefficient for the Sauk River.
Month j Sauk River Arroyo Seco

October
November
December
January
February
March
April
May
June
July
August
September

Solution:
1) H,: p, = 0.5 for Sauk River
From equation 11.6

where

W = arctanh r = arctanh (.34) = .35409


o = arctanh p = arctanh (SO) = -54931
CORRELATION 285
Since Izl < 1.96, we cannot reject H,: r, = 0.5 for the Sauk River.

2) The 95% confidence limits on p, for the Sauk River are calculated from equation 11.7 as

3) H,: p5 = 0 for Arroyo Seco is tested by using equation 11.3.

Therefore we cannot reject H,: p5 = 0.

4) The test, H,: all pj are equal, is tested by using equation 11.9.

Sauk River Arroyo Seco

i ri Wi = arctanh ri ri Wi = arctanh ri

1 0.6 1 0.71 0.00 0.00


2 0.58 0.66 0.00 0.00
3 0.50 0.55 0.00 0.00
4 0.3 1 0.32 0.45 0.49
5 0.38 0.40 0.2 1 0.21
6 0.37 0.39 0.70 0.87
7 0.44 0.47 0.60 0.69
8 0.34 0.35 0.75 0.97
9 0.17 0.17 0.98 2.30
10 0.65 0.78 0.97 2.09
11 0.93 1.66 0.96 1.95
12 0.5 1 0.56 0.00 0.00
Sauk River W = 0.585
Arroyo Seco W = 0.798

Sauk River X2 = 27[5.707 - 12(.585)'] = 43.208


Arroyo Seco X2 = 27[15.919 - 12(.798)~]= 223.489

Therefore H, is rejected for both the rivers.

5) An average correlation coefficient for the Sauk River is calculated from equation 11.11.

where
-
W = 0.585

m=
C [(i - 3 ) - - 27/29 = 1/29
1 ) --
2 (ni - 3) 27

Comment: In parts 4 and 5 of this problem several simplifications were made in the summations
since ni was equal to 30 for all i; in general this cannot be done. In part 5 an overall average
correlation coefficient was calculated. Since in part 4 it was shown that the correlations for the
various months are significantly different, the utility of an overall average correlation is suspect.

Graybill (1961) presents the exact probability distribution of r and states that for small
samples, the exact distribution should be used in hypothesis testing. References to tables that aid
in hypothesis testing for small samples and examples of their use are also given.
Again, it is emphasized that the above tests are based on a random sample from multivariate
normal distributions. Even under these conditions, only the test of H,: r = 0 conducted using
equation 11.3 is "exact". The other tests are approximate with the approximation improving as
the sample size increases.
For non-normal populations, it may be possible to transform the variables to a normal
situation and then apply the above tests to the transformed data. If a transformation of a non-
normal random variable is not possible or not desired, then the above tests must be considered as
CORRELATION 287
approximate with the approximation becoming poorer as the coefficient of skew of the random
variables increase.

S E W CORRELATION
It is not uncommon to find in a time series of hydrologic data that an observation at one time
period is correlated with the observation in the preceding time period. Such correlation is termed
serial correlation or autocorrelation. By definition, the elements of a sample of data possessing
serial correlation are not random elements. A serially correlated sample of size n contains less
information about a process than a completely random sample of size n. In a serially correlated
sample, part of the information contained in each observation is already known through its
correlation with the preceding observation
Such correlation can also exist between an observation at one time period and an
observation k time periods earlier fork = 1,2, . . . In this discussion of serial correlation, it is as-
sumed that observations are equally spaced in time and that the statistical properties of the
process do not change with time (stationary process). The population serial correlation coeffi-
cient is denoted by p(k) (and frequently called the autocorrelation coefficient) where k is the lag
or number of time intervals between the observations being considered. The sample serial corre-
lation coefficient will be given by r(k). The sample serial correlation coefficient for a sample of
size n is given by

n-k n-k
n-k xi=lxiEi=l X i + k
XiXi+k -
(n - k)
r(k) =
(2;:: xi I?
n-k n-k

From equation 11.12 it is seen that r(0) is unity. That is, the correlation of an observation
with itself is 1. Equation 11.12 also demonstrates that as k increases, the number of pairs of
observations used in estimating r(k) decreases because all of the summations contain n - k
terms. Serial correlation should only be estimated for k considerably less than n.
If p(k) = 0 for all k # 0, the process is said to be a purely random process. This indicates that
all of the observations in a sample will be independent of each other. In chapter 14, Yevjevich
(1972b), Matalas (1966, 1967b),Julian (1967) and others treat hydrologic time series ,inmore detail.
Anderson (1942) has proposed a test of significance for the serial correlation coefficient for
a circular, normal, stationary time series. A circular series is one that closes on itself so that xn is
followed by xl. Under these assumptions

Although the assumption of a circular series is unrealistic, values of r(k) from equation
11.13 will not differ greatly from those calculated from equation 11.12 if n is large in comparison
to k. Under these conditions r(k) will be approximately normally distributed with mean
- l/(n - 1) and variance (n - 2)/(n - 112if p(k) = 0. The confidence limits on p(k) are then
estimated by

If the calculated r(k) falls outside these confidence limits, the hypothesis that p(k) is zero
[H,: p(k) = 0 versus Ha: p(k) # 0] is rejected.

Example 11.2.' Frequently, in the analysis of runoff volumes, one finds there is significant serial
correlation caused by storages on the watershed. Appendix C contains a listing of the monthly
and annual runoff volumes for Cave Creek near Lexington, Kentucky. Test the hypothesis that
p(1) = 0 for the annual runoff volumes.

Solution: This solution assumes a = 0.05 and is based on equation 11.14, and therefore assumes
that the annual runoff is normally distributed and is a stationary time series. Furthermore, p(1) is
estimated from equation 11.13 assuming that the series is circular [in this case this is equivalent
to assuming x,+, = x, in calculating r(l)].

Since -0.520 < r(1) < 0.402, &: p(1) = 0 is not rejected.
Comment: From the width of the confidence interval, it is apparent that the above test is not very
powerful for small samples. A sample of around 400 observations would be required to reject H,:
+
p(k) = Oifr(k) = 0.1.
CORRELATION 289
Matalas (1967b) has suggested that for hydrologic data r(1) tends to be greater than zero due
to persistence caused by storage. If r(1) is found to be less than zero, it is in many cases difficult
to explain hydrologically. In this case one might take r(1) as equal to zero.
Matalas and Langbein (1962) state that in an autocorrelated series, each observation
represents part of the information contained in the previous observation. They discuss stationary
time series having r(1) # 0 and r(i) = 0 for i = 2, 3, . . . They state that n observations of a
nonrandom series having r(1) > 0 give only as much information (measured in terms of a
variance) about the mean as some lesser number, n, of observations in a purely random time
series.
This lesser number of observations is called the effective number of observations and is
given by

If r(1) = 0, then n, = n. If r(1) > 0, then n, < n. Equation 11.15 is expressed graphically as
figure 11.1. As an example, a 50-year record for which r(1) = 0.2 contains only as much
information about the mean as a 33-year record with r(1) = 0. Note that if n is large or r(1)
small, the second term in the denominator of equation 11.15 can be neglected with little loss
in accuracy.

n
Fig. 11.1. Relation between n and n, for various values of p(1) (after Matalas and Langbein 1962).
CORRELATION AND REGIONAL ANALYSIS
Matalas and Langbein (1962), Yevjevich (1972a), Alexander (1954), and others demon-
strate that the information relative to estimating the regional mean contained in data from n sta-
tions in a region having an average interstation correlation of p is equivalent to the information
contained in n' uncorrelated stations in the region where n' is given by

As n gets large, n' approaches l/p. For a of 0.2, the maximum information about the
regional mean contained in n stations could not exceed the information contained in 5 uncorre-
lated stations.
From a consideration of equation 11.16, it seems it would be logical to establish relatively
few independent hydrologic stations in a region rather than several correlated stations. However,
by the very concept of a hydrologic region, the hydrologic characteristics may be correlated.
Correlation within a region can be exploited to yield improved estimates of a particular
hydrologic variable at a point through correlation with another hydrologic variable at that point
or a similar characteristic at another point. For instance, let Y and X represent two random
hydrologic variables having no serial correlation for which n, and n, + n2 observations, respec-
tively, are available. Also consider that Y and X are correlated with a correlation coefficient of
ryx. Now, the record on Y can be extended by using the correlation between Y and X. This
relation is merely a simple regression considering Y as the dependent and X the independent vari-
able. The relation is developed based on the n, common observations. From equation 9.15 it can
be shown that the regression between Y and X is given by

where ryXis the estimate for pyx and y and x are deviations from their respective means. Now n2
estimates of Y can be computed from equation 11.17 based on the n2 observations on X not
common to the observations on Y. Let Y1 and Y2 represent the mean of Y based on the original
n, observations and the n2 estimated observations, respectively. A new weighted mean for Y
based on n, + n, observations can now be computed from

For the n2 additional observations to improve the estimate of Y, it is necessary that ryx be greater
than 1/(n, - 2) (Matalas and Langbein 1962).
If the random variables Y and X contain significant serial correlation, the situation is
somewhat more complex. Matalas and Langbein (1962), Matalas and Rosenblatt (1962), and
Yevjevich (1972a) contain treatments of this case. In general, serial correlation serves to decrease
the information relative to the mean while cross-correlation tends to improve information rela-
tive to the mean.
CORRELATION AND CAUSE AND EFFECT
At this point it should be apparent that a high correlation between two variables does not
necessarily imply that there is a cause-and-effect relation between the variables. The fact that the
monthly flows on adjacent small streams are correlated does not mean that changes in the
monthly flow of one stream causes a corresponding change in the other stream. More likely both
changes are caused by the same external factors operating on both watersheds.
Again, it is emphasized that independent variables are uncorrelated and correlated variables
are not necessarily related through cause and effect. The dependence in correlated variables is a
stochastic dependence and not a physical or cause-and-effect dependence. Dependence and
correlation are linear properties. Dependence among variables may be strong and nonlinear in the
presence of a nonsignificant (linear) correlation coefficient.

SPURIOUS CORRELATION
Spurious correlation is any apparent correlation between variables that are in fact
uncorrelated. Spurious correlation can arise due to clustering of data. For example, in figure 11.2,
the correlation of Y with X within either of the data clusters is near zero. When the data from both
clusters are used to calculate a single correlation coefficient, this correlation is found to be quite
high. This is spurious correlation. Figure 11.3 shows a plot of Y versus X where both Yi and Xi are
random variables obtained by adding 11 to a random observation from a standard normal
distribution. For a sufficiently large sample rx,, would be zero. If both Yi and Xi are divided by yet
a third random observation Zi, obtained in the same manner as Xi and Yi, and the correlation be-
tween Yi/Zi and Xi/Zi computed, for a sufficiently large sample the correlation will be near 0.5.
Figure 11.4 is a plot of Yi/Z, versus Xi/Zi. Figure 11.4 indicates that Xi furnishes information use-
ful in estimating Yi when in fact Yi and Xi are uncorrelated. The correlation between Yi/Zi and
Xi/& is spurious.

X
Fig. 11.2. Spurious correlation due to data clustering.
292 CHAPTER 11

Fig. 11.3. Absence of correlation between two random variables.

Fig. 11.4. Spurious correlation introduced by dividing 2 random variables by a common third
random variable.
CORRELATION 293
Pearson (1896-1 897) investigated the spurious correlation that can arise between ratios. Let
Y = X1/X2and Z = X3/X4. The correlation between Y and Z, rxz, was found to be a function of
the variances, covariances, and means of the X's. Pearson's derivation assumed that the X's were
normally distributed and that the coefficient of variation of each X was small enough so that its
third and higher powers could be neglected. Reed (1921) arrived at the same results without
specifying the parent distribution of the X's. Pearson's general formula is

where rij is the correlation between Xi and Xj, and Ci is the coefficient of variation of Xi.
Chayes (1949) and Benson (1965) considered many special cases of equation 11.19. For
example, if X2 = X4, r12 = r13 = r34 = 0, r24 = 1, and C, = C2 = C3 = C4, equation 11.19
reduces to rxy = 0.5, which is the case shown in figure 11.4. Benson (1965) produced a table
(Table 11.1) showing many special cases of ratio and product correlations.
Spurious correlation can arise in hydrology when dimensionless terms or standardized
variables are used. Benson (1965) presents several examples of possible spurious correlation in
hydrology.

Exercises

11.1. Calculate the first-order serial correlation coefficients for the sediment load and annual dis-
charge data for the Green River at Munfordville, Kentucky. Test the hypothesis that these two
correlations are equal. Discuss the assumptions you have made and how they affect the validity
of the tests you have made.

11.2. Calculate the correlation between the sediment load and annual discharge for the Green
River at Munfordville, Kentucky. Test the hypothesis that this correlation is equal to 0.50.

11.3. Verify the "comment" of example 11.2.

11.4. Calculate the first-order serial correlation coefficient for the Spray River, Banff, Canada.
Test the hypothesis that the first order serial correlation is zero.

11.5. Work exercise 11.4 for the Piscataquis River near Dover-Foxcroft, Maine.

11.6. If the annual runoff from the Spray River, Banff, Canada, is normally distributed, how
many independent observations would provide as much information relative to estimating the
mean annual runoff as does the 45 years of actual record?

11.7. Work exercise 11.6 for the Piscataquis River, near Dover-Foxcroft, Maine and its 54 years
of record.
.p
u"u"u"
rnmm

u "- 5
r~
nm
*u"u"
5 u"Urnm
2
n
PI
2 u_
%
h

u_ 4

U,
Ph
U
IC) I
=
rn
+
rn

+
PIN
I U

PIPI
U
N N
U
s
V
PI PI
U :-
U
n
PI PI
+
PI - V

1%
PI

U V
IX
+ 121 12 1)
c-
V
11
21 x"
PI
X,
x

u" %
<
h

2
I rn
* I
u,
L' U
PI*

+ PI PI
U
V
11.8. The following data were collected on two streams in southeastern Kentucky. Use the data
to extend the peak flow record of Cave Branch through 1972. Estimate the average peak flow for
the entire record plus estimated record for Cave Branch. Is this estimated average an
improvement over an estimate based on the actual observed record of Cave Branch?

Peak Flow Data

Year Cave Branch Helton Branch Year Helton Branch


12. Multivariate Analysis
THERE ARE several multivariate data analysis techniques that may prove useful in
working with hydrologic data. The treatment here is concerned with the principles and tech-
niques of principal components analysis, cluster analysis, and multivariate regression analysis.
Multivariate techniques have been around for some time but have found limited use in
hydrologic applications. For a more complete treatment of multivariate analysis, especially the
inferential aspects of multivariate analysis, reference should be made to the books by Morrison
(1967), Press (1972), Cooley and Lohnes (1971), Harman (1967), Anderson (1958), Harris
(1975), and Karson (1982). Synder (1962) and Wong (1963) were among the first to apply
multivariate analysis to hydrology.

NOTATION
In this chapter an uppercase underlined letter will denote a matrix and a lowercase under-
Z could be an n X p matrix made up of p n X 1
lined letter will denote a column vector. Thus -
zj
column vectors for j = 1,2, ..., p.
PRINCIPAL COMPONENTS
Often, when data are collected on p variables, these p variables are correlated. This correla-
tion indicates that some of the information contained in one variable is also contained in some of
the other p - 1 variables. The objective of principal components analysis is to transform the p
original correlated variables into p uncorrelated, or orthogonal, components. These components
are linear functions of the original variables. Such a transformation can be written

where -X is an n X p matrix of n observations on p variables. Because we will be dealing with


variances and covariances, all X's will be assumed to be deviations from their respective
means so that -X is a matrix of deviations from means. Z- is an n X p matrix of n values for each
of p components, and - A is a p X p matrix of coefficients defining the linear transformation.
Because the original p-variate set of observations - X contains correlation, it might be
X with q < p orthogonal components. Thus, it is desired
possible to characterize the variance of -
to construct Z
- so that each component, -Jz . (an n X 1 column vector) explains the maximum
amount of the variance of - X left unexplained by the first j - 1 components. In this way it may
be found that the first q components explain most of the system variance and that the last p - q
components explain little of the system variance. The total variance of - X is defined to be the
sum of the variances of the p variables contained in -X. The variance-covariance matrix of - X is
defined to be the p X p matrix 2 - where 2 - = [ui,,]and oijis the covariance of the ith and j"
X for i # j and ui,,is the variance of the ith variable. 2
variables in - - is estimated by - S whose
elements are given by

The total system variance, V, is defined as the sum of the variances of the original variables
and can be estimated as

V = Trace -
S= 2:='=, s ~ , ~

The j" principal component, -Jz., is the linear function

z-J . = Xa.
J (12.4)

where -Jz . is an n X 1 and -Ja . a p X 1 column vector. zj can also be written z . = [zij] where
-J

Zij = zk=Xi,kakj
P
(12.5)
MULTIVARIATE ANALYSIS 299
The variance of zjis found from

and may be estimated by -Ja!Sa..


-J Note that this is simply a matrix equivalent of equation 3.58 for
the variance of a linear function.
The variance of the first principal component - z, is estimated by

z,
- is thus defined by the vector -
a, that maximizes the variance of -
z, subject to the constraint that
a' -1a = 1. This is a normalizing constraint without which there would be no unique solution.
-1
Equation 12.7 can be maximized by using the Langrangian multiplier A, to introduce the
constraint = 1. Let

Q is maximized by differentiation.

For the solution of equation 12.8 to be other than the trivial solution - 0 we must have
al = -

This is a classical characteristic value problem. A, is called the characteristic root and -
a, the char-
acteristic vector of -
S. Equation 12.9 has p solutions for A,. This is easily seen by considering the
special case of -S to be a 2 X 2 matrix in which case equation 12.9 becomes

or ( s , . ~- A1)(s2,, - A,) - s ~slV2


, =
~ 0. This is a quadratic equation in A, having 2 solutions.
Special properties of - S guarantee that the p solutions will be real.
Multiplying equation 12.8 by - a; results in

Because our objective was to maximize - a;Sa,,


- the desired solution to equation 12.9 is the
largest characteristic root (the largest value) for A.
Equations 12.7 and 12.10 demonstrate the important point that

Having found the characteristic root, A,, of - S, the characteristic vector, -


a,, is found from
equation 12.8 using the constraint that _a;?, = 1, which is equivalent to Xi=, a:,, = 1.
The second principal component is found in a similar manner. Now it is desired to find -a,
- =-
such that Var(z2) a;Sa2
- is maximized subject to the constraints that - a;a2 = 1 and
a;a, = g;a, = 0. This latter constraint guarantees that -
- z, and -
z2 are orthogonal (uncorrelated).
Using,a procedure similar to the above for - a,, let Q be

Premultiplication by -
a; results in

Due to orthogonality, premultiplication of equation 12.8 by -


a; results in -
a; -
Sa, = 0. Substituting
this into equation 12.13 results in y = 0. Therefore, from equation 12.12 we have

from which it follows that -a,, the coefficients of the second largest principal component, are the
coefficients of the characteristic vector associated with the second largest characteristic root of
S. Premultiplying equation 12.14 by -
- a; also results in A, = - a;Sa2
- = Var(z2).
- In general, the jb
principal component of the p-variate sample - X is the linear function 3 = -JX a where 2j are the
elements of the characteristic vector associated with the jth largest characteristic root of - S.
From equation 12.1 we can find Z'Z - - as -
Z'Z- = (XA)'(XA)
- - = A'X'XA = (n - 1) - A'SA.
- It can
be easily shown (see equations 12.24-12.28) that A'SA - - is a diagonal p X p matrix with the ith
diagonal element equal to A,. This matrix may also be written as -- A'SA = - D, where D,
- is the
diagonal matrix whose diagonal elements are the characteristic roots of S.
E is an orthogonal matrix, then the trace of E'FE
One property of matrices is that if - - - equals
the trace of -
F. Therefore
- = Trace(AISA)
Trace (D,) -- - =V
= Trace(S)

However

The sum of the characteristic roots which equals the sum of the variances of the principal com-
ponents also equals the total system variance.
MULTIVARIATE ANALYSIS 30 1

The covariance between - zi and sj is Cov(zi,


- sj)= Cov(Xai,
- JXa.) = --J
4 S a - = 0 for i # j.
Therefore, - zj
zi and are uncorrelated.
Some important properties obtained thus far are:

zi and
1. - zj are uncorrelated for i # j

4. x L = l ~ a -r z=i EL=,hi = Traces- = V


5. Z = XA where

From item 4 above, it can be seen that the fraction of the total variance accounted for by the
jth principal component is A,/V. In many situations the first q components account for a large
fraction (say 90% or more) of the system variance, indicating that the last p - q components are not
needed in terms of explaining variance. Many times these last p - q components are discarded with
the effect that the problem has been reduced from one of dealing with an n X p - X matrix containing
correlation to dealing with an n X q(q < p)Z - matrix that is orthogonal.
The question of how many components are needed to satisfactorily explain the system vari-
ance or what part of the total system variance should be explained is an unresolved one. Morrison
(1967) suggests that only the first 4 or 5 components should be extracted since later components
will be difficult to physically interpret in terms of the problem at hand. Unfortunately, there are
no statistical tests that can be used to determine the significance of a component. The sampling
theory of principal components is not well developed, especially when the components are ex-
tracted from the correlation matrix rather than the covariance matrix as in later examples.
The covariance between the original variables, - Z, is given by
X, and the principal components, -

Cov(X, - = Cov(X, XA) = SA


- Z)
--

The covariance between the variables and the jthcomponent is given by

From (S - A-I)a. 0 we have -JS a = Ajgj. Therefore, Cov(X,


1- -J = - - Z.)
-J = Ajgj. The covariance
between the ithvariable and the jth component is given by A,qi. The correlation between the i"
variable and the jthcomponent is
302 CHAPTER 12
The vx(xi)
-
= siPi= S; and Var(zj) = A,. Therefore

A into a p X p matrix of correlations between the ith


This equation can be used to transform -
observed variable and the jth computed component. These correlations can then be used in an
attempt to assess the physical meaning of the components.
In some situations, some of the p variables can be eliminated from further consideration
by examining the correlations defined by equation 12.17. If a variable has no significant
correlation with a component, then that variable is not contributing much to the variance of the
component. By eliminating the variable from the component, the fraction of the system
variance explained by the component would be changed very little. The difficulty here is that
this variable may be correlated with a second component, in which case its
elimination would decrease the variance explained by the second component. For these
reasons, variables are generally eliminated only if they are not correlated with any of the q
components retained for analysis.

Example 12.1. Consider the data in table 10.2. Let - X be a 13 X 3 matrix made up of 13
observations on mi^), S(%), and L(ft). Compute the principal components of - X based on the
covariance matrix. Compute the correlation between the variables and the components.

Solution: -
S is computed from equation 12.2.

(2- Ail is computed from

which simplifies to
MULTIVARIATE ANALYSIS 303
The solutions to this cubic equation are

Note that 2 Xi = Trace - S = 161.557.


The first principal component accounts for 100Al/Trace - S = 100(155.963)/161.557 =
96.54% of the total system variance. The coefficients of the characteristic vectors can be com-
puted from equation 12.8. For example, for the first principal component we have

Solving these three equations simultaneously for a,,,, a , , , and a,,, results in

a2,, = -51.43a3,, and a,,, = 1.5503a3,,.

Using the constraint that a:, + a;,, + a:,, = 1, the solution is a,,, = -030, a2,, = -.999 and
= .020. Similarly, for A, and A, we get

Thus,

The values for the principal components can now be calculated from

where X is composed of deviations from the mean-that is, deviations of A, S, and L from their
means.
304 CHAPTER 12

The correlation matrix between the variables and the components can be computed from
x2 and -
equation 12.17. For example, the correlation between - z1 is

Cor(x2,
- - Z,) = ~ : ~ a ~ ,=, 155.9631/2(-0.999)/155.7691P
/s~ = -0.9995

The resulting correlation matrix is

Example 12.1 illustrates that using the S matrix in a principal component analysis presents some
problems if the units of the X variables differ greatly. In example 12.1, the magnitude of the
observations associated with the second variable were much greater than those associated with
the other two variables. Consequently, the variance of -x2 was much greater than either Var(x,) or
Var(x3).
- x2 - accounted for 100 Var(x2)/
- Trace S or 96.4% of the system variance. This means that
the first principal component is merely a restatement of - x2. This can also be seen from the fact
that the correlation between - 2, is - 1.000.
x2 and -
In most hydrologic studies the problem of noncommensurate units on the X's has been
handled by standardizing the X's through the transformation ( x , ~- Xj)/sj. The covariance matrix
of the standardized variables becomes the correlation matrix - S =- R, as can be seen from
equation 12.2. The principal components analysis is then done on - R. The total system "variance"
now becomes Trace - R = p because - R has 1's on the diagonal.
The characteristic roots and vectors are determined from

and the numerical value of the components is computed from


MULTIVARIATE ANALYSIS 305

The correlation between the jthstandardized variable and the jth component (equation 12.17)
reduces to

These correlations are sometimes called factor loadings. The factor loadings can be used to attach
physical significance to the components. If a particular component is highly correlated with 1,2,
or 3 variables, then the component is a reflection of these variables. For example, in a study of
watershed geomorphic factors, it might be found that a component is highly correlated with the
average stream slope and the basin relief ratio. This being the case, that particular component
might be termed a measure of watershed steepness.
- -
-

Example 12.2. Repeat example 12.1 using - S


R instead of -

Solution:

which has solutions


A, = 1.9692
A, = 0.9273

A, = 0.1035

In this formulation, -
z, accounts for 100(1.9692)/3 or 65.64% of the system "variance"
z, and -
whereas - z, account for 30.91% and 3.45%, respectively.
The corresponding characteristic vectors are

The factor loadings computed from hf124, are

Since component 1 is highly correlated with both area and length, this component might be
called a "size" component. Likewise, component 2 might be called a slope component. In terms of
explaining the "variance" of R, component 3 could be eliminated because it explains only 3.40%
of the variance and is not correlated with any of the variables. We cannot eliminate any variables,
however, because component 1 is strongly dependent on - - whereas component 2 depends
XI and X3
on -X,.
In terms of explaining the variance of R, we have reduced our problem from one of considering
a 13 X 3 - X matrix with correlations to a 13 X 2 - Z matrix without correlations (assuming -Z3 is
discarded).
The values for the components are computed from

where

thus
MULTIVARIATE AVALYSIS 307

REGRESSION ON PRINCIPAL COMPONENTS


Often a principal components analysis is the first step in the development of a prediction
model for some dependent variable, Y. Once the principal components are derived, they are used
as the independent variables in a multiple regression analysis with the dependent variable, Y.
Because of the differing units usually present in the original independent variables, the principal
components are generally abstracted from the correlation matrix.
The steps in performing a multiple regression on principal components are outlined here.
First, the independent variables are standardized and the dependent variable centered so that
X = [xij]and Y = [y,] where
- -
x1.J- . = (X..
1.J
- Xj)/sj and yi = Yi - Y (12.21)

where Yi is the ithobservation on Y. Y is the mean of Y, Xi, is the i" observation on the j" variable,
and X, and sj are the mean and standard deviation of the jthvariable. Centering Y is not necessary.
It eliminates the need for an intercept and simplifies notation. The matrix of principal
components, - Z, is determined from Z- =- XA with -A being a p X p matrix whose jthcolumn is a,,
the characteristic vector computed from equation 12.18 with - R =- X'X/(n
- - 1).
The regression model is

where -Y is an n X 1 vector whose elements are the n observations of the centered dependent
Z is an n X p matrix whose elements, Zij, represent the i" value of the jth principal
variable, -
component.
p is estimated from equation 10.8 as

The expression for P can be simplified by writing -


Z as

where -Jz. is an n X 1 vector whose elements are the n values of the jthprincipal component.

so that

Z'Z
- - zj]
- = [zi'
From equation 12.4 we have

Now - -J a . is 0 for i # j and is A, for i = j. Thus, -


4R - is a p X p matrix whose off-diagonal
Z'Z
elements (i # j) are all zeros and whose j' diagonal element (i = j) is (n - 1)A,.

(z'z)-'
- is therefore

Equation 12.23 can now be written as

From equation 10.14 and the above results it is apparent that

,, = 0
C O V ( ~p,) for i#j
uL
var(@,) = for i =j
(n - l)Aj

where u is the standard error of the regression equation.


MULTIVARIATE ANALYSIS 309

b, 6,
Thus is independent of for i f j. The independence of the b's is a result of the onhog-
,.
onality of the principal components. Since the p's are independent, the t-test given by equation
10.17 can be repeatedly applied to test hypotheses on the 6's from a single regression equation.
Furthermore, the numerical value for the fi's retained in the regression will not be altered by
eliminating any number of the other b's. This is the distinct advantage of having an orthogonal
matrix of independent variables.
A second advantage of having independent b's is that the interpretation of the fi's in terms
of the independent variables is greatly simplified. Thus, if some hydrologic meaning can be
attached to a component through an examination of the factor loadings, hydrologic significance
can also be attached to the 6's. Unfortunately, in most hydrologic applications of principal
components analysis, a clear and distinct interpretation of the principal components has not been
possible. This, in turn, means the hydrologic significance of the fi's is unclear as well.
Some authors (DeCoursey and Deal 1974) state that yet another advantage for using regres-
sion on principal components as compared to normal multiple regression is that the resulting re-
gression coefficients are more stable when applied to a new set of data because the coefficients
are fitted on the basis of only statistically significant orthogonal components. This could imply
that using an equation based on regression on principal components for prediction on a sample
not included in the equation development would have a smaller standard error on this sample
than would a normal multiple regression equation. If this is the case, it would be an important
advantage for the regression on principal components technique. An adequate demonstration of
this hypothesis needs to be developed, however.
A disadvantage of using principal components in a regression analysis is that even if all but
one of the components is eliminated, all of the original variables (the X's) must still be measured
because each component is a function of all of the X's (equation 12.4).
In reporting the results of a regression on principal components, it is generally desirable to
transform the resulting regression equation- into an equation in terms of the original X variables.
This can be done -
since yi = Yi - Y, the 6 ' s are known constants, zil = akjx,, and
x-1 ..~ = (X.' J. - Xj)/sj. Thus equation 12.22 becomes

Equation 12.33 can then be simplified by collecting terms to be of the form

where the p*'s are constants. If only q(q < p) components are retained in the final regression
equation, and the components are rearranged so that the first q components are retained, the first
summation in equation 12.33 would run from I to q; however, the second summation would still
run from 1 to p. This means the summation in equation 12.34 would run from 1 to p. It also means
that even though the equation contains only q components, all p of the original variables must be
measured to predict Y.
Some of the original X variables can be eliminated from the analysis before any regres-
sions are performed by examining the factor loadings and eliminating variables that are not
highly correlated with any of the components. The remaining X variables are then resubmitted
3 10 CHAPTER 12
to a principal components analysis with the multiple regression being performed on the new
components. This procedure has the advantage of reducing the number of variables that must be
measured to use the resulting regression equation. It has the disadvantage of eliminating X vari-
ables rather arbitrarily (there is no statistical test for the significance of the factor loadings)
without ever having them in a position to determine their usefulness in explaining the variation
in the dependent variable, Y.
In many applications of regression on principal components, the last p - q components are
discarded before the regression is performed. The number of retained components, q, is selected
so that a large proportion of the variance of - X is accounted for. This procedure reduces the
number of coefficients that must be estimated but runs the risk of eliminating a component that
may explain a significant amount of the variation in - Y even though it explains little of the
variance of - X.
Equation 12.31 gives = a;XIY/(n - 1)X, whereas equation 12.32 gives ~ a r ( 8 , = )
u2/(n - l)Xj. The statistical significance of Pj is tested using equation 10.17 with Po = 0.Thus
the test statistic is

There is no reason to believe before the regression is performed that this test statistic will be
nonsignificant for small values of Xj (i.e., for the last p - q components). Therefore, the
regression should be performed on all of the components and then the components that prove to
be nonsignificant can be eliminated.
The value of the test statistic given by equation 12.35 can be shown to be proportional to the
correlation between -
Y and -Jz . as follows:

Therefore

or the significance of the jh component is directly proportional to its correlation with the de-
pendent variable. Equation 12.38 can be used to test the significance of the jthcomponent.
At this point, it should be noted that if a dependent variable - Y is regressed on p principal
components extracted from a p X p correlation matrix and then transformed via equation 12.33,
the results are identical to those that would be obtained by a direct regression of Y on the original
p variables. This is because multiple regression is a linear operation and the principal
components are independent linear functions of the original variables that explain all of the vari-
ance of the variables.
MULTIVARIATE ANALYSIS 311
1MULTIVARIATEMULTIPLE REGRESSION
Occasionally, it is desirable to predict several dependent variables from the same set of
independent variables. Such a situation might be predicting the mean annual flood, 10-year peak
flow, and 25-year peak flow for a setting where it is desirable to maintain the correlation among
the dependent variables. This can be accomplished using a multivariate extension of multivariate
regression. The prediction model would be

where Y is an n X q matrix of dependent variables, X - is an n X p matrix of independent vari-


ables, and -
j3 is a p X q matrix of coefficients. Press (1972) discusses this model in more detail.
The coefficients, -j3, can be estimated in a manner similar to that employed in multiple
regression as

This equation can be written as

b
where - and -
Y are partitioned into q p X 1 vectors. Furthermore

demonstrating that the solution to equations 12.40 is equivalent to q multiple regressions each
involving the same X but a different vector of dependent variables. Tests of hypothesis
concerning - p J can be made using the procedures set forth in chapter 10.
In multivariate regression as in multiple regression, one commonly has a large number of
independent variables all of which are not important in predicting the q dependent variables. If q
separate multiple regressions are performed and independent variables eliminated using the pro-
cedures of chapter 10, it would be unlikely that the resulting equations would contain the same
set of independent variables.
If the multivariate regression model is used, all q of the prediction equations will contain the
same set of independent variables. Press (1972) presents a procedure for testing the hypothesis that
j3. = -
-1
PT where -P, is a 1 X q vector made up of the coefficients associated with the ithindependent
variable for each of the q dependent variables and - PT is a 1 X q vector of constants. To test that
the ithindependent variable was not significant would be equivalent to the test that - Pi = -0. Thus,
a procedure is available for eliminating variables from the regression to produce a useable model.
One distinct advantage in using the same independent variables for estimating several
dependent variables is that the correlation structure of the dependent variables is preserved.
DeCoursey (1973) used such an approach to derive prediction equations for the 2-, 5-, lo-, and
25-year peak flows on watersheds in Oklahoma. In situations like this it is highly desirable to
retain the observed correlations among the dependent variables in the resulting prediction
312 CHAPTER 12
equations. In the case of flood flows, if this is not done it might be possible to have equations that
are inconsistent and predict, say, a 10-year peak to be greater than the 25-year peak flow.
Another place where retention of the correlation structure among a set of dependent variables
is important is in estimating the parameters descriptive of runoff hydrographs. Rice (1967) dis-
cusses this application of multivariate, multiple regression in simultaneously estimating the runoff
volume, peak discharge, and a base time parameter for runoff hydrographs based on data presented
by Reich (1962). Rice states that even though three separate regressions produce slightly better fits
to the original pool of data, the multivariate solution might be more effective in predicting hydro-
graphs for storms on watersheds not included in the original data sample.

CANONICAL CORRELATION
Canonical correlation examines the relationship between two sets of variables. Consider the
2. Partition -
X with covariance matrix -
n X p matrix - 2 so that
X and -

X = [Y
- -,Z
-] - i s n X p, a n d-
whereY Z i s n X p2and

where 2
- is p X p, Zl1 &2 is p2 x p2 with pl +
is p1 x pl, 2 1 2 is PI X p2, 2 2 1 is p2 X P I ,and -
p2 = p and p, 5 p2. In this formulation - Ell = Var(Y), z2, = Var(Z),
- and - C12= 22, =
Cov(Y,
- - Z).
Canonical correlation investigates the correlation between - Y and - Z. Linear functions of -
Y
and -
Z are formed and then the correlation between these linear functions determined. Define

U, = a;Y' and V, = a;zl


so that a,is p, X 1, -
Y is p, X n, a, is p2 X 1 and -
Z is p2 X n. The linear functions U, and V, are
formed in such a way as to maximize the correlation between them.

U, and -
The variances of - V, are

Var(U,)
- = var(a;
- - a;cl,
Y') = - - g, and

Therefore

Our goal is to find the -


a, and - - V,).
a, that maximizes Cor(Ul, - Because correlation is not changed
by linear operations, we must use a constraint to get a unique solution. We will use
MULTIVARIATE -ANALYSIS 313

a normalizing constraint. In this case Cor(U1,


- - V,) = - Z12s2.
a;- As was done in the case of
principal components, Lagrangian multipliers are used to maximize

r comes from matrix multiplication of the partitions of -


- Z so that

Unfortunately, - r is not symmetric, so the determination of the resulting p2 values of the h's
may require special computing techniques. hi is numerically equal to the square of the correlation
between -Ui and - Vi. For convenience the hi are arranged so that hl > h2 > - - . > A,,.
The hope is that hi is sufficiently large that other h's can be dropped so that attention can be
focused on - U1 and - V,, which are vectors, rather than -Y and - Z, which are matrices. Of course,
- associated with each of the Xi's. If h2 is sufficiently large, two vectors
there will be a -1U. and a Vi
on -
U and two vectors on - V may have to be considered.
The vector - Z to -
art used to transform - V is found by determining the eigenvectors of

a, is found, the vector -


Once - X to -
a,transforming - U is found from

The partitioning of - X into -


Y and Z- has to be done up front by the investigator and is not a result
of the analysis.
Interpretations and use of canonical correlation in hydrology appears to have some of the
same drawbacks associated with principal components. The problem might be reduced from con-
sidering a p, variate -Y and a p2 variate -
Z to single variates -
U and -
V, yet -
U is a function of all p,
Y's and -V is a function of all p2 Z's. Some investigators eliminate some of the Y and Z variables
a,or -
if the coefficients in - a, are "small" on those particular variables.

CLUSTER ANALYSIS
The main objective of a regional flood frequency analysis is to develop regional regression
models which can be used to estimate flow characteristics at ungaged stream sites. Hydrologic
data from several gaging stations in hydrologically homogeneous regions are collected and ana-
lyzed to obtain estimates of the regression parameters. Identification of these hydrologically ho-
mogeneous regions is a vital component in any regional frequency analysis. One method used to
identify these regions is a multivariate statistical procedure known as cluster analysis.
Cluster analysis is a method used to group objects with similar characteristics. Two clustering
methods are used for this purpose. The first type of procedures is known as hierarchical methods, and
they attempt to group objects by a series of successive mergers. The most similar objects are first
grouped and as the similarity decreases, all subgroups are progressively merged into a single cluster.
The second type of procedures is collectively referred to as nonhierarchical clustering techniques
and, if required, can be used to group objects into a specified number of clusters. The clustering
process starts from an initial set of seed points, which will form the nuclei of the final clusters.
The most commonly used similarity measure in cluster analysis is the Euclidean distance,
defined by:

where Di, is the Euclidean distance from site i to site j, p is the number of variables included in
the computation of the distance (i.e., the basin and climatic variables) and zi,, is a standardized
value for variable k at site i.
In many applications the variables describing the objects to be clustered (discharges, water-
shed areas, stream lengths, etc.) will not be measured in the same units. It is reasonable to assume
that it would not be sensible to treat, say, discharge measured in cubic meters per second, area in
square kilometers, and stream length in kilometers as equivalent in determining a measure of
similarity. The solution suggested most often is to standardize each variable to unit variance prior
to analysis. This is done by dividing the variables by the standard deviations calculated from the
complete set of objects to be clustered. The standardization process eliminates the units from
each variable and reduces any differences in the range of values among the variables.
To get a feel for how cluster analysis works, consider six precipitation stations and their
associated annual precipitation in mm:

station 1 2 3 4 5 6
precipitation 1000 1200 600 700 500 1100

It is desired to see if these stations can be grouped into homogeneous groups based on the aver-
age annual precipitation.
The first thing that is done is to standardize the precipitation values. For this set of data, the
mean is 850 and the standard deviation is 288. Table 12.1 contains the data and results. Equation
12.47 is used to calculate Dij.For example, DlY2is d ( 0 . 5 2 - 1.21)~which equals (0.52 - 1.21),
or 0.69. The results for all of the Dij are shown in Section A of table 12.1.
The next step is to find the minimum value of the similarity measure, Di,,.This value is seen
to be 0.35. The value 0.35 appears several times. The pair (3,4) was arbitrarily chosen as the first
similar pair. Section B of table 12.1 contains the Dij values from Section A except for the (3, 4)
row. This row contains the minimum of D3,jand D4,jfor j = 1, 2, 5, and 6. For example, D3,1is
1.39 and D4.1 is 1.04. Therefore, the (3,4), 1 entry in Section B is 1.04. Other values in the (3,4)
row are similarly determined.
Again, the minimum entry in Section B is found to be 0.35 corresponding to the (1,6) pair. Thus
(1,6) is clustered as in Section C and entries for Section C are determined from Section B in the same
manner as entries in Section B were determined from Section A. The next step results in (1,6) and 2
being clustered to form (1, 2, 6). This is followed by (3, 4) being clustered with 5 to form (3,4, 5).
Table 12.2 is similar to table 12.1 except that the value of precipitation for the third station
is changed from 600 to 1050 mm. Carrying through the analysis as was done for table 12.1 re-
sults in forming the clusters (4,5) and (1, 2, 3,6).
In table 12.3, the third station value is changed to 1800 mm. The cluster results are (1,2,4,
5, 6) and 3. In all of these analyses, the Di, entry is a measure of the similarity that exists. For
Table 12.1. First cluster analysis of rainfall data

Station 1 2 3 4 5 6 Mean St dev

Precipitation 1000 I200 600 700 500 1100 850 288


z 0.52 1.21 -0.87 -0.52 -1.21 0.87 0 1

Table A

Table B

Table C

Table D

Table E
Table 12.3. Third cluster analysis of rainfall data

Stiition 1 2 3 4 5 6 Mean St dev

Precipitation 1000 1200 1800 700 500 1100 1050 45 1


z -0.11 0.33 1.66 -0.78 - 1.22 0.11 -7E- 18 1

Table A

Table B

Table C

Table D

Table E
318 CHAPTER 12
example, in table 12.3, the Dij values of 0.22 indicate strong similarity. The value of 0.44 shows
that stations 4 and 5 are not as similar as are stations 1, 2, and 6. The value 0.67 shows that the
cluster (4,5) and (1,2,6) are less similar than either 4 and 5 or l , 2 , and 6. Finally, the value 1.33
shows that 3 is not very similar to the cluster (1, 2,4,5,6).
Clustering may stop when there is a significantjump in the similarity measure. In table 12.3 one
might conclude with three clusters, (1,2,6), (4,5), and (3),or with two clusters, (1,2,4,5,6) and 3.
Table 12.4 extends the analysis to consideration of two measures of the stations being con-
sidered, precipitation and potential evapotranspiration. Again, Section A was constructed from
equation 12.47. For example, the Dl,, entry is calculated from standardized values as
D I 2 = d ( - 0 . 1 1 - 0.33)~+ (-1.21 - 1.21)~or 2.47. The analysis is completed based on
Section A in the same manner as for tables 12.1-12.3. Here a satisfactory clustering doesn't
appear to exist. It looks as though 2 and 6 might be clustered but possibly the other stations can
not be clustered.
Table 12;5 is based on the ratio of precipitation to potential evapotranspiration. Using this
system measure, 2, 4, and 6 certainly form a cluster. Depending on the purpose of the analysis,
one might conclude that (1, 3) and (2,4,5, 6) represent the final clustering.

Exercises

12.1. Calculate the correlation matrix for the first two variables contained in the tables of exer-
cise 10.8.

12.2. Calculate the characteristic values and characteristic vectors associated with the correla-
tion matrix of exercise 12.1.

12.3. Compute the numerical values of the principal components of the data in the first two
columns of the table in exercise 10.8 (based on the correlation matrix).

12.4. (a) Work exercise 12.1 using the first three variables. (b) Work exercise 12.2 based on the
first three variables. (c) Work exercise 12.3 based on the first three variables.

12.5. (a) Work exercises 12.1, 12.2, and 12.3 based on the covariance matrix. (b) Work exercise
12.4 based on the covariance matrix.

12.6. Work exercise 12.4 using all of the variables in the table of exercise 10.8 except Q,. (Note:
Don't try this without a computer-life is too short!)

12.7. Calculate the factor loadings for the data of (a) exercise 12.2, (b) exercise 12.4, or (c) ex-
ercise 12.5.

12.8. Show that Z'Z


- = (n - 1)D,
- by using as an example the data of exercise (a) 12.2, (b) 12.4,
or (c) 12.5.
Table 12.4. Fourth cluster analysis of rainfall data
--

Station 1 2 3 4 5 6 Mean St dev

Precipitation 1000 1200 1800 700 500 1100 1050 45 1


21 -0.11 0.33 1.66 -0.78 -1.22 0.11 -7E-18 1
PET 500 1200 600 700 1000 1100 850 288
22 - 1.21 1.21 -0.87 -0.52 0.52 0.87 2E- 17 1

Table A

Table B

Table C

Table D

Table E
Table 12.5. Cluster analysis of rainfall-evaporationdata

Station 1 2 3 4 5 6 Mean St dev

Ratio
z

Table A

Table B

Table C

Table D

Table E
13. Data Generation
CHAPTER 15 discusses several stochastic models that have been found useful in hydrol-
ogy. Stochastic models contain random components. These random components contain random
elements. If a stochastic model is to be used to generate hydrologic data, methods must be avail-
able for generating the random elements of the models.
A random element is usually thought of as an element selected in a fashion such that each
element in the population has an equal chance of being selected. If the sample results from choos-
ing a number at random from a population of numbers in such a fashion that each number in the
population has an equal chance of being selected, the procedure is equivalent to sampling from a
uniform distribution. More generally, a random element can be selected from any probability
distribution as long as the elements are independent of each other. This chapter first sets forth tech-
niques for generating random samples from probability distributions. Next, a method for generat-
ing a multivariate random sample that preserves the correlations between the variates is presented.
Finally, several possible areas of application for data generation methods are discussed.
In any application of data generation methods, it must be kept firmly in mind that data gen-
eration cannot improve or overcome faulty data. At best, one can generate a set of data having
statistical properties equal to the properties of the sample used in estimating the population
parameters. In addition to this, data generated stochastically is subject to the same sampling
errors as natural data. As a matter of fact, data generation has been widely used to study sampling
errors, an application discussed later in the chapter.

UNIVAFUATE DATA GENERATION


A random number is defined as a number selected from a population of numbers in such
a fashion that the probability of the number being in some interval is strictly governed by the
probability density function of the population. If this probability is proportional to the length
of the interval, the pdf is a uniform distribution. A random digit would be one of the numbers
0, 1, ..., 9 selected in such a fashion that any one of these numbers would have an equal ability
of being selected. In a sample of 100 random digits, the expected result (but with a very low prob-
ability of occurrence) would be ten each of the digits 0, 1, ..., 9.
Tables of random numbers are generally available in many statistics books. Computer routines
for generating random numbers are included as a part of the program libraries for most computers.
Care must be exercised when using computer routines in that some generate biased samples.
Many computer routines generate uniformly distributed random numbers in the interval (0, 1).
A uniform random number, Y, in the interval (a, b) can be generated from a uniform random num-
ber in the interval (0, l), R,, by the relationship Y = (b - a)R, + a.
Random observations may be generated from probability distributions by making use of the fact
that the cumulative probability function for any continuous variate is uniformly distributed over the
interval 0 to 1. Thus, for any random variable Y with probability density function py(y), the variate

is uniformly distributed over (0, 1).


A procedure, illustrated in figure 13.1, for generating a random value y from pY(x)is

1. Select a random number R, from a uniform distribution in the interval (0, 1).

2. Set P,(y) = R, in equation 13.1.

3. Solve for y.

Step 3 in this procedure is known as obtaining the inverse transform of the probability
distribution.

Fig. 13.1. Procedure for generating a random observation from a probability distribution.
DATA GENERATION 323
As an example, consider the Weibull distribution with

and

Solving for y results in the inverse transform

By substituting Ru for Py(y), random values of Y from the 3-parameter Weibull distribution can
be generated from

For some distributions it is not possible to solve equation 13.1 explicitly for y. That is, an
analytic inverse transform cannot be found. The normal and gamma distributions are examples
of this. Fortunately, in the case of the normal distribution, numerically generated tables of
standard random normal deviates are widely available. A standard random normal deviate is a
random observation from a standard normal distribution. Random observations for any normal
distribution can be generated from the relationship

where RNis a standard random normal deviate and p, and o are the parameters of the desired normal
distribution of Y. Computer routines are available for generating standard random normal deviates.
For some distributions, relationships with other distributions can be used in the generat-
ing process. For example, a gamma variate with integer values for rl has been shown to be
the sum of rl exponential variates each with parameter A. Therefore, gamma variates with
integer values for rl can be generated by summing rl values generated from an exponential
distribution.
Whittaker (1973) discusses a method for generating random gamma variates with any
shape parameter rl. Because the gamma distribution is closed under addition, a gamma random
variable with any shape parameter can be constructed if one with a shape parameter in the
internal 0 < rl < 1 can be constructed. Let Rul,Ru2,and Ru3be independent uniform random
variables on (0, 1). Define S1 and S2by

then if S1 + S2 5 1, define Y and Z as Z = Sl/(S, + S2)and


DATA GENERATION 325

Then Y has a gamma distribution with shape parameter -q and scale parameter A.
This procedure requires the generation of at least 3 uniform random variables. If S, + S2
> 1, then R,, and RU2are rejected and new values generated. The probability that S, + S2 5 1
is given as .rr-q(l - -q) cosec(.rr-q) and has a minimum of n / 4 at -q = 6 and is symmetric about
this value.
To generate a gamma variate with -q > 1, a gamma variate with an integer shape parameter,
and a shape parameter <1 can be added as long as the scale parameter, A, is held constant. For
example, to generate a gamma random variate with -q = 3.6 and any A, a gamma variate with
-q = 3 and A can be added to a gamma variate with -q = 0.6 and A.
Table 13.1 presents a summary of some analytical methods for generating observations from
selected common probability distributions. The table is modified from Hahn and Shapiro (1967).
Computer routines are available for generating random numbers from many different probability
distributions.
Where analytical inverse transforms cannot be found, numerical procedures can be
employed. One numerical method is to select a random number between 0 and 1 and then
numerically integrate equation 13.1 along the x-axis until the accumulated integral equals the
selected random number. At this point y would be equal to the value of x that had been reached.
A second numerical method, and one that would be faster if a large number of random
observations were needed, would be to numerically integrate equation 13.1 starting at the
extreme left of the distribution. The integration would proceed to the right in small increments
along the x-axis until the accumulated integral was sufficiently close to 1. At each step of the
integration, the value of x and the accumulated integral would be saved in the form of a table. The
generation process would then consist of selecting a random number in the interval (0, I),
entering the table with this random number considered as an accumulated integral, and finding
the corresponding value of x. This value of x would then be set equal to the desired random
variate y.

Example 13.1. Generate 22 observations from an exponential distribution with A = 2. Plot the
observations on semilogarithmic (probability) paper. Estimate A from the observations.

Solution: The 22 observations are generated from the relationship y = -ln(R,)/A where
R, is a randomly selected number in the interval 0 to 1. The values of Y so generated are
shown below. Figure 13.2 is a plot of the resulting numbers along with the lines describing
the exponential distribution with parameter A = 2 and with parameter i= 1.718 calculated
as i= 1 / ~ .

Comment: This problem illustrates the random variations possible when sampling from a proba-
bility distribution. As the sample size increases, ); should approach A and the plotted points will
lie more nearly on the line describing the exponential distribution with A = 2.
In hydrologic frequency analysis, the data represent a sample from an unknown population.
Thus, uncertainty as to the proper frequency distribution exists as well as uncertainty in the
values for population parameters.
326 CHAPTER 13

Ru Y Rank Plotting Position

Sum = 12.806

Fig. 13.2. Exponentially generated data for example 13.1.


DATA GENERATION 327
Several exercises at the end of this chapter are designed to help develop a "feel" for the
scatter that can be expected when sampling from various frequency distributions. Problems
dealing with testing distributional assumptions are also included. These problems demonstrate
that for small samples a single set of data is not a reliable indicator of the distribution that gener-
ated the sample.

MULTIVARIATE DATA GENERATION


Multivariate uncorrelated random variables can be generated by repeated application of the
univariate generation techniques using the appropriate probability distributions. If correlations
exist among the variables, specific techniques which preserve the correlations must be used.

Multivariate, Correlated, Normal Random Variables


Consider the vector Y - made up of random variables Yi that are normally distributed.
Random values for this p-variate, normally distributed, 1 X p vector -Y that preserve the means,
variances, covariances, and correlations between the variables can be generated by using
principal components. Recall from chapter 12 that - Z = --
X A where - Z is an n X p matrix of n
values for each of p components having a mean of zero and a variance of hi, -X is an n X p matrix
of n observations on each of p standardized variables (mean zero and variance one), and - A is a
p X p orthogonal matrix of characteristic vectors of the correlation matrix -
R.
Since -A is orthogonal we have

X = --
-
ZA'

where -z is a 1 X p vector consisting of a single value for each of the p uncorrelated components.
The mean and variance of the jthprincipal component are 0 and Aj, respectively. This equation
can be used to generate standardized normally distributed random variables that preserve their
correlation. The components are uncorrelated so that a random value of 3 is generated as
3 = (zl, z2, ..., zp) where zk is a random observation from a normal distribution with mean zero
zj
and variance A,. Post multiplying by 4' then produces - xj
x j. n values for can be generated by
repeating this process n times. Then

is an n X p matrix of standardized normally distributed random variables having the property


X'X
that - - = (n - l)R.
-
Letting -X = [xij], the matrix - ] be computed from yij = ajxij + IJ.~where p,
Y = [ Y , ~can
and ajare the desired mean and standard deviation of the j" variable. The resulting n X p matrix
Y is made up of n normally distributed random observations on p variables with the mean and
-
variance of the j" variable being pj and aj2respectively and the correlation between the i" and
j" variable being contained in the matrix -
R = [rij]. Because the desired correlations and the vari-
antes are produced, the correct covariances are also produced.

Multivariate, Correlated, Nonnormal Random Variables


Y of nornormal, correlated ran-
Taylor et al. (1993) present a procedure to generate a matrix -
dom variables by first generating a matrix - X of N(0, I), correlated random variables and then
transforming -X to - - is a matrix of
P through the inverse cumulative standard normal distribution. P
probabilities such that the prob(xi,-< X) = Pij. The resulting P
- is transformed to Y
- based on the
desired probability distributions for the Yi's.
The steps in the procedure are:

1. Assume the data are multivariate normal and generate a sample of the required size of
correlated, multivariate N(0, 1)observations, xij, having the desired correlation structure, R,.
-

2. Transform the generated multivariate normal observations to their corresponding cumulative


probability. That is, for each x i , find the corresponding Px(xij) for the standard normal
distribution.

3. Transform the Px(xij)to yij by substituting Px(xij) into the desired inverse cumulative distri-
bution for the corresponding yij to find vectors having the correct pdfs. Note that this last
transformation guarantees that the y's will have the correct pdfs. Because all of the
transformations are nonlinear, the correlation matrix, - R2, for the generated y's will only
approximately preserve the original correlation matrix. In other words, -R2 will not be exactly
R,.
equal to -

Example 13.2. Generate a sample of 20 observations from the 3-variate normal distribution
having the properties p1 = 3.173, p2 = 16.462, p3 = 2.566, al = 2.113, a2= 12.481, a3=
1.150, p,,, = -0.1713, p1,3 = 0.8958 and p2,3 = -0.2059.

Solution: This correlation structure corresponds to the correlation matrix in example 12.2. The
procedure is to first generate 20 observations from a 3-variate standard normal distribution
having the desired correlation structure by 20 applications of equation 13.8. The matrix A is
contained in example 12.2. A 1 X 3 vector - z is generated as (Z,, Z2, Z3) where Z, is a random
observation from a normal distribution with mean 0 and variance Xi. The Xi are obtained from
example 12.2 as 1.9692,0.9273, and 0.1035. The 1 x 3 vector - x = (XI, X,, X3) is then computed
from equation 13.8. Finally, a 1 X 3 vector -y = (y,, y2, y3) is computed as yi = xiai + pi. This
process is repeated 20 times, generating the required 20 values for y. The following matrix
Y.
contains the resulting 20 observations on -
DATA GENERATION 329

Y are shown below. These statistics


The means, standard deviations, and correlations of this -
are not the same as the desired population parameters (as expected) because they are based on a
random sample of size 20.
The above procedure was also carried out for samples of 200 and 999 observations with the
results shown below. Again, it should be kept in mind that these results are based on random sam-
ples. A second random sample of the same size would result in different estimates for the
population parameters.

Population 3.17 16.46 2.57 2.11 12.48 1.15 -.I7 -90 -.21
20 3.79 12.83 2.90 1.52 13.14 0.94 -.I3 .89 .03
200 3.12 17.10 2.58 2.22 11.39 1.20 -.I3 .9 1 -.I5
999 3.13 15.48 2.54 2.08 12.38 1.12 - .20 .90 - .23

Example 13.3. Generate a sample of 20 observations having the properties given in example
13.2 except assume that the distributions of the random variables XI, X2, and Xg are normal,
exponential, and lognormal, respectively. Note that X, can only be approximately exponential
because p2is not exactly equal to a,.

Solution: The first steps are the same as for example 13.2. The Y matrix is then transformed
to a cumulative probability matrix - P is transformed to h e required X
P and the - - by finding the
330 CHAPTER 13
values of xij corresponding to p i j based on an inverse transformation using the appropriate
probability distribution. For X, the distribution is normal so Y, = X,. For X2 the distribution is
exponential so

For X,, the mean and standard deviation of the logarithms are computed from equations
6.30-6.3 1. The ln(X,) is then obtained as the inverse of the normal distribution, having the de-
termined logarithmic mean and standard deviation. Finally, X, is the antilogarithm of the
resulting normally generated value. A part of the indicated matrices are shown here:

As an example, consider p,,,. This is the probablity that Y2 < -5.56 if Y2 is N(16.462,
12.4812).The value of this probability is 0.038829. In general, pi, = probability of X, < xijif X,
is N(pj, 0;).To transform pij to the actual xij, one finds the value of Xij satisfying prob(xj < X)
= pij, where the probability is based on the appropriate pdf. For example, xlz is from the
cumulative exponential distribution whose value is plz.

The value of XI,, is generated from a lognormal distribution. Using equations 6.35 and 6.36,
the mean and standard deviation of the logs of yl,, are found to be 0.85083 and 0.42782. The value
of the standard normal distribution having a cumulative probability of 0.610928 is 0.281739.
Thus, is given as

This procedure was also carried out generating 1000 observations with the results shown
below:

Figure 13.3 shows histograms of the resulting simulations of the 1000 observations. The
histograms indicate the data generally follow the desired distributions.
DATA GENERATION 33 1

Normal plot of variable 1

Exponential plot of variable 2

Lognormal plot of variable 3

Fig. 13.3. Frequency histograms for data generated for example 13.3.

APPLICATIONS OF DATA GENERATION


Data generation techniques or Monte Carlo simulation have been widely used in hydrology.
These uses range from generating large samples of data from known probability distributions to
studying the probabilistic behavior of complex water resources systems. Chapter 15 treats sto-
chastic hydrologic models in more detail. The use of simulation in hydrology is certainly not a
recent development. In 1927 Sudler (1927) generated a 1000-year record of annual runoff values
to develop probability distributions of reservoir capacities. Chow (1964) has indicated that how
much risk and uncertainty are associated with a proposed investment can be estimated by the use
of multiple sequences of generated data.
Fiering (1966) has discussed the stochastic simulation of water resources systems. In his
paper, he makes the following points:

1. Synthetic hydrologic traces do not provide a mechanism for overcoming biased or faulty data.

2. Simulation is not a substitute for analytical solution.


3. When system simulation appears necessary, it is statistically unjustifiable to rely solely on the
observed sequence of hydrologic events.

McMahon et al. (1972) discuss the use of simulated streamflow in reservoir design. Burges
and Linsley (1971) investigated the influence of the number of traces used in determining the
frequency distribution of reservoir stage. In their study, they generated inflows from both an
annual and a monthly, normal, Markov model. They found that, in general, fewer traces were
required to define the storage distribution when the monthly model was used than when the annual
model was used. They also found that about 1000 traces should be used to determine the storage
frequency distribution when the annual model is used.
Hahn and Shapiro (1967) discuss evaluating system performance by Monte Carlo simula-
tion. Benjamin and Cornell (1970) discuss using simulation to derive the probability distribution
of a random variable that is a function of other random variables. Smart (1973) discusses the use
of simulation to determine relationships between certain parameters of random geomorphological
models. Shreve (1970) used simulation to generate a sample of topologically random channel
networks. Fiering (1961) discusses simulation in reservoir design. Fiering and Jackson (1971)
develop models for simulating streamflow.
A widely used application of data generation has been in the general area of uncertainty,
reliability, and risk analysis. Data generation is used to examine a large number of outcomes or
possibilities from a system from which probabilistic statements can be made. Chapter 16 should
be consulted for more detail on this topic.
The stochastic nature of quantities estimated from stochastic models can be investigated
using data generation techniques. The design of any water resources system is dependent upon
estimates of hydrologic quantities. These estimates are based on some type of stochastic model-
whether it be a flood frequency curve or a comprehensive river basin simulation model. One of
the first steps in developing design estimates is the selection of the stochastic model to be used.
Regardless of what stochastic model is finally selected, the parameters of this model must be
estimated from historical data. Because the parameters are functions of random variables (the his-
torical data), the parameters themselves are random variables. Furthermore, the design estimate that
is arrived at using the model is a random variable because it is dependent on the model parameters.
As an example, consider the design capacity of a reservoir required to meet a given crite-
rion. This capacity might be determined based on an available historical streamflow record. If a
different historical streamflow record were available and was used to determine the required
capacity, the estimate based on this historical record would differ from the estimate based on the
original historical record. The estimated design would be a random variable because it is a func-
tion of the available streamflow record and streamflow is a random variable. Intuitively, if two
extremely long streamflow records were used, one would expect less difference in the estimated
reservoir capacity than if two short streamflow records were used. Furthermore, one would expect
the estimated capacity based on the long record to more closely approximate the "true"capacity
than the estimate based on a short record.
In general, the variance of a parameter estimate is a decreasing function of the sample size.
The larger the sample, the smaller the variance of the parameter estimate. This, in turn, implies
DATA GENERATION 333

that the variance of the design estimate will decrease as the sample size increases. The difference
in a design estimate and its true population value may be thought of as a prediction error.
A general procedure for determining the probability distribution of prediction errors as a
function of sample size is presented in Haan (1972b). The procedure assumes that the correct
stochastic model is being employed. The procedure is as follows:

1. Estimate the parameters of the stochastic model and assume these estimates are equal to the
population values.

2. Simulate k independent sets of data of the type being studied with the model using the as-
sumed population parameters. Each set of data consists of n observations or years of record.

3. Reestimate the parameters of the model being used from the n simulated observations for each
of the k data sets. This results in k parameter sets.

4. Estimate the desired quantity, Q (mean annual runoff, 50-year peak flow, 90-day low flow,
etc.) with the model using each of the k parameters sets. This will result in k estimates for Q.

5. Look at the probability distribution, PQ(q), of the k estimates for Q and determine the
probability of an individual estimate being outside some acceptable limits. If Q* represents
the estimate of Q and Q, and Q, are the lower and upper limits, then the probability that Q*
will be outside the desired interval is given by

6. Repeat steps 2 through 5 for various values of n, the record length.

7. Select the record length that gives an acceptable probability (sufficiently low) of Q* falling
outside the interval Q, to Q,.

This procedure can be applied to many types of stochastic models. Haan (1972a) presents an
illustration of its use in conjunction with the Thomas and Fiering (1962) streamflow simulation
model (chapter 15). In this example, a set of population parameters were assumed for the model
and k = 100 sets of observations generated. Each set of observations consisted of n years of
record. The process was carried out for n = 10, 15,25, and 50 years. Using each set of observa-
tions and each record length, the parameters of the Thomas and Fiering streamflow model were
estimated, and, from these parameters, the mean annual flow determined by simulation. This
gave 100 estimates for the mean annual flow for each of the 4 record lengths. A probability
distribution (normal distribution) was fit to the 100 estimates for the mean annual flow. The prob-
ability that a single estimate of the mean annual flow would deviate more than a given amount,
d, from the population mean annual runoff of 371.6 mm was evaluated. This entire process was
repeated 3 times, giving a total of 300 simulated traces.
334 CHAPTER 13
Table 13.2. Probability of error greater than d in mean annual runoff for problem described
by Haan (1972a)

n, in d in millimeters
years 6.40 12.70 25.40 38.10 50.80 76.20 101.60

The results of this analysis, presented in table 13.2, show the expected result that as the
number of years of record increases, the probability of making an error greater than a given
value decreases. For example, for this particular stream, there is a probability of 0.22 of missing
the true mean annual runoff by more than 50.8 mm if 10 years of record are available, whereas
the probability is only 0.03 if 50 years of data are available for parameter estimation.
This procedure for estimating prediction error probabilities requires that the population
parameters for the stochastic model be known. Because this is rarely the case in hydrology, these
parameters must be estimated from all of the available information. Obviously, these estimated
parameters will not equal the population parameters, but when used as population values along
with the above simulation technique, they will yield estimates of error probabilities that can serve
as a guide in determining how much data is needed to ensure an acceptably low probability of
making an unacceptable error with the stochastic model.

Exercises

13.1. Without using a table of standard normal deviates, generate 20 observations from a normal
distribution with a mean of 100 and a variance of 100. What is the mean and variance of the 20
observations?

13.2. Select 100 observations from a normal distribution with mean 0 and variance 1 (use a table
of standard normal deviates). Plot a histogram of these observations. Test the hypothesis that
these are from a normal distribution using the X 2 test and the Kolmogorov-Smirnov test. Why do
the mean and variance of the data not equal 0 and 1, respectively?

13.3. Generate 20 observations from an exponential distribution with A = 0.5. (a) Test the hy-
pothesis that the observations are normally distributed. (b) Test the hypothesis that the observations
are exponentially distributed.

13.4. Generate independent samples of size 10, 20, 30, 50, and 100 from an N(0, 1). Plot the ob-
servations on probability paper using one plot for each sample. Repeat this entire process 5 times.
Study the resulting probability plots in an attempt to develop a "feel" for the scatter that one can ex-
pect when sampling from a frequency distribution. (This might be undertaken as a class project with
each student working through a sample of 10,20,30,50, and 100. The results can then be shared.)
DATA GENERATION 335
13.5. Any number of variations of exercise 13.3 can be worked using different initial
distributions, test distributions, parameter values, and sample sizes. Some variations should be
used to assist in developing a "feel" for the scatter present in random samples from frequency
distributions and for the discriminatory power of the chi-square and Kolmogorov-Smimov tests.

13.6. Write a computer program for generating random observations from a gamma distribution
for integer values of q. Generate independent sets of size 10,20,30,40,50, and 100 observations
using q = 2 and A = 1.5. Test the hypothesis that these generated values are from a (a) gamma
distribution, (b) normal distribution, (c) exponential distribution.

13.7. Repeat example 13.2 for samples of size 20,200, and 999. Why are your results not iden-
tical to those of example 13.2?

13.8. Weekly rainfall during a particular week of the year at a weather station is thought to fol-
low a gamma distribution with q = 2 and A = 1.5. If 25 years of data are available for estimat-
ing the parameters of the gamma distribution, what is the probability that the estimated 50-year
weekly rainfall based on the estimated gamma parameters will be in error by more than
0.5 inches?

13.9. See exercise 5.10.


14. Analysis of
Hydrologic Time Series
THIS IS an introduction into the analysis of time series of hydrologic data. There have been
many books and articles written on the subject of time series and stochastic processes. The
statistical literature contains a wide array of such books covering a wide range of complexity.
Bendat and Piersol (1966, 1971) have prepared very readable books on the analysis of random
data. Books and articles dealing with time series analysis and stochastic models in hydrology
include Yevjevich (1972b,c), Kisiel (1969), Matalas (1966, 1967b), Julian (1967), Salas et al.
(1980), Bras and Rodriguez-Iturbe (1985), Salas (1993), and Clarke (1998). The classic text of
Box and Jenkins (1976) lays the foundation for a particular type of time series analysis and
models now known as Box-Jenkins models. Pankratz (1983) and others have presented
examples, details and clarifications of the Box-Jenkins approach. Many books on time series
analysis are available. In the treatment here considerable reliance has been placed on Cryer
(1986). These references and others contain much more information than is presented here and
should be consulted by those requiring more than an introductory knowledge to time series
analysis. In addition to books and articles, several software packages for personal computers are
available that make time series analysis practical.

DEFINITIONS
A sequence of variables collected over time on a particular variable is a time series. A time
series can be composed of a quantity either observed at discrete times, averaged over a time inter-
val, or recorded continuously with time. An ensemble of time series is a set of several time series
measuring the same variable. A single time series is called a realization. Thus, an ensemble is
t t
a. Stochastic b. Stochastic + Trend

t
I t
c. Stochastic+ Periodic d. Stochastic + Jump

Fig. 14.1. Time series containing stochastic and several types of deterministic components.

made up of several realizations. A time series may be composed of only deterministic events,
only stochastic events, or a combination of the two. Most generally, a hydrologic time series will
be composed of a stochastic component superimposed on a deterministic component. For
example, the series composed of average daily temperature at some point would contain seasonal
variation-the deterministic component-plus random deviations from the seasonal values-the
stochastic component. The deterministic components may be classified as a periodic component,
a trend, a jump, or a combination of these. Figure 14.1 shows typical stochastic time series with
various types of deterministic components.
Trends in a hydrologic time series can result from gradual natural or human induced
changes in the hydrologic environment producing the time series. Changes in watershed
conditions over a period of several years can result in corresponding changes in streamflow
characteristics that show up as trends in time series of streamflow data. Urbanization on a large
scale may result in changes in precipitation amounts that show up as trends in precipitation
(Huff and Changnon 1973). Climatic changes or shifts may introduce trends into hydrologic
time series.
Jumps in time series may result from catastrophic natural events such as earthquakes or
large forest fires that may quickly and significantly alter the hydrologic regime of an area.
Anthropomorphic changes such as the closure of a new dam or the beginning or cessation of
pumping of ground water may also cause jumps in certain hydrologic time series. Astronomic
cycles are generally responsible for periodicities in natural hydrologic time series. Annual cycles
are many times apparent in streamflow, precipitation, evapotranspiration, groundwater level, soil
moisture and other types of hydrologic data. Weekly cycles may be present in water use data such
as industrial, domestic, or irrigation demands. Many times the latter time series will contain both
annual and weekly periodicities. Salas-LaCruz and Yevjevich (1972) and Yevjevich (1972~)dis-
cuss periodicities and trends in hydrologic data in more detail.
The time scale of time series may be either discrete or continuous. A discrete time scale
would result from observations at specific times with the times of the observations separated
by At or from observations that are some function of the values that actually occurred during
At. Most hydrologic time series fall in this latter category. Examples would be the average
monthly flow in a stream (At = 1 month), annual peak discharge (At = 1 year), and daily
rainfall (At = 1 day).
A continuous time scale results when data are recorded continuously with time such as the
stage at a stream gaging location. Even when a continuous time scale is used for collecting the
data, the analysis is usually done by selecting values at specific time intervals. For example, rain-
gage charts &e usually analyzed by reading the data at selected times (i.e., every 5 minutes) or at
"break points" (here At is not a constant).
In this chapter it will be assumed that the data are available at discrete times evenly spaced
At time units apart. Even though the discussion centers around a time scale concept, a distance or
space scale can be used as well. For example, the width of a stream along a certain reach might
be a stochastic process where the width would be the random variable and distance along the
reach the "time".
The random variable described by the time series may be discrete or continuous. A sequence
of 0's and 1's denoting rainless and rainy days would be a discrete stochastic process with a
discrete time scale. The amount of daily rainfall would be a continuous stochastic process with a
discrete time scale (At = 1 day). Thus a times series may be composed of either discrete or
continuous random variables on discrete or continuous time scales.
A stochastic process can be represented by X(t). The probability density function of X(t) is
denoted by px(x;t) which describes the probabilistic behavior of X(t) at the specified time, t. If
the properties of a time series do not change with time, the series is called stationary. For a sta-
tionary series p,(x;t,) equals px(x;t,) where t, and t2 represent any two different possible times.
If px(x;t,) and px(x;t,) are not equal, the series is termed nonstationary. Of the series shown in fig-
ure 14.1, only that given in 14.la can possibly be stationary. If the deterministic component is re-
moved from 14.lb, c, and d, they too might be stationary.
The properties of a time series can be obtained based on a single realization over a time in-
terval or based on several realizations at a particular time. The properties based on a time inter-
val of a single realization are known as time average properties. The properties based on several
realizations at a given time are known as the ensemble properties. If the time average properties
and the ensemble properties are the same, the time series is said to be ergodic.
Figure 14.2 shows several possible realizations for a continuous stochastic process on a
continuous time scale. The time average over the time interval 0 to T of the i" realization is given by
TIME SERIES 339

X~

0 I

0
- t
t t+r

0
. i ( t )I l l ~.;t+r)

s
Fig. 14.2. Several realizations of a stochastic process.

For a realization on a discrete time scale, the time average would be determined from

where n is the total number of equally spaced points at which Xi(t) was observed.
The ensemble average at time t is given by

where m is the number of realizations in the ensemble. As m gets large,

-
If the process is such that X(t) = X(t + T) for all values o f t and T, the process is said to be
stationary in the mean, or first-order stationary.
The ensemble covariance of X(t) and X(t + T) is given by

If the covariance given by equation 14.5 is independent o f t but dependent on T (the lag), the
time series is stationary in the covariance. If T = 0, equation 14.5 gives the variance of the series.
Stationarity in the covariance implies stationarity of the variance. If a series is stationary in the
mean and in the covariance, the series is said to be second-order stationary, or weakly stationary.
340 CHAPTER 14
If a series is stationary in the covariance but not in the mean, the term weakly stationary or second-
order stationary should not be used. For many hydrologic applications, one is satisfied with sec-
ond-order stationarity. If a process is second-order stationary and p,(x; t) is a normal distribution,
the process can be shown to be stationary.
Bendat and Piersol (1966) state that in actual practice, random data representing stationary
physical phenomena are generally ergodic. For ergodic random processes, the time average
mean, as well as all other time average properties, equals the ensemble averaged value. Thus the
properties of a stationary random phenomenon can be measured properly, in most cases, from a
single observed time history record.
Generally, only one realization of a stochastic process is available. More than one realiza-
tion can be obtained by breaking the single realization into several shorter series. Unfortunately,
most hydrologic records are so short that breaking them into even shorter series may not be prac-
tical. If the statistical properties of the parts of a time series are not significantly different from
one another, the series is said to be self-stationary.
For a skgle realization the mean is determined from equation 14.1 or 14.2.The covariance
can be determined by

depending on whether the series is on a continuous or discrete time scale.


In the remainder of this chapter, it is assumed that the random or stochastic component of
the time series is stationary and ergodic so that the time average properties of the stochastic com-
ponent can be used. This eliminates the need for more than one realization.
Several tools are available for analyzing time series. The first step is to identify and remove
the deterministic components of the series. Next, the stochastic components can be addressed
using regression analysis, Box-Jenkins models, autocorrelation functions, or spectral analysis.

TREND ANALYSIS
A common deterministic component in a time series is a trend. A trend is a tendency for
successive values to be increasing (or decreasing) over time. Changing hydrologic conditions can
introduce trends into a hydrologic time series. Urbanization may contribute to increased peak flows
or runoff volumes. Increased demands on groundwater may result in declining groundwater levels
or declining base flows in streams. Climate change may result in changes over time in rainfall, tem-
perature, and other climatic variables which in turn may alter strearnflows and groundwater levels.
If the data meet the assumptions of regression, simple linear regression can be used to test
for the presence of a linear trend and multiple linear regression may be used to test for trends
1890 1910 1930 1950 1970 1990

Year

Fig. 14.3. Annual rainfall at Stillwater, Oklahoma, 1894-1997.


having a more complex temporal relationship. If the regression coefficients are significantly dif-
ferent from zero, a hypothesis of no trend would be rejected.
Trends in a time series are often difficult to properly interpret if few observations are avail-
able. Even in a stationary time series, apparent trends may be present over short time intervals.
Figure 14.3 shows the annual rainfall at Stillwater, Oklahoma, for the period 1894-1997.
A regression of rainfall on time yields the relationship

where X(t) is the total annual rainfall in year t. The slope of 0.028 has a standard error of 0.0267.
The calculated "t" statistic for testing the hypothesis that the slope is zero is 1.05, which is clearly
not significant-indicating that the hypothesis cannot be rejected. Figure 14.4 is a probability

Normal Distribution
Fig. 14.4. Normal probability plot of residuals of time series regression of Stillwater, Oklahoma,
annual rainfall.
342 CHAPTER 14

20
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
Year

Fig. 14.5. Annual rainfall at Stillwater, Oklahoma, 1978-1987.

plot of the residuals indicating the assumption of normality is valid. The first-order serial corre-
lation coefficient of the residuals is 0.083, indicating a lack of serial correlation in the residuals.
Thus, the regression assumptions are satisfied and the conclusion of no trends is supported.
To illustrate the hazards of using a short hydrologic record to evaluate trends, the data for
Stillwater rainfall from 1978 through 1987, shown in figure 14.5, were investigated. The result-
ing relationship was

The standard error on the slope was 0.489 and the calculated t for testing for zero slope was
3 . 3 6 a n indication one must reject the hypothesis of no trend. Figure 14.6 shows a probability

Normal Distribution
Fig. 14.6. Normal probability plot of residuals of a portion of the Stillwater, Oklahoma, annual
rainfall.
plot of the residuals for this regression. The first order serial correlation coefficient for the resid-
uals was 0.03. Again, the assumptions of regression are not violated.
The Stillwater rainfall example illustrates that short periods of apparently nonstationary
data may be embedded in a longer stationary data series. Concluding nonstationarity from the
short series and projecting data either forward or backward in time based on this conclusion can
clearly lead to erroneous projections.
If the data under consideration do not meet the assumptions of regression as set forth in
chapters 9 and 10, conclusions based on regression are approximate with an unknown level of
confidence associated with statistical tests. Helsel and Hirsh (1992), Salas (1993) and others
discuss the use of nonparametric tests for trends. Nonparametric tests do not depend on distribu-
tional assumptions regarding the data and residuals but are generally based on relative ranks of
data points. Conover (197 1) presents many nonparametric statistical procedures.
Salas describes the Mann-Kendall nonpararnetric test for trends in the series X(t) for t = 1,
+ +
2, ..., N. Each value in the series X(tl) for t' = t I, t 2, ..., N is compared to X(t) and assigned
a score z(k) given by

for k running from 1 to N(N - 1)/2. The Mann-Kendall statistic is given by

The test statistic for N m 10 is given by

where m = 1 if S < 0 and m = -1 if S > 0. V(S) is given by

if there are few values of z(k) = 0 (ties). In hydrologic data such as rainfall totals, streamflow
rates or volumes, groundwater levels, and so forth, one would expect very few exact ties. If the
data series is on a discrete time scale, ties might be more common. In that event, Salas (1993) or
Helsel and Hirsh(1992) should be consulted.
The hypothesis of no trend is rejected if lu,l > z,-+ where z is from the standard normal
distribution and a is the level of significance.
Example 14.1. The annual rainfall for Stillwater for the period 1978-1 987 is given below. Use
the Mann-Kendall test for a significant trend in the data.

1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
Sums

Values of z(k) are determined by constructing N - 1 columns of z(k) values with the first value
of z(k) in column j occupying the j + IS' position. Thus, in column 3 the first value of z(k)
occupies the 4" position. The value of z(k) is determined by the assignment rules given as equa-
tions 14.8. Consider the third column. By applying equations 14.8, t = 3 and X(3) = 34.03. The
first entry in the third column is in the 4lhrow and is - 1 because t' = t + 1 = 4, and X(3) = 34.03
is less than X(4) = 35.72.
S is simply the sum of all N(N - 1)/2 or 45 values of z(k) and is equal to -25. The value
+
of m is 1 since S < 0.

For u = 0.10, z ~ - = / ~ Because lucl > z, - a/27the hypothesis of no trend is rejected.


~ 1.64.
Comment: This is the same result as obtained using the parametric regression test. For this data
the regression assumptions were satisfied. Most nonparametric tests have been found to be nearly
as powerful as parametric tests when the assumptions of the parametric test are met and more
powerful when they are not met. This leads many to adopt the nonparametric approach if any
doubt concerning distributional assumptions exists.

The trend in a set of data may be removed by subtracting the trend line from each data point.
If the data follow the assumptions of linear regression, the detrended data X1(t)would be given by
Nonparametric estimates of a and b may also be obtained. Helsel and Hirsh (1992) suggest
that the slope may be estimated from

& X(t) - X(tl)


b = median
t - t'

for t' < t and t' = 1,2, ..., n - 1; t = 2,3, ..., n. The intercept is estimated from

2 = x(t)med - 6 ~ 4 (14.14)

where med indicates the median value.

Example 14.2. Compute the nonparametric estimates for the slope and intercept for the Still-
water rainfall data of example 14.1.

Solution:

Values for a and b are estimated from equations 14.12 and 14.14. The median of the values in the
above table is 1.70667, which is the estimate for b. The median of the values in the column
labeled t is 1982.5 and in the column labeled X(t) is 34.88. Therefore

Estimates for X(t) are then given by

For example, for t = 1980 the estimate is -3348.59 + 1.70667(1980), or 30.61. Figure 14.7
shows the resulting nonparametric regression line. The nonparametric estimate of the detrended
data, Xf(t),is again given by equation 14.12 using the nonparametric estimates for the slope and
intercept.
Year

Fig. 14.7. Nonparametric regression line on part of the Stillwater, Oklahoma, annual rainfall.

JUMPS
Jumps or abrupt changes in the mean of a time series may be detected using equations 8.15
through 8.17 if the time point at which a jump is suspected is known and the necessary distributional
assumptions (normality) are valid. If the distributional assumptions are not met, these tests become
approximate with an unknown level of significance. The degree of approximation depends on the
severity of the deviation from normality. For highly skewed data, the approximation could be quite
poor. The procedure in making the test is to divide the time series into two subseries at the point of
the suspected jump with n, and n, observations in the subseries where n, + n, = n, the total number
of observations. A test of the hypothesis that pl= p2for the two subseries is then made.
For data that do not meet the assumptions associated with the parametric tests, a nonpara-
metric test for the hypothesis p1= p2is available. Conover (1971, 1980) and Salas (1993) pres-
ent the Mann-Whitney test for the equality of means.
The entire sample is ranked with Ri being the rank of the ih observation in the series for
i = 1 to n. The quantity

and the test statistic

are computed. If IT1 > z,-,/, where z,-,/~ is the standardized normal z value with probability
z > z,-,/, equal to a/2, the hypothesis of equality of means is rejected. Conover (1971, 1980)
should be consulted if the data has many ties or groupings of equal values.

Example 14.3. Below are annual flow data for Beaver Creek in western Oklahoma. The data are
plotted in figure 14.8. It has been hypothesized that after the 28th year, the flow regime has
TIlME SERIES 347

0 5 10 15 20 25 30 35 40 45 5

Year

Fig. 14.8. Beaver Creek annual flow.

distinctly changed. Test the hypothesis that the flow for years 1-28 differs from the flow for years
2 9 4 6 using (a) a normality assumption and (b) a nonparametric approach.

Solution:
(a) If a normality assumption is made, the test statistic comes from equation 8.17. The mean
and standard deviation of the first 28 observations are 83,684 and 75,046 and for observations
2 9 4 6 they are 33,046 and 3 1,759, respectively. Using equation 8.17, the t statistic is calculated as

which would indicate the two parts of the records have different means.

Year Flow Rank Year Flow Rank Year Flow Rank


(b) Based on the nonparametric approach, equations 14.15 and 14.16 are used. The sum S
is computed as

and the test statistic is

indicating that the hypothesis of a difference would probably be rejected.


Comment: Which of the two conclusions one would adopt would depend on the distributions of
the annual flows. The flows for the two periods should be examined and a determination made as
to the validity of a normality assumption.

AUTOCORRELATION
One method of characterizing correlation within a time series over time is the autocorrela-
tion function, P(T),given by

For T = 0 equation 14.17 indicates that p(0) = 1 because Cov(X(t), X(t + T)) = Cov(X(t), X(t)) =
Var(X(t>> -
From figure 14.2 it can be seen that for small values of T the covariance term would be
positive because for the most part like signs are being multiplied (X(t) - X and X(t + T) - X
have the same sign for small T).As T increases, a point is reached where the covariance, and thus
P(T),may become negative. Some authors call Cov(X(t), X(t + T)) the autocorrelation function.
In keeping with the terminology established earlier in this book, the Cov(X(t), X(t + T)) will be
called the autocovariance.
A plot of the autocorrelation function against the lag T is called a correlogram. For random
data such as shown in figure 14.la, the correlogram would appear as in figure 14.9a. In the case
of data containing a cyclic and stochastic component such as shown in figure 14. lc, the correlo-
gram would be cyclic as in figure 14.9b where p is the period of the cycle.
Correlograms are useful in determining if successive observations are independent. If the
correlogram indicates a correlation between X(t) and X(t + T), the observations cannot be
assumed to be independent. The autocorrelation function may thus be said to indicate the "mem-
ory" of a stochastic process. When p ( ~ becomes
) zero, the process is said to have no memory for
what occurred prior to time t - T. In practice, P(T) should be zero for large T for most random
TIME SERIES 349

Z r
Fig. 14.9. Typical correlograms: (a) Random process. (b) Random process superimposed on a
periodic process.

processes. If p ( ~ for
) large T exhibits a pattern that is not zero, it may be an indication of a deter-
ministic component. For example, if the correlogram appears as in figure 14.9b, it indicates the
data contains a periodic component.
A hydrologic time series representing a process involving significant storage is likely to
have values at time t + 1 that are correlated with values at the previous time t. The correlation
+
may extend over several time increments so that X(t k) is correlated with X(t) k time units ear-
lier. Daily flows in a stream and daily, monthly, and possibly annual groundwater levels are ex-
amples of hydrologic time series that often exhibit correlation over time. Annual maximum peak
flow is an example of a time series that is unlikely to exhibit correlation over time.
For a discrete time scale, the autocorrelation function becomes p(k) where k is the lag or
number of time intervals separating X(t) and X(t + T).The relationship between T and k is given by

where At is the length of the time interval (e-g., 1 day, 1 month, 1 year, etc.). If p(k) = 0 for all
k f 0, the process is said to be a purely random one. This indicates that the observations are lin-
early independent of each other. If p(k) # 0 for some k # 0, the observations k time increments
apart are dependent in a statistical sense and the process is referred to as simply a random one. If
a time series is nonstationary, p(k) will not be zero for all k # 0 because of the deterministic
element, even if the random element is itself a purely random time series (Matalas 1967b).
Unless the deterministic element is removed, one cannot determine to what extent nonzero values
of p(k) are affected by the deterministic element.
The population autocorrelation function, p(k), may be estimated by r(k), which is given by
n-k n-k
n-k Ci=l Xi Xi+k
xixi+k -
n-k
r(k) =
n-k
Ci=l xi
2
-
(2:;
n-k
n-k 2
xi=lX i + k
- (2: xi+, )2
n-k 1
1/2

with Xi = X(ti), XI+, = X(ti + kAt), and n is the total number of observations. Some authors use
the terminology autocorrelation function for p(k) or p ( ~ and
) serial correlation function for r(k)
or r ( ~ )This
. distinction is not made in this text.
350 CHAPTER 14
For any observed series, it is unlikely that r(k) will be exactly zero. If r(k) differs from zero
by more than is expected by chance, then the observations k time periods apart cannot be assumed
independent. Procedures are available for testing the hypothesis that p(k) = 0 and for placing
confidence intervals on p(k) (see chapter 11). Often, computer programs that generate correlo-
grams also compute confidence intervals such that if the computed autocorrelation falls outside the
confidence interval for a particular value of k, a hypothesis of r(k) = 0 would be rejected.

PERIODICITY
Autocorrelation analyzes a time series in the time domain. It provides information on the
behavior of the series over time, especially with regard to the memory of the process or how the
process at one instance of time is dependent on, or related to, the process at some prior time. An
alternate analytic approach is to examine the series in the frequency domain. With this approach,
an attempt is made to quantify the variability in the series in terms of repeating patterns having
fixed periods or, what is equivalent, fixed frequencies. The variance of the process is partitioned
among all possible frequencies so that the predominate frequencies can be identified. Let
X, = Xi = X(t = ih) for i = 1 to n. That is, the X's are equally spaced in the time domain.
We can express X, as a Fourier series

The maximum value for the number of terms in the series, q, is given by q = (n - 1)/2 if n
is odd and q = n/2 if n is even. The frequency, fi = i/n, represents the i" harmonic of the funda-
mental frequency l/n.
The coefficients a and b can be estimated from

1
% =; X:=lxt= x

$ = -2E n,= 1 X, COS 2Tfit


n
2
bi = -
n
X:= X, sin 2.rrfit

for n odd and i = l , 2 , 3, ..., q. For n even

and
b, = 0
or
TIME SERIES 35 1
The periodogram, I(fi), is defined as

For a discrete time series, the angular frequency, mi, is equal to 2nfi or 2ni/n.
The variance of X, is given by

VdXt) = Va[% + C:= 1 a, cos wit + Cy= bi sin wit]

Because % is a constant,

By definition, the Var(cos oit) is

A property of the cos function is that E 6 G i $ = 0. Therefore

= 1 otherwise

Similarly,

Var(sin wit) = 1/2 i + 0, i # n/2


= 0 otherwise

The net result of these manipulations is

1
Var(X,) = - Cq=,(a:
2
+ b:) n odd

var(X,) = 51 ZP= (a: + b:) + a t n even


or, the variance of XIhas been partitioned among the frequencies so that the variance associated with
fi is M($+ bi2).I(fi)is n times the variance associated with fi. By definition, Var(X1) = 2.Therefore

I(fi)
Let g(fi) = - so that
nu2

The function g(fi) is the spectral density function representing the fraction of the variance of
X, associated with fi. Recall that

0 0 02 0 04 0 06 0 08 0 1 0 12 0 14 0 16

Frequency

Fig. 14.10. C o s ( 2 ~ k12)


/ (top) and its correlogram (middle) and spectral density (bottom).
TIME SERIES 353
Some plot the spectral density function, g(fi), versus fi and some plot the periodogram, I(i),
versus i. Because p = l/f = l/(i/n) = n/i, the period associated with any i can be easily deter-
mined as n/i. Peaks or spikes in g(fi) or I(i) indicate frequencies that predominate in determining
the variance of X,.
Figures 14.10 and 14.11 show some correlograms and spectral density functions for well-
behaved functions. In figure 14.10 the function is

For this function f is 1/12 cycles per time unit and the wave length is 12 time units. A software
package, NCSS 2000 (1998), was used to make the calculations used in generating the plots of

-0.8 J
Lag

0 0.05 0.1 0.15 0.2 0.2 5


Frsqusncy

Fig. 14.11. Sum of 3 cosines (top) and its correlogram (middle) and spectral density (bottom).
354 CHAPTER 14

figure 14.10. The frequency axis of the periodograrn is actually 27r/f in this plot. Note that for
this deterministic function the correlograrn reflects the function exactly.
The function for figure 14.11 is

"0 50 100 1 50 200 250


Month

Autocorrelation Plot

Periodograrn

0.1 0.2
Frequency

Fig. 14.12. Cave Creek monthly stream flow.


TIME SERIES 355
Here again, the correlogram reflects the deterministic function. The three frequencies of 116,
1/ 12, and 1/24 can easily be seen in the periodogram.
Figure 14.12 is a similar analysis of the monthly stream flow on Cave Creek near Lexington,
Kentucky. The correlograrn reflects a cycle of 12 months but does not reproduce the flow record. The
maximum correlations of 50.4 are considerably less than the + 1.0 for the deterministic functions
reflecting a combination of deterministic and random components in the data. The periodogram
clearly shows the periodic nature of monthly flow at this location with a period of 12 months.

AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS (ARIMA)


Autoregressive integrated moving average models, ARIMA, are often known as Box-
Jenkins models because of their early work on this class of models (Box and Jenkins, 1976).
ARMA models are a subclass of ARIMA time series models. ARMA stands for autoregressive
moving average models. These models make an observation at time t a function of observations
and errors at time t, t - 1, t - 2, and so on. Autoregressive (AR) implies that an observation z, is
a linear function of 2,-,, z,-,, .... Moving average (MA) implies that an observation z, is a linear
function of white noise at t, t - 1, t - 2, .... Several software packages are available for
estimating ARIMA models.
ARMA models assume that the series is stationary in the mean. Nonstationarities such as pe-
riodicities, trends, jumps, and so forth should be removed prior to an ARMA analysis. IE there is
a trend or drift in the mean, this can be removed via differencing. For example, if

Z, = a + bt and w, = z, - 2,-,
then

so that w, is stationary in the mean. w, is the first difference of z,.


If z, = a + bt + ct2 then w, = z, - ztPl = a + bt + ct2 - a - b(t - 1) - c(t - 1)2 = b -
c + 2ct, which is a linear trend. Taking the second difference

Thus y, is stationary in the mean.


The nthdifference is denoted as Vnz,. For a linear trend, Vz, is stationary. For a quadratic
trend, V2z, is stationary. Differencing is indicated by the term "integrated. An autoregressive
integrated moving average model is denoted by ARIMA(p, d, q), where p is the autoregressive
order, d is the order of differencing, and q is the order of the moving average components.
If z, is ARIMA(p, 1, q) then Vz, is ARIMA(p, 0, q) or ARMA(p, q). If z, is ARIMA(p, 2, q),
then V2z,is ARIMA(p, 0, q) or ARMA(p, q).
A general ARMA(p, q) model is
356 CHAPTER 14

where the C$lq- ... C$+q-,represents the p" order AR and a, - 0 .-. -Oqa,-q represents the q"
order MA. Most hydrologic applications never get any more complex than an ARIMA(2, 1, 2).

Moving Average Processes (MA)


This treatment follows Cryer (1986) chapters 4 and 7 which should be consulted for a more
complete coverage. A moving average process of order q, MA(q), is defined as

where q represents an unobserved white noise series. The q are identically and independently
distributed random variables (iid rvs) with a mean pa= 0 and variance o:,and z, is a stationary
time series with zero mean. A mean term can be added later if necessary.

MA( 1)
A first-order moving average process, MA(1), is given by

For notational convenience let y, = Cov(zt, zt-,,). Some properties of a MA(1) series are:

Note that this last equation presents a way of estimating O1. We can estimate p, by r1 and
equate
and solve for 8,. This is a moment estimator and is not very efficient.

For 0, to be real, 4r: must be less than 1. This implies that rf must be less than '/4 or -% 5 r1 5 %.
For a MA(1) process, p, must lie between 2%. When p, = -0.5,0, = 1. When p, = +0.5, el = - 1.
Therefore, - 1 5 8, 5 1. Because of the randomness of a sample, it is possible for IrlI > % but if that
occurs it brings into question the appropriateness of a MA(1) as a descriptor of the process.

A second order moving average process is given by

with the following properties:


358 CHAPTER 14
Moment estimates can be obtained by solving the following equations for 81 and 02.

Certain restrictions apply to these results.

Once again, the expression for p1and p2provide a means for estimating 8, and 02; however,
the two simultaneous equations may be difficult to solve.

MA(q)
The general result for pk for a MA(q) can now be written as

Note the numerator for p, is - 3' , and pk = 0 fork > q. This can be used to identify the order
of a MA(q) process.

Autoregressive Processes
An autoregressive process of order p is given by

z, has a mean of zero and a, is independent of zt-,, zt-,, ...

A first-order autoregressive process is written as


TIME SERIES 359
and has the properties that

This last equationshows that /+,I < 1 or else yo would be less than zero, which is not pos-
sible. 4, must be inside the unit circle.

Fork = 1

For k = 2

For k = k

+
Because I+,\ < 1, pk exponentially decays toward zero. For positive, pk is positive. For
negative, pk alternates between positive and negative values.
For k = 1, 4, = Y,/Y, = pl and can be estimated by r,. Another way to estimate 4, is
through linear regression using the model

+ +
which is analogous to Y = a bX E. Because zt has a mean of 0 and the Var(zt) = Var(z,-,),
regression will result in a = 0 and b = r,.

AR(2)
A second-order autoregressive process is given by
360 CHAPTER 14
with the property

Yk = E(zz-~) = E ( $ ~ z - l & - k + $2~t-2~t-k + a,~-k)

It can be noted that is the same as & z , - ( ~ - ~ , and &2zt-k is the same as Z,Z,-(,-~) so
that

Dividing by yo

with k = 1 and p, = 1 and p-, = pl , pl is given by

Successive values of pk can be determined fiom pk = $,pk- +


The autocorrelation function can take on many shapes. In all cases, however, p, dies out ex-
ponentially fast as k gets large. This die-out may be strictly exponential or it may be in the form
of a damped sine wave. Again, regression techniques can be used to estimate the $'s because

AR@)
A general pb order autoregressive process is given by

with the property

Yk = E(zt&-k) = E($l&-l&-k + $ 2 z t - 2 ~ t - k + ". + $ p ~ t - ~ ~ - +


k qzt-k)

= $lE(~-i&-k) + $2E(~-2zt-k) + + +pE(zt-p~-k) + E(%zt-k)

= $lyk-l + $ 2 ~ k - 2 + .- - + $pyk-p for k > 1.


TIME SERIES 36 1
Dividing y, by y o with po = 1 and p-, = p, results in

This leads to the Yule-Walker equations

which may be written as

and has solution

For an AR(1) process, the Yule-Walker solution is pl = 4,. For an AR(2) process, the
Yule-Walker relationships are

Having as their solution

To get an estimate of 4, and <b2,the p1 and p2 are replaced by r, and r2. In general, the
Yule-Walker equations are solved to estimate the 4's.
362 CHAPTER 14
Autoregressive Moving Average Models ARMA(p, q)
A general ARMA(p, q) model is given by

where a, is independent of q-,. Note that zt is not independent of a,-,.

A R M . (1 , l )
An ARMA(1, 1) model is given by

and has the following properties:

Yk = E(~zt-k)= E ( + l z t - l ~ - k + qZt-k - Olat-lzt-k)


= +lyk-l fork > 1

y1 = +lye - 0,u2 fork = 1

Because yk = +,yk- for k > 1,


TIME SERIES 363

= -= for k 1 I
Y0 1 - 201+, + 0:
pk decays exponentially as k increases. The damping factor is 4,. The decay starts at p1 rather
than po = 1, as for the AR(1).

Autoregressive Integrated Moving Average ARIMA(p, d, q)


The models discussed to this point, MA, AR, and ARMA, have been for stationary series. If
a series is not stationary, a stationary one can often be formed by differencing the original series.
Let w, = Vdz,,w, is called the d" difference of z,.

first difference w, = Vz, = zt - z , - ~


second difference w, = V2z, = VZ, - VztP1= (z, - z,-~) - (z,-~- z,-~)
= Z, - + z , - ~etc.
In practice, rarely is it necessary to consider d > 2. The purpose of forming w, is in an at-
tempt to define a stationary time series from a nonstationary one. Thus, if z, has a linear appear-
ing trend, the first difference may well be stationary.An ARMA model may then be fit to w,. Such
a model is called an AR?MA(p,l ,q) model. The "1" indicates that an ARMA(p, q) model has been
applied to the first difference of z,.
Some obvious special cases of ARIMA models are:

An ARIMA(p, 1, q) on z, is an ARMA(p, q) on w, = Vz,.


364 CHAPTER 14
AR1M.N1, 1,O)

Estimate of noise variance a2

For AR(P)
6; = (1 - &r1 - &2r2- - - - - &$p)s2
For AR(1)
6; = (1 - 6)s2 since$, = r,

For MA(q)

For ARMA(1, 1)

PARAMETER ESTIMATION VIA LEAST SQUARES

AR Models
For all practical purposes, the least squares estimates for the +'s can be obtained by solving
the Yule-Walker equations.

MA Models

MA(1)
TIME SERIES 365
a,-, -
- z , - ~+
a, = z, + 8,z,-, + 8:a,-,
This process can be continued to get

a, = z, + 8z,-, + 82z,-2 + -.- or


Z, - 8 2 ~ , - 2-
= (-8~,-~ . -) + a, (Eq B)

Thus, an MA(1) is an AR(m). Equation B is clearly nonlinear in the parameter 8. The


objective is to minimize

From equation A, calculate zi for i = 1 to n by taking a, equal to its expected value of zero. Then

The procedure is to search the interval (- 1 < 8 < 1) for 8 that minimizes S(8). Cryer (p. 133)
discusses a Gauss-Newton procedure based on a linear approximation and recursive differentiation.

MA(q)
We must simply generalize the above procedure. Now we have a multivariate search

,
with a, = a- = a-, = - - = a,-, = 0. The objective is to find the set of 9's that minimizes E 2.

Consider a, = a,(+,, 8,) and minimize S = (+,, el) = 2 2.We can write

To get a, we have a start-up problem, namely zo. We could set zo = E(z,). Another way is to
simply start the sum at t = 2 and thus avoid the zo problem.
366 CHAPTER 14

ARMA(P' q)

with a, = a,-, = - 0 - = a,-, = 0 and minimize to get the 0's and $'s.

PARAMETER ESTIMATION VIA MAXIMUM LIKELIHOOD


Cryer (1986) discusses maximum likelihood estimation for the ARIMA parameters. He
states that the large sample properties of the maximum likelihood and least squares estimators are
identical. The large sample properties of maximum likelihood and least squares estimators are:
AR(1)
Exercises

14.1. Let X(k) = cos(2nk/ 12) + E where E is an independent random observation from a normal
distribution with a mean of zero and a variance of 0.5. Compute and plot the correlogram and
spectral density function for this process by generating values for X(k) for k = 0 to 200. Compare
the results with figure 14.10.

14.2. Let X(k) = [cos(2~k/6)+ cos(2nk/l2) + cos(2nk/24)]/3 + E where E is an independent


random observation from a normal distribution with a mean of zero and a variance of 0.5. Com-
pute and plot the correlogram and spectral density function for this process by generating values
for X(k) for k = 0 to 200. Compare the results with figure 14.11.

14.3. The following data represent the years of eruptions of the volcano Aso for the period 1229
to 1962. Let k equal the eruption number beginning with 1229 so that k = 1 for 1239, k = 2 for
1240, and so on. Let X(k) be the number of years since the last eruption so that X(l) = 10, X(2)
= 1, X(3) = 25, and so forth. Compute the correlogram and spectral density function for X(k).
Is there any apparent pattern to the eruptions?

Years of eruptions of the volcano Aso for the period 1229 to 1962

Data from Davis (1973).

14.4. The following data represent an unusual phenomena in that they are observations of a true
time series from the geologic past. The Ecocene lake deposits of the Rocky Mountains consists
of thinly laminated dolomitic oil shales hundreds of feet thick. It has been established that the
laminations are varves, or layered deposits caused by seasonal climatic changes in the lake
basins. By measuring the thickness of these laminations, a record of the annual change in the rate
of deposition through the lake's history is obtained (Davis 1973). Compute and plot the correlo-
gram and spectral density function for this data. Discuss any apparent patterns. The data are pre-
sented column-wise.
368 CHAPTER 14

Thickness (mm) of successive varves of a section through the Green River Oil Shale

Data from Davis (1973).

14.5. Compute and plot the correlogram and spectral density function for the weekly precipita-
tion at Ashland, Kentucky, for the week of March 17. Discuss any apparent patterns. (Data in
appendix.)

14.6. Work exercise 14.5 using the monthly precipitation for Walnut Gulch near Tombstone,
Arizona. (Data in the appendix.)

14.7. For a MA(1) process with a, iid N(0,l) and values of 8, = 0.9,0.4, and -0.9:

(a) Generate 100 values for z,.

(b) Plot z, vs t.

(c) Compute and plot the autocorrelation function.

(d) Compute and plot the periodogram.

(e) Estimate el.

14.8. For the following MA(2j processes, assume a, is iid N(0,l). For each case:

(a) Generate and plot 100 values of z,.

(b) Compute and plot r(k) for k = 1 to 10.

(c) Compute and plot the periodogram.

(d) Based on population 8's, calculate p, and p., Compare to r(1) and r(2).
14.9. Assume an AR(1) process with a, iid N(0, 1). For = 0.9 and -0.4

(a) Generate and plot 100 values for z.

(b) Compute and plot r(k) for k = 1, 10.

(c) Compute and plot the periodogram.

(d) Estimate +,from the generated data sets.

14.10. Assume an AR(2) process with a, iid N(0, 1) and <PI = 0.5 and = -0.3

(a) Generate and plot 100 values of z,.

(b) Compute and plot r(k) for k = 1 to 10.

(c) Compute and plot thc periodogram.

(d) Estimate the 4's.


15. S o m e Stochastic
Hydrologic Models
EVERY DESIGN decision requiring hydrologic knowledge is based on a hydrologic
model of some type. This model might be one that gives the peak discharge from a small
watershed as some function of the watershed area, it might be a flood frequency curve, it might
be a comprehensive "deterministic" model capable of generating synthetic streamflow records,
or it might be a stochastic model for generating a time series of hydrologic data. "Determinis-
tic" is used here in the sense that once the model parameters are known, the same inputs to the
model always produce the same outputs. Thus parametric models are included under "deter-
ministic" even though the model parameters may be functions of observed hydrologic records,
and thus be random variables. Ultimately, design decisions must be based on a stochastic
model or a combination of stochastic and deterministic models. This is because any system
must be designed to operate in the future. Deterministic models are not available for generat-
ing future watershed inputs in the form of precipitation, solar radiation, and so forth, nor is it
likely that deterministic models for these inputs will be available in the near future. Stochastic
models must be used for these inputs. If a design is based solely on the basis of a historical
record of rainfall or streamflow, the stochastic model employed is simply the historical record
itself. It should be kept in mind that any historical record is but one realization of a stochastic
time series and that future realizations will resemble the historical record only in a statistical
sense even if the process is stationary.
Designers of water resources systems have realized for years that evaluating their designs
using past or historical records provided no guarantee that the design would perform satisfac-
torily in the future because future flow sequences will not be the same as past flow sequences.
Typically, historical flow sequences are quite short-generally less than 25 years in length.
STOCHASTIC MODELS 37 1

Even during the 100-year life of a project, an observed historical flow sequence of 10 to 25
years in length will not repeat itself. In all cases the designer would agree that the worst flood
(or drought) on record is not the worst possible flood (or drought). The use of historical records
alone can only approximate the risk involved. That is, if a design is made based on a historical
record, chances are the design would be adequate if the historical record repeated itself.
However, we know that the historical record will not repeat itself. There is, thus, a certain risk
that the design will be inadequate for the unknown flow sequence that the system will actually
experience. This latter point can be illustrated by considering the design of a facility that might
have a 5-year life-say a small, temporary boat dock. For this design assume that 100 years of
flow records are available. The 100 years of record would provide 20 independent 5-year flow
sequences. A proposed design could then be evaluated on 20 independent flow sequences equal
in length to the design life of the facility. If it is found that in 15 of the sequences, the design
is adequate and inadequate in the remaining 5, then one would estimate that for some future
5-year period there would be a probability of 0.25 (5/20) that the design would be inadequate.
If the design is adopted, a risk of 0.25 exists. A risk of 0.25 may be unacceptable. For this case,
consider that an acceptable risk is 0.05. This means the design should be increased and reeval-
uated until it proves inadequate in only 1 of the 20, 5-year observed sequences. Generally, the
design life of a water resources project exceeds the length of available record so that a risk
evaluation using this procedure is not possible.
In other cases it may be desirable to know the severity of a shortage. For instance, one might
be looking at a system to control the thermal pollution from a power plant. It may be that the de-
sign requirement is to affect the natural water temperature by less than 5C. During low flows, the
ratio of the volume of heated water discharge to the volume of natural flow may be such that it is
difficult to keep the overall temperature rise to less than 5.5"C. In this case, it would be desirable
to know the magnitude as well as the frequency of failing to meet the design standard since a 6C
temperature rise would be less damaging than a 15C temperature rise.
The approach outlined above assumes that there is some probabilistic mechanism underly-
ing the generation of streamflows and that this mechanism is sufficiently stable that it can be
considered stationary. It is also assumed that the sample in hand is a representative one.
An even better approach to determining risk probabilities would be through operations
(analytic or Monte Carlo) on the underlying, exact probability distribution or distributions that
the natural hydrologic process follows. Of course, this type of information is never available and
in practice must be approximated. Even if the exact distribution was known, its parameters would
have to be estimated from an observed record (sample) and would not equal the population pa-
rameters. Thus, to overcome the objections of design evaluation based on a single (and many
times short) flow record, a data generation scheme or stochastic model is needed.
A stochastic model is a probabilistic model having parameters that must be obtained from
observed data. Stochastic streamflow models, for example, do not convert rainfall to runoff
through theoretical or empirical relationships as do deterministic models, but use the information
in past or historical streamflows. Stochastic streamflows are neither historical flows nor predic-
tions of future flows, but they are representative of possible future flows in a statistical sense.
Stochastically generated data can be used in evaluating risk probabilities, providing a "satisfac-
tory" stochastic model is available.
372 CHAPTER 15
It should be noted that a stochastic model depends heavily on the assumptions of stationarity
and representativeness. The effects of watershed changes for example cannot be evaluated. On the
other hand, deterministic models might be able to simulate a changing hydrology as the basin
changes-but remember the future rainfall problem. Thus, one approach to watershed modeling
is to stochastically generate rainfall and use a deterministic model to convert the rainfall to
streamflow.
In developing a stochastic model, it is assumed that the data are the result of a random
process or one that involves chance. One cannot precisely state what the data values will be at
any particular future time, but one will be able to make statements of probability concerning
future data values. In looking over past data records it is apparent that strearnflows are not com-
pletely random with no constraints, but do possess certain recognizable features. For instance, if
the average annual flow has been around 15 inches for a long period of time, it is unlikely that it
will suddenly change to 25 inches unless the watershed is altered in some fashion. If the flows
have tended to be between 10 and 20 inches per year with only an occasional yearly total outside
these limits, the model should not produce a large number of flows outside these limits. Thus, the
model should preserve the overall mean and spread or variance of the data.
Further, it may be noted that there is some degree of persistence in that low flows tend to fol-
low low flows and high flows tend to follow high flows. A streamflow model should retain this
property. From this it can be seen that historical records certainly guide us in model development.
It is not the purpose of this chapter to promote stochastic hydrologic models. Rather, some
of the most prominent models are discussed. There is a very rapidly expanding literature on
stochastic hydrologic models. No attempt is made to cover all of the models currently in the
literature or to discuss all of the features of the models that are covered in the chapter.
In selecting a stochastic model it is important to be able to state what characteristics of the
phenomena being modeled are important and what characteristics are unimportant. For example,
if streamflow is being modeled, the following is a partial listing of the questions that must be
considered.

1. Is it necessary to model the peak flows?

2. Are annual peaks sufficient or will other peaks occurring during the year be important?

3. Is the time during the year when the peak occurs important?

4. Is the sequence or order of occurrence of the peaks important?

5. Is the simultaneous occurrence of a peak flow and some other event important?

6. Is the volume of flow important?

7. Is daily, weekly, monthly, or annual volume of flow required?

8. Is the simultaneous volume and peak during some interval required?


STOCHASTIC MODELS 373
9. Is the seasonality in volumes for durations of less than a year important?

10. Is the dependence of the flow in one time period on the flow in previous time periods
important?

11. Is it sufficient to model the mean flow for a period? Is the variance important too? What
about the skewness?

12. Is the relationship of the flow on one stream and that on nearby streams of concern?

13. Is the time series of flows stationary?

14. Is there evidence of trends or jumps in the flow record? Will there be trends or jumps in the
future? Is it important to model these features?

15. How well do the above properties have to be modeled?

16. What is the quality and quantity of data available for model selection and parameter
estimation?

17. Given the available historic data, can the model parameters be estimated with sufficient
accuracy?

The answers to these and many other questions must be obtained before a model can be
applied to a particular problem. Of course, the answers to these questions depend heavily on the
use that is to be made of the model. Generally, it is desirable to select the simplest model that will
provide necessary information.
If one is considering the design of a reservoir to be located on a single stream to provide
irrigation water, quite likely a model that is capable of producing synthetic monthly streamflows
could be used. If a more complex model that considers daily flows or peak flows is used, it may
be necessary to sacrifice accuracy of monthly flow simulation to obtain accuracy of daily flows
or peaks. In any case, the more complex model would be more expensive to develop and test, and
certainly more expensive to use in developing long synthetic strearnflow traces. On the other
hand, if the proposed reservoir is to also provide flood control benefits, then estimates of flood
peaks and possibly shorter duration flow volumes would be required.
The design of an irrigation reservoir illustrates the case where joint probabilities or joint hy-
drologic time series may be required. It may be that the land to be irrigated and the watershed
supplying water to the reservoir are subjected to similar climatic conditions. Thus, the in-igation
demand may be the greatest at precisely the same time that flows into the reservoir are the low-
est, and vice versa. Neglecting this possible correlation between supply and demand can result in
an under-designed reservoir. If the demand and supply are independent, designing for the highest
demand at a time of the lowest supply may be over-designing, because the joint occurrence of
these events may have a very low probability.
374 CHAPTER 15
There is no substitute for a thorough knowledge of the problem to be solved and the fea-
tures of the problem that must be reproduced by the simulation model. It is relatively easy to
develop a simulation model for a problem by making unrealistic simplifying assumptions. It is
difficult to develop a model for use in solving a problem as it really exists. It is generally bet-
ter to develop an approximate solution to the real problem than an exact solution to an unreal
problem.
In chapter 14 it was stated that a hydrologic time series may contain trends or jumps. If a
historical record contains trends or jumps and it is desired to use this record for the estimation of
the parameters of a stochastic model, it is necessary to be able to separate the deterministic and
stochastic components of the historical record. Once the deterministic component is removed, the
stochastic component can be used for parameter estimation. The model that is developed may
have to incorporate both deterministic and stochastic components. This is especially apparent in
cases where trends are present. If the trend is expected to continue into the period being modeled,
the trend component must be present in the model that is developed. Trends can generally be
modeled by a polynomial equation of the type

Where Tp(t)represents the trend in the parameter p as a function of time t and Po, P,, P2, ... are
coefficients that may be estimated by multiple regression. The order of the polynomial can also
be tested by determining the highest order trend having a regression coefficient that is signifi-
cantly different from zero.
Jumps in a hydrologic time series may be identified by computing the mean value of the pa-
rameter of interest during the two time periods on either side of the jump. These two means are
then tested to see if they are significantly different from each other. The exact time at which a
jump occurs cannot be easily identified from the data alone because of the presence of stochastic
variation. In this case, a review of the data gathering procedure and factors affecting the variable
under study should be undertaken in an attempt to identify possible causes for the jump and the
time that these factors became important.
Many stochastic models require the estimation of a large number of parameters. Again, the
limited hydrologic data that is available at a point may be inadequate to estimate these parame-
ters. A regional approach to parameter estimation may help in this situation, provided regional
data are available (Benson and Matalas, 1967; Stedinger et al., 1994; Helsel and Hirsh, 1992; also
chapters 7 and 10).
Two classical stochastic models, the Bernoulli process and the Poisson process, and some
of their potential applications in hydrology, were discussed in chapter 4. The remainder of this
chapter is devoted to other selected models that appear frequently in the hydrologic literature.

PURELY RANDOM STOCHASTIC MODELS


Possibly the simplest stochastic process to model is where the events can be assumed to
occur at discrete times with the time between events constant, the events at any one time are
independent of the events at any other time, and the probability distribution of the event is
STOCHASTIC MODELS 375

known. Stochastic generation from a model of this type merely amounts to generating a sample
of random observations from a univariate probability distribution.
This type of model might be appropriate for generating a synthetic record of flood peaks.
Problems with the method are the uncertainty as to the proper probability distribution to use and
the uncertainty in the parameter values of the probability distribution. These two types of
uncertainty exist in all stochastic models to some extent. The larger the sample for estimating the
model parameters and testing the derived model, the less will be these uncertainties. Regional data
can also be used in some situations to assist in distribution selection and parameter estimation.
Another slightly more advanced application of a purely random model might be in generat-
ing sequences of point storm rainfall amounts. The time between storms might be modeled as an
independent Poisson or Bernoulli process (Lane and Osborn 1973) and the amount of rain as a
gamma variable. The model could be made more complex by assuming the distribution parame-
ters are a function of the time of year or that the parameters of the gamma distribution depend on
the generated time since the last storm.
Whether or not a process can be considered as a purely random process may be indicated by
its correlogram, or spectral density. If r(k) is not significantly different from zero for k greater
than zero, or if the spectral density function oscillates randomly with no apparent peaks, the
process may be a purely random process. The difficulties in selecting the proper probability den-
sity function and in parameter estimation remain, however.

FIRST-ORDER MARKOV PROCESS


Many hydrologic time series exhibit significant serial correlation. That is, the value of the
random variable under consideration at one time period is correlated with the values of the random
variable at earlier time periods. The correlation of a random variable X at one time period with its
value k time periods earlier is denoted by px(k) and is called the kth order serial correlation. If
px(k) can be approximated by px(k) = pxk(l),then the time series of the random variable X might
be modeled by a first-order Markov process. The first-order Markov process might also be used for
a model if serial correlation for lags greater than one are not important. A first-order Markov
process, as defined here, is the same as the first-order autoregressive model of chapter 14.
A first-order Markov process is defined by

where Xi is the value of the process at time i, px is the mean of X, px(l) is the first-order serial
+ a~ random component with E(E) = 0 and Var(~)= a
correlation, and E ~ is :. This model states
that the value of X in one time period is dependent only on the value of X in the preceding time
period plus a random component. It is also assumed that E ~ is + independent
~ of Xi. The variance
of X is given by a; and can be shown to be related to a: by
If the distribution of X is N ( F ~ a;)
, then the distribution of E is N(0, a:). Random values of
Xi+, can now be ei+, randomly from an N(0, a:) distribution. If t is
N(0, l), then tu, a:). Thus, a model for generating X's that are N ( F ~ ,
a;) and follow the first-order Markov model is

The procedure for generating a value for Xi+, is to estimate F ~ax, , and px(l) by %, Sx, and
rx(l), respectively, and then select a ti+, at random from an N(0, 1) distribution and calculate Xi+,
based on x, Sx, and rx(l) and Xi. The first value of Xi, X,, might be selected at random from an
N ( F ~ a;).
, To eliminate the effect of X, on the generated sequence, the first 50 or 100 generated
values might be discarded.
Equation 15.3 has been widely used for generating annual runoff from watersheds (Fiering
and Jackson 1971). Since t is N(0, l), it is possible to generate values of X that are less than zero.
If this occurs it is generally recommended that the negative X be used to generate the next value
for X and then discarded. This procedure will result in a slight bias. If the occurrence of negative
X's is common in the generation process, it may indicate that X is not normally distributed. In
this event, some other distribution of E must be used. Equation 15.3 generates normally distrib-
uted X's with a mean of FX, of variance of a$and first-order serial correlation of px(l). Serial
correlation is common in hydrology, and, depending on the use to be made of the model, may be
quite important. Note that if px(l) = 0, equation 15.3 reduces to the independent process of
selecting a random observation from N ( F ~ a$. , On the other hand, if px(l) = 1, equation 15.3
is completely deterministic in that Xi+, is completely specified by Xi(Xi+, = Xi).
For a first-order Markov process, the lag k serial correlation, px(k), is given by

Thus, the correlogram exponentially decays from px(0) = 1 to pX(m) = 0 according to equation
15.4. If an observed correlogram has this property, the Markov model may be an appropriate
generating model.
Equation 15.3 can be applied to the logarithms of data through the transformation
Yi = ln(Xi). The generation model is given by

where FY, ayand py(l) refer to the mean, standard deviation, and first-order serial correlation of
the logarithms of the original data. Generation by equation 15.5 preserves the mean, variance,
coefficient of skew, and first-order serial correlation of the logarithms of the original data, but not
of the data itself. Matalas (1 967) suggests a procedure for using a first-order Markov model on
the logarithms that preserves the mean, variance, skewness, and first-order serial correlation of the
STOCHASTIC MODELS 377

original data. The procedure is based on the transformation Yi = ln(Xi - a ) with the parameters
of equation 15.5 related to the parameters of X through the following equations:

px = a + exp :(- + py )

In these equations, px, a;, yx, and px(l) refer to the mean, variance, coefficient of skew, and
first-order serial correlation of the original data and are estimated by X,s;, CSx,and rx(l),
respectively. The quantities py, uy, py, and a are estimated from equations 15.6-15.9 and then
used in equation 15.5 to generate values for Yi+,. Xi+, is then calculated from

The X's generated in this fashion have the same mean, variance, skewness, and first-order serial
correlation as the sample used to estimate px, a x2 , yx, and px(l).
The procedure that is recommended for estimating py, uy, py, and a is to solve equation
15.8 for s,. Equation 15.9 then yields ry(l), equation 15.7 yields Y,and equation 15.6 yields a.
Equation 15.1 can be used to generate X's that are distributed approximately gamma with
mean X, variance s;, and skewness cSx(Thomas and Fiering 1963). The procedure is to define ye
as the skewness of the random component, E. y, is estimated by

Then a random element eitl is defined by

where ti+, is a random value from an N(0, 1). Xi+, is then generated by

with the resulting generated X's being approximately gamma distributed with mean X, variance
s;, first-order serial correlation rx(l), and skewness cSx.
The first-order Markov model (equation 15.1) is also known as the first-order autoregressive
model because px(l) is equal to the regression coefficient P that would be obtained with a
regression using Y as Xi+, and X as Xi.

FIRST-ORDER MARKOV PROCESS WITH PERIODICITY


The first-order Markov model of the previous section assumes that the process is stationary
in its first three moments. It is possible to generalize the model so that the periodicity in hydrologic
data is accounted for to some extent. The main application of this generalization has been in gen-
erating monthly streamflow where pronounced seasonality in the monthly flows exist. This annual
cycle is prevalent in many types of hydrologic and climatic data. The periodicity may affect not
only the mean but all of the moments of the data as well as the first-order serial correlation.
To generalize the Markov model, we adopt the notation that the subscript i refers to the year un-
der consideration and the subscript j refers to the season within the year. Thus, j may run from 1 to 4
if 4 seasons 'are being considered, 1 to 12 if monthly data are being considered, 1 to 52 for weekly
data, and so on. In general, we will take j to run from 1 to m, the number of seasons in the year.
With this notation, pxj refers to the mean of X in the jth season. pxjis estimated by Xj
where

with n equal to the number of years of data, and Xijthe data value in the jth season of the ithyear.
Similarly, u i j is estimated by s;,,, yxi is estimated by CSx,and ~ , ~ (is lestimated
) by rX,,(l).
Note that ~ ~ ~is ~the( first-order
1 ) serial correlation between values in successive seasons. If
monthly streamflow is being considered, px,,(l) would be the first-order serial correlation
between flows in months 4 and 5. pXj(l)would be estimated by

where

In equation 15.15 there are some notational problems when j = m. In this case, j + i should be
taken as 1 because the first season follows the m" season (January follows December, for example).
With this notation, the multiseason, first-order Markov model for normally distributed flows
becomes

In any application, the population parameters are estimated by the corresponding sample
statistics. The subscript notation of equation*15-17 again has problems in that Xtj+, is really
equal to Xi+,,,when j = m. For instance. if a monthly model is considered, then Xi,,, (or the 1 3 ' ~
STOCHASTIC MODELS 379

monthly value in the ith is actually Xi+,,,(or the first monthly value in the next year). c.j+l
is again a random observation from an N(0, 1). Values generated by equation 15.17 are thus the
sum of the mean for the season plus a regression coefficient times the deviation from its mean of
the value in the previous period plus a random component that is normally distributed with mean
zero and variance ox,j+ ,2
.
The first-order Markov model can also be generalized to a seasonal model for gamma vari-
ates (Fiering and Jackson 1971) by generalizing equations 15.11, 15.12, and 15.13; equation
15.11 becomes

Equation 15.12 becomes

where tij is a random value from an N(0, I). Equation 15.13 becomes identical to equation 15.17
with population parameters replaced by their estimates except E ~ , is ~ +used
~ in place of tij+,. The
resulting XiVjwill be distributed almost gamma. Because skewness varies from season to season,
the representation is not statistically pure (Fiering and Jackson 1971). This is because the sum of
gamma variates is not gamma unless the scale parameter, A, is the same.
Equation 15.17 can be applied to the logarithms of the original data. In this case, X i jwould
refer to the logarithm of the value in the ith year and jth season. The parameters of the model
would also be based on the logarithms. The model used in this way would preserve the mean,
variance, skewness, and first-order serial correlation of the logarithms of the data, but not of the
data itself. Equations 15.3 and 15.17 have been widely used in hydrology. Equation 15.17 is
sometimes known as the Thomas-Fiering model because of the early work of these two
researchers with the model (Thomas and Fiering 1962, 1963; Fiering 1967). The model in the
form of equation 15.17 requires that many parameters be estimated. For each season the mean,
variance, and first-order serial correlation must be estimated. This results in estimating 3m
parameters (a monthly model requires one to estimate 36 parameters). This large number of
parameters requires considerable data. The technique based on data generation given in chapter
13 can be used for evaluating the effect of the length of record available for parameter estimation
on the reliability of the Thomas-Fiering model or other stochastic models.

HIGHER-ORDER AUTOREGRESSIVE MODELS


The model given by equation 15.1 can be generalized to include the effects of more than one
preceding time period. Such a model has been called a multilag model, higher-order Markov
model or higher-order autoregressive model. The model can be written as
380 CHAPTER 15

The Xj's might represent actual data values or their natural logarithms. In the case of a normal
model, the random element becomes

where a$ is the variance of X; R2 is the multiple coefficient of determination between Xi+,and Xi,
,;
Xi-1, ..., Xi -+,ti+, is a random observation from N(0, 1); and the p's are multiple regression
coefficients.
The multilag model permits one to incorporate linear influences on data in one period re-
flected by data in several preceding periods. The regression coefficients, p, can be estimated by
normal multiple regression means. The question of how many lags to include can also be ana-
lyzed by the methods of multiple regression devoted to determining whether or not a particular
"independent" variable is important.
One difference between this model and the multiple regression procedures is that the
number of observations available for parameter estimation, n*, changes as the number of lags
changes. If there are n total observations, then there are n - 1 observations available for
estimating px(l) of the first-order Markov model. If two lags are considered (m = 2), then there
are only n - 2 observations available for parameter estimation. In general, for an mth-order
Markov model, there are n* = n - m observations for parameter estimation. What this means is
that multiple regression techniques are not strictly applicable because the sample size and
variables involved in the regressions change as the number of lags change. For instance, R2 may
actually increase if the number of lags included decreases because the data set involved in the
regression has changed. Generally, it is recommended that if a kLh-orderlag is included, then all
lags up to k also be included. For example, if a third-order lag is included, then the first- and
second-order lags should be included as well.

MARKOV CHAIN MODELS


A Markov chain is a stochastic process having the property that the value of the process
at time t, X,, depends only on its value at time t - 1, Xt-,, and not on the sequence of values
XtP2,X,-,, ..., Xo that the process passed through in arriving at X,-,. This can be written

prob(X, = ajlXt-, = ai,X,-, = ak,Xt-, -


- a ,,..., Xo = a,)

= prob(X, = aj(Xt-, = ai)

The conditional probability, prob(X, = aj(X,-, = q), gives the probability that the process at time
t will be in "state" j given that at time t - 1 the process was in "state" i. Equation 15.22 says that
this conditional probability is independent of the "states" occupied at times prior to t - 1.A state
is simply a subdivision of the process X, into some interval. Thus, if X, represents the depth of
rainfall on day t, one state might be defined as no rainfall, another as between 0.00 and 0.05
inches of rainfall, and so forth.
The prob(X, = ajlXL-, = q ) is commonly called the one step transition probability. That is,
it is the probability that the process makes the transition from state q to state aj in one time period
STOCHASTIC PyIODELS 38 1

or one step. The prob(Xt = aj/Xt-, = q) is usually written as pivj(t),indicating the probability of
a step from a, to aj at time t. If pij(t) is independent o f t (pij(t) = pij(t + T) for all t and T),then
the Markov chain is said to be homogeneous. In this event

prob(Xt = aj1 Xt-, = a,) = p.11. . (15.23)

Higher-order Markov chains can be defined to represent stochastic processes such that the
value of the process at time t is dependent on its value in several immediately preceding time pe-
riods. Thus an nth-orderMarkov chain is one in which

prob(Xt = aj[Xt-, = ai, Xt-, = a,, Xt-, -


- a,, ..., Xo = a,) =

prob(Xt = ajlXt_, = q , X t P 2= a, ,..., Xt-, = $)

The treatment of Markov chains in this text will be limited to first-order homogeneous Markov
chains.
If a process is divided into m states, then m2 transition probabilities must be defined. How-
ever, at each step the process must either remain in state i or proceed to one of the other m - l
states. Thus

With this restriction, an m-state Markov chain requires that m(m - 1) transition probabilities
(parameters) be estimated. The remaining m pi,,'s can be determined from equation 15.25. The m2
transition probabilities can be represented by the m X m matrix -P given by

Equation 15.25 states that the elements in any row of - P must sum to unity. A matrix having
this property is said to be a stochastic matrix. Some authors define -P as -
P = [pi+j]'= pji. Under
this definition the columns of -P sum to unity. The definition given by equation 15.23 will be used
in this treatment.
The transitional probability matrix P
- can be estimated from observed data by tabulating the
number of times the observed data went from state i to state j, ni,j. Then an estimate for pij
would be

Considerable data may be required to get accurate estimates of pi,jif p i j is small. This is because
in an observed set of data, ni,jmay be uncharacteristically high or low if plj is close to zero and
the sample is small.
382 CHAPTER 15
The more states that a process is divided into, the less accurate will be the estimates for pij.
For example, if a daily rainfall model is being considered, one might like to have 10 states to ad-
equately represent the possible amounts of rainfall. However, 10 states require the estimation of
90 transition probabilities. This, in turn, requires a large amount of data.
Once -P is known, all that is required to determine the probabilistic behavior of the
Markov chain is the initial state of the chain. In the following the notation py)
means the prob-
ability that the chain is in state j at step or time n. The 1 X m vector has elements py).
Thus

Under this definition p(0)is the initial probability vector. P(') is then given by

and p(2)is given by

where -
Pn is the n~ power of -
P. In general

Furthermore it can be shown that

For a proof of these relationships, reference should be made to any number of books on proba-
bility or stochastic processes (see for instance Bailey 1964; Feller 1957; Brieman 1969).
As the Markov chain advances in time, pj") becomes less and less dependent on
That is to say the probability of being in state j after a large number of steps becomes inde-
pendent of the initial state of the chain. A point is reached where = p(n+m) for a sufficiently
large n. From equation 15.31 we then get, for a sufficiently large n, that P"
- =- Pn+". When this
occurs the chain is said to have reached a steady state. Under steady state conditions
p(n) =)m'$
- and can thus be denoted simply as p. The 1 X m vector p can be thought of as
giving the probabilities of being in the various states after a large number of steps. Under
steady state conditions

The solution of equation 15.33 thus provides - p.


Pnis called the n step transitional probability matrix. That is, -
- Pn = [p/;)] has elements which
give the probability of going from state i to state j in n steps. Since for large n, is independent
STOCHASTIC MODELS 383

of the initial state, we must have p$' = pjn). Thus, -


Pn is made up of m 1 X m vectors all equal to
p. That is, for large n
-

One can therefore calculate the steady state probabilities simply by computing Pn - for a large

enough n. In practice, one would compute P" - and p2".


- If the two differed by only an acceptably
small amount, p would be taken as one of the rows of - p2". Bailey (1964) gives a procedure for cal-
culating -Pn based on characteristic roots. On a digital computer, -
Pncan be easily evaluated by mul-
tiplication. This method for finding p may require n to be very large. The steady state probabilities
p can also be determined directly from equation 15.33. Example 15.2 illustrates this approach.

Example 15.1. Consider a 2-state, first-order Markov chain for a sequence of wet and dry days.
Let state 1 be a dry day and state 2 be a wet day. Assume the transitional probability matrix to be

Thus the probability of a dry day following a wet day is given by p2,1as 0.5. Evaluate:

(a) prob(day 1 wet lday 0 dry)

(b) prob(day 2 wet lday 0 dry)

(c) prob(day 100 wetlday 0 dry)

Solution:

(a) prob(day 1 wetlday 0 dry) = plV2= p(:) = 0.1

(b) prob(day 2 wet 1 day 0 dry) =

thus p':) = 0.14

(c) prob(day 100 wetlday 1 dry) = p(:oo)


384 CHAPTER 15
However, the fact that day 1 was dry would not significantly affect the probability of rain on day
100. Therefore, it can be assumed that n is large and base the solution on the steady state proba-
Pn for large n.
bilities contained in -

p16 is
- assumed to be the steady state n-step transitional probability matrix because the are not
changing much and the two rows are identical. Thus, p$loo)= pf') = 0.1667. The probability of
rain on any day in the distant future is 0.1667. For this to be true, an analysis of rainfall records
should show that 16.67% of the days are wet. This serves as a check on P. -

Comment: Another check on the steady state probabilities is to see if equation 15.33 is valid.

This demonstrates that p = (0.8333, 0.1667) is the steady state probability matrix. See example
15.2 for further comment on this.

- --

Example 15.2. Consider a Markov chain model for the amount of water in storage in a reservoir.
Let state 1 represent the nearly full condition, state 2 an intermediate condition, and state 3 the
nearly empty condition. Assume that the transition probability matrix is given by

Note that it is not possible to pass directly from state 1 to state 3 or from state 3 to state 1 with-
out going through state 2. Over the long run, what fraction of the time is the reservoir level in
each of the states?

Solution: The fraction of time spent in each state is given by p. Equation 15.33 can be used to
determine p. Examination of equation 15.33 shows that if p is a solution so is Ap.- Therefore,

a solution to 15.33 is unique only up to a scalar multiplication. However, because p is a


STOCHASTIC MODELS 385
probability vector, the sum of its elements must be 1. Therefore, our solution technique is to find
an arbitrary solution to 15.33 and then scale it so that it C pi = 1.

If we let p, = 1, the first of these equations gives p2 = 3. With p2 = 3, the last of the equations
gives p, = 6/7. Therefore, one solution of p P = p is p = (1,3,6/7). Since 2 pi must equal 1, p
can be scaled so that p = (0.2059,0.6176, 0.1765). This solution can be substituted into equation
15.33 to verify that it is, in fact, a solution. Another check would be to compute -
Pnfor large n and
show that -Pn = (p p p)'. Thus, over the long run the reservoir is nearly full 20.59% of the time,
nearly empty 17.65% of the time, and in the intermediate state 61.76% of the time.
p. The Pn
Comment: This problem illustrates the direct solution via equation 15.33 for - - approach
for determining - P can, however, be used. For this example p8 P'~
- and - can be found to be (4 sig-
nificant figures):

Thus - P ' are


p8 and - ~ nearly identical, and - ~ the same columns which are in fact equal to p- .
P ' has
This can be seen from

Thus p satisfies equation 15.33.

Data generation from a Markov chain requires only a knowledge of the initial state and the
transitional probability matrix - P. To determine the state at time 2, a random number is selected
between zero and one. If this random number is between c;: pi,, and C;=, pi,, for n =
1, 2, ..., m, the next state is taken as state n. Example 15.3 illustrates this procedure.
386 CHAPTER 15
- - -p --

Example 15.3. Assume that the reservoir of example 15.2 is nearly full at t = 0. Generate a
sequence of 10 possible reservoir levels corresponding to t = 1,2, ..., 10.

P can be written in the form of a cumulative transition probability matrix


Solution: The matrix -
P* where
-

and

For this example

Time State Random State


t at t no. att = l Reservoir level at t

0 1 0.48 2 Nearly full


1 2 0.52 2 Intermediate
2 2 0.74 2 Intermediate
3 2 0.15 1 Intermediate
4 1 0.27 1 Nearly full
5 1 0.03 1 Nearly full
6 1 0.49 2 Nearly full
7 2 0.02 1 Intermediate
8 1 0.97 2 Nearly full
9 2 0.96 3 Intermediate
10 3 Nearly empty

Markov chains have been used in hydrology for modeling rainfall (Gabriel and Neumann
1962; Pattison 1964; Bagley 1964; Grace and Eagleson 1966; Hudlow 1967). Lloyd (1967) pres-
ents a discussion of the application of Markov chains to reservoir theory.
Some of the difficulties in using Markov chains in hydrology are:

1. Determining the number of states to use.

2. Determining the intervals of the variable under study to associate with each state.

3. Assigning a number to the magnitude of an event once the state is determined (i.e., how much
rainfall should be assigned given that chain moved to state 3 and that state 3 encompasses all
rainfalls between 1 and 2 inches).

4. Estimating the large number of parameters involved in even a moderate size Markov chain
model. A chain with 5 states has 20 parameters to estimate. If seasonality is encountered and
4 seasons are needed, 80 parameters are required.
STOCHASTIC MODELS 387

5. Handling situations where some transitions are dependent on several previous time periods
while others are dependent on only one prior time period. Hudlow (1967) found the dry-dry
transition for hourly rainfall showed a sixth-order Markov dependence while a first-order
dependence was adequate for the other transitions.

Woolhiser, Rovey, and Todorovic (1973) discuss an n-day rainfall model in which the tran-
sition from wet to dry days is based on a 2-state Markov chain and the amount of rain on rainy
days is exponentially distributed. Haan et al. (1976) describe a 7-state Markov chain model of
daily rainfall in which the amount of rain in each state is assumed uniformly distributed except
for the last state, in which a shifted exponential distribution is used.
Carey and Haan (1976) present a modified Markov chain daily rainfall simulation model in
which the transitional probabilities are replaced by a continuous probability distribution. That is,
given that the system is in state i on day n in season k, then the probability distribution of the
amount of rain on day n + I is given by

prob(X,+, 5 x 1 X, in i, season k) = p[l + (I - ptl) Px(x li, k) (15.35)

where X, is the amount of rain on day n, ptlis the probability of no rain on day n + 1 given X,
was in state i and season k, and Px(x 1 i, k) is the cumulative probability distribution of rainfall on
day n + 1 given that rain occurs on day n + 1 and that X, was in state i and season k. Hence, to
each state in each season there is a corresponding distribution function of the form of equation
15.35. The parameters ptl
are estimated as fkl/fr where f;, is the historical frequency of transi-
tion from state i to state I (no rain) in season k and fr is the total number of occurrences in state
i and season k. The parameters of each distribution ~ , ( x ] ik)
, are determined from historical data
using the set of observations [X,+~(X,+~ > 0, X, in i, season k].
Synthetic traces of daily rainfall are generated from equation 15.35 in the following manner:

I. Determine the state i and season k of X,.

2. Generate a uniform random number, R,, from the interval (0, I).

3. If R, < pF,, then X,+, = 0.

4. If R, > pk,, generate a random observation, x, from ~ , ( x l i k)


, and set X,,, = x.

5. Repeat steps 1-4 advancing in time and changing seasons as required.

For Kentucky rainfall, Carey and Haan (1976) used 3 states and 12 seasons. Gamma
distributions were used for P,(xl i, k). Furthermore, for a given season, the same gamma distri-
bution could be used for all 3 states. Thus, for each season 5 parameters-2 parameters of the
gamma distribution and 3 values for pF,-had to be estimated, or a total of 60 parameters. This
compares with 505 parameters when the Markov chain approach of Haan et al. (1976) was used
[a 7 X 7 transition probability matrix for each of 12 seasons plus an exponential parameter.
12(7 X 6) + I = 5051. When simulated rainfall for these two models was compared to histor-
ical rainfall at 7 Kentucky locations, the Carey-Haan model proved superior.

Exercises

15.1. Develop a stochastic model for generating a sequence of numbers that could represent the
years between eruptions of the Volcano Aso (see exercise 14.3). Use the model to generate a
series of 100 possible times (years) between eruptions. Compare the correlogram and spectral
density functions for the generated and observed sequences.

15.2. Assume that the time (days) between rains follows a Poisson distribution with a mean of
2 days. Further assume that the amount of rain (inches) on rainy days follows a gamma distribution
with a mean of 1 inch and a variance of 0.50 inch. Simulate I year of rainfall using this model.

15.3. Use the first-order Markov model to generate 100 years of annual runoff (inches) for Cave
Creek near Fort Spring, Kentucky. (Basic data in Appendix.)

15.4. Generate a random sample of size 100 from a gamma distribution with n = 3.5 and X =
2.5. Plot the observed and expected relative frequencies.

15.5. The following data are presented by Burges and Johnson (1973) for the Sauk River in
Washington. Based on this data and the first-order, seasonal, lognormal Markov model, generate
50 years of streamflow data. Compute and plot the correlogram and spectral density function for
the generated data.
-
Month xj Sx,j Yxj Month xj SX,~ Yxj

Oct. 5.02 2.3 1 0.61 Apr. 6.42 1.80 0.44


Nov. 6.50 3.38 0.58 May 10.70 2.89 0.34
Dec. 7.33 3.23 0.50 June 12.76 3.32 0.17
Jan. 6.42 2.95 0.3 1 July 9.05 3.26 0.65
Feb. 5.35 2.62 0.38 Aug. 4.44 1.47 0.93
Mar. 5.02 1.66 0.37 Sept. 3.29 1.22 0.5 1

15.6. Generate 100 years of monthly streamflow data for Cave Creek near Fort Spring, Ken-
tucky, using the seasonal first-order normal Markov model. Compare the correlogram and spec-
tral density function of the simulated and observed data. (Basic data are in Appendix.)

15.7. Write out and explain how a model such as described by equations 15.20 and 15.21 can be
used as a higher-order, multiseason Markov model. Apply the model to Cave Creek near Fort
Spring, Kentucky, (Appendix for data), using a second-order, monthly Markov model.

15.8. Use the first-order, normal Markov model to generate 100 years of annual runoff for the
Spray River near Banff, Canada. Compare the correlogram and spectral density functions for the
observed and simulated data.
STOCHASTIC MODELS 389
15.9. Use equation 15.29 to show the individual generating equations for a 2-site model in terms
of p1,2(0)7pl(l), ~2(1)7 and xi.

15.10. What is pj,,(l) for the model given by equation 15.25?

15.11. Generate 1 year of rainfall letting the sequence of wet and dry days be defined by the
Markov chain of example 15.1 and the amount of rainfall on a rainy day by a gamma distribution
with a mean of 1 and a variance of 0.50 inches.

15.12. Generate a succession of 200 water level states for the situation described in examples
15.2 and 15.3. What fraction of the time was the reservoir level in each of the three states? How
does this compare to the predicted results of example 15.2?
16. Probabilistic Methods
for Uncertainty, Risk,
and Reliability Analysis
WATER RESOURCES and environmental engineering systems deal with the extremely
complex nature of the physical, chemical, biological, and socio-economic processes. While
designing or analyzing the performance of a given system, most often a mathematical model is
used to describe the interrelationships and interactions among its component processes. Typically,
hydrologic and water quality models are complex and might be written in a generic form as

where - 0 represents the outputs being modeled, I- represents the inputs to the model such as rain-
fall, temperature, and so on, P- represents the parameters required by the model, t represents time,
and e- represents errors associated with the modeling process.
One axiom of stochastic processes is that any function of a random variable is itself a
random variable. Thus, if any of the variables in -I or P- are uncertain and known only in a prob-
abilistic sense or if the nature of the functional relationships in the model are uncertain, then -
0
is also uncertain and can be known only in a probabilistic sense. The design and analysis of hy-
drologic, hydraulic, and environmental projects are subject to uncertainty because of inherent
uncertainty in natural systems, a lack of understanding of the causes and effects in various
physical, chemical, and biological processes occurring in natural systems, and insufficient data.

This chapter was written by Dr. Aditya Tyagi, formerly a graduate research assistant in the
Biosystems and Agricultural Engineering Department of Oklahoma State University, Still-
water, Oklahoma, and currently a water resources engineer with CH2M Hill, Austin, Texas.
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 39 1
As a result of these uncertainties, the performance of a project will also be uncertain. The pres-
ence of uncertainties brings into question conventional deterministic design practices due to
their inability to account for possible variations of system responses. The issues involved in the
design and analysis of water resources and environmental engineering systems under uncer-
tainty are multidimensional. Therefore, quantification of system uncertainties is imperative in
order to design or operate a project successfully. Reliability, risk, and uncertainty analysis are
therefore becoming increasingly important in modeling and designing water resources infra-
structure and decision support systems. In some cases, uncertainty analysis is mandatory, par-
ticularly when critical decisions involve potentially high levels of risk. A systematic quantita-
tive uncertainty analysis provides insight into the level of confidence warranted in model
estimates and in understanding judgements associated with modeling processes. It may also
play an illuminating role in identifying how robust the conclusions about model results are and
help target data gathering efforts.
It is apparent that considerable work may be involved in gathering the data required to char-
acterize the uncertainty in each parameter and the parameters as a whole. Before making any data
collection effort, it would be wise to investigate the importance of various parameters to the
process being modeled. If a parameter has little impact on the output of a model, there is no need
to spend a great deal of time and money to estimate that parameter or worrying about uncertainty
in that parameter. Sensitivity analysis is used to measure the importance of a parameter.

SENSITIVITY ANALYSIS
Sensitivity analysis is the study of how the variation in the output of a model can be appor-
tioned, qualitatively or quantitatively, to different sources of variation, and how the given model
depends upon the information fed into it. It ranks model parameters based on their contribution
to overall error in model predictions. While carrying out sensitivity analysis, selection of an effi-
cient sensitivity analysis method is critical.

Traditional or Local Sensitivity Analysis


This method is also known as a one-parameter-at-a-time sensitivity analysis, in which the
effect of the variation in each uncertain input pararneter is determined by keeping other uncertain
parameters at a constant level (generally at their expected value). The result is a series of partial
derivatives, one for each parameter, that defines the rate of change of the output function relative
to the rate of change of the input parameter. Two types of sensitivity coefficients are used. One is
called an absolute sensitivity coefficient, or simply the sensitivity coefficient, S, and the other is
called a relative sensitivity coefficient, S,. These coefficients are defined as

where S is the absolute sensitivity (output unitslinput units), S, is the relative sensitivity (dimen-
sionless), 0 represents a particular output, and P represents a particular input parameter. Graphi-
cally the terms in these relationships are shown in figure 16.1.
392 CHAPTER 16

P
Fig. 16.1. Definitions for numerical derivatives.

Most hydrologic and water quality models are a collection of algorithms and not a continu-
ous function of the parameters in the usual sense of function, thus numerical derivatives are used
to approximate the partial derivatives of equation 16.2. The numerical derivatives may be
approximated as

where AF' is the amount a parameter is perturbed from its base value (this is generally taken as
10%or 15%of P). The numerical derivatives are calculated about base parameter values. The rel-
ative sensitivity coefficients are dimensionless and thus can be compared across parameters,
whereas the absolute sensitivity coefficients are affected by units of output and input and cannot
be directly compared across non-commensurate parameters.
Most hydrological and environmental engineering models are complex and contain a large
number of parameters. The disadvantage of determining model response to one parameter at a
time (performing the traditional sensitivity analysis) is that it requires considerable computation
and provides information about only one point in the parameter space. In some cases, this type of
sensitivity analysis may be misleading as such combinations of inputs would be unlikely in the
real world. To overcome this problem, global sensitivity analysis may be used.

Global Sensitivitv Analvsis


This method is also known as a variance-based method. The effect of variation in the
inputs, as all inputs are allowed to vary over their ranges, taking into account the shape of their
probability density functions, is determined. This usually requires some procedure for sampling
UNCERTAINTY, RISK. AND RELIABILITY ANALYSIS 393

Repeat n times
Fig. 16.2. Monte Carlo simulation.

the parameters, perhaps in a Monte Carlo simulation (MCS) form. The MCS process is illus-
trated in figure 16.2 and discussed in detail in the subsequent section, Uncertainty Methods.
If several parameters are simultaneously and independently varied, then the multiple
regression of output 0 on all parameters, Pi, is

where bi represents regression coefficients. Normalized sensitivity indices can be obtained for
each variable in equation 16.4 by subtracting its mean and dividing by its estimated standard
deviation. The normalized regression model is

where 6 and so are the mean and standard deviation of simulated output 0 , Pi and share the mean
and standard deviation of ith parameter. By equating equation 16.4 with equation 16.5, the
relationship between the standardized coefficient and un-normalized multiple regression
coefficient is

where pi is the normalized sensitivity index of ithparameter. It can be shown that Pi is the corre-
lation coefficient between simulated output 0 and generated parameter Pi.
- - - --

Example 16.1. The head loss, hf(m), in a pipe is given by the Hazen-Williams equation as

Compute the sensitivity coefficients (equation 16.2) assuming L as constant (1500 m) and mean
values of Am, Q, C, and D are 1.0 (non-dimensionless), 0.915(m3/s), 130 (SI units), 0.305(m),
respectively.

Solution: Output, h , is calculated by substituting base values (mean values) in the given
functional relation of hf as

Analytical partial derivatives of hf with respect to various uncertain parameters are determined and
absolute sensivity coefficients are evaluated by substituting mean values in the resulting expressions.
The calculation is presented in table 16.1, which indicates that D is the most sensitive parameter.

Table 16.1. Sensitivity analysis using analytical derivatives

Parameter, Pi

Symbol Base value S sr

Example 16.2. For the preceding example, determine S and S, using numerical approximation.

Solution: First, the output value 0 at the base values of the input parameters is found to be
535.29. Then, assuming AP = lo%, parameters are perturbed about their base values. The cal-
culation is presented in table 16.2.
For nonlinear models, the error in sensitivity coefficients depends upon magnitude of per-
turbation (AP) and non-linearity of model response with respect to different parameters.
Table 16.3 presents effect of magnitude of perturbation on relative sensitivity coefficients.
Table 16.3 demonstrates that when the functional relationship is linear with respect to a
parameter, there is no impact of magnitude of AP on S,. The inexactness of S, increases with
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 395

Table 16.2. Sensitivity analysis by numerical approximation method at AP = 10%

Output at perturbed
Perturbed input values input values Sensitivity coefficients

Parameter P P-AP P+AP 0,-AP OP+AP S sr

Table 16.3. Relative sensitivity coefficient at different levels of perturbation

Relative sensitivity coefficient, S,

Parameter AP = 1% AP = 5% AP = 10% AP = 15%

increase in magnitude of AP as the non-linearity of an output function increases with respect to a


parameter. At AP = 1%, the values of S, with respect to A, Q, and C match those obtained by
analytical method (table 16.1). This example suggests that one should be very cautious in choos-
ing AP for determining numerical derivatives in complex models, such as watershed models.

Example 16.3. Use the global sensitivity analysis method for the preceding example. Assume
the parameters are independent with A, normally distributed with a coefficient of variation (CV)
of 0.12, and Q, C, and D lognormally distributed with CV's of 0.10,0.15, and 0.05 respectively.

Solution: MCS is used to generate 5000 random observations for each of the uncertain parame-
ters. The value of hf is calculated for each of the 5000 sets of parameters. Based on these data the
following regression equation (R-square = 0.89) is obtained.

The corresponding normalized regression equation is

hf = 0.268Am + 0.435Q - 0.598C - 0.539D + constant


The coefficient of each parameter is its normalized sensitivity index. For example, the normalized
sensitivity coefficient of X, is 0.268. It means that one standard deviation change in the model
parameter will lead to a 0.268 standard deviation change in the model prediction. As mentioned
396 CHAPTER 16
earlier, the normalized sensitivity coefficients for an input parameter is its correlation coefficient
with the output random variable.
The difference between the local and global sensitivities for the head loss should be noted. The
local sensitivity coefficient does not require the use of probabilistic properties of uncertain
parameters. It is based on the functional characteristics. On the other hand, the global sensitivity
coefficient is based on both functional and probabilistic characteristics of the input random vari-
ables. Local sensitivity analysis indicates D as the most sensitive parameter. Global sensitivity
analysis indicates C as the most uncertain parameter. To investigate which sensitivity coefficient
is the most useful, the contribution due to each component function should be determined. This
is discussed in the next section.

Uncertainty Analysis
The main objective of uncertainty analysis is to assess the statistical properties of model out-
puts as a function of stochastic input parameters. In water resources engineering projects, design
quantities and model outputs are functions of several parameters, not all of which can be quantified
with absolute accuracy. The task of uncertainty analysis is to determine the uncertainty features of
the model outputs as a function of uncertainties in the model itself and in the stochastic parameters
involved. It provides a formal and systematic framework to quantify the uncertainty associated with
the model outputs. Furthermore, it offers the designer useful insights regarding the contribution of
each stochastic parameter to the overall uncertainty of the model outputs. Such knowledge is es-
sential in identifying the important parameters to which more attention should be given to improve
assessment of their values and then reduce the overall uncertainty in the model output. Quantitative
characterization of uncertainty provides an estimate of the degree of confidence that can be placed
on the analysis and findings.
As an example, water quality models are formulated to describe both observed conditions
and predict planning scenarios that may be substantially different from observed conditions.
Planning and management activities such as checking basin-wide water quality for regulatory
compliance, waste load allocation, and so forth, require the assessment of hydrologic, hydraulic,
and water quality conditions beyond the range of observed data. These inadequacies in model
parameters or inputs force water quality modelers to characterize the impacts of parameter
uncertainties quantitatively so that appropriate decisions regarding water pollution abatement
programs can be made. The most complete and ideal description of uncertainty is the pdf of the
quantity subject to uncertainty. However, in most practical problems, a pdf is very difficult, if
not impossible, to derive precisely. In most situations, the main objective of uncertainty analysis
is to evaluate the first and second moments of a model output in terms of input random
variables.

RELIABILITY AND RISK ANALYSIS


Reliability and risk analysis is a technique for identifying, characterizing, quantifying, and
evaluating the probability of a pre-identified hazard. It is widely used by private and government
agencies to support regulatory and resource allocation decisions. In most hydrologic, hydraulic,
and environmental engineering problems, empirically developed or theoretically derived mathe-
matical models are used to evaluate a system's performance. These models involve several
UNCERTAINTY. RISK, AND RELIABILITY ANALYSIS 397

uncertain parameters that are difficult to accurately quantify. An accurate reliability assessment
of such models would help the designer build more reliable systems and aid the operator in mak-
ing better maintenance and scheduling decisions.
The reliability of a system can be most realistically measured in terms of probability. The
failure of a system can be considered as an event in which the demand or loading, L, on the sys-
tem exceeds the capacity or resistance, R, of the system so that the system fails to perform satis-
factorily for its intended use. The objective of reliability analysis is to ensure that the probability
of the event (R < L) throughout the specified useful life is acceptably small. The risk, Pf, defined
as the probability of failure, can be expressed as (Ang and Tang 1984; Yen et al. 1986)

where P denotes the probability function. Equation 16.7 can be rewritten in terms of the per-
formance function Z as

P, = P(Z < 0)
where Z is defined alternatively as

Z=R-L

The reliability, 8,of the system can be written as

In general, from equation16.8, the risk can be expressed as

where PR,L (r, 1) is the joint pdf of R and L; c is the lower bound of R; and a and b are the lower and
upper bounds of L, respectively. The resistance, R, and load, L, are random variables given as

where -U is the vector representing input parameters of the model representing R, and -
V is the
vector representing input parameters of the model representing L. In some problems L may be a
398 CHAPTER 16
deterministic quantity representing a hydrologic/hydraulic/environrnentaltarget level such as
peak discharge; volume; contaminant concentration in soil, water, or air; minimum dissolved
oxygen in a stream; critical cancer risk; and so on. Alternatively, by using the performance vari-
able Z defined in equations 16.9, 16.10, and 16.11, the risk can be written as

where pZ(z) is the pdf of Z. The pdf of Z is unknown, or difficult to obtain. In most cases the
exact distribution of Z may not be required, as any of several distributions can be used to make a
decision if correct information about the moments of pZ(z)is available.

Uncertainty, Risk, and Reliability Analysis Methods


Ideally, a pdf should be obtained for a complete assessment of the uncertainty, risk, and
reliability analysis of a given system. This requires determination of the joint pdf for all the sig-
nificant sources of uncertainty affecting the output of the system. However, the determination of
probability distributions for the basic variables is quite difficult and involves several assump-
tions. Furthermore, the multivariate combination and integration of the input variable distribu-
tions is a daunting task. The aggregation of uncertainties in the basic variables of a model into
measures of overall model output uncertainty/reliability is done in only an approximate manner.
Several methods that have been used in water resources and environmental engineering will be
discussed.

First-Order Approximation Method


The first-order approximation (FOA) method can be used to estimate the amount of uncer-
tainty, or scatter, of a dependent variable due to uncertainty in the independent variables included
in a functional relationship. Benjamin and Cornell (1970) have described the first-order approx-
imation technique in detail.
Consider an output random variable, Y, which is a functjon of n random variables. Mathe-
matically, Y can be expressed as

- = (X,, X,, .. ., XJ, a vector containing n random variables. In FOA, a Taylor series
where X
expansion of the model output is truncated after the first-order term

where -X, = (X,,, X,,, ..., X,,), a vector representing the expansion points. In FOA applications
to water resources and environmental engineering, the expansion point is commonly the mean
value of the basic variables. Thus, the expected value and variance of Y are
UNCERTAINTY, RISK, AND RELIABTLITY ANALYSIS 399

- - -
where a, is the standard deviation of Y; -X = (XI, X2, ..., Xn), a vector of mean values of the
input basic variables. If the basic variables are statistically independent, the expression for
Var(Y) becomes

Sim~lifiedFOA Estimates for Some Functional Forms


In most hydrologic, hydraulic, and environmental engineering problems, empirically devel-
oped or theoretically derived model equations involving several uncertain parameters are used.
Simplified formulas can be derived in terms of the mean and CV of input random variables so
that determination of partial derivative (dg/dXi) can be avoided.
Consider a multiplicative model in which the output random variable Y is expressed as the
multiplication of n power functions.

where Co and ri are constants and Xis are independent stochastic input random variables. The
first-order mean of the model output, by,can be written as

where px is the mean of Xi. The first-order variance of the multiplicative form, 6$, can be
approximated as

where CVq is the coefficient of variation of Xi. Dividing equation 16.24 by the square of equa-
tion 16.23, the approximate coefficient of variation of Y, e ~ , ,can be evaluated as

Another form of interest is the additive form obtained when two or more power functions
are added. The general additive form is written as:
The approximate mean of Y is given as

Similarly, the variance of the additive model can be approximated by

So evYcan be evaluated by

ev, =

A third,functional form is the combination of multiplicative and additive forms. This form
is obtained when two or more multiplicative forms having common power function(s) are added.
The general form can be represented as

For evaluating the mean and variance of combined forms of Y such as equation 16.30, the mean
and variance of the additive part must be determined first using equation 16.27 and equation
16.28. Next equations 16.23, 16.24, and 16.25 are used to determine the mean, variance, and CV
of Y by treating the combined form as a multiplicative form assuming the additive part as a mul-
tiplicative component with known mean and variance.
To estimate the reliability, 8,of a system, it is typically assumed that Z is normally distrib-
uted. Using p,(z) to be a normal distribution with its parameters E[Z] and ozdetermined by FOA,
equations 16.8 and 16.12 are used to determine the risk and reliability of a given system.
An alternative method to define a system reliability is the reliability index, P, which is de-
fined as the reciprocal of the coefficient of variation of Z, given as

The great advantage of FOA is its simplicity, requiring knowledge of only the first two sta-
tistical moments of the basic variables and simple sensitivity calculations about selected central
values. FOA is an approximate method that may suffice for many applications (Ku 1966), but the
method does have several theoretical and/or conceptual shortcomings (Melching 1992a; Cheng
1982). The main weakness of the FOA method is that it is assumed that a single linearization of
the system performance function at the central values of the basic variables is representative of
the statistical properties of system perfofinance over the complete range of basic input variables.
The accuracy of the estimates is influenced in part by the degree of nonlinearity in the functional
relationship, and the importance of higher-order terms which are truncated in the Taylor series
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 40 1
expansion (Bum and McBean, 1985). In applying FOA in risk and reliability analyses, it is gen-
erally assumed that the performance function is normally distributed, which is seldom true. Any
attempt to characterize the tails of the actual distribution based on an assumption of normality is
likely to result in an inexact answer (Burn and McBean, 1985).

Example 16.4. Determine the first-order mean and standard deviation of head loss function
given in example 16.1. Use the same mean and CV values as given in the preceding examples.

Solution: The FOA estimate for the mean, bhf,is calculated using equation 16.23 as

Using equation 16.25, the FOA estimate for the CV of hf is

Thus, the first-order standard deviation, Gh = khfCVhI= 535.29(0.43) = 230.17 m.

Example 16.5. Using Manning's equation, the flow in a compound channel is given as

where A, is the model correction factor to account for model uncertainty with mean and CV
values of 1.0 and 0.15, respectively. Y, and Y, represent section factors (Yi = A R ~ / for
~ ) the
main channel and overbank sections, respectively. Consider section factors to be deterministic
(Y, = 296.9 m8/3and Yb = 0.6 m8j3),and n,, n,, and S are random variables with mean values
of 0.034, 0.068, and 0.005 and CV values of 0.17, 0.38, and 0.25, respectively. Determine the
mean and standard deviation of Q.

Solution: Substituting values of Y, and Yb, the expression for flow is rewritten as

+ +
where is a dummy variable representing the additive form = 296.9%-' + 1.2nb1.The first-
+
order mean and standard deviation of are calculated from equations 16.27 and 16.28 as

b,+= xi=
2
1 Cik$i = 296.9(0.034)-I + 1.2(0.068)-I = 8750 and
6: =
2 2 2
Ci i k y
xi=1
2ricv2
X,- 296.92(-1)2(0.034)2(-1)(0.
17)~ + (1.2)2(- 1)2(0.068)'-1)(0.38)2
= 22037852, So

CV,+= q22037852/8750 = 0.17.


402 CHAPTER 16

Now, Q = A,+SO.~ can be considered as a multiplicative form and equations 16.23 and 16.25 can
be used to determine the overall FOA mean and CV of Q. The FOA estimate for the mean, Go, is

Using equation 16.25, the FOA estimate for the CV of Q is

Thus, the first-order standard deviation, 6 , = poCVo = (8750)(0.26) = 160.86 m3/s.

Example 16.6. For a storm sewer, peak runoff, QL,is given by the rational formula as:

The capacity, Qc, of a sewer is given by Manning's equation as

The definition and statistical characteristics of uncertain variables are listed in table 16.4. Deter-
mine the risk.

Solution: Using equation 16.9, the performance function can be defined as

To determine the first-order mean and standard deviation of Z, first the mean and CVs of Qc and
QLare determined using multiplicative formulas as

Using these estimates, the mean and standard deviation of Z are determined considering it as an
additive form. Using equation 16.27
Table 16.4. Statistical data for storm sewer design

Input variable Definition Mean CV Distribution

A,(non-dimensionless) model correction factor 1.100 Triangular


n(S1 units) Manning's roughness 0.015 Gamma
aft) pipe diameter 3.000 Triangular
S,(ft/ft> hydraulic grade line 0.005 Triangular
A,(dimensionless) model correction factor 1.000 Triangular
C(dimension1ess) runoff coefficient 0.825 Triangular
I(in.b) rainfall intensity 4.000 Triangular
A(acre) drainage area 10.00 Triangular
Note: 1 ft = 0.305 m; 1 in. = 2.54 cm; and 1 acre = 4047 m2.

Using equation 16.31, the reliability index, P, is 0.79 and the corresponding risk (equation 16.16
assuming a nomal distribution) is

where z is the standard nomal variate defined as z = (X - yx)/ux, and @(z) is the standard
normal cumulative distribution function.

Example 16.7. For the preceding example, determine the risk using the following definition of
the performance function, Z = Qc/QL - 1.

Solution: Substituting the expressions of Qc and QL,Z is expressed as

where $ = Qc/QL, also known as safety factor. The first-order mean and standard deviation of $
are determined using formulas corresponding to multiplicative forms.

Thus, @,, = @,CV, = 1.362(0.365) = 0.497


Now, bz = @,, - 1 = 1.362 - 1 = 0.362 and 6, = 6, = 0.497
From equation 16.16 the corresponding risk is
404 CHAPTER 16
The results of above examples (16.6 and 16.7) clearly show that risk estimates are different for
the two mechanically equivalent formulations under the same underlying assumption of normal dis-
tribution for the performance function. This indicates that the probability of failure depends upon
the formulation of the performance function. This is known as the lack of invariance problem.

Monte Carlo Simulation


In Monte Carlo simulation (MCS), probability distributions are assumed for the uncertain
input variables for the system to be studied. Random values of each of the uncertain variables are
generated according to their respective probability distributions and the model describing the
system is executed. By repeating the random generation of variable values and model execution
steps many times, the statistics and an empirical probability distribution of the model output can
be determined. A schematic of MCS is illustrated in figure 16.2. The accuracy of the statistics and
probability distribution obtained from MCS is a function of the number of simulations performed
and the adequacy of the assumed parameter distributions. It requires judgement on the part of the
modeler to create theoretical input sample distributions that are representative of the populations
and to estimate the number of trials needed to generate the input and output density functions.
There is no strictly defined answer to either of these questions. Further, if the input parameters
are correlated, a multivariate simulation of the input parameters must be used.
A key problem in applying the MCS method is estimating the necessary sample size. One
empirical test consists of iterating the sample program with increasingly greater sample sizes and
estimating the convergence rate of the sample mean value towards the population mean (Burges
and Lettenrnaier, 1975). The error in the estimation of the population mean is inversely propor-
tional to the square root of the number of trials. To improve the estimate by a factor of two, the
sample size must increase by a factor of four. If the sample size is n, the standard deviation of the
mean is 1 / 6 times the standard deviation of the population. This indicates that the sample size
must be large (Siddal 1983). As the sample size increases, the precision of the empirical
percentile estimates of a model output improves. However, Martz (1983) noted that the rate of
convergence to the true distribution decreases as the size of sample increases. The method often
entails sample sizes that are in the range of 5,000 to 20,000 members. Generally, the number of
required samples increases with the variances and the coefficient of skewness of the input distri-
butions (Burges and Lettenmaier 1975).
The fraction, Fi, of the total variance in model output attributable to the i" parameter based
on a MCS can be estimated by computing

where pi is the correlation coefficient between i" parameter and the output as defined in equation
16.6.
Another simulation technique similar to MCS is the Latin hypercube sampling (LHS), in
which a stratified sampling approach is used. In LHS the probability distribution of each basic
variable are subdivided into non-overlapping intervals (say, m) each with equal probability (l/m).
Random values of the basic variables are simulated such that each range is sampled only once. The
order of the selection of the ranges is randomized and the model is executed m times with the ran-
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 405
dom combination of basic variables from each range for each basic variable. The output statistics
and distributions may then be approximated from the sample of m output values. McKay et al.
(1979) has shown that the stratified sampling procedure of LHS converges more quickly than an
equidistribution sampling employed in MCS. The main shortcoming with this stratification scheme
is that it is one-dimensional and does not provide good uniformity properties on a k-dimensional
unit hypercube (Diwekar and Kalagnanam 1997). Except reducing computation effort to some ex-
tent, LHS has the same problems that are associated with MCS.

Example 16.8. Using MCS, determine the mean and variance of head loss for example 16.3.
Also determine the contribution due to each input parameter.

Solution: Using MCS, means and standard deviations of head loss are determined for different
numbers of simulations as shown in figure 16.3. The MCS estimates for the mean and standard de-
viation of head loss with 20,000 simulations were obtained as 595.5 m and 270.2 m, respectively.
Equation 16.32 and the regression of example 16.3 show that ,A Q, C, and D contributed 7.9,
20.8,39.3, and 32.0 percent, respectively, of the overall variance.

550 4 1
0 2000 40M) 6000 8000 10000 12000 14000 16000 18000 20000
Number of simulations

Number of simulations

Fig;. 16.3. Variation of mean and standard deviation.


406 CHAPTER 16
Corrected FOA Method
In this section, the properties of statistical expectation of a random variable are used to de-
rive moments of various model forms. When Y has a multiplicative form with strictly independ-
ent input parameters, Xis, the mean of Y, py, can be written as

where E[] is an expectation operator, and py, is the mean of the ithpower function

The coefficient of variation of Z, CV, can be written as

The variance of Z, a;, can be written as

Equation 16.35 shows that the output uncertainty of a multiplicative model is governed by the
most uncertain component function. Using the additive form (equation 16-26),the mean of Z, p,,
is given by

Similarly, the variance of Z, a;, can be written as

Equation 16.38 shows that magnitude of Ci is as important as uncertainty in the component


function (ayi).
For evaluating the mean and variance of combined forms of Z such as equation 16.30, the
mean and variance of the additive part is determined first. Next p,, CV,, and a; are determined
by treating the combined form as a multiplicative form considering the additive part as a multi-
plicative component with known mean and variance.

Correcting FOA Mean and Variance Estimates of an Individual Function


Knowledge of relative error corresponding to FOA estimates ( c y and 6;) can be used to
correct them to obtain their exact values. Consider a power function

where c and r are constants. The FOA estimate for the mean (Benjamin and Cornell 1970), by,is
UNCERTAINTY, RISK, AND RELIABILITY A N U S I S 407
*2
The FOA estimate for the variance of Y (Benjamin and Cornell, 1970), oy, is

These estimates for pYand oycontain errors. The exact value of any moment can be computed as

FOA estimate
Exact value =
1 - E(.)

where E(.) is the relative error in a moment estimated using FOA. Analytical relationships for
E(.) in FOA estimates for the means and the variances of component functions were developed
(Tyagi 2000) for generic power and exponential functions using five common distributions.
These analytical expressions can be used as a guide for judging the suitability of the FOA by
determining the relative errors in the most sensitive parameters. Further, when relative error is
more than the acceptable error, these analytical relationships enable one to correct FOA esti-
mates for means and variances of model components to their true values. Using these corrected
values of means and variances for model components, one can determine the exact values of
mean and variance of an overall model output. Tables 16.5 and 16.6 present the developed
expressions for E(cy) and ~ ( 6 ; )for a generic power function (Y = cXr). Similarly, tables 16.7
and 16.8 present the developed expressions for E(bY)and ~ ( 6 ;for ) a generic exponential
function (Y = becx).
To further simplify the correction procedure, these analytical relationships have been pre-
sented graphically by Tyagi (2000). The relative error plots show where FOA estimates are
acceptable and where they are unacceptable and need to be corrected. In specific situations, a given

Table 16.5. Generalized relative error in FOA predicted mean of a power function

Distribution Relative error in FOA predicted variance, E(ji,)

Uniform

6(r + l)(r + ~)cv;


Symmetrical triangular 1-
[(I + c v x G ) ( ' + * ) + (1 C V ~ G ) ( ' +-~ )21
-

Lognormal

Gamma

Exponential

Note:
(1) To avoid singularity at r = - 1, r should be taken as -0.9999.
(2) To avoid singularity at r = -2, r should be taken as - 1.9999.
Table 16.6. Generalized relative error in FOA predicted variance of a power function

Distribution Relative error in FOA predicted variance, E($$)

Uniform 1-
+ 1) r2(r + 1 ) 2 ~ ~ 4 ,
12(2r
{ 2 f i c v x ( r + 1 ) ~ [ ( 1+ ~ ~ ~ f i ) ~ ~-+c ' ~- ~( lf i ) ~-' (2r
+ ~+ ]1)[(1 + ~ \ j ~ f i ) I + ~ -
- (~l ~ ~ d ? i ) ' + l ] ~ )
Symmetrical 36(2r + 1) r2(r + 1)'(r + ~)'CV;
triangular 1-
{3(r + l)(r + 2 ) 2 ~ ~ i [+( 1C V ~ ~ 2+(1 ) ~ -" C V ~ ~ )-~21-(2r ' + ~+ 1)[(1 + c v x 6 ) " 2+(1 - C V X 6 ) ' +'-21')

Lognormal

r2cvZ1- 2 r ) [ I ' ( ~ ~')I2


;
Gamma 1-
T[cvi2(1 + ~ ~ c v ~ , ) ] T ( c v ; -~ ){T[cv;~(~ + ~ c v ; ) ] ) ~
r2
Exponential 1-
% [r(2r + 1) - T2(r + I)]
Note:
(1) To avoid singularity at r = - 1, r should be taken as -0.9999.
(2) To avoid singularity at r = -2, r should be taken as - 1.9999.
UNCERTAINTY. RISK, AND RELIABILITY ANALYSIS 409
Table 16.7. Generalized relative error in FOA predicted mean of an exponential function

Distribution Relative error in FOA predicted variance, E(b,)

Uniform

Symmetrical triangular

Normal
1
Gamma 1 - (1 - c ~ ~ c vexp(cp,)
:)~
Exponential 1 - (1 - ~ I J J ~ ) ~ ~ P ( ~ I J J ~ )

Table 16.8. Generalized relative error in FOA predicted variance of an exponential function

Distribution Relative error in FOA predicted variance, ~ ( 6 ; )

12 C 4 p ; Z ~ ~ ~ e 2 f i ~ ~ C V V
Uniform 1-
(e2fic*xCVx
- I ) [ ( ~ ~ C ~-
~~ +
C V) , e ~ ~d 3~c C*L X
x ~~
vXI ]~ ~ +
Symmetrical 6 6 v6e2<&*.~v,
triangular 1- 72c PXC x
(edCkCVx - 1 ) 2 [ ( 3 ~ 2 p : ~- +
~ :2)(e2*kcVx + 1) 2 e 6 ~ ~ C V ~ ( 3 ~+2 2)]
p:~~~

c2u:
Normal 1-
exp(c2u;) [exp(c2u:) - 1]

c2p:exp(2c PX)
Gamma 1- 1 2
(1 - 2cpxCv;)-,: - (1 - cpxcv:>-,:

c2p;exp(2c PX)
Exponential 1-
(1 - 2cpx)-' - (1 - cPx)-l

function may be very nonlinear (represented either by a very large or very small exponent of a
power function). These situations can be identified and dealt with by using the relative error plots.
In absence of the knowledge of the complete pdf, but knowing the mean and variance ex-
actly, certain exact statements on the probability of an output random variable lying within given
bounds can be estimated using the Chebyshev inequality (equation 3.77) which states that

+
where t is a constant. In example 16.7, the safety factor is defined as = R/L = Qc/&. Then,
the greatest lower bound of the system reliability [here, 8 = P($ 2 I)] is given (Huang 1986) as
410 CHAPTER 16

Considering example 16.7, the greatest lower bound of the probability of safety is 3 2 0.347.
This result may be used as a reference value.

Example 16.9. Estimate the mean and standard deviation of Q given in example 16.5 assuming a
normal distribution for ,A a uniform distribution for n, and n,, and a lognormal distribution for S.

Solution: To determine exact values of the mean and standard deviation of Q, firstly, FOA
estimates for the mean and variance of component power functions are estimated using equations
16.38 and 16.39. Then using relative error functions corresponding to given distributions from
tables 16.5 and 16.6, the FOA estimates are corrected. The calculation procedure is presented in
table 16.9.
Using equation 16.35 and corrected means of the individual power functions from
table 16.9, the exact mean of the additive form, p+, is 9019.9 m3/s. Similarly, using equation 16.36
and corrected variances of the component power functions, o+is 1586.2 m3/s. The corresponding
CV+is 0.176. Now, treating Q as a multiplicative form with A,, +, and S$%S its components with
known means and CV values, PQ is 632.99 m3/s from equation 16.31, and CV, is 0.265 from
equation 16.33. Thus, uQ is 167.73 m3