HYDROLOG*
S t a t i s t i c a l Methods i n
HYDROLOGY
Second Edition
CHARLES T. HAAN
Orders: 18008626657
Office: 15152920140
Fax: 15152923348
Web site: www.iowastatepress.com
Authorization to photocopy iteins for internal or personal use, or the internal or personal use of
specific clients, is granted by Iowa State Press, provided that the base fee of $.lo per copy is paid
directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For those
organizations that have been granted a photocopy license by CCC, a separate system of payments
has been arranged. The fee code for users of the Transactional Reporting Service is 08 1381503
712002 $. 10.
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Hydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
5 NORMALDISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
General normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Reproductiveproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Approximations for standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Constructing pdf curves for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Normal approximations for other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
..
Parameters of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
H, p = pl. Ha: p = p2.normal distribution. known variance . . . . . . . . . . .206
..
H, p = p,. Ha: p = p2.normal distribution. unknown variance . . . . . . . . .206
H, p = po. Ha: p # po.normal distribution. known variance . . . . . . . . . . . .207
H, p = po.Ha: p # po.normal distribution. unknown variance . . . . . . . . . 207
Test for differences in means of two normal distributions . . . . . . . . . . . . . . . .208
CONTENTS xi
.
Test of H,: u2 = a; versus Ha: a ' # a: normal population . . . . . . . . . . . . . . 209
Test of H, a: = a; versus Ha: a: # a; for two normal populations . . . . . . .209
Test for equality of variances from several normal distributions . . . . . . . . . . .209
Testing the goodness of fit of data to probability distributions . . . . . . . . . . . . . . . . 210
Chisquare goodness of fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Distributional tests based on cumulative distributions . . . . . . . . . . . . . . . . . . .213
Comparing two empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
General comments on goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13 DATAGENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Univariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Multivariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327
Multivariate. correlated. normal random variables . . . . . . . . . . . . . . . . . . . . . 327
Multivariate. correlated. nornormal random variables . . . . . . . . . . . . . . . . . . .328
Applications of data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .334
17 GEOSTATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .426
Semivariogrammodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .430
Combination semivariogram models . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . .432
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .433
Anexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 .
Anisotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
.
Local and global estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Polygon declustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Celldeclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447
Pointkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Blockkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
xiv CONTENTS
Estimation of cumulative distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .448
Modeling using geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
APPENDIXES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .451
A .1. Common distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Hydrologicdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .454
A.2. Monthly runoff (in.), Cave Creek near Fort Spring, Kentucky . . . . . . . 454
A.3. Peak discharge (cfs), Cumberland River at Cumberland Falls,
Kentucky .................................................. 455
A.4. Peak discharge (cfs), Piscataquis River, DoverFoxcroft, Maine ......457
A.5. Total Precipitation (in.) for week of March 1 to March 7, Ashland,
Kentucky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .458
A.6. Flow and sediment load, Green River at Munfordville, Kentucky . . . . . .458
A.7. Streamflow (in.), Walnut Gulch near Tombstone, Arizona . . . . . . . . . . . .459
A.8. Monthly Rainfall (in.), Walnut Gulch near Tombstone, Arizona . . . . . . .460
A.9. Annual discharge (cfs ), Spray River, Banff, Canada . . . . . . . . . . . . . . . .461
A.lO. Annual discharge (cfs), Piscataquis River, DoverFoxcroft, Maine . . . .461
A.ll. Annual discharge (cfs), Llano River, Junction, Texas . . . . . . . . . . . . . .461
Statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A.12. Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A .13. Percentile values for the t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.14. Percentile values for the chi square distribution . . . . . . . . . . . . . . . . . . .465
A.15. Percentile values for the F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .467
A .16. Critical values for the KolmogorovSmirnov test statistic . . . . . . . . . . .469
A .17. DurbanWatson test bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .470
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .471
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .483
Preface to the
Second Edition
SINCE THE publication of the first edition of this book, statistics has come to play an
increasingly important role in hydrology. The advancements in computing technology and data
management have made the application of statistical techniques that were previously known but
difficult to implement allnost routine. User friendly software for personal computers has made
powerful statistical routines available to nearly all hydrologists. Generally, this software comes
with user manuals or help files that lead a new user through the steps needed to use the programs.
Unfortunately, these aids rarely indicate the assumptions inherent in the techniques, the limita
tions of the techniques, and the situations in which the techniques should or should not be used.
They are generally weak in instructing one on the interpretation of the results of the analysis as
well. This software is a tool that is available for use in hydrology but does not replace sound
hydrologic understanding of the problem at hand nor does it replace a basic understanding of the
statistical technique being used.
This current edition should serve as a companion to many of the software programs
availablenot to explain how to use the software, but to provide guidance as to the proper rou
tines to use for a particular problem and the interpretation of the results of the analysis.
The basic philosophy of the current edition is the same as that of the first edition. Enough
detail on particular statistical methods is presented to gain a working understanding of the tech
nique. Certainly the treatment on any particular statistical technique is not exhaustive. Much
theory and derivation are omitted and left to more indepth treatments found in books dealing
specifically with the various topics.
Two chapters have been added to the book. One of these chapters deals with uncertainty
analysis and the other with geostatistics. Both of these topics have received great emphasis in
xvi PREFACE TO THE SECOND EDITION
the past decade. Uncertainty analysis is a growing concern as it is increasingly recognized that
both statistical and deterministic analyses result in estimates that are far from absolute answers.
Increasingly. attempts are made to evaluate how much uncertainty should be associated with
various types of analyses. Rather than providing a point estimate of some quantity, confidence
limits are sought, such that one can assert with various degrees of confidence bounds within
which the sought after quantity is thought to be. Geostatistics has become of increasing impor
tance as geographically referenced information becomes available and is used in geographical in
formation systems (GISs) to produce hydrologic estimates.
The chapter on uncertainty was written by Aditya Tyagi, a former PhD candidate at
Oklahoma State University and currently a water resources engineer with CH2M Hill. Jason
Vogel, a research engineer and PhD candidate at Oklahoma State University, was a coauthor of
the chapter on geostatistics.
Preface to the
First Edition
THE RANDOM variability of such hydrologic variables as streamflow and precipitation has
been recognized for centuries. The general field of hydrology was one of the first areas of science
and engineering to use statistical concepts in an effort to analyze natural phenomena. Many pa
pers have been published that amply demonstrate the value of statistical tools in analyzing and
solving hydrologic problems. In spite of the long history and proven utility of statistical tech
niques in hydrology, relatively few comprehensive and basic treatments of statistical methods in
hydrology have been published.
This book has been prepared to assist engineers and hydrologists develop an elementary
knowledge of some statistical tools that have been successfully applied to hydrologic problems.
The intent of the book is to familiarize the reader with various statistical techniques, point out
their strengths and weaknesses and demonstrate their usefulness. The serious reader will want to
supplement the material with formal courses or independent study of those individual topics that
are major interests. No single topic has been developed completely. Books have been written cov
ering many of the topics discussed as single chapters in this presentation. Again the purpose here
is to develop understanding and illustrate the usefulness of the techniques. Most of the techniques
are discussed in sufficient detail for a thorough understanding and application to problem situa
tions. The philosophy of the presentation has been that one does not have to understand hydro
dynamics to swim even though it could help one to become a more proficient swimmer.
The book has not been written for statisticians or for those primarily interested in statistical
theory. Rather it has been prepared for hydrologists and engineers interested in learning how
statistical models and methods can be valuable tools in the analysis and solution of many
hydrologic and engineering problems. The basic premise has been taken (and justifiably so) that
xviii PREFACE TO THE FIRST EDITION
statisticians are competent so that many statistical results are presented without developing a
rigorous proof of their validity. Proofs for most results can be found in mathematical statistics
books many of which are listed in the bibliography.
No prior knowledge of statistics is required if one starts with Chapter 2. Those with varying
degrees of statistical knowledge may choose to start with later chapters. A knowledge of calculus
is required throughout and some familiarity with matrices is needed for material in later chapters.
Appendix D is a review of the basic matrix manipulation used in the book (not in this new
edition).
This is not a statistical "cookbook" for hydrologists. It does non contain stepbystep calcu
lation procedures for "standard hydrologic problems. Basic statistical concepts are discussed
and illustrated in enough detail so that one can develop his own computational procedures or
methods.
Most of the computations in actual work situations would be done on digital computers.
Computer programs have not been included because it is felt that most computer centers will
have programs or programmers available. Likewise computational techniques are not empha
sized. For example, in the chapter on multiple regression, efficient techniques for matrix inver
sion are not presented as it is felt that these techniques are readily available at most computer ten
ters. The emphasis is thus retained on the statistical technique being used and not on the
computational aspects of the problem.
Some liberties have been taken in that many terms are not precisely defined in a mathemat
ical sense unless such a definition is warranted. Where terms are loosely defined, it is hoped that
the meticulous reader will accept the general connotation of the terms for purposes of simplicity
and to avoid placing emphasis on terms rather than concepts.
Many of the problems require sets of data. Those data may be supplied by the reader or se
lected from the data in Appendix C.
I am grateful to the Literary Executor of the late Sir Ronald A. Fisher, F. R. S., to Dr. Frank
Yates, F. R. S. and to London Group Ltd., London, for permission to reprint Table E.5 from their
book Statistical Tablesfor Biological, Agricultural arzd Medical Research, 6~ Edition ( 1 974) (not
in this new edition).
Acknowledgments for
the Second Edition
IT HAS been nearly a quarter century since I wrote the first edition of this book. During
that time I have become indebted to many people. I have spent nearly this entire period with
the Biosystems and Agricultural Engineering Department at Oklahoma State University. This
Department has provided a wonderful atmosphere for intellectual growth and accomplishment.
The faculty, staff, and students that I have been associated with have helped to create a working
environment that was challenging, friendly, and one in which my only limitation was myself.
I am grateful to many individuals. Bill Barfield has continued to be a valued friend and
coworker. Dan Storm, Bruce Wilson, and many graduate students have been especially instru
mental in much of my research and teaching in the field of statistical hydrology.
My daughter, Dr. Patricia Haan, assistant professor in the Biological and Agricultural
Engineering Department at Texas A&M University, has been very helpful in clarifying some
points in the text and correcting errors.
Certainly my wife of 34 years, Jan, has been most supportive and forgiving as I have devoted
far too much time to work.
As is true of all of us, I owe whatever I have accomplished to my Creator without Whom
I could accomplish nothing.
Acknowledgments for
the First Edition
MUCH OF the material presented in this book was developed for a course taught to students
in the Agricultural Engineering and Civil Engineering Departments at the University of Ken
tucky. The suggestions and clarifications made by the students in this course over the past 8 years
have been a great aid in attempting to make this book more understandable.
Special acknowledgment must be given to Dan Carey for his careful readings of the entire
manuscript. These readings resulted in several corrections and clarifications. Several individu
als have read parts of the book and made valuable suggestions for its improvement. Among
those reviewing parts of the manuscript were Donn DeCoursey, David Allen, David Culver,
and personnel of the U.S. Soil Conservation Service under the direction of Neil Bogner.
Several individuals in the Agricultural Engineering Department at the University of Kentucky
offered valuable suggestions and considerable encouragement. Deserving special mention are Billy
Barfield, Blaine Parker. and John Walker.
This undertaking has required sacrifice on the part of my family and especially my wife Janice.
She not only typed the early drafts of the book but offered continued encouragement over the years
as work and revisions were done on the book.
This manuscript was reproduced from photoready copy. The excellent typing involved in
preparing this final draft as well as an earlier draft was done by Pat Owens. Buren Plaster drafted
all of the figures.
Of course any failings and shortcomings of this book must be credited to me. My hope is that it
will be found useful in at least partially meeting the need for an elementary treatment of statistical
methods in hydrology. Whatever is accomplished along these lines I owe to our Father for giving me
the will to see this project through and the ability to withstand the setbacks experienced along the way.
Finally I express my appreciation to all of the members of the Agricultural Engineering
Department at the University of Kentucky for their understanding during the preparation of this
manuscript.
Statistical Methods i n
HYDROLOGY
1. Introduction
MORE THAN 25 years ago I set about writing a book on the application of statistical tech
niques to hydrology. That book, published in 1977, became the first edition of this current work
and was appropriately titled Statistical Methods in Hydrology. Although soundly criticized for
producing a book of the general type "Statistics for ," that was little more than a "relevant
Schuam's Outline series" on statistics with a little hydrology thrown in (Burges 1978), the book
has had a very wide reception, has gone through several printings and has been widely quoted in
the literature. However, as I have reflected on this critique over the years, and as I have used sta
tistics to address problems in hydrology and observed others doing the same, I have come to the
conclusion that this critique contained a large element of truth.
There is no shortage of very fine books at many levels of complexity on statistics. The
theory of statistical procedures and the assumptions in statistical procedures are well explained
and widely available. The same statistical techniques might be applied to hydrologic data or to
the comparison of the value of the Japanese yen to the U.S. dollar. Statistical techniques are based
in mathematics and probability. The units attached to the data being studied are immaterial from
a statistical standpoint. What is important is the degree to which the data agree with the assump
tions inherent in the statistical procedure being applied.
Similarly, there are many books on hydrology. Some of these books are quite general, some
are quite theoretical, some are quite empirical, and none are really exhaustive. The problem with
hydrology is that it is, in practice, very messy. For example, we can present in great detail the
mathematical development of equations describing the overland flow of water on planes of vari
ous types and how flow profiles develop and how runoff hydrographs result at the lower end of
these planes. There exist very elegant solutions for these problemsalbeit often numerical
4 CHAPTER 1
procedures are required to arrive at these solutions. With rapid advances in computing technol
ogy, this presents a rapidly diminishing problem.
The real problem as I see it is that we have developed an elegant solution to a nonexistent
problem. In my lifetime I have observed many rainfallrunoff events and have rarely seen the
type of flow described above except in artificial situations such as parts of parking lots or streets
covering a tiny fraction of a drainage basin. If there is any overland flow, before it goes very far
flow concentration develops and the overland flow "planes" become very nonuniform.
Does that mean it is wrong to develop and present these idealized equations? Does that
mean it is wrong to use models that contain these equations to develop runoff hydrographs? NO!
It simply means that one must be aware of the relationships between the mathematics of the
model and the actual hydrology that is occurring. Through proper selection of roughness coeffi
cients and other coefficients in such models, good estimates of runoff hydrographs may result.
Yet that does not mean that the model actually describes in exact detail the hydrologic processes
that are occurring. We must not confuse actual hydrologic processes with models of these
processes.
On numerous occasions I have seen those practicing hydrology confusing hydrologic mod
els with actual hydrologic systems. The complexity, the nonhomogeneity, the dynamic nature of
actual hydrologic systems are not recognized. The uncertainty inherent in parameters used by hy
drologic models to particularize the model to a specific catchment or hydrologic problem are not
recognized. The numbers produced by the model are taken as the true hydrologic response of the
actual hydrologic system. More disturbing, the algorithms that make up the model are taken as
true and exact representations of the hydrologic systems they purport to represent. Quite likely
the one using the hydrologic model has great skills in modeling and in computers but little
understanding of the complexity of hydrologic systems.
At this point one might be wondering why I have jumped on mathematical models when this
book is about statistics. The answer lies in my experience over the years that statistical methods
are often criticized for not being physically based and not representing what is actually occumng
in the field. Yet all hydrologic models, not just statistical models, are susceptible to this criticism.
Statistical models are often applied just as are mathematical models with little regard to the
assumptions in the models. Some take model results as truth, especially if the statistical or math
ematical technique is complex. Others will reject model results on the basis that all assumptions
are not met. So basically, in hydrology, we face the same dilemmas whether we use mathemati
cal or statistical models.
No model describes the actual and complete hydrology of anything but the simplest of
settings. Regardless of what approach we use toward solving an actual hydrologic problem,
compromise must be made with the methodology employed. One can never turn professional
judgment over to any particular hydrologic model whether the model is mathematical, statistical,
or some combination of the two. Any model must be seen as an aid to judgment and not as a
replacement for it.
There are no completely theoretical models and no completely statistical models. All mod
els have components of both theory and statistics. Both are techniques for quantifying our
understanding and our observations of hydrologic processes. The presence of theory or statistics
may not be a formal presence, but it is there. This leads to the conclusion that all models have
INTRODUCTION 5
statistical components to some degree. Any constants that are estimated based on observations,
even observations formalized into tables like Manning's n values, have been determined by for
mal or informal application of statistics. Any statistical model should be formulated based on
some understanding of the system being modeled. This understanding may be brought into the
model through a conceptual structure of the model. These conceptual components are what bring
hydrology into the model as opposed to having a purely statistical model. In my view, one should
not ignore hydrology when developing models for use in hydrology no matter how sophisticated
the statistical techniques that are being used. To the extent that hydrologic knowledge is used in
structuring a statistical model, the model may be said to contain conceptual components. Statis
tical models should not be developed by simply throwing data on every conceivable variable into
some computerized statistical routine and hoping for the best.
As far as the hydrologist is concerned, statistics is not an end in itself. Statistics is a tool that
may help one to understand hydrological data. The fact that to hydrology, statistics is a tool must
be kept foremost in mind. It must also be kept in mind that statistics is just one of several tools
available for application in hydrology.
Hydrologic processes are not driven by principles of statistics but by physical, chemical,
and biological principles, the socalled "Laws of Nature". Often the hydrologic setting is of such
complexity that the underlying component hydrologic processes cannot be expressed in such a
way as to yield a suitable computational framework for describing the system. Perhaps the
mix of surface soil properties, land uses, topography, and so forth are such that the setting of a
particular hydrologic problem cannot be adequately described. Perhaps the complexity and
heterogeneity of the system is such as to preclude deterministic modeling. Perhaps data are avail
able on a response variable such as stream flow, water quality, or ground water level, but not on
the causative variables of rainfall, evaporation, infiltration, and so on. In such a case statistical
techniques may be needed in an effort to uncover descriptive behavioral relationships among the
data. Such relationships are not causeeffect relationships but descriptive relationships. The
relationships may support hypotheses concerning cause and effect but do not conclusively estab
lish such relationships.
Over the past 20 years I have seen many inappropriate applications of statistics in hydrol
ogy. I have seen hydrologists stake their reputation as hydrologists on statements made based on
poor knowledge of statistics. I have also seen statisticians make farreaching conclusions with a
very elementary knowledge of hydrology; here the argument goes "the data show . . .". The data
are separated from their hydrologic reality and analyzed as pure numbers!
One thing that has compounded the problem of inappropriate use of statistics in hydrology
(or any other field, I suspect) is the ready availability of powerful statistical software that is easy
to use. I applaud the availability of this software but shudder at some of the applications that are
made with it.
Sometimes a statistical procedure is improperly applied or applied in inappropriate circum
stances. The numbers generated by a statistical analysis are then venerated as absolute truth. It
would be better to apply a technique recognizing and admitting its shortcomings and then using the
results as a guide rather than religiously adopting the results and claiming they represent reality.
This long introduction has been composed to impart some of my hydrologicstatistical
modeling philosophy and to alert the reader that this book will emphasize the assumptions
inherent in statistical techniques and the consequences of violating these assumptions. Statistical
techniques will be explained at the practical level without many derivations and proofs. Refer
ences to these will be given. The book will be most useful to someone having at least an elemen
tary knowledge of mathematical statistics and hydrology. This book addresses the interface of
these two disciplines.
The question naturally arises as to what is meant by hydrology in this book. Hydrology
broadly defined is the study of water. The Federal Council for Science and Technology (1962)
defined hydrology as
the science that treats of the waters of the Earth, their occurrence, circulation, and
distribution, their chemical and physical properties, and their reaction with their envi
ronment, including their relation to living things. The domain of hydrology embraces
the full life history of water on the earth.
This definition is more or less used in this book. The definition is broad and includes topics
some may consider to be more proper to geology, engineering, environmental science, biology,
chemistry, paleontology, or some other science. Some may even feel it includes aspects which are
nonscientific. By using this definition, when the word "hydrology" is used, it includes these other
areas as well.
Statistics will be considered in a limited sense in the context of this book. Statistics will be
defined as
Models are often used in developing this understanding and in making inferences. Model is
a general term that will be taken to mean
There are many ways of collecting physical laws and empirical observations and of com
bining them to produce a model. Models can generally be represented as
where 0 represents the outputs or quantities to be estimated; f(...) represents the mathematical
structure of the model; I represents inputs to the model, boundary conditions, and initial condi
tions; P represents parameters that help particularize the model to a specific situation; and e rep
resents differences between what actually occurs, 0, and what the model predicts, 0,.
INTRODUCTION 7
There are many ways of classifying models. Some people draw sharp distinctions between
statistical models and other models. In practice one cannot do a thorough modeling exercise
without drawing on statistics in some way. Often some type of statistical work has to be done to
come up with values for parameters for a model that might otherwise be considered a nonstatis
tical model. Thus, the parameters of the model become some function of observations. If another
set of observations were used presumably different parameter values would result. Since obser
vations (data) in hydrology are generally thought of as random variables and any function of a
random variable is a random variable, the parameters for the model effectively become random
variables and thus a statistical element enters a model that might otherwise not be considered as
a statistical model.
Broadly speaking, quantitative hydrologic models fall on a continuous spectrum of model
"types" ranging from completely deterministic on the one hand to completely stochastic on the
other. A completely deterministic model would be one arrived at through consideration of the un
derlying physical relationships and would require no experimental data for its application.
Statistical models range in complexity from estimating the most likely outcome or result of an
experiment to describing in detail a sequence (time series) of outcomes that mimic actual out
comes. All statistical approaches rely on observations. The mathematical techniques used to
extract the information contained in the observations may be as simple as computing an average
or so complex as to require thousands of stochastic simulations.
Most hydrologic models fall somewhere between the extremities of this model spectrum.
Often such models are termed parametric models. A parametric model may be thought of as de
terministic in the sense that once model parameters are determined, the model always produces
the same output from a given input. On the other hand, a parametric model is stochastic in the
sense that parameter estimates depend on observed data and will change as the observed data
changes. A stochastic model is one whose outputs are predictable only in a probabilistic sense.
With a stochastic model, repeated use of a given set of model inputs produces outputs that are not
the same but follow certain statistical patterns. A statistical model is one arrived at by applying
statistical methods to a set of data to produce an estimation procedure. Multiple regression mod
els are examples of statistical models. In this sense, all stochastic models are statistical models
but all statistical models are not stochastic models.
No matter how simple the hydrologic system or how complex the hydrologic model, the model
is always an approximation to the system. There are no hydrologic modelsdeterministic, sto
chastic, or combinedthat represent exactly anything but the most trivial of hydrologic systems.
The digital computer has made possible great advances in all types of hydrologic models.
These advancements are noteworthy for both stochastic and deterministic models and have led
some hydrologists to vigorously adopt the philosophy that all hydrologic problems should be
attacked stochastically and some the philosophy that they should be attacked deterministically. The
purpose of this book is not to promote statistical or stochastic models but to present some basic
statistical concepts that have been found useful as aids for the solution of hydrologic problems.
Many hydrologic problems can best be solved through the joint application of the various
modeling methods. For instance, it may be possible to adequately predict the runoff hydrograph
8 CHAPTER 1
from a simple watershed deterministically given the rainfall input. It is unlikely, however, that
rainfalls that will occur during the life of a water resources project will be deterministically pre
dictable. Thus, one approach to project evaluation would be a stochastic simulation of rainfall,
deterministic conversion of the rainfall to streamflow, and a statistical analysis of the resulting
streamflows.
Regardless of the type of model that is used, model parameters must be determined in some
way from observed hydrologic data. The validity and applicability of a model depend directly on
the characteristics of the data used to estimate model parameters. A model can be no better than
the data available for parameter estimation. The data used for parameter estimation must be rep
resentative of the situation in which the model is going to be used. Obviously, if one is attempt
ing to model streamflow from an urban area, model parameters cannot be estimated from forested
watersheds. Similarly, future hydrologic behavior of a watershed can be modeled based on past
observations only if available historical data are representative of future conditions. If drastic
land use changes are to be made, then the model parameters must be adjusted accordingly.
All techniques used for hydrologic analysis rely on assumptions. Often the strict validity of
the analysis depends on how well the true system meets these assumptions. This is certainly true
of statistical models and statistical methods applied to hydrologic systems.
There are no statistical procedures whose assumptions exactly match particular hydrologic
systems. Likewise there are no hydrologic systems that exactly meet the assumptions made in
any particular hydrologic model.
With this in mind one is forced to the conclusion that models cannot yield an exact solution
to any realistic hydrologic problem. Models must be treated as a tool that can be used to gain
insight and to arrive at potential outcomes in a given hydrologic setting, but the final decision re
garding any hydrologic process rests with the hydrologist, not the models. The hydrologist may
choose to adopt a solution generated from modeling considerations, but this decision must be
based on the hydrologist's convictions that the solution is hydrologically sound and not simply
on how well the model describes the data. How close the final real solution is to the model
solution will certainly depend on how well the physical setting matches the assumptions of the
modeling techniques employed. It is the hydrologist who must make the determination as to
the relationship between the model result and hydrologic reality.
The fact that a statistical modeling procedure requires assumptions that are not strictly met
in a particular hydrologic setting does not mean that statistically derived results are of no value.
Again, the statistical modeling technique is used to provide insight into the problem at hand and
not the final result. Even when it is known that certain assumptions are violated, useful informa
tion can often be obtained from a statistical modeling effort.
Throughout this book, assumptions that accompany the statistical technique being discussed
will be set forth and discussed from a hydrologic standpoint. The potential problems associated
with violating the assumptions will be discussed. One of the frustrations that is constantly faced
in using statistical models to represent hydrologic systems is trying to determine if assumptions
are met or to what extent assumptions are not met for a particular set of data and the effect of not
meeting assumptions on conclusions reached using the method.
One might come away feeling that it is inappropriate to use statistics in hydrology. That is not
the case at all. What is inappropriate is for an analyst to relegate absolute hydrologic authority to
a statistical analysis at the expense of hydrologic knowledge of the system and to give no weight
to other tools available, such as mathematical models and common sense.
Deterministic hydrologic models, whether numerical or conceptual, suffer the same prob
lems in terms of assumptions as do statistical models. Rarely are hydrologic models adequately
tested over the full range of conditions for which they will be applied. Rarely are all of the as
sumptions associated with hydrologic models actually set forth. For instance, one assumption
inherent in hydrologic models is that a basin's hydrologic response to a rare or extreme event can
be modeled with the same algorithms used to model common or predominate events.
In hydrologic frequency analysis, the criticism is often justifiably leveled that estimating a
rare floodsay a 500year flood, from a record of 20 or 30 years, none of which are extraordi
narily largeis fraught with the possibilities of errors. The question is asked, how could
relatively common flow levels have information embedded in them that would determine the
magnitude of a 500year event? Said in another way by example, in Oklahoma most annual peak
flows from smaller watersheds are generated from thunderstorms that arise over the Great Plains
of the central United States. The really big floods may be the result of a hurricane sweeping in
from the Gulf of Mexico and traveling over Oklahoma. How can flow data from thunderstorms
predict flow magnitudes of hurricanerelated floods?
But the same questions apply to deterministic hydrologic models. If a model is formulated
and parameters estimated based on common flow levels, how can one be sure these same pararn
eter values and algorithms apply to extreme events?
In both cases, flood frequency analysis and modeling, information is gained about the pos
sible magnitude of the 500year event. For certain neither estimate is exact! In addition to these
estimates the hydrologist should do some field work, look at channel capacities, possibly look
for evidence of extreme floods in the geologic past (paleohydrology), and rely on as much
hydrologic reasoning as possible to arrive at the final estimate of the 500year event. One should
additionally attempt to place some type of uncertainty bands on the estimate.
What is being suggested is that responsibility for a hydrologic estimate rests squarely on the
hydrologist rather than on some analytic technique. One cannot blame the logPearson type 111
distribution for making a bad flood frequency estimate. The problem is not the distribution itself
(after all the distribution is just a mathematical equation) but the inappropriate application of the
distribution in making the estimate. One cannot blame a hydrologic model if a hydraulic struc
ture fails because the flow estimated by the model was in error. One may conclude that the model
was inappropriate but it was the hydrologist that made the estimate using the model as a tool.
HYDROLOGIC DATA
Hydrologic data seems to be simultaneously abundant and scarce. We are deluged with data
on rainfall, temperature, snowfall, and relative humidity from around the world on a daily basis
in newspapers, radio and television reports, and on worldwide computer information networks.
Many agencies worldwide collect and archive hydrologic data on streamflow, lake and reservoir
levels, ground water elevations, water quality measures, and other aspects of the hydrologic
cycle. These data are available in many different forms. Currently access to hydrologic data is
being rapidly improved as the data is made available over electronic networks.
Yet in the face of this apparent abundance, data on a particular aspect of the hydrologic cycle
at a particular location for a particular time period are often inadequate or completely lacking. It
is often the task of the hydrologist to use any data that can be found having some application to
the problem at hand, hydrologic models of various kinds, plus their own hydrologic knowledge
to explain past, present, or anticipated hydrologic behavior of the system under study. Statistical
procedures are used to evaluate the data, transfer the data to the problem at hand, select models
and model parameters, evaluate model predictions, organize one's personal conception of how
available data and knowledge come to bear on the problem, make predictions of future behavior
of the system, and many other aspects of hydrologic problemsolving.
Hydrologic data are generally presented as values at particular times, such as a river stage at
a particular time, or values averaged over time, such as the annual flow for a stream for a partic
ular year. Aggregating data into averages over time intervals may cause a loss of information if
the variability of the process within the time period is of interest. Conversely, aggregation may
make it possible to more clearly visualize longterm trends because shortterm variations about
the trend may be removed. The variability from observation to observation in a time series of hy
drologic data may be very rapid and significant or very minor. Generally systems having a lot of
storage vary more slowly than systems lacking that storage. Figure 1.1 is a plot of the water sur
face elevation of the Great Salt Lake near Salt Lake City, Utah. This figure shows that during the
period of this record, water level changes of about 20 feet have occurred but yeartoyear change
is relatively slow with the exception of 19821 984 when a rise of about 4 feet per year occurred
and in the late 1980s when the level dropped rather quickly.
Figure 1.2 shows the annual peak discharge for the Kentucky River near Salvisa, Kentucky.
There is little yeartoyear carryover or storage in this river system, so the flows vary more or
less randomly from one year to the next.
Figure 1.3 shows the water surface elevation of Devils Lake in North Dakota. The behav
ior of this lake is puzzling in that it has gone from nearly 1440 feet in elevation in 1867 to 1401
feet in 1940 in an almost continuous decline, at which point an erratic but steady increase in ele
vation began until it reached 1447 feet in 1999.
Fig. 1.1. Water surface elevation of the Great Salt Lake near Salt Lake City, Utah.
INTRODUCTION 11
0
1895 1915 1935 1955 1975 1995
Year
Fig. 1.2. Annual peak flows on the Kentucky River near Salvisa, Kentucky.
In the case of the Salt Lake data, a model that estimated the water level in one year based
solely on the level the previous year might produce reasonable estimates. The form of such a
model would be y, = y,, where y, is the water level at time t and y,, is the water level at the
previous time t  1. Such a model may give a better prediction of the lake level in year t than
would a model y, = y where y is the average lake level. The opposite is the case in.the Kentucky
,.
River peak flow data. Here y, = would be better than y, = y, The previous year's flow is of
little value in predicting the current year's flow.
A model for Devils Lake would be difficult to surmise based simply on lake level data,
because even a reasonable estimate for the longterm average lake level could not be determined
on this record of over 100 years. Simply based on the data, one cannot determine the maximum
elevation reached prior to 1867 or what elevation the lake might achieve in the absence of
human interference after 1999. Presumably, physical and hydrologic information would shed
some light on this problem. These considerations will be discussed in detail and quantified later
in the book.
In selecting data for model parameter estimation, it is important to establish that the data are
representative and homogeneous over time or can be adjusted for any nonhomogeneities that
may be present. L anything has occurred to cause a change in the characteristic being analyzed,
the data must either be adjusted to account for the change or analyzed in two sections: one before
the change and one after.
Some common causes of nonhomogeneities are relocating gages (especially rain gages),
diverting streamflows, constructing dams, watershed changes such as urbanization or deforesta
tion, stream channel alterations and possibly weather modification, as well as natural events of a
catastrophic nature such as earthquakes, humcane floods, and so forth. In some instances the data
can be corrected for changes. One possible adjustment would be by reverse reservoir routing to
determine what streamflows would have been had a reservoir not been constructed. Some
changes such as gradual urbanization of a watershed are difficult to correct.
The statement that the data must be representative means, for example, that data from only
unusually wet or dry periods should not be used alone as this will bias the results of the analysis.
If there are only a few years of record available for analysis, the chances are good that the data
are not representative of the longterm variability that actually exists. Most stochastic models as
sume that the data being considered are homogeneous and representative.
The concept of the return period of hydrologic events plays an important role in hydrology.
The return period of an event is defined as the average elapsed time between occurrences of an
event with a certain magnitude or greater. For example, a 25year peak discharge is a discharge
that is equaled or exceeded on average once every 25 years over a long period of time. It does not
mean that an exceedance occurs every 25 years, but that the average time between exceedances
is 25 years. An exceedance is an event with a magnitude equal to or greater than a certain value.
Sometimes the actual time between exceedances is called the recurrence interval. With this
definition for recurrence interval, the average recurrence interval for a certain event is equal to
the return period of that event. In this book, recurrence interval is used in the same sense as return
period.
Of course, the concept of return period can also be applied to low flows, droughts, shortages,
and so on. In this case the return period would be the average time between events with a certain
magnitude or less. Such an event might still be called an exceedance in the sense that the sever
ity of a drought exceeds some preset level.
Regardless of whether the return period is refemng to an event greater than some value or
to an event less than some value, the return period can be related to a probability of an
exceedance. If an exceedance occurs on the average once every 25 years, then the probability or
chance that the event occurs in any given year is & = 0.04 or 4%. Probability, p, of an event
occurring in any one year and return period, T, in years, are thus related by
Example 1.I. The mean annual suspended sediment load for the Green River near Munfordville,
, be estimated from the data contained in Appendix B. This data and the resulting
~ e n t u c k ican
estimated mean annual suspended sediment load may contain many types of errors. Systematic
errors could result if the flow was sampled for sediment only when the depth of flow exceeded a
preset stage. This is because low flows would not be sampled.
Generally, the sediment concentration in low flows is less than that in higher flows. Thus a
built in bias or systematic error is produced. Measurement errors could result from plugged sam
plers, samplers not properly aligned with the direction of flow, allowing the sampler to pick up
some bed load, and a number of other reasons. Data transmittal errors and processing errors can
result from mistakes in transcribing data from data forms, placing data in the wrong columns on
spreadsheets or data entry forms, illegibly written data, and other sources.
Sampling error can be illustrated by assuming that the tabulated data are exactly correct
(contain no systematic, measurement, transmittal, or processing errors). If the mean annual sus
pended sediment load is calculated for each successive 5year period, the results are 640,827;
484,739; 497,604; and 460,392 tons per year. Under the no error assumption, 4 different values
of the mean annual suspended sediment discharge have been calculated each of which contains
no errors yet none of which are the same. The difference in the 4 estimates is caused by natural
variability in the phenomena (sediment) being sampled. This difference is called sampling error.
If conditions on the watershed contributing to the Green River near Munfordville never changed
and if the climatic conditions do not change, then theoretically the sampling error can be made as
small as desired by an increase in the sample size above the 5 years used in this illustration. Prac
tical limitation is imposed by the length of the available sediment load data record.
Much of the statistical machinery discussed in this book is concerned with sampling errors
and the estimation of population characteristics from samples of data. The fact that sampling
errors are inherent in random data does not mean, however, that statistical manipulations and
sophistication can in any way overcome faulty data. The quality of any statistical analysis is no
better than the quality of the data used. It can be worse but no better. Furthermore, statistical
considerations should not be used to replace judgement and careful thought in analyzing hydro
logic data. In many instances some intelligent thought is worth reams of computer output based
INTRODUCTION 15
on a statistical analysis of some data. Statistics should be regarded as a tool, an aid to under
standing, but never as a replacement for useful thought.
Rarely will one find a hydrologic problem that exactly fulfills all of the requirements for the
application of one statistical technique or another. Two choices are thus available. One can rede
fine the problem so that it meets the requirement of the statistical theory and thus produce an
"exact" answer to the artificial problem. The second approach is to alter the statistical technique
where possible and then apply it to the real problem realizing that the results will be an approxi
mate answer to the real problem. In this case the degree of the approximation depends on the
severity of the violated assumptions. This latter approach is preferable and requires knowledge
of available statistical techniques, of assumptions and theory underlying the techniques, and of
the consequences of violating the assumptions. It is toward this latter approach that this book is
oriented.
Most of the examples and exercises used in this book were selected for pedagogical reasons,
not to promote a particular technique. Thus, when a problem involves fitting a normal distribu
tion to annual peak flow, the purpose of the problem revolves around learning about the normal
distribution and is not to demonstrate that a normal distribution is applicable to peak flows. Sim
ilarly, many examples and problems had to be simplified so that they could be realistically solved
with attention focused on the statistical technique and not the many fascinating intricacies of
most real problems. That is not to say the techniques do not apply to real problemsuite the
contrary. However, most real problems involve multiple aspects, lots of data, and many consid
erations other than statistical ones. Rather than get involved in these other important aspects,
many of the examples and problems are idealizations of real situations.
Because the exercises were selected as a learning aid, it will be instructive to at least read
the problems at the end of each chapter. Many of the problems present useful results that supple
ment the material in that chapter.
Many actual problems in hydrology require considerable computation. Digital computers are
used for this purpose. Special statisticalnumerical procedures have been developed to simplify
the computations involved and improve the accuracy of the results obtained from many of the
analyses presented in this book. These procedures are not presented here. Rather the emphasis is
on the principles involved. Some statistical techniques such as geostatistics and multivariate tech
niques often require extensive calculation and considerable efficiency is gained by using special
purpose programs incorporating numerical shortcuts and safeguards against roundoff errors.
Finally, there are many important areas of statistical analysis applicable to hydrology that
are not included in this book. These omitted techniques for the most part require knowledge of
the material contained in this book before they can be applied. Thus, this book is an introduction
to statistical methods in hydrology. Furthermore, the book is not intended as a handbook or
statistical "cookbook for hydrologists. The purpose of this book is to enable the reader to better
apply statistical methods to hydrologic problems through a knowledge of the methods, their
foundations and limitations.
2. Probability and
Probability
Distributions
Basic Concepts
HYDROLOGIC PROCESSES may be thought of as stochastic processes. Stochastic in this
sense means involving an element of chance or uncertainty where the process may take on any of the
values of a specified set with a certain probability. An example of a stochastic hydrologic process
is the annual maximum daily rainfall over a period of several years. Here the variable would be the
maximum daily rainfall for each year and the specified set would be the set of positive numbers.
The instantaneous maximum peak flow observed during a year would be another example
of a stochastic hydrologic process. Table 2.1 contains such a listing for the Kentucky River near
Salvisa, Kentucky. By examining this table it can be seen that there is some order to the values
yet a great deal of randomness exists as well. Even though the peak flow for each of the 99 years
is listed, one cannot estimate with certainty what the peak flow for 1998 was. From the tabulated
data one could surmise that the 1998 peak flow was "probably" between 20,600 cfs and 144,000
cfs. We would like to be able to estimate the magnitude of this "probably". The stochastic nature
of the process, however, means that one can never estimate with certainty the exact value for the
process (peak discharge) based solely on past observations.
The definition of stochastic given above has some theoretical drawbacks, as we shall see.
Hydrologic processes are continuous processes. The probability of realizing a given value from
a continuous probability distribution is zero. Thus, the probability that a variable will take on a
certain value from a specified set is zero, if the variable is continuous. Practically this presents no
problem because we are generally interested in the probabilities that the variate will be in some
range of values. For instance, we are generally not interested in the probability that the flow rate
will be exactly 100,000 cfs but may desire to estimate the probability that the flow will exceed
100,000 cfs, or be less than 100,000 cfs, or be between 90,000 and 120,000 cfs.
Table 2.1. Peak discharge (cfs) Kentucky River near Salvisa, Kentucky
With this introduction, several concepts such as probability, continuous, and probability
distribution have been introduced. We will now define these concepts and others as a basis for
considering statistical methods in hydrology.
PROBABILITY
In the mathematical development of probability theory, the concern has been not so much
how to assign probability to events, but what can be done with probability once these assign
ments are made. In most applied problems in hydrology, one of the most important and difficult
tasks is the initial assignment of probability. We may be interested in the probability that a cer
tain flood level will be exceeded in any year or that the elevation of a piezometric head may be
more than 30 feet below the ground surface for 20 consecutive months. We may want to deter
mine the capacity required in a storage reservoir so that the probability of being able to meet the
projected water demand is 0.97. To address these problems we must understand what probability
means and how to relate magnitude to probabilities.
The definition of probability has been labored over for many years. One definition that is
easy to grasp is the classical or a priori definition:
If a random event can occur in n equally likely and mutually exclusive ways, and if na
of these ways have an attribute A, then the probability of the occurrence of the event
having attribute A is n d n written as
This definition is an a priori definition because it assumes that one can determine before the
fact all of the equally likely and mutually exclusive ways that an event can occur and all of the
ways that an event with attribute A can occur. The definition is somewhat circular in that
"equally likely" is another way of saying "equally probable" and we end up using the word
"probable" to define probability. This classical definition is widely used in games of chance
such as card games and dice and in selecting objects with certain characteristics from a larger
group of objects. This definition is difficult to apply in hydrology because we generally cannot
divide hydrologic events into equally likely categories. To do that would require knowledge of
the likelihood or probability of the events, which is generally the objective of our analysis and
not known before the analysis.
The classical definition of probability takes on more utility in hydrology in terms of relative
frequencies and limits.
If a random event occurs a large number of times n and the event has attribute A in
na of these occurrences, then the probability of the occurrence of the event having
attribute A is
na
prob(A) = limit 
n+m n
Number of trials
Fig. 2.1. Coin flipping experiment.
outcomesheads or tailsso n is 2. There is one outcome with heads so n, is 1. Thus the prob
ability of a head is !4. If the coin is not balanced so that the two outcomes are not equally likely,
we could not use the a priori definition. We had to know the answer to our question before we
could apply the a priori definition.
This is not the case when the relative frequency definition is used. Obviously we cannot flip
the coin an infinite number of times. We have to resort to a finite sample of flips. Figure 2.1 shows
how the estimate of the probability of a head changes as the number of trials (flips) changes. A
trend toward K is noted. This is called stochastic convergence towards %. One question that might
be asked is, "is the coin unbiased?" One's initial reaction is that more trials are needed. It can be
seen that the probability is slowly converging toward K but after 250 trials is not exactly equal to
!4. This is the plight of the hydrologist. Many times more trials or observations are needed but are
not available. Still, the data does not clearly indicate a single answer. This is where probability
and statistics come into play.
Equation 2.2 allows us to estimate probabilities based on observations and does not require
that outcomes be equally likely or that they all be enumerated. This advantage is somewhat off
set in that estimates of probability based on observations are empirical and will only stochasti
cally converge to the true probability as the number of observations becomes large. For example,
in the case of annual flood peaks, only one value per year is realized. Figure 2.2 shows the prob
ability of an annual peak flow on the Kentucky River exceeding the mean annual flow as a func
tion of time starting in 1895. Note that each year additional data becomes available to determine
both the mean annual flow and the probability of exceeding that value. Here again, a convergence
toward % is noted yet not assured. In fact, there is no reason to believe that K is the "correct"
20 CHAPTER 2
I Year I
Fig. 2.2. Probability that the annual peak flow on the Kentucky River exceeds the mean annual
peak flow.
probability since the probability distribution of annual peak flows is likely not symmetrical about
the mean.
If two independent sets of observations are available (samples), an estimate of the probabil
ity of an event A could be determined from each set of observations. These two estimates of
prob(A) would not necessarily equal each other nor would either estimate necessarily equal the
true (population) prob(A) based on an infinitely large sample. This dilemma results in an impor
tant area of concern to hydrologistshow many observations are required to produce "accept
able" estimates for the probabilities of events?
From either equation 2.1 or 2.2 it can be seen that the probability scale ranges from zero to
one. An event having a probability of zero is impossible, whereas one having a probability of one
will happen with certainty. Many hydrologists like to avoid the endpoints of the probability scale,
zero and one, because they cannot be absolutely certain regarding the occurrence or nonoccur
rence of an event. Sometimes probability is expressed as a percent chance with a scale ranging
from 0% to 100%. Care must be taken to not confuse the percent chance values with true proba
bilities. A probability of one is very different from a 1% chance of occurrence as the former im
plies the event will certainly happen while the latter means it will happen only one time in 100.
In mathematical statistics and probability, set and measure theory are used in defining and
manipulating probabilities. An experiment is any process that generates values of random vari
ables. All possible outcomes of an experiment constitute the total sample space known as the
population. Any particular point in the sample space is a sample point or element. An event is a
collection of elements known as a subset.
To each element in the sample space of an experiment a nonnegative weight is assigned
such that the sum of the weights on all of the elements is 1. The magnitude of the weight is pro
portional to the likelihood that the experiment will result in a particular element. If an element is
quite likely to occur, that element would have a weight of near 1. If an element was quite unlikely
to occur, that element would have a weight of near zero. For elements outside the sample space,
a weight of zero is assigned. The weights assigned to the elements of the sample space are known
as probabilities. Here again, the word likelihood is used to define probability so that the defini
tion becomes circular.
Letting S represent the sample space; Ei for i = 1,2, ..., represents elements in S; A and B
represent events in S; and prob(Ei) represents the probability of Ei, it follows that
and
Fig. 2.3. Venn diagram illustrating a sample space, elements, and events.
22 CHAPTER 2
Using notation from set theory and Venn diagrams, several probabilistic relationships can be
illustrated. If A and B are two events in S, then the probability of A or B, shown as the shaded
areas of Figure 2.3, is given by
Note that in probability the word "or" means "either or both". The notation U represents a union
so that A U B represents all elements in A or B or both. The notation n represents an intersection
so that A n B represents all elements in both A and B. The last term of equation 2.6 is needed
since prob(A) and prob(B) both include prob(A n B). Thus, prob(A r l B) must be subtracted
once so the net result is only one inclusion of prob(A n B) on the righthand side of the equation.
If A and B are mutually exclusive, then both cannot occur and prob(A n B) = 0. In this case
Figure 2.3 illustrates the case where event A and B are mutually exclusive and figure 2.4
shows A and B when they are not mutually exclusive.
If A" represents all elements in the sample space S that are not in A, then
This statement says that the probability of A or Ac is certainty since one or the other must occur.
All of the possibilities have been exhausted. Since A and A" are mutually exclusive
Equation 2.7 often makes it easy to evaluate probability by first evaluating the probability that an
outcome will not occur.
An example is evaluating the probability that a peak flow q exceeds some particular flow q,.
A would be all q's greater than q, and A" would be all q's less than q,. Because q must be either
greater than or less than q,, prob(q > q,) = 1  prob(q < q,). We show later that for continuous
random variables prob(q = q,) = 0.
If the probability of an event B depends on the occurrence of an event A, then we write
prob(B IA), read as the probability of B given A or the conditional probability of B given A has
occurred. The prob(B) is conditioned on the fact that A has occurred. Referring to figure 2.4 it is
apparent that conditioning on the occurrence of A restricts consideration to A. Our total sample
PROBABILITY 23
space is now A. The occurrence of B given that A has occurred is represented by A fl B. Thus the
prob(B I A) is given by
assuming of course that prob(A) f 0. Equation 2.8 can be rearranged to give the probability of
A a n d B as
Now if prob(B1A) = prob(B), we say that B is independent of A. Thus the joint probability of
two independent events is the product of their individual probabilities.
Example 2.1. Using the data shown in table 2.1, estimate the probability that a peak flow in excess
of 100,000 cfs will occur in 2 consecutive years on the Kentucky River near Salvisa, Kentucky.
Solution: From table 2.1 it can be seen that a peak flow of 100,000 cfs was exceeded 7 times in
the 99year record. If it is assumed that the peak flows from year to year are independent, then the
probability of exceeding 100,000 cfs in any one year is approximately 7/99 or 0.0707. Applying
equation 2.9, the probability of exceeding 100,000 cfs in two successive years is found to be
0.0707 X 0.0707 or 0.0050.
Example 2.2. A study of daily rainfall at Ashland, Kentucky, has shown that in July the proba
bility of a rainy day following a rainy day is 0.444, a dry day following a dry day is 0.724, a rainy
day following a dry day is 0.276, and a dry day following a rainy day is 0.556. If it is observed
that a certain July day is rainy, what is the probability that the next two days will also be rainy?
24 CHAPTER 2
Solution: Let A be a rainy day 1 and B be a rainy day 2 following the initial rainy day. The prob
ability of A is 0.444 since this is the probability of a rainy day following a rainy day.
Now, the prob(B1A) is also 0.444 since this is the probability of a rainy day following a rainy day.
Therefore
The probability of two rainy days following a dry day would be 0.276 X 0.444, or 0.122.
Note that the probabilities of wet and dry days are dependent on the previous day. Indepen
dence does not exist. It can be shown that over a long period of time, 67% of the days will be
dry and 33% will be rainy with the conditional probabilities as stated. If one had assumed
independence, then the probability of two consecutive rainy days would have been 0.33 X 0.33
= 0.1089, regardless of whether the preceding day had been rainy or dry. Since the probabil
ity of a rainy day following a rainy day is much greater than following a dry day, persistence
is said to exist.
This is called the theorem of total probability. Equation 2.10 is illustrated by figure 2.5.
Example 2.3. It is known that the probability that the solar radiation intensity will reach a
threshold value is 0.25 for rainy days and 0.80 for nonrainy days. It is also known that for this
particular location the probability that a day picked at random will be rainy is 0.36. What is the
probability the threshold intensity of solar radiation will be reached on a day picked at
random?
Solution: Let A represent the threshold solar radiation intensity, B1 represent a rainy day and B,
a nonrainy day. From equation 2.10, we know that
BAYES THEOREM
By rewriting equation 2.8 in the form
and then substituting from equation 2.10 for prob(A), we get what is called Bayes Theorem:
As pointed out by Benjamin and Cornell (1970), this simple derivation of Bayes Theorem
belies its importance. It provides a method for incorporating new information with previous or
socalled prior probability assessments to yield new values for the relative likelihood of events
of interest. These new (conditional) probabilities are called posterior probabilities. Equation
2.11 is the basis of Bayesian Decision Theory. Bayes theorem provides a means of estimating
probabilities of one event by observing a second event. Such an application is illustrated in
example 2.4.
Example 2.4. The manager of a recreational facility has determined that the probability of 1000
or more visitors on any Sunday in July depends upon the maximum temperature for that Sunday
as shown in the following table. The table also gives the probabilities that the maximum temper
ature will fall in the indicated ranges. On a certain Sunday in July, the facility has more than 1000
visitors. What is the probability that the maximum temperature was in the various temperature
classes?
Solution: Let Tj for j = 1, 2, ..., 6 represent the 6 intervals of temperature. Then from equation
2.11
0.05(0.05)
prob(<60FI 1000 or more) = = 0.005
0.507
Similar calculations yield the last column in the above table. Note that ~ , 6 .prob(Tjl
, 1000 or
more) is equal to 1.
COUNTING
In applying equation 2.1 to determine probabilities, one often encounters situations where it
is impractical to actually enumerate all of the possible ways that an event can occur. To assist in
this matter certain general mathematical formulas have been developed.
If El, E2, ..., En are mutually exclusive events such that Ei n Ej = 0 for all i # j where 0 rep
resents an empty set and Ei can occur in ni ways, then the compound event E made up of
outcomes El, &, ..., En can occur in n,, n2, ... nn ways.
The problem of sampling or selecting a sample of r items from n items is commonly
encountered. Sampling can be done either with replacement, so that the item selected is irnmedi
ately returned to the population, or without replacement, so that the item is not returned. The
order of sampling may be important in some situations and not in others. Thus, we may have four
PROBABILITY 27
The notation n! is read "n factorial". n! = n(n  l)(n  2) ... (2)(1) By definition O! = 1.
Unordered sampling without replacement is similar to ordered sampling without replacement
except that in the case of ordered sampling the r items selected can be arranged in r! ways. That is,
an ordered sample of r items can be selected from r items in (r),, or r!, ways. Thus r! of the ordered
samples will contain the same elements. The number of different unordered samples is therefore
(n),/r!, commonly written as ( y ) and called the binomial coefficient. The binomial coefficient gives
the number of combinations possible when selecting r items from n without replacement.
The number of ways of selecting samples under the four above conditions is summarized in the
following table:
With Without
replacement replacement
n!
Ordered (4,= (n  r)!
n!
Unordered
(n  l)!r!
Example 2.5. For a particular watershed, records from 10 rain gages are available. Records from
3 of the gages are known to be bad. If 4 records are selected at random from the 10 records, (a)
What is the probability that 1 bad record will be selected? (b) What is the probability that 3 bad
records will be selected? (c) What is the probability that at least 1 bad record will be selected?
Solution: The total number of ways of selecting 4 records from the 10 available records (order is
not important) is
(a) The number of ways of selecting 1 bad record from 3 bad records and 3 good records
from 7 good records is
Applying equation 2.1 and letting a = 1 bad and 3 good records, the probability of a is 105/210
or 0.500.
(b) The number of ways of selecting 3 bad records and 1 good record is
(c) The probability of at least 1 bad record is equal to the probability of 1 or 2 or 3 bad
records. The probability of 1 and 3 bad records is known to be 0.500 and 0.033 respec
tively. The probability of 2 bad records is
Thus, the probability of at least 1 bad record is 0.500 + 0.300 + 0.033 = 0.833. This latter result
could also be determined from the fact that the probability of 0 or 1 or 2 or 3 bad records must
equal one. The probability of at least 1 bad record thus equals
GRAPHICAL PRESENTATION
Hydrologists are often faced with large quantities of data. Since it is difficult to grasp the
total data picture from tabulations such as table 2.1, a useful first step in data analysis is to use
graphical procedures. Throughout this book various graphical presentations will be illustrated.
Helsel and Hirsch (1992) contains a very good treatment of graphical presentations of hydrologic
data. The appropriate graphical technique depends on the purpose of the analysis. As various
analytic procedures are discussed throughout this book, graphical techniques that supplement the
procedures will be illustrated. Undoubtedly the most common graphical representation of hydro
logic data is a time series plot of magnitude versus time. Figure 1.2 is such a plot for the peak
flow data for the Kentucky River. A plot such as this is useful for detecting obvious trends in the
data in terms of means and variances or serial dependence of data values next to each other in
time. Figure 1.2 reveals no obvious trends in the mean annual peak flow or variances of annual
peak flows. It also shows no strong dependence of peak flow in one year on the peak flow of the
previous year. We previously indicated that figure 1.1, showing the water level in the Great Salt
Lake of Utah, indicated an apparent yeartoyear relationship. Such a relationship is reflected in
the serial correlation coefficient. Serial correlation will be discussed in detail later in the book.
For now suffice it to say that the serial correlation for the annual lake level for the Great Salt Lake
is 0.969 and for the annual peak flows on the Kentucky river is 0.067. This indicates a very
strong yeartoyear correlation for the Salt Lake data and insignificant correlation for the peak
flow on the Kentucky River. Serial correlation indicates dependence from one observation to the
next. It also indicates persistence. In the Salt Lake data, lake levels tend to change slowly with
high levels following high levels. In the Kentucky River data, no such pattern is evident.
In conducting a probabilistic analysis, a graphical presentation that is often quite useful is
a plot of the data as a frequency histogram. This is done by grouping the data into classes and
then plotting a bar graph with the number or the relative frequency (proportion) of observations
in a class versus the midpoint of the class interval. The midpoint of a class is called the class
mark. The class interval is the difference between the upper and lower class boundaries.
Figure 2.6 is such a plot for the Kentucky River peak flow data. Frequency histograms are of
most value for data that are independent from observation to observation. A frequency his
togram for the Kentucky River data would have the same general shape and location regardless
of what period of record was selected. The Salt Lake data would have a different location if the
period 1890 to 1910 was used in comparison to the period 1870 to 1890 because the levels were
higher over the later period. Certainly no satisfactory histogram of levels could be developed for
Devils Lake, North Dakota (figure 1.3).
.25 Kentucky River
.24
.19
.14
.09
 .03 04
.oo
, .oo .01
The selection of the class interval and the location of the first class mark can appreciably af
fect the appearance of a frequency histogram. The appropriate width for a class interval depends
upon the range of the data, the number of observations, and the behavior of the data. Several sug
gestions have been put forth for forming frequency histograms. Spiegel (1961) suggests that
there should be 5 to 20 classes. Steel and Torrie (1960) state that the class interval should not ex
ceed onefourth to onehalf of the standard deviation of the data. Sturges (1926) recommends that
the number of classes be determined from
where m is the number of classes, n is the number of data values, and the logarithm to the base
10 is used.
Whatever criteria is used, it should be kept in mind that sensitivity is lost if too few or too
many classes are used. Too few classes will eliminate detail and obscure the basic pattern of the
data. Too many classes result in erratic patterns of alternating high and low frequencies. Figure 2.7
represents the Kentucky River data with too many class intervals. If possible, the class intervals
and class marks should be round figures. This is not a computational or theoretical consideration,
but one aimed at making it easier for those viewing the histogram to grasp its full meaning.
In some situations it may be desirable to use nonuniform class intervals. In chapter 8 a situ
ation is presented where the intervals are such that the expected relative frequencies are the same
in each class.
Another common method of presenting data is in the form of a cumulative frequency distri
bution. Cumulative frequency distributions show the frequency of events less than (greater than)
some given value. They are formed by ranking the data from the smallest (largest) to the largest
(smallest), dividing the rank by the number of data points and plotting this ratio against the
corresponding data value. If the data are ranked from the smaller (larger) data values to the larger
(smaller) values, the resulting cumulative frequency refers to the frequency of observations less
PROBABILITY 31
.I4 1
K e n t u c k y River
25 40 5 5 7 0 8 5 1 0 0 1 1 5 130 1 4 5
P e a k flow (1000s cfs)
Fig. 2.7. Frequency histogram for the Kentucky River data with too many classes.
20 70 120
Flow (1000s ds)
Fig. 2.8. Cumulative frequency distribution of Kentucky River data.
(more) than or equal to the corresponding data value. Figure 2.8 is a cumulative frequency plot
based on the Kentucky River peak flow data. Again, ranking and plotting data in this fashion is
most meaningful if the data are not serially correlated.
RANDOM VARIABLES
Simply stated, a random variable is a realvalued function defined on a sample space. If the
outcome of an experiment, process, or measurement has an element of uncertainty associated
with it such that its value can only be stated probabilistically, the outcome is a random variable.
This means that nearly all quantities in hydrology, flows, precipitation depths, water levels,
storages, roughness coefficients, aquifer properties, water quality parameters, sediment loads,
number of rainy days, drought duration, and so forth, are random variables.
Random variables may be discrete or continuous. If the set of values a random variable can
assume is finite (or countably infinite), the random variable is said to be a discrete random vari
able. If the set of values a random variable can assume is infinite, the random variable is said to
be a continuous random variable. An example of a discrete random variable would be the num
ber of rainy days experienced at a particular location over a period of 1 year. The amount of rain
received over the year would be a continuous random variable. For the most part in this text, cap
ital letters will be used to denote random variables and the corresponding lower case letter will
represent values of the random variable.
It is important to note that any function of a random variable is also a random variable. That
is if X is a random variable then Z = g(X) is a random variable as well. This follows from the
fact that if X is uncertain, then any function of X must be uncertain as well. Physically, this means
that any hydrologic quantity that is dependent on a random variable is also a random variable. If
runoff is a random variable, then erosion and sediment delivery to a stream is a random variable.
If sediment has absorbed chemicals, then water quality is a random variable.
Example 2.7. Nearly every hydrologic variable can be taken as a random variable. Rainfall for
any duration, streamflow, soil hydraulic properties, time between hydrologic events such as flows
above a certain base or daily rainfalls in excess of some amount, the number of times a stream
flow rate exceeds a given base over a period of a year, and daily pan evaporation are all random
variables. Quantities derived from random hydrologic variables are also random variables. The
storage required in a water supply reservoir to meet a given demand is a function of the demand
and the inflow to the reservoir. Since reservoir inflow is a random variable, required storage is
also a random variable. As a matter of fact, the demand that is placed on the reservoir would be
a random variable as well. The velocity of flow through a porous media is a function of the hy
draulic conductivity and hydraulic gradient which are both random variables. Therefore, the flow
velocity is also a random variable.
The cumulative distribution has jumps in it at each xi equal in magnitude to fx(xi) or the proba
bility that X = xi. The probability that X = xi can be determined from
PROBABILITY 33
0 1 2 3 4 5 6 7 8 9 1 0
X
Fig. 2.9. A discrete probability distribution function.
0 1 2 3 4 5 6 7 8 9 1 0
X
Fig. 2.10. A discrete cumulative distribution function.
The notation fx(x) and Fx(x) denote the pdf and cdf of the discrete random variable X
evaluated at X = x.
Often, continuous data are treated as though they were discrete. ~ o o k i n gagain at the
Kentucky River peak flow data, we can define the event A as having a peak flow in the ithclass.
Letting ni be the number of observed peak flows in the ithinterval and n be the total number of
observed peak flows, the probability that a peak flow is in the ithclass is given by
Thus, the relative frequency, fxi, can be interpreted as a probability estimate, the frequency
histogram can be interpreted as an approximation for a pdf, and the cumulative frequency can be
interpreted as an approximation for a cdf.
Many times it is desirable to treat continuous random variables directly. Continuous random
variables can take on any value in a range of values permitted by the physical processes involved.
Probability distribution functions of continuous random variables are smooth curves. The pdf of
a continuous random variable X is denoted by px(x). The cdf is denoted by Px(x). Px(x) repre
sents the probability that X is less than or equal to x.
The notation px(x) and Px(x) denote the pdf and cdf, respectively, of the continuous random
variable X evaluated at X = x. Thus, p,(a) represents the pdf of the random variable Y evaluated
at Y = a. P,(a) represents the cdf of the random variable Y and gives prob(Y 5 a).
A function, px(x), defined on the real line can be a pdf if and only if
By definition px(x) is zero for X outside R. Also Px(x,) = 0 and Px(x,) = 1 where xl and
xu are the lower and upper limits of X in R. For many distributions these limits are m to 03
or 0 to 03. It is also apparent that the probability that X takes on a value between a and b is
given by
The prob(a r X 5 b) is the area under the pdf between a and b. The probability that a random
variable takes on any particular value from a continuous distribution is zero. This can be seen
from
Because the probability that a continuous random variable takes on a specified value is zero, the
expressions prob(a 5 X 5 b), prob(a < X 5 b), prob(a 5 X < b) and prob(a < X < b) are all
PROBABILITY 35
equivalent. It is also apparent that Px(x) can be interpreted as the probability that X is strictly less
than x since prob(X = x) = 0.
Figures 2.11 and 2.12 illustrate a possible pdf and its corresponding cdf. In addition to den
sity functions that are symmetrical and bellshaped, densities may take on a number of different
a b x
Fig. 2.11. Probability density function.
h
X
Y
a b X
Fig. 2.12. Cumulative probability distribution function.
36 CHAPTER 2
1
'J ' shaped Reverse "J" shaped 'U " shaped
ExDonential
shapes including distributions that are skewed to the right or left, rectangular, triangular, expo
nential, "J"shaped, and reverse "J"shaped (Fig. 2.13).
At this point a cautionary note is added to the effect that the probability density function,
px(x), is not a probability and can have values exceeding one. The cumulative probability distri
bution, Px(x), is a probability [prob(X 5 x) = Px(x)] and must have values ranging from 0 to 1.
Of course, px(x) and Px(x) are related as indicated by equation 2.20 and knowledge of one spec
ifies the other.
Example 2.8. Evaluate the constant a for the following expression to be considered a probability
density function:
What is the probability that a value selected at random from this distribution will (a) be less than 2?
(b) fall between 1 and 3? (c) be larger than 4? (d) be larger than or equal to 4? (e) exceed 6?
Solution:
From equation 2.21 we must have
PROBA%ILITY 37
and
A
Px(x) =  for 0 5 x 5 5
125
where Pz(d) > P,(d), P,(x,) = 0, Pz(x,) = 1, and P,(x) and P2(x) are nondecreasing functions of
X. Figure 2.14 is a plot of such a distribution. For this situation the prob(X = d) equals the mag
nitude of the jump AP at X = d or is equal to Pz(d)  P,(d). Any finite number of discontinuities
of this type are possible.
An example of a distribution as shown in figure 2.14 is the distribution of daily rainfall
amounts. The probability that no rainfall is received, prob(X = 0), is finite, whereas the proba
bility distribution of rain on rainy days would form a continuous distribution. A second example
would be the probability distribution of the water level in some reservoir. The water level may be
maintained at a constant level d as much as possible but may fluctuate below or above d at times.
The distribution shown in figure 2.14 could represent this situation.
The relationship between relative frequency and probability can be envisioned by consider
ing an experiment whose outcome is a value of the random variable X. Let px(x) be the proba
bility density function of X. The probability that a single trial of the experiment will result in an
outcome between X = a and X = b is given by
2 0 2 4 6 8 10
X
Fig. 2.14. A possible piecewise continuous pdf for the case prob (X = d) # 0.
In N independent trials of the experiment, the expected number of outcomes in the interval a to b
would be
Because the righthand side of this equation represents the area under px(x) between xi  Axi/2
+
and xi Axi/2, it can be approximated by
Equation 2.25 can be used to determine the expected relative frequency of repeated, independent
outcomes of a random experiment whose outcome is a value of the random variable X.
If N independent observations of X are available, the actual relative frequency of outcomes
in an interval of width Axi centered on xi may not equal fxi as given by equation 2.25 because X
PROBABILITY 39
is a random variable whose behavior can only be described probabilistically. The most probable
outcome or the expected outcome will equal the observed outcome only if px(x) is truly the prob
ability density function for X and for an infinitely large number of observations. Even if the true
probability density function is being used, the actual frequency of outcomes in the interval Ax,
approaches the expected number only as the number of trials or observations becomes very large.
Example 2.9. Plot the expected frequency histogram using the probability density function of
example 2.8 and a class interval of !4.
.00075
.00675
.01875
.03675
.06075
.09075
.I2657
.I6875
.2 1675
.27075
Sum .99750
0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
X
Fig. 2.15. Plot for example 2.10.
40 CHAPTER 2
BIVARIATE DISTRIBUTIONS
The situation frequently arises where one is interested in the simultaneous behavior of two or
more random variables. An example might be the flow rates on two streams near their confluence.
One might like to know the probability of both streams having peak flows exceeding a given value.
A second example might be the probability of a rainfall exceeding 2.5 inches at the same time the
soil is nearly saturated. Rainfall depth and soil water content would be two random variables.
Example 2.10. The magnitude of peak flows from small watersheds is often estimated from the
"Rational Equation" given by Q = CIA where Q is the estimated flow, C is a coefficient, I is a
rainfall intensity, and A is the watershed area. The assumption is made that the return period of
flow will be the same as the return period of the rainfall that is used. To verify this assumption it
is necessary to study the joint probabilities of the two random variables Q and I.
If X and Y are continuous random variables, their joint probability density function is
pX,y(x,y) and the corresponding cumulative probability distribution is P, .(x, y). These two are
related by
and
P,,y(x, y) = prob(X 5 x and Y 5 y) = m:J JTm px,,(t, s) ds dt
The corresponding relationships for X and Y being discrete random variables are
fX,Y(~i,
yj) = prob(X = xi and Y = yj) (2.28)
MARGINAL DISTRIBUTIONS
If one is interested in the behavior of one of a pair of random variables regardless of the
value of the second random variable, the marginal distribution may be used. For instance,
the marginal density of X, px(x), is obtained by integrating px,,(x, y) over all possible values
of Y.
Px(x) = P X , Y (m)
~ , = prob(X I x and Y I m) (2.32a)
and
CONDITIONAL DISTRIBUTIONS
A marginal distribution is the distribution of one variable regardless of the value of the sec
ond variable. The distribution of one variable with restrictions or conditions placed on the second
variable is called a conditional distribution. Such a distribution might be the distribution of X
given that Y equals yoor the distribution of Y given that x, 5 X Ix,.
42 CHAPTER 2
In general, the conditional distribution of X given that Y is in some region R is arrived at
using the same reasoning that was used in obtaining equation 2.8. The total sample space of Y is
now the region R. Because
then
for X and Y continuous. Similarly the conditional distribution of (xIY is in R) for X and Y discrete is
The determination of conditional probabilities from equations 2.37 and 2.38 are done in the
usual way.
The proof of this may be found in Neuts (1973). In most statistics books pXly(xIY= yo) is simply
written as
If the region R of equation 2.37 is the entire region of definition with respect to Y, then
SRPY(S)ds = 1
and
& s) ds = pX(x)
so that
This results from the fact that the condition that Y is in R when R encompasses the entire 
region of definition of Y is really no restriction but simply a condition stating that Y may
take on any value in its range. In this case, pXIy(xIYis in R) is identical to the marginal
density of X.
INDEPENDENCE
From equation 2.37 or 2.38 it can be seen that in general the conditional density of X given
Y is a function of y. If the random variables X and Y are independent, this functional relationship
disappears (i.e., p X l y ( x l ~
is)not a function of y). In fact, in this case
or the conditional density equals the marginal density. Furthermore, if X and Y are independent
(continuous or discrete) random variables, their joint density is equal to the product of their mar
ginal densities.
The random variables X and Y are independent in the probabilistic sense (stochastically
independent) if and only if their joint density is equal to the product of their marginal densities.
Independence is an extremely important property. A bivariate distribution is much more
difficult to define and to work with than is a univariate distribution. If independence exists and
the proper pdf for X and for Y can be determined, the bivariate distribution for X and Y is given
as the product of these two univariate distributions.
DERIVED DISTRIBUTIONS
Situations often arise where the joint probability distribution of a set of random variables is
known and the distribution of some function or transformation of these variables is desired. For
example, the joint probability distribution of the flows in two tributaries of a stream may be
known whereas the item of interest may be the sum of the flows in the two tributaries. Some
commonly used transformations are translation or rotation of axes, logarithmic transformations,
nthroot transformations for n equal 2 and 3 and certain trigonometric transformations.
Thomas (197 1) presents the developments that lead to the results presented here concerning
transformations and derived distributions for continuous random variables. The procedures for
discrete random variables is simply an accounting procedure.
Let Y = x2 7X + 12. The probability distribution and possible values of Y can be determined
from the following table.
is a monotonic function u(X) is monotonically increasing if u(x,) 1 u(x,) for x2 > x, and
monotonically decreasing if u(x2) 5 u(x,) for x, > x,) can be found from
Example 2.12. Find the probability of 0 < U < 10 if U = X' and X is a continuous random vari
able with
Solution:
A check to see that pU(u) is a probability density can be made by integrating pU(u) from 0 to 25
3 6 103'
Now prob(0 < U < 10) = J'O du = 
O 250 125
where J(li) is the Jacobian of the transformation computed as the determinant of the matrix of
u, v
partial derivatives
The limits on U and V must be determined from the individual problem at hand.
Example 2.13. Given that p,,,(x, y) = (5  y/2  x)/14 for 0 < X < 2 and 0 < Y < 2. If U =
X + Y and V = Y/2, what is the joint probability density function for U and V? What are the
proper limits on U and V?
Solution:
The limits on U and V can be determined by noting that Y = 2V and X = U  2V. Therefore, the
limitofY=OmapstoV=O,Y=2mapstoV= 1,X=OmapstoU=2VandX=2mapsto
U = 2V + 2. These limits are shown in figure 2.16. A check can be made by integrating pu.v(u, v)
overtheregion0 < V < 1,2V < U < 2V + 2.
PROBABILITY 47
In some cases, the function U = u(X) may be such that it is difficult to analytically determine
the distribution of U from the distribution of X. In this case it may be possible to generate a large
sample of X's (chapter 13), calculate the corresponding U's and then fit a probability distribution
to the U's (chapter 6). It should be noted, however, that this empirical method will not in general
satisfy equations 2.47 or 2.48.
MIXED DISTRIBUTIONS
If pi(x) for i = 1, 2, ..., m represent probability density functions and Xi for i = 1, 2, ..., m
represent parameters satisfying Xi 2 0 and Xy= Xi = 1, then
Mixed distributions in hydrology may be applicable in situations where more than one dis
tinct cause for an event may exist. For example, flood peaks from convective storms might be
described by pl(x) and from hurricane storms by p2(x)If A1 is the proportion of flood peaks gen
erated by convective storms and X2 = (1  XI), is the proportion generated by hurricane storms,
then equations 2.54 and 2.55 would describe the probability distribution of flood peaks.
Singh (1974), Hawkins (1974), Singh (1987a, 1987b) Hirschboeck (1987), and Diehl and
Potter (1987) discuss procedures for applying mixed distribution in the form of equation 2.54 to
flood frequency determinations. Two general approaches are used. One is to allow the data and
statistical estimation procedures to determine the mixing parameter, A, and the parameters of the
distributions, pi(x). The other is to use physical information on the actual events to classify them
and thus determine A. Once classified, the two sets of data can independently be used to deter
mine the parameters of the pdfs.
Example 2.14. A certain event has probability 0.3 of being from the distribution pl(x) = e",
x > 0. The event may also be from the distribution p2(x) = 2e2x, x > 0. What is the probability
that a random observation will be less than I?
Solution:
PROBABILITY 49
Exercises
2.1. (a) Construct the theoretical relative frequency histogram for the sum of values obtained in
tossing two dice. (b) Toss two dice 100 times and tabulate the frequency of occurrence of the
sums of the two dice. Plot the results on the histogram of part a. (c) Why do the results of part b
not equal the theoretical results of part a? What possible kinds of errors are involved? Which kind
of error was the largest in your case?
2.2. Select a set of data consisting of 50 or more observations. Construct a relative frequency
plot using at least two different groupings of the data. Which of the two groupings do you prefer?
Why?
2.3. In a period of one week, 3 rainy days were observed. If the occurrence of a rainy day is an
independent event, how may ways could the sequence consisting of 4 dry and 3 wet days be
arranged?
2.4. If the occurrence of a rainy day is an independent event with probability equal to 0.3, what
is the probability of (a) exactly 3 rainy days in one week? (b) the next 3 days will be rain? (c) 3
rainy days in a row during any week with the other 4 days dry?
2.5. Consider a coin with the probability of a head equal to p and the probability of a tail equal
to q = 1  p. (a) What is the probability of the sequence HHTHTTH in 7 flips of the coin? (b)
What is the probability of a specified sequence resulting in r H's and s T's? (c) How many ways
can r H's and s T's be arranged? (d) What is the probability of r H's and s T's without regard to
the order of the sequence?
2.6. The distribution given by fx(x) = l / N for X = 1'2'3, . . . ,N is known as the discrete uni
form distribution. In the following consider N 1 5. (a) What is the probability that a random
value from fx(x) will be equal to 5? (b) What is the probability that a random value from fx(x)
will be between 3 and 5 inclusive? (c) What is the probability that in a random sample of 3
values from fx(x) all will be less than 5? (d) What is the probability that the 3 random values from
fx(x) will all be less than 5 given that 1 of the values is less than 5? (e) If 2 random values are se
lected from fx(x), what is the probability that one will be less than 5 and the other greater than 5?
(f') For what X from fx(x) is prob(X 5 x) = 0.5?
2.7. Consider the continuous probability density function px(x) = a sin2 mx for 0 < X < 7c. (a)
What must be the value of a and m? (b) What is Px(x)? (c) What is prob(0 < X < 7c/2)? (d) What
is prob(X > a/2 I X < a/4)?
2.8. Consider the continuous probability density function given by px(x) = 0.25 for 0 < X < a.
(a) What is a? (b) What is prob(X > a/2)? (c) What is prob(X > a/2 I X > a/4)? (d) What is
prob(X > a/2 I X < a/4)?
50 CHAPTER 2
2.9. Let px(x) = 0.25 for 0 < X < a as in exercise 2.8. What is the distribution of Y = In X?
Sketch py(y).
2.10. Many probability distributions can be defined simply by consulting a table of definite inte
grals. For example 5," xn' e' dx is equal to r(n) where r(n) is defined as the gamma function
(see chapter 6). Therefore one can define px(x) = xn' e'/T(n) to be a probability density func
tion for n > 0 and 0 < X < m. This distribution is known as the lparameter gamma distribution.
Using a table of definite integrals, define several possible continuous probability distributions.
Give the appropriate range on X and any parameters.
2.11. The annual inflow into a reservoir (acrefeet) follows a probability density given by p,(x) =
l/(P1 al). The total annual outflow in acrefeet follows a probability distribution given by py(y)
= 1/(P2  %). Consider that P1 > P2 and al < 04~.(a) Calculate the expression for the probabil
ity distribution of the annual change in storage. (b) Plot the probability distribution of the annual
change in storage. (c) If P1 = 100,000, a, = 20,000, P, = 70,000 and % = 50,000, what is the
probability that the change in storage will be i) negative and ii) greater than 15,000 acrefeet?
2.12. The probability of receiving more than 1 inch of rain in each month is given in the follow
ing table. If a monthly rainfall record selected at random is found to have more than 1 inch of
rain, what is the probability the record is for July? April?
2.13. It is known that the discharge from a certain plant has a probability of 0.001 of containing
a fish killing pollutant. An instrument used to monitor the discharge will indicate the presence of
the pollutant with probability 0.999 if the pollutant is present and with probability 0.01 if the pol
lutant is not present. If the instrument indicates the presence of the pollutant, what is the proba
bility that the pollutant is really present?
2.14. A potential purchaser of a ferry across a river knows that if a flow of 100,000 cfs or more
occurs, the ferry will be washed down stream, go over a low dam, and be destroyed. He knows
that the probability of a flow of this kind in any year is 0.05. He also knows that for each year that
the ferry operates a net profit of $10,000 is realized. The purchase price of the ferry is $50,000.
Sketch the probability distribution of the potential net profit over a period of years neglecting in
terest rates and other complications. Assume that if a flow of 100,000 cfs or more occurs in a
year, the profit for that year is zero.
2.15. Assume that the probability density function of daily rainfall is given by
(a) Is this a proper probability density function? (b) What is prob(X > 0.5)? (c) What is prob
(X > 0.5 1 X # O)?
This is a mixture of two uniform distributions. (a) Sketch p,(x) for A, = 0.5. (b) Sketch px(x) for
A, = 0.1. (c) Sketch px(x) for A, = 0.333. (d) In a random sample from px(x), 60% of the values
were between 0 and 2. What would be an estimate for the value of A,?
this population. Streamflow records for the past 50 years on a particular stream would be a
sample from which inferences about the behavior of the stream for all time (the population) could
be made. This information could also be used to estimate the behavior of the stream during some
future period of years (another but yet unrealized sample) so that a structure could be properly
designed for the stream. Thus, one might use information gleaned from one sample to make
decisions regarding another sample.
Quantities that are descriptive of a population are called parameters. In most situations these
parameters must be estimated from samples of data. Sample statistics are estimates for popula
tion parameters. Sample statistics are estimated from samples of data and as such are functions
of random variables (the sample values) and thus are themselves random variables. The average
number of pages in all of the books in a particular library would be a parameter representing
the population (the books in the library). This parameter could be estimated by determining the
average number of pages in all of the books on a particular shelf in the library (a sample of the
population). This estimate of the parameter would be a statistic.
As pointed out in chapter 1, for a decision based on a sample to be valid in terms of the
population, the sample statistics must be representative of the population parameters. This in
turn requires that the sample itself be representative of the population and that "good" parame
ter estimation procedures are used. One could not get a "good" estimate of the average number
of pages per book in a library by sampling a shelf that contained only fat, engineering
handbooks. By the same token, one cannot get "good" estimates for the parameters of a stream
flow synthesis model if the estimates are based on a short period of record during which an
extreme drought occurred.
One rarely, if ever, has available a population of observations on a hydrologic variable.
What is generally available is a sample (of observations) from the population. Thus, population
parameters are rarely, if ever, known and must be estimated by sample statistics. By the same
token, the true probability density function that generated the available sample of data is not
known. Thus, it is necessary to not only estimate population parameters, but it is also necessary
to estimate the form of the random process (experiment) that generated the data.
This chapter is devoted to a discussion of parameters descriptive of populations and
how estimates (statistics) for these parameters can be obtained from samples drawn from
populations.
and the first moment of the total area about the origin is
X
Fig. 3.1. Moment of arbitrary area.
In case of a random variable and its associated probability density function such as shown
in figure 3.2, the first moment about the origin is again given by
The ithcentral moment is defined as the ithmoment about the mean, p, of a distribution and
is given by
It is apparent that the expected value of (x  p,)' is equal to the ithcentral moment
Arithmetic Mean
Generally, the first property of a random variable that is of interest is its mean or average
value. The mean, px, of a random variable, X, is its expected value. Thus
A sample estimate of the population mean is the arithmetic average, X,calculated from
where n is the number of observations or items in the sample. The arithmetic mean can be esti
mated from grouped data by
where k is the number of groups, n is the number of observations, ni is the number of observa
tions in the ithgroup and xi is the class mark of the ithgroup.
Geometric Mean
The sample geometric mean, K, is defined as
Median
The sample median, Xmd,is the observation such that half of the values lie on either side of
Xmd.The population median, I J . ~ ,would
, be the value satisfying
Mode
The mode is the most frequently occurring value. Thus the population mode, IJ.,,, would be
a value of X maximizing px(x) and thus satisfying the equations
dpx(x)

dx
0 and d2px(x) < 0 x continuous
dx2
Weighted Mean
The calculation of the arithmetic mean of grouped data is an example of calculating a
weighted mean where ni /n is the weighting factor. In general, the weighted mean is
where wi is the weight associated with the ithobservation or group and k is the number of obser
vations or groups.
MEASURES OF DISPERSION
Range
The two most common measures of dispersion are the range and the variance. The range of
a sample is simply the difference between the largest and smallest sample values. The range of a
population is many times the interval from m to or from 0 to m. The sample range is a func
tion of only two of the sample values but does convey some idea of the spread of the data. The
population range of many continuous hydrologic variables would be 0 to m and would convey
little information. The range has the disadvantage of not reflecting the frequency or magnitude of
values that deviate either positively or negatively from the mean because only the largest and
smallest values are used in its determination. Occasionally, the relative rangethe range divided
by the meanis used.
Variance
By far the most common measure of dispersion is the variance, or its positive square root
the standard deviation. The variance of the random variable X is defined as the second moment
about the mean and is denoted by 0;.
Thus, the variance is the average squared deviation from the mean. For a discrete population of
size n, equation 3.23 becomes
where k is the number of groups, n is the number of observations, xi is the class mark and ni the
number of observations in the i" group.
The variance of some functions of the rambmvariable X can be determined from the
following relationships:
The units on the variance are the same as the units on x2.The units on the standard devia
tion are the same as the units on the random variable. A dimensionless measure of dispersion is
the coefficient of variation, defined as the standard deviation divided by the mean. The coefficient
of variation is estimated from
MEASURES OF SYMMETRY
As is apparent from figure 2.13, many distributions are not symmetrical. They may tail off
to the right or to the left and as such are said to be skewed. A distribution tailing to the right is
said to be positively skewed and one tailing to the left is negatively skewed. The skewness is the
third moment about the mean and is given by
One measure of absolute skewness would be the difference in the mean and the mode. A meas
ure such as this would not be too meaningful, however, because it would depend on the units of
measurement. A relative measure of skewness, known as Pearson's first coefficient of skewness,
can be obtained by dividing the difference in the mean and the mode by the standard deviation.
P  Pmo
population measure of skewness = (3.32)
(7
RANDOM VARIABLES 59
Mean = Mode = Median
x  Xmo
sample measure of skewness =
Sx
The mode of moderately skewed distributions can be estimated from (Par1 1967)

Xmo =X  3(x  xmd)
so that
 ~md)
s a m ~ l emeasure of skewness =
If sample estimates are replaced by population values in equation 3.35, Pearson's second coeffi
cient of skewness results.
The most commonly used measure of skewness is the coefficient of skew given by
where M, is the sample estimate for p3.The sample coefficient of skew has the advantage of be
ing a function of all of the observations in the sample. Figure 3.3 shows symmetrical, positively
and negatively skewed distributions.
MEASURES OF PEAKEDNESS
A fourth property of random variables based on moments is the kurtosis. Kurtosis refers to
the extent of peakedness or flatness of a probability distribution in comparison with the normal
60 CHAPTER 3
probability distribution. Kurtosis is the fourth moment about the mean. A coefficient of kurtosis
is defined as ,
where M4 is the sample estimate for k4. According to Yevjevich (1972a), a less biased esti
mate for the coefficient of kurtosis is obtained by multiplying equation 3.39 by
n3

where n is the sample size.
(n  l)(n  2)(n  3)
The coefficient of kurtosis for a normal distribution is 3. The normal distribution is said to
be mesokurtic. If a distribution has a relatively greater concentration of probability near the mean
than does the normal, the coefficient of kurtosis will be greater than 3 and the distribution is said
to be leptokurtic. If a distribution has a relatively smaller concentration of probability near the
mean than does the normal, the coefficient of kurtosis will be less than 3 and the distribution is
said to be platykurtic. Figure 3.4 illustrates kurtosis. The coefficient of excess, 5, is defined as
K  3. Therefore, for a normal distribution 5 is 0, for a leptokurtic distribution 5 is positive and
In either case, the result is the average value of the function g(X, Y) weighted by the probability
that X = x and Y = y or more simply the mean of the random variable U.
In the discrete case
A general expression for the r, s moment about the origin of the jointly distributed random
variables X and Y is
The analogous result holds for (r = 0, s = 2). The comparable results for discrete random variables
are easily obtained.
Covariance
The covariance of X and Y is defined as the 1, 1 central moment
For the case where X and Y are independent, equation 3.49 can be written
since px,,(x, y) would equal px(x) py(y). Furthermore, both of the integrals in equation 3.50 are
equal to zero so that
if X and Y are independent. The converse of this is not necessarily true, however.
The sample estimate for the population covariance ax,,is S,,, computed from
Correlation Coefficient
The covariance has units equal to the units of X times the units of Y. A normalized covari
ance called the correlation coefficient is obtained by dividing the covariance by the products of
the standard deviations of X and Y
RANDOM VARIABLES 63
It can be shown (Thomas 1971) that  1 5 pxy 5 1. Obviously, if X and Y are independent,
p,,, = 0. Again, the converse is not necessarily true. X and Y can be functionally related and still
have p,,, (and ox,, ) equal to zero. Actually px,, is a measure of the linear dependence between
X and Y. If pxqy= 0, then X and Y are linearly independent; however, they may be related by
+
some other functional form. A value of p,,, equal to 1 implies that X and Y are perfectly related
by Y = a + bX. If pxVy= 0, X and Y are said to be uncorrelated. Any nonzero value of px,,
means X and Y are correlated.
The covariance and the correlation coefficient are a measure of how the two variables X and
Y vary together. If pX,, and ox,, are positive, large values of X tend to be paired with large val
ues of Y and vice versa. If pX,, and ox,, are negative, large values of X tend to be paired with
small values of Y and vice versa.
The population correlation coefficient p,,, can be estimated by the sample correlation
coefficient as
where sx and sy are the sample estimates for ox and oy given by equation 3.25 and SX,Yis the
sample covariance given by equation 3.52.
Figure 3.5 demonstrates some typical values for r,,,. In figure 3.5a all of the points lie on
the line Y = X  1; consequently, there is perfect linear dependence between X and Y and
the correlation coefficient is unity. In figure 3.5b the points are either on or slightly off the line
Y = X  1, and r x , = 0.986. Perfect linear dependence does not exist in this case because some
of the points deviate slightly from the straight line. In measuring and relating naturally occurring
hydrologic variables, a correlation coefficient of 0.986 would be considered quite good and the
resulting straight line, Y = X  1 in this case, would usually be judged a good usable relation
ship between X and Y.
In figure 3 . 5 the correlation coefficient has dropped to 0.671. The points in this case are
scattered about the line Y = 1.264  1.571X. The scatter of the points is much greater than in
the previous case, although the existence of some dependence (stochastic) is still in evidence.
In figure 3.5d the scatter of the points is great, with a corresponding lack of a strong (sto
chastic) dependence. Generally, a correlation coefficient of 0.21 1 is considered too small to
indicate a useful stochastic dependence as knowledge about X gives very little information
about Y.
In the last two paragraphs the modifier "stochastic" has appeared with the word dependence.
This is because in reality there are two kinds of dependencestochastic and functional. Gener
ally, throughout this book the word dependence alone should be taken to mean stochastic (or sta
tistical) dependence.
Figures 3.5e and 3.5f contain examples of functionally dependent variables. In figure 3.5e
the relationship is Y = x 2 / 4 for X > 0 and in figure 3.5f the relationship is Y = &.
for 3 < X < 3. The correlation coefficient for figure 3.5e is 0.963, indicating a high degree of
stochastic (linear) dependence. This illustrates that even though the dependence between X and Y
Fig. 3.5. Examples of the correlation coefficient.
is nonlinear, a high correlation coefficient can result. If the plot of figure 3.5e were to cover a
different range of X, the correlation coefficient would change as well.
Figure 3.5f illustrates a situation where Y and X are perfectly functionally related even
though the correlation coefficient is zero. The functional relationship is not linear, however. This
figure demonstrates that one cannot conclude that X and Y are unrelated based on the fact that
their correlation coefficients are small.
The fact that two variables have a high degree of linear correlation should not be interpreted
as indicating a functional or causeandeffect relationship exists between the two variables. The
annual water yield on two adjacent watersheds may be highly positively correlated even though
a high yield from one watershed does not cause a high yield from the second watershed. More
likely the same climatic factors and geomorphic factors are operating on the two watersheds,
causing their water yields to be similar. The fact is often overlooked that high correlation does not
necessarily mean a causeandeffect relationship exists between the correlated variables.
RANDOM VARIABLES 65
Further Pro~ertiesof Moments
If Z is a linear function of two random variables X and Y, then
Equations 3.55 and 3.56 can be generalized when Y is a linear function of n random
variables as follows.
then
and
A noteworthy result of equation 3.56 or 3.58 is that for uncorrelated random variables, the
variance of a sum or difference is equal to the sum of the variances. This is because the variation
in each of the random variables contributes to the variation of their sum or difference.
As a special case of a linear function, consider the Xi to be a random sample of size n. Let
the ai all be equal to l/n. Then Y is equal to X,the mean of the sample. The Var(Y) is the var(T?T)
and can be found from equation 3.58. Since the Xi form a random sample, the Cov(Xi, Xj) = 0
+
for i j and Var(Xi) = Var (X). we now have
Equation 3.59 states that the variance of the mean of a random sample is equal to the vari
ance of the sample divided by the number of observations used to estimate the mean of the
sample. If X and Y are independent random variables, then the equation 3.49 shows that the
expectation of their product is equal to the product of their expectation.
The variance of the product XY for X and Y independent can be obtained from
Because X and Y are independent, pX,,(x, y) = px(x) py(y) and E ( X Y ) ~becomes E(x2)E(y2)
or E ( X Y ) ~= (& + a;)(& + a t ) . Also from equation 3.60, E2(xY) = E~(x)E~(Y) =
Thus
which reduces to
That this is true is obvious from the example of g(X) = X2. From equation 3.23 it can be
seen that E(x2) = a; +
SAMPLE MOMENTS
If xi for i = 1 to n is a random sample, then the rfhsample moment about the origin is
RANDOM VARIABLES . 67
(xi  X)'
M,= z:=,
For the bivariate case involving a random sample of xi and y,, the r, s sample moment about
the origin is
The expected value of sample moments is equal to the population moments (Mood et al. 1974).
Two important properties of moments worthy of repeating are:
2. The second moment about the origin is equal to the variance plus the square of the mean.
The moments about the mean are related to the moments about the origin by the following
general equation (Thomas 1971)
For the computation of sample moments it is often convenient to use equation 3.66. The results
of equation 3.66 for the first four sample moments are
68 CHAPTER 3
Sample moments can be computed from grouped data by using the equations
and
where xj and nj are the class mark and number of observations, respectively, in the j" group, n is
the total number of observations, and k is the number of groups.
Moments of greater than third order are generally not computed for hydrologic variables
because of the small sample size. Higherorder moments are very unreliable (have a high vari
ance) for small samples. For example, the variance of s2 (the variance of the sample variance) is
(Mood et al. 1974)
Yevjevich (1972a) presents general expressions for the variance of the variance, coefficient of
skew, and kurtosis.
Po is the population mean and an estimator b, of Po is TI. Estimates for other PWMs can
be obtained from order statistics. A random sample of observations can be arranged so that
x(,) 5 x(,~)I... 5 x(~).The x(,, are known as order statistics. An estimator, b:, for p, for r 2 1
is
where 1  ( j  0.35)/n are estimators for P,(x(~,). Stedinger et al. (1994) recommend this esti
mator for single site estimation despite its bias because it generally results in a smaller mean
square error than the unbiased estimator given below.
RANDOM VARIABLES 69
When unbiasedness is important, the following estimators may be used
Lmoment estimates for the mean, standard deviation, skewness, and kurtosis are given by
Because Lmoments do not involve squares and cubes of observations, they tend to produce less
variable estimates for higher moments, especially when an unusually large or small observation
happens to be present in a sample.
Lmoments and probability weighted moments are related by
Estimates, i,of A, are obtained by replacing the P, with sample estimates b,.
PARAMETER ESTIMATION
Thus far, probability distribution functions have been written px(x) or fx(x), depending on
whether they were continuous or discrete. More correctly, they should be written px(x; 0,, I2,.. ,
0,) or fx(x; 0,, I,, ..., Om),indicating that in general the distributions are a function of a set of
parameters as well as of random variables. To use probability distributions to estimate probabil
ities, values for the parameters must be available. This section discusses methods for estimating
the parameter values for probability distributions. Certain properties of these parameter estimates
or statistics are also discussed. Rather than carry a dual set of relationshipsone for continuous
and one for discrete random variablesonly the expressions for the continuous random vari
ables will be displayed. The results are equally applicable to discrete distributions.
The usual procedure for estimating a parameter is to obtain a random sample x,, x2, ..., X,
from the population X. This random sample is then used to estimate the parameters. Thus Gi, an
estimate for the parameter 4, is a function of the observations or random variables. Since iiis a
function of random variables, iiis itself a random variable possessing a mean, variance, and
probability distribution.
Intuitively, one would feel that the more observations of the random variables that were
available for parameter estimation, the closer 6 should be to 0. Also, if many samples were used
for obtaining 6, one would feel that the average value of 6 should equal 0. These two statements
deal with two properties of estimators known as consistency and unbiasedness.
Unbiasedness
An estimate 6 of a parameter 0 is said to be unbiased if E(6) = 0. The bias, if any, is given
by E(6)  0.
bias = ~ ( 6 
) 0 (3.75)
The fact that an estimator is unbiased does not guarantee that an individual 6 is equal to 0
or even close to 0, it simply means that the average of many independent estimates for 0 will
equal 0.
Consistency
An estimator 6 of a parameter 0 is said to be consistent if the probability that 6 differs from
0 by more than an arbitrary constant E approaches 0 as the sample size approaches infinity.
Consistency is an asymptotic property because it states that by selecting an n sufficiently
large, the prob (1 6  0 I > E ) can be made as small as desired. For small samples (as are many
times used in practice) consistency does not guarantee that a small error will be made. In spite of
this, one feels more comfortable knowing that 6 would converge to 0 if a larger sample were
used.
A single estimate of 0 from a small sample is a problem because neither unbiasedness nor
consistency give us much comfort. In choosing between several methods for estimating 0, in
addition to being unbiased and consistent it would be desirable if the ~ a r ( 6 were
) as small as
possible. This would mean that the probability distribution of 6 would be more concentrated
about 0.
RANDOM VARIABLES 71
Efficiency
An estimator 6 is said to be the most efficient estimator for 0 if it is unbiased and its vari
ance is at least as small as that of any other unbiased estimator for 0. The relative efficiency of 6,
with respect to 6, for estimating 0 is the ratio of ~ar(6,)to ~ a r ( 6 , ) .
Sufficiency
Finally, it is desirable that 6 use all of the information contained in the sample relative to 0.
If only a fraction of the observations in a sample are used for estimating 0, then some informa
tion about 0 is lost. An estimator 6 is said to be a sufficient estimator for 9 if 6 uses all of the
information relevant to 0 that is contained in the sample.
More formal statements of the above four properties of estimators and procedures for deter
mining if an estimator has these properties can be found in books on mathematical statistics
(Lindgren 1968; Freund 1962; Mood et al. 1974).
There are many ways for estimating population parameters from samples of data. A few of
these are graphical procedures, matching selected points, method of moments, maximum likeli
hood, and minimum chisquare. The graphical procedure consists of drawing a line through plot
ted points and then using certain points on the line to calculate the parameters. This procedure is
very arbitrary and is dependent upon the individual doing the analysis. Frequently, the method is
employed when few observations are availablewith the thought that few observations will not
produce good parameter estimates anyway. When few points are available is precisely the time
when the best methods of parameter estimation should be used.
The method of matching points is not a commonly used method but can produce reasonable
first approximations to the parameters. The procedure can be valuable in getting initial estimates
for the parameters to be employed in iterative solutions that can arise when the method of mo
ments or maximum likelihood are used.
Example 3.1. A certain set of data is thought to follow the distribution p,(x) = Xe" for X X 0.
In this particular data set, 75% of the values are less than 3.0. Estimate the parameter X.
Solution:
px(x) = hepAx
1  Px(x) = eAx
Xx = In( 1  Px(x))
Comment: If a sample of size n is available this procedure could be used to obtain n estimates for
h. These n estimates could then be averaged to obtain i. If the probability distribution of interest
had m parameters, then the value of P,(x) and x at m points would be used to obtain m equations
in the m unknown parameters. The method of matching points is not recommended for general use
in getting final parameter estimates. Certainly this method would not use all of the information in
the sample. Also, several different estimates for the parameters could be obtained from the same
sample depending on which observations were used in the estimation process.
Method of Moments
One of the most commonly used methods for estimating the parameters of a probability dis
tribution is the method of moments. For a distribution with m parameters, the procedure is to
equate the first m moments of the distribution to the first m sample moments. This results in m
equations which can be solved for the m unknown parameters. Moments about the origin, the
mean, or any other point can be used. Generally, for 1parameter distributions the first moment
about the origin, the mean, is used. For 2parameter distributions the mean and the variance are
generally used. If a third parameter is required, the skewness may be used.
Similarly, Lmoments may be used in parameter estimation by equating sample estimates of
the Lmoments to the population expression for the corresponding Lmoment depending on the
particular pdf being used. Again, for m parameters, m Lmoments would be required. This tech
nique will be illustrated in chapter 6 for some particular pdfs.
Example 3.2. Estimate the parameter A of the distribution px(x) = he" for X > 0 by the
method of moments.
1
Thus, the mean of px(x) is 1/A so that A can be estimated by ); = =.
X
Solution:
x  0,
let Y= so that dx = 0, dy
02
RANDOM VARIABLES 73
and
The first integral has an integrand h(y) such that h(y) = h(y) and is therefore zero. The
second integral can be written as
Therefore kx = 4 , or the parameter 8, of this distribution is equal to the mean of the distribution
and can be estimated by
let y =  so that dx = f i g 2 dy
fie2
and
= 9;
Thus, the parameter 0; is equal to the variance and can be estimated by s; (the sample variance).
62 2
2  sx
Substituting the parameter estimates in terms of their population values into the expression
for px(x), the result is
Maximum Likelihood
Assume we have in hand n random observations x,, x,, ..., xn. Their joint probability dis
tribution is p, (x,, x2, ..., xn; 01, 02, ..., 0,). Because for a random sample the xi's are independ
ent, their joint distribution can be written
Now, this latter expression is proportional to the probability that the particular random sample
would be obtained from the population and is known as the likelihood function.
The m parameters are unknown. The values of these m parameters that maximize the likeli
hood that the particular sample in hand is the one that would be obtained if n random observa
tions were selected from px(x; I1,I2,..., 0,) are known as the maximum likelihood estimators.
The parameter estimation procedure becomes one of finding the values of I,,I2,..., 0, that max
imize the likelihood function. This can be done by taking the partial derivative of L(0,, O,, ..., 0,)
with respect to each of the Oi's and setting the resulting expressions equal to zero. These m
equations in m unknowns are then solved for the m unknown parameters.
Because many probability distributions involve the exponential function, it is many times
easier to maximize the natural logarithm of the likelihood function. The logarithmic function is
monotonic, thus the values of the 0's that maximize the logarithm of the likelihood function also
maximize the likelihood function.
Example 3.4. Find the maximum likelihood estimator for the parameter A of the distribution
px(x) = Ae'" for X > 0.
Solution:
RANDOM VARIABLES 75
Note that this is the same estimate as obtained in example 3.2 using the method of moments. The
two methods do not always produce the same estimates.
Example 3.5. Find the maximum likelihood estimators for the parameters el, and 0; of the
distribution
Therefore 2 ( x i  0,) = 0
Example 3.5 shows that the maximum likelihood estimators are not unbiased. It can be
shown, however, that the maximum likelihood estimators are asymptotically (as n +m) unbi
ased. Maximum likelihood estimators are sufficient and consistent. If an efficient estimator ex
ists, maximum likelihood estimators, adjusted for bias, will be efficient. In addition to these four
properties, maximum likelihood estimators are said to be invariant, that is, if (6) is a maximum
likelihood estimator of 0 and the function hie) is continuous, then h(6) is a maximum likelihood
estimator of h(0).
The method of moments and the method of maximum likelihood do not always produce the
same estimates for the parameters. In view of the properties of the maximum likelihood estima
tors, this method is generally preferred over the method of moments. Cases arise, however, where
one can get maximum likelihood estimators only by iterative numerical solutions (if at all), thus
leaving room for the use of more readily obtainable estimates possibly by the method of
moments. The accuracy of the method of moments is severely affected if the data contains errors
in the tails of the distribution where the moment arms are long (Chow 1954). This is especially
troublesome with highly skewed distributions.
Finally, it should be kept in mind that the properties of maximum likelihood estimators are
asymptotic properties (for large n) and there well may exist better estimation procedures for
small samples for particular distributions.
CHEBYSHEV INEQUALITY
Certain general statements about random variables can be made without placing restrictions
on their distributions. More precise probabilistic statements require more restrictions on the dis
tribution of the random variables. Exact probabilistic statements require complete knowledge of
the probability distribution of the random variable.
One general result that applies to random variables is known as the Chebyshev inequality.
This inequality states that a single observation selected at random from any probability distribu
tion will deviate more than k a from the mean, k, of the distribution with probability less than or
equal to l/k2.
For most situations this is a very conservative statement. The Chebyshev inequality produces an
upper bound on the probability of a deviation of a given magnitude from the mean.
Example 3.6. The data of table 2.1 has a mean of 66,540 cfs and a standard deviation of 22,322
cfs. Without making any distributional assumptions regarding the data, what can be said of the
probability that the peak flow in a year selected at random will deviate more than 40,000 cfs from
the mean?
Solution: Applying Chebyshev's inequality we have k a = 40,000 cfs. Using 22,322 cfs as an
estimate for a we obtain k = 1.79.
RANDOM VARIABLES 77
The probability that the peak flow in any year will deviate more than 40,000 cfs from the
mean is thus less than or equal to 0.3 11.
Comment: One can see that this is a very conservative figure by noting that only 6 values out of
99 (6/99 = 0.061) lie outside the interval 66,540 5 40,000. By not making any distributional
assumptions, we are forced to accept very conservative probability estimates. In later chapters we
will again look at this problem making use of selected probability distributions.
a;c
If we now let S = l/k2 and choose n so that n 1 7 , we have the (weak) Law of Large Numbers
se
(Mood and Graybill 1963) which states:
Let px(x) be a probability density function with mean yx and finite variance a;. Let x, be the
mean of a random sample of size n from px(x). Let E and S be any two specified small numbers
a;,
such that (E > 0 , 0 < S < 1. Then for n any integer greater than 
e2s
This statement assures us that we can estimate the population mean with whatever accuracy
we desire by selecting a large enough sample. The actual application of equation 3.79 requires
knowledge of population parameters and is thus of limited usefulness.
Example 3.7. Assume that the standard deviation of peak flows on the Kentucky River near
Salvisa, Kentucky, is 22,322 cfs. How many observations would be required to be at least 95%
sure that the estimated mean peak flow was within 10,000 cfs of its true value if we know noth
ing of the distribution of peak flows?
We must have at least 100 observations to be 95% sure that the sample mean is within 10,000
cfs of the population mean if we know nothing of the population distribution except its standard
deviation. This happens to be very close to the number of observations in the sample (99).
Comment: We will look at this ~roblemaoain later making certain distributional assum~tions.
78 CHAPTER 3
Exercises
3.1. What is the expected mean and variance of the sum of values obtained by tossing two dice?
What is the coefficient of skew and kurtosis?
xt
3.2. Modular coefficients defined as Kt = T are occasionally used in hydrology. What is the
X
mean, variance, and coefficient of variation of modular coefficients in terms of the original data?
3.3. What effect does the addition of a constant to each observation from a random sample have
on the mean, variance, and coefficient of variation?
3.4. What effect does multiplying each observation in a random sample by a constant have on
the mean, variance, and coefficient of variation?
3.5. Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that 1
0  kQj is greater than 10,000 cfs?
3.6. Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that a single random observation will deviate
more than 10,000 cfs from pQ?
3.7. Using the data of exercise 2.2 calculate the mean and variance from the grouped data. How
do the grouped data mean and variance compare to the ungrouped mean and variance? Which
estimate do you prefer?
3.8. Calculate the covariance between the peak discharge Q in thousands of cfs and the area A in
thousands of square miles for the following data.
3.9. Calculate the correlation coefficient between Q and A for the data in exercise 3.8.
3.10. Calculate the coefficient of skew for Q in exercise 3.8. Note that this estimate is relatively
unreliable because of the small sample.
3.11. Calculate the kurtosis and the coefficient of excess for Q in exercise 3.8. Note that these
estimates are unreliable because of the small sample size.
RANDOM VARIABLES 79
3.12. Complete the steps necessary to arrive at equation 3.56 from 3.55.
3.14. A convenient relationship for calculating the estimated variance of a sample of data is
3.15. The estimated covariance between X and Y of a bivariate random sample can be calculated
from
Derive this expression from equations 3.49. Note that this estimated covariance is biased. In
practice, the final divisor of n is replaced by n  1 to correct for bias.
3.16. In exercise 2.14, if the future maximum life of the ferry is 15 years, what is the expected
net profit? Neglect the interest or discount rate.
3.17. What are the maximum likelihood estimates for the parameters of the two parameter
exponential distribution? This distribution is given by
3.18. What are the moment estimates for the parameters of the exponential distribution given in
exercise 3.17?
3.19. For the following data, what are the moment and maximum likelihood estimates for the
parameters of the distribution given in exercise 3.17? x = 15.0, 10.5, 11.O, 12.0, 18.0, 10.5, 19.5.
3.20. Calculate the coefficient of skew for the Kentucky River data of table 2.1.
3.21. Calculate the kurtosis of the Kentucky River data of table 2.1.
3.22. Using the data of exercise 2.2, calculate the coefficient of skew from the grouped data.
3.23. Using the data of exercise 2.2, calculate the kurtosis from the grouped data.
3.24. What are the maximum likelihood estimates for cx and P in the distribution
1
3.25. What are the mean and variance of fx(x) =  for x = 1, 2, ..., N?
N
3.26. What are the mean and variance of px(x) = a sin2x for 0 < X < n?
3.27. Use the method of moments to estimate a in px(x) = a sin2 x for 0 < X < n based on the
random sample given by X = 0.5, 2.0, 3.0, 2.5, 1.5, 1.8, l.0,0.8, 2.5, 2.2.
3.28. The r~ moment about xo can be written as E(X  xo)'. Show that the variance is the smallest
possible second moment.
4. Some Discrete
Probability
Distributions and
Their Applications
THUS FAR, probability distributions have been considered in general terms. This chapter is
devoted to some particular discrete distributions and their applications. The following two chapters
are devoted to selected continuous distributions. These chapters are by no means exhaustive treat
ments of probability distributions; only some of the more common distributions are considered.
HYPERGEOMETRIC DISTRIBUTION
Drawing a random sample of size n (without replacement) from a finite population of size
N, with the elements of the population divided into two groups with k elements belonging to one
group, is an example of sampling from a hypergeometric distribution. The two groups may be de
fective or nondefective objects, rainy or nonrainy days, success or failure of a project, and so
forth. For discussion purposes we will consider that an element (or outcome) from the population
is either a success or a failure. The probability of x successes in a sample of size n selected from
a population of size N containing k successes can be determined by applying equation 2.1.
The total number of possible outcomes or ways of selecting a sample of size n from N ob
jects is (F). The number of ways of selecting x successes and n  x failures from the population
containing k successes and N  k failures is (,k) (:I~~) . Thus the probability is
The distribution given by equation 4.1 is known as the hypergeometric distribution where
fx(x; N, n, k) is the probability of obtaining x success in a sample of size n drawn from a popu
lation of size N containing k successes.
The cumulative hypergeometric distribution giving the probability of x or fewer successes is
There are certain natural restrictions on this distribution. For example: x cannot exceed k, x can
not exceed n, k cannot exceed N, and n cannot exceed N. N, n, k, and x are all nonnegative inte
gers. Furthermore, the outcomes must be random and equally likely.
The mean of the hypergeometric distribution is
Example 4.1. The hypergeometric applies in example 2.5. In this example, a success is selecting
a bad record and N = 10, k = 3, n = 4. The solutions can be written in terms of the hypergeo
metric as
Example 4.2. Assume that during a certain September, 10 rainy days occurred. Also assume that
at this particular location the occurrence of rain on any day is independent of whether or not it
rained on any previous day. (This is often not a good assumption).
A sample of 10 September days is selected at random. (a) What is the probability that 4 of
these days will have been rainy? (b) What is the probability that less than 4 of these days were
rainy?
  
Example 4.3. Examples of the hypergeometric distribution commonly found in statistics books
include card sampling problems (What is the probability of exactly 2 aces in a 5card hand
selected at random from a 52card deck?) and acceptance sampling problems (What is the prob
ability of selecting 5 defective items from a lot of 50 items if 20 items are selected and the lot
actually contains 12 defectives?)
Binomial Distribution
Consider a discrete time scale. At each point on this time scale an event may either occur
or not occur. Let the probability of the event occumng be p for every point on the time scale;
thus, the occurrence of the event at any point on the time scale is independent of the history of
any prior occurrences or nonoccurrences. The probability of an occurrence at the ithpoint on the
time scale is p for i = 1,2, ... A process having these properties is said to be a Bernoulli
process.
An example of a Bernoulli process might be the occurrence of rainy days. The time scale has
units of days. On any particular day, rainfall may or may not occur. If the occurrence of rainfall
on any given day is independent of the past history of rainfall occurrences, the sequence of rainy
and dry days can be considered a Bernoulli process.
As an example of another Bemoulli process, consider that during any year the probability of
the maximum flow exceeding 10,000 cfs on a particular stream is p. Common terminology for a
flow exceeding a given value is an exceedance. Further consider that the peak flow in any year is
independent from year to year (a necessary condition for the process to be a Bernoulli process).
Let q = 1  p be the probability of not exceeding 10,000 cfs. We can neglect the probability of
a peak of exactly 10,000 cfs since the peak flow rates would be a continuous process. In this ex
ample the time scale is discrete with the points being nominally 1 year in time apart. We can now
make certain probabilistic statements about the occurrence of a peak flow in excess of 10,000 cfs
(an exceedance).
For example, the probability of an exceedance occumng in year 3 and not in years 1 or 2 can
be evaluated from equation 2.9 as qqp since the process is independent from year to year. The
+
probability of (exactly) one exceedance in any 3year period is pqq qpq + qqp since the ex
ceedance could occur in either the first, second, or third year. Thus, the probability of (exactly)
one exceedance in three years is 3pq2
In a similar manner, the probability of 2 exceedances in 5 years can be found from the sum
mation of the terms ppqqq, pqpqq, pqqpq, ..., qqqpp. It can be seen that each of these terms is
equivalent to p2q3 and that the number of terms is equal to the number of ways of arranging 2
items (the p's) among 5 items (the p's and q's). Therefore, the total number of terms is (z), or 10,
so that the probability of exactly 2 exceedances in 5 years is
This result can be generalized so that the probability of X exceedances in n years is
(:) pxqn".The result is applicable to any Bemoulli process so that the probability of X occurrences
of an event in n independent trials if p is the probability of an occurrence in a single trial is given by
and gives the probability of X or fewer occurrences of an event in n independent trials if the prob
ability of an occurrence in any trial is p.
Continuing the above example, the probability of less than 3 exceedances in 5 years is
The mean, variance, and coefficient of skew of the binomial distribution are
The distribution is symmetrical for p = q, skewed to the right for q > p and skewed to the left
for q < p.
Because the probability of a success on any trial is independent of past history, the origin of
the time scale of a Bernoulli process can be taken at any time point. Thus the probability of any
combination of successes or failures is the same for any sequence of n points regardless of their
location with respect to the origin.
Example 4.4. On the average, how many times will a 10year flood occur in a 40year period?
What is the probability that exactly this number of 10year floods will occur in a 40year period?
Comment: This problem illustrates the difficulty of explaining the concept of return period. On
the average a 10year event occurs once every 10 years and in a 40year period is expected to
occur 4 times. Yet in about 80% (100[1  0.20591) of all possible independent 40year periods,
the 10year event will not occur exactly 4 times. As a matter of fact the probability that it will
occur 3 times is nearly identical to the probability it will occur 4 times (0.2003 vs. 0.2059). The
number of occurrences, X, is truly a random variable (with a binomial distribution).
The binomial distribution has an additive property (Gibra 1973). That is, if X has a binomial
distribution with parameters n, and p and Y has a binomial distribution with parameters n, and p,
then Z = X + Y has a binomial distribution with parameters n = n, + n, and p.
A useful property of the binomial distribution is that
The binomial distribution can be used to approximate the hypergeometric distribution if the
sample selected is small in comparison to the number of items N from which the sample is drawn.
In this case, the probability of a success would be about the same for each trial, and sampling
without replacement (hypergeometric) would be very similar to sampling with replacement
(binomial).
Example 4.5. Compare the hypergeometric and binomial for N = 40, n = 5, k = 10 and X = 0,
1,2,3,4,5.
Solution:
Hypergeometric Binomial
X fx(x; N, n, k) = fx(x; 40, 5, 10) fx(x; n, p) = fx(x; 5, 10/40)
Comment: This merely indicates that drawing a small sample without replacement from a large
population and drawing the same sample with replacement (so probabilities in each trial are con
stant) are nearly equivalent.
Example 4.6. The operator of a boat dock has decided to put in a new facility along a certain
river. In an economic analysis of the situation it was decided to have the facility designed to with
stand floods up to 75,000 cfs. Furthermore, it was determined that if one flood greater than this
occurs in a 5year period, repairs can be made and the operator will still break even on its opera
tion during the 5year period. If more than one flow in excess of 75,000 cfs occurs, money will
be lost. If the probability of exceeding 75,000 cfs is 0.15, what is the probability the operator will
make money?
Solution: Money will be made if no floods exceeding 75,000 cfs occur during the 5year period.
Let X be the number of floods. From the binomial distribution
DISCRETE DISTRIBUTIOPiS 87
Comment: The probability that the operator will make the investment, work for 5 years, and just
break even is very high
Thus, even though the risk or probability of losing money is low (1  0.39 15  0.4437 = 0.1648),
the investment may not be an attractive one.
Example 4.7. In order to be 90% sure that a design storm is not exceeded in a 10year period,
what should be the return period of the design storm?
Solution: Let p be the probability of the design storm being exceeded. Based on the binomial
distribution, the probability of no exceedances is given by
1
T =  = 95 years
P
Comment: To be 90% sure that a design storm is not exceeded in a 10year period, a 95year
return period storm must be used. If a 10year return period storm is used, the chances of it being
exceeded are
It can be shown that as T gets large, this expression approaches 1  l/e or 0.632. For T = 5,
10, and 25, the probability is 0.67,0.65, and 0.64, respectively. Thus, if the design life of a structure

and its design return period are the same, the chances are very great that the capacity of the struc
ture will be exceeded during its design life. The risk associated with a return period over n years is
risk = 1  (1  l/Ty.
The procedure outlined in example 4.7 can be used to determine a design return period when
the allowable risk is stated. Note that the design return period must be much greater than the life of
the project to be reasonably sure that an exceedance will not occur. No matter what design return pe
riod is selected, there is still a chance that an exceedance will occur. Some may argue that there is an
upper limit to the magnitude of natural events, such as flood peaks. They would argue that a peak of
100,000 cfs from a 1acre watershed would be impossible. In practice the probability that would be
assigned to an event of this sort is so small that it can be neglected for most practical purposes.
Figure 4.1 shows the design return period that must be used to be a certain percent confident
that the design will not be exceeded during the design life of the project. The parameters on the
curves are the percent chance of no exceedance during the design life. For example, to be 90%
sure that a design condition will not be exceeded during a project whose design life is 100 years,
the project would have to be designed on the basis of a 900year event. Figure 4.1 is derived from
calculations like those contained in example 4.7.
Figure 4.1 can also be used to evaluate the risk or percent chance of an event in excess of the
design event during the design life. For example, if a project is designed on the basis of a 50year
Fig. Design return period required as a function of design life to be a given percent confident
(curve parameter) that the design condition is not exceeded.
I
Comment: What has occurred prior to the trials of interest is of no concern since the Bernoulli
process is based on the assumption of independence from trial to trial.
Geometric Distribution
I
The probability that the first exceedance (or success) of a Bernoulli tqal occurs on the
xthtrial can be found by noting that for the first exceedance to be on the xth there must be
X  1 preceding trials without an exceedance followed by 1 trial with an
the desired probability is pqx' This is known as the geometric distribution 1
1
E(X) = l/p means that on the average a Tyear event occurs on the T~~y ar, which agrees
with our intuitive concept of a return period.
Example 4.9. What is the probability that a 10year flood will occur for the fir time during the
fifth year after the completion of a project? What is the probability it will be at the fifth year
before a 10year flood occurs?
I
Solution: The probability that the first exceedance is in year 5 is
The probability that it will be at least the fifth year before the first occurrence is not the same as
the probability of the first occurrence in the fifth year. The expression "at leas& implies the first
occurrence might be in the fifth year or some later year. The desired probability is equal to the
probability of no occurrences in the first 4 years, which is (0.9)~= 0.6561.
Solution: This is the same as the probability of the first occurrence on the tenth year or
As might be expected because the negative binomial is based on the binomial, the additive
feature holds. Thus, if X and Y are described by fx(x; k,, p) and f,(y; k,, p) respectively, then
Z = X + Y follows the negative binomial f,(z; k, + k,, p).
Example 4.1 1. What is the probability that the fourth occurrence of a 10year flood will be on the
fortieth year?
Solution:
geometric distribution. The probability that the kthoccurrence was at the xthtime is described by
the negative binomial distribution. It was also found that the probability distribution of the length
of time between occurrences can be found from the geometric distribution by noting that the
probability that X trials elapse between occurrences is the same as the probability that the first
occurrence is at the X + first time or fx(x + 1; p) = pqx.
POISSON PROCESS
Poisson Distribution
Consider a Bernoulli process defined over an interval of time (or space) so that p is the prob
ability that an event may occur during the time interval. If the time interval is allowed to become
shorter and shorter so that the probability, p, of an event occurring in the interval gets smaller and
the number of trials, n, increases in such a fashion that np remains constant, then the expected
number of occurrences in any total time interval remains the same. It can be shown that as n gets
large and p gets small so that np remains a constant, A, the binomial distribution approaches the
Poisson distribution given by
The mean, variance, and coefficient of skew of the Poisson distribution are
As A gets large, the distribution goes from a positively skewed distribution to a nearly symmet
rical distribution. The cumulative Poisson distribution is
Example 4.12. What is the probability that a storm with a return period of 20 years will occur
once in a 10year period?
Thus the solutions are not identical but are quite close to each other.
Example 4.13. What is the probability of 5 occurrences of a 2year storm in a 10year period?
Comment: For this situation n is not large enough and p not small enough for a good approximation.
Example 4.14. What is the probability of fewer than 5 occurrences of a 20year storm in a 100
year period?
The Poisson distribution possesses the additive property that the sum of two Poisson
random variables with parameters A, and A, is a Poisson random variable with parameter
A = A, + A,. A Poisson process for a continuous time scale can be defined analogous to a
Bernoulli process on a discrete time scale. The Poisson process refers to the occurrence of
events along a continuous time (or location) scale. The assumptions underlying the process
are:
+
1. The probability of an event in any short interval t to t At is AAt (proportional to the length
of the interval) for all values oft. This property is known as stationarity.
DISCRETE DISTRIBUTIONS 93
2. The probability of more than one event in any short interval t to t + At is negligible in com
parison to AAt.
3. The number of events in any interval of time is independent of the number of events in any
other nonoverlapping interval of time.
The probability distribution of the number of events X in time t for a Poisson process is
given by
(~t)'e~~
fx(x; At) = A>O; t>O; x=1,2, ...
x!
where fx(x; At) is the probability of X events in time t. Equation 4.20 is a Poisson distribution
with parameter At. The mean and variance of fx(x; At) are E(X) = At and Var(X) = At. The
parameter A is the average rate of occurrence of the event.
Exponential Distribution
The probability distribution of the time, T, between occurrences of the event can be found
by noting that the prob(T < t) is equal to 1  prob(T > t). The prob(T > t) is equal to the prob
ability of no occurrences in time t which is fx(O; At) or e". Thus
which is a cumulative distribution known as the exponential distribution. The probability density
function is
and is the probability distribution of the length of the time interval between occurrences of the
event. The mean and variance of the exponential distribution are 1/A and 1 / ~respectively.
~ ,
Gamma Distribution
The probability distribution of the time to the nthoccurrence can be found by noting that the
time to the nthoccurrence is the sum of n independent random variables, TI + T2 + + T, from
the exponential distribution. The method of derived distributions can be used with the result that
the probability density function of the time to the n" occurrence is
which is the gamma distribution for integer values of the parameter n. The gamma distribution
has E(T) = n/A and Var(T) = n / ~ ~ .
Example 4.15. Barges arrive at a lock an average of 4 each hour. (a) If the arrival of barges at the
lock can be considered to follow a Poisson process, what is the probability that 6 barges will
arrive in 2 hours? (b) If the lock master has just locked through all of the barges at the lock, what
is the probability she can take a 15minute break without another barge arriving? (c) If the oper
ation of the lock is such that 4 barges can be locked through at once and the lock master insists
that this always be the case, what is the probability that the first barge to arrive after 4 previous
barges have been locked through will have to wait at least 1 hour before being locked through?
Solution:
(a) For this problem the rate constant is 4 hours'. The probability of 6 arrivals in 2 hours
can be determined from the Poisson distribution
86e8
fx(x; At) = fx(6; 8) = = 0.1221
6!
Note that this is not the same as the probability that it will be 15 minutes until the next amval.
The time scale is continuous so the probability that it will be exactly 15 minutes until the next
arrival is zero. We can only talk of probabilities associated with time intervals, not specific
times.
(c) The barge must wait for the arrival of 3 additional barges. The probability that the time
T for 3 barges to arrive is greater than 1 hour
The probability that T 5 1 for 3 arrivals comes from the gamma distribution
For a Poisson process, the probability that an event will occur in a short time interval t to t + At
is hAt for all t. The probability that more than one event occurs in At is negligible. The probability
distribution of the number of events in a given time T is the Poisson distribution. The exponential
distribution describes the time between events and the gamma distribution the time to the n" event.
Example 4.16. It has been proposed that an eventbased rainfall simulation model can be
constructed by modeling the occurrence of rainstorms by a Poisson process and the amount of
rain in each storm by some continuous probability distribution. In this way, the time between
rainstorms would follow an exponential distribution, the time for X rainstorms would follow a
gamma distribution, and the number of rainstorms in a time interval would follow a Poisson
distribution. Duckstein et al. (1975) and Fogel et al. (1974) used a modification of this approach.
Part of Fogel et al.'s results are shown as figure 4.2.
0 5 10 15 20 25 30
Number of events per year
Fig. 4.2. Distribution of occurrences of warm season rainfall in which the areal mean of five
gages in New Orleans, Louisiana, exceeded 0.50 inches and at least one gage recorded
more than 1.O inch. (Fogel et al. 1974).
MULTINOMIAL DISTRIBUTION
The binomial distribution can be generalized to include the probabilities of outcomes of sev
eral types rather than the two possible outcomes of the binomial. If the probabilities associated
with each of k distinct outcomes are p,, p2, ...,p,, then in independent trials the probability of XI
outcomes of type 1, X2 outcomes of type 2, ...,Xkoutcomes of type k is given by the multinomial
distribution as
where  x and p are 1 X k vectors. Some restrictions on this distribution are
X, 
k
zi=lpi=l and 'Cf='=lxi=n
Example 4.17. On a certain stream the probability that the maximum peak flow during a lyear
period will be less than 5,000 cfs is 0.2 and the probability that it will be between 5,000 cfs and
10,000 cfs is 0.4. In a 20year period, what is the probability of 4 peak flows less than 5,000 cfs
and 8 peak flows between 5,000 and 10,000 cfs?
Solution: To apply the multinomial distribution we define the third event as a peak flow in ex
cess of 10,000 cfs. This event has probability 1  0.2  0.4 = 0.4. The event of a peak flow
greater than 10,000 cfs must occur 20  4  8 = 8 times. The desired probability is
Comment: The expected result from 20 years of flood peak data would be
This problem demonstrates that even though the expected results are 4, 8, and 8, the probability
of this happening is very low.
Exercises
4.1. Compute the terms of the binomial distribution with n = 10 and p = 0.2. Plot in the form
of a histogram.
4.2. Compute the terms of the cumulative binomial with n = 10 and p = 0.2. Plot the terms.
4.3. If a project is designed on a 10year retum period, what is the probability of at least 1
exceedance during the 10year life of the project?
4.4. What design retum period should be used to ensure a 95% chance that the design will not be
exceeded in a 25year period?
DISCRETE DISTRIBUTIONS 97
4.5. Construct a curve relating the design return period to the life of a project when a 90 percent
chance of no exceedance is used.
4.6. What design return period should be used to ensure a 50% chance of no exceedance in a
10year period?
4.7. What design return period should be used to ensure a 75% chance of no more than 1
exceedance in 10 years?
4.8. Construct an example where the Poisson is not a good approximation for the binomial.
4.9. In a certain locality contractors A, B, and C get about 50%, 25% and 25% respectively of
all water resources projects. Five contracts are coming up for bid. What is the probability that
contractor A will get all 5 jobs? What is the probability that A will get 2 jobs and B will get
2 jobs?
4.10. In 100 years the following number of floods were recorded at a specific location. Draw a
relative frequency histogram of the data. Fit a Poisson distribution to the data and plot the relative
frequencies according to the Poisson distribution on the histogram. Is the Poisson a good
approximation for the data?
4.1 1. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of 5
successive years without a flood?
4.12. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of
exactly five years between floods?
4.13. Compute the probability of at least 1 nyear event in a kyear period using (a) n = 100,
k = 20; (b) n = 500, k = 50.
4.14. Using the Poisson approximation to the binomial distribution show that the probability of
at least one occurrence of a Tyear event in T years is 0.632.
98 CHAPTER 4
4.15. The Bernoulli distribution is given by
4.16. Use the Poisson distribution to approximate the binomial distribution of exercise 4.1. Plot
the terms of this Poisson distribution on the histogram of exercise 4.1.
4.17. Two widely separated watersheds are selected for a study on peak discharges. If the occur
rence of flood flows on the two basins can be considered as independent events, what is the prob
ability of experiencing a total of 5,20year events on the two watersheds in a 10year period?
4.18. A wellknown scientist has predicted that during a certain 3year period a severe drought
will occur on the plains east of the Rocky Mountains. He made this prediction based on his
observance of sunspot activity. If the probability of a drought is 0.10 in any year, what is the
probability that the scientist's prediction will come true if the occurrence of a drought is a strictly
random phenomena unrelated to sunspot activity?
4.19. In a certain region there are 20 possible small watersheds suitable for a research project. Un
known to the project manager, 6 of these basins have subsurface geological features that permit
large quantities of surface water to enter underground formations and leave the basin via subsurface
flow. The project manager wants to select 6 watersheds from the 20 for study. (a) What is the prob
ability that 1 of the basins having the above described geologic features will be selected? (b) What
is the probability that 3 of these basins will be selected? (c) What is the probability that at least one
of the basins will be selected? (d) What is the probability that all of these basins will be selected?
4.20. In the situation described in exercise 4.19 the project manager wants to pick 3 pairs of
watersheds for the evaluation of an evapotranspiration suppressant. One basin in each pair will
be used for a control and one will be treated with the suppressant. What is the probability that all
of the control watersheds will have the geologic problem while all of the rest will not?
4.21. It is desired to model the number of rainy days in July and August as a Bernoulli process.
Based on the data below and the assumption that the Bernoulli model is applicable: (a) What is
the probability of 10 or more rainy days in each of the months of July and August? (b) What is
the probability of 20 rainy days in the 2month period? (c) What assumptions concerning the
Bernoulli process are likely violated by this problem? For this problem write answers in terms of
summations. Do not evaluate the summations.
Year 1 2 3 4 5 6 7 8 9 10
No. of rainy days
July 10 15 17 8 9 19 17 14 20 4
August 4 9 8 3 0 10 12 2 8 6
DISCRETE DISTRIBUTIONS 99
4.22. For the binomial distribution show that f,(x; n, p) = f,(x  1; n  1, p) f,(l; 1, p) +
fx(x; n  1, p) fx(O; 1, p). Write out a narrative description of the meaning of this equation.
4.23. Work exercise 4.21 using the Poisson distribution to approximate the binomial.
4.24. Pool the data of exercise 4.21 so that a single estimate is obtained for p of the binomial distri
bution. Compute the probability of 20 rainy days in the 2month period of JulyAugust. Compare
this probability to the one computed in part b of exercise 4.21. Which answer would you prefer?
4.25. Using the data of exercise 4.21, what is the probability that the sixth wet day of August
occurs on August 29,30, or 3 1 ?
4.26. Show that for the Poisson process the time for n occurrences follows the gamma distribution.
(Hint: Use the method of derived distributions to find the distribution of the time to 2 occurrences.
Using the distribution of the time to 2 occurrences the method of derived distributions can be used
to get the time to 3 occurrences. This process can then be repeated until a pattern emerges. Induc
tion could also be used by showing that if the time for n  1occurrences is given by equation 4.20
by substituting n  1for n then the time for n occurrences is given by equation 4.20. Also, the time
for 1 occurrence is given by equation 4.19, which is the same as equation 4.20 with n = 1.)
5. Normal Distribution
THE MOST widely used and most important continuous probability distribution is the
Gaussian, or normal distribution. The normal distribution has been widely used because of its
early connection with the "Theory of Errors" and because it has certain useful mathematical
properties. Many statistical techniques such as analysis of variance and the testing of certain
hypotheses rely on the assumption of normality. The errors involved in incorrectly assuming
normality (purposely or unknowingly) depend on the use under consideration. Many statistical
methods derived under the assumption of normality remain approximately valid when moderate
departures from normality are present and as such are said to be robust.
The very name "normal" distribution is misleading in that it implies that random variables
that are not normally distributed are abnormal in some sense. The Central Limit Theorem indicates
the conditions under which a random variable can be expected to be normally distributed. In a
strict theoretical sense, most hydrologic variables cannot be normally distributed because the
range on any random variable that is normally distributed is the entire real line (03 to +a). Thus
nonnegative variables such as rainfall, streamflow, reservoir storage, and so on, cannot be strictly
normally distributed. However, if the mean of a random variable is 3 or 4 times greater than its
standard deviation, the probability of a normal random variable being less than zero is very small
and can in many cases be neglected.
GENERAL N O W DISTRIBUTION
The normal distribution is a 2parameter distribution whose density function is
NORMAL DISTRIBUTION 101
X
P
Fig. 5.1. Normal distributions with same mean and different variances.
PI P2 $3
Fig. 5.2. Normal distributions with same variance and different means.
In examples 3.3 and 3.5 it was shown that if either the method of moments or the method of max
imum likelihood is used to estimate the two parameters of this distribution, the result is 8, = p
and 822 = u2 where p and u2 are the mean and variance of X, respectively. For this reason the
normal distribution is generally written as
REPRODUCTIVE PROPERTIES
If a random variable X is N(p, u2) and Y = a + bX, the distribution of Y can be shown to
be N(a + by, b2u2). Furthermore, if Xi for i = 1, 2, ..., n, are independently and normally
distributed with mean pi and variance ui2,then Y = a + blX, + b2X2+    + b,X, is normally
distributed with
py = a + Cy,lbipi (5.2)
and
2
UY

 Cb2" i = 1 iui2 (5.3)
Any linear function of independent normal random variables is also a normal random variable.
Example 5.1. If xiis a random observation from the distribution N(p, u2), what is the distri
Xi
bution of Z = C:= ?
n
Solution: X is a linear function of xi given by 51 = (xl + x2 + .  + xn)/n. From equations 5.2 and
5.3 and the reproductive properties of the normal distribution, % is normally distributed with mean
and variance
the random variable Z will be N(0, I). This is a special case of a + bX with a =  p/u and b =
l / u . The random variable Z is said to be standardized (has P = 0 and u2 = 1) and N(0,l) is said
to be the standard normal distribution. The standard normal distribution is given by
NORMAL DISTRIB UTTON 103
2 1 0 +I +2
Fig. 5.3. Standard normal distribution ( p = 0, u2 = 1).
Figure 5.3 shows the standard normal distribution which along with the transformation Z =
(X  p)/u contains all of the information shown in figures 5.1 and 5.2. Both pZ(z)and Pz(z) are
widely tabulated. Most tables utilize the symmetry of the normal distribution so that only posi
tive values of Z are shown. Tables of Pz(z) may show prob(Z < z), prob(0 < Z < z), or prob(z
< Z < z). Care must be exercised when using normal probability tables to see what values are
tabulated. The table of Pz(z) in the appendix gives prob (Z < z). There are many routines pro
grammed into computer software to evaluate the normal pdf and cdf. Some approximations for
the standard normal distribution are given below.
A table of Pz(z) shows that 68.26% of the normal distribution is within 1 standard deviation
of the mean, 95.44% within 2 standard deviations of the mean, and 99.74% within 3 standard
deviations of the mean. These are called the 1,2, and 3 sigma bounds of the normal distribution.
The fact that only 0.26% of the area of the normal distribution lies outside the 3 sigma bound
demonstrates that the probability of a value less than p  3 0 is only 0.0013 and is the justifica
tion for using the normal distribution in some instances even though the random variable under
consideration may be bounded by X = 0. If p is greater than 30, the chance that X is less than
zero is many times negligible (this is not always true, however).
Example 5.2. Compare the 1, 2, and 3 sigma bounds under the assumption of normality and
under no distributional assumptions using Chebyshev's inequality.
Solution: The 1, 2, and 3 sigma bounds of N(p, u2) contain 68.26, 95.44, and 99.72% of the
distribution. Thus, the probability that X deviates more than a , 2u, and 3u from p is 0.3174,
0.0456, and 0.0028 respectively.
104 CHAPTER 5
Chebyshev's inequality states that the prob(1X  pI > ka) < l/k2. This corresponds to a
probability that X deviates more than a , 20, and 3 a from p of less than 1.00, less than 0.25, and
less than 0.1 1, respectively.
Example 5.3. As an example of using tables of the normal distribution consider a sample drawn
from an N(15,25). What is the prob(15.6 < X < 20.4)?
However, this integral is difficult to evaluate. Making use of the standard normal distribution, we
can transform the limits on X to limits on Z and then use standard normal tables.
From the standard normal table Pz(l08) = 0.860 and P,(0.12) = 0.548. The desired prob
ability is 0.860  0.548, or 0.312.
Let y = In (2p). For 0.005 < Pz(z) < 0.5, an approximation for z is given by
NORMAL DISTRIBUTIOW 105
An approximation for Pz(z) for positive values of z is given by
Pz(z) =
1
1  0.5 exp 
(83z ;3351)z

+ 562
+ 165 1
Of course, for negative values of z, P,(z) for the absolute value of z can be obtained and then
Pz(z> = 1  Pz(lzl).
Example 5.4. Use a normal approximation to determine prob(l0.5 < X < 20.4) if X is distrib
uted N(15,25).
'03
1.08
+ 165
+ 562
1 = 0.860
Comment: Often, in solving problems of this type, it is useful to sketch a normal distribution and
then shade in the area corresponding to the desired probability. For this problem the sketch would
be as in figure 5.4.
X
Fig. 5.4. Prob(0.9 < z < 1.08).
106 CHAPTER 5
 
Example 5.5. kepeat example 3.7 assuming the Kentucky River data is

Solution: Since X is assumed normal, X is N(p, 22,3222/n). Therefore, Z = 
22,322/6
' is N(O, 1).
From the problem statement [X pI < 10,000. So n must be determined so that
From the standard normal table it is seen that 95% of the normal distribution is enclosed by
 1.96 < Z < 1.96. From this n is calculated as
or at least 19 observations are required to be 95% sure that X is within 10,000 cfs of p if X is
N(p, 22,322')
Comment: By assuming normality, the required minimum number of observations has been
reduced from 100 to 19. The Law of Large Numbers has placed a lower limit on n without knowl
edge of the distribution of X. The price for this ignorance of the distribution of X is seen to be
very great if in fact X is normally distributed.
In practice, if the Xi are identically and independently distributed, n does not have to be very
large for S, to be approximated by a normal distribution. If interest lies in the central part of the
distribution of S, ,values of n as small as 5 or 6 will result in the normal distribution producing
reasonable approximations to the true distribution of S,. If interest lies in the tails of the
distribution of S,, as it often does in hydrology, larger values of n may be required.
As stated above, the Central Limit Theorem is of limited value in hydrology since most
hydrologic variables are not the sum of a large number of independently and identically distrib
uted random variables. Fortunately, under some very general conditions it can be shown that if Xi
for i = 1,2, ..., n is a random variable independent of Xj for j # i and E(Xi) = pi and Var(Xi)
= ai2, then the sum S, = X, + X2 +   .+ X, approaches a normal distribution with E(S,) =
2:=pi and Var(S, ) = Z:= a"s n approaches infinity (Thomas 1971). One condition for this
generalized Central Limit Theorem is that each Xi has a negligible effect on the distribution of S,
(i.e., there cannot be one or two dominating Xi's).
NORlMAL DISTRIBUTION 107
This general theorem is very useful in that it says that if a hydrologic random variable is the
sum of n independent effects and n is relatively large, the distribution of the variable will be ap
proximately normal. Again, how large n must be depends on the area of interest (central part or
tail of the distribution) and on how good an approximation is needed.
Example 5.6. In the last chapter the gamma distribution for integer values of n was derived
as the sum of n exponentially distributed random variables. The mean and variance of the ex
ponential distribution are given as 1/X and 1/X2, respectively. The Central Limit Theorem
gives the mean and variance of the sum of n values from the exponential distribution as n/X
and n/X2 for large n. This agrees with the mean and variance of the gamma distribution.
In chapter 6, the coefficient of skew of the gamma distribution is given as 2 / f i , which
approaches zero as n gets large. Thus, the sum of n random variables from an exponen
tial distribution is a gamma distribution which approaches a normal distribution (with y
approaching 0) as n gets large.
because the mean of the data is 66,540 cfs and the standard deviation is 22,322 cfs. This integral
is easily evaluated using standard normal tables as 0.0322.
An approximation to the relative frequency in a class interval can also be made by using
equation 2.25b.
Class Expected
Mark Relative Frequencies Observed
Xi Zi Pz(zi) f xi Relative Frequencies
0.03 16
0.0659
0.1122
0.1564
0.1783
0.1663
0.1270
0.0793
0.0405
0.0169
Sum 0.9744
for the first class interval Axi = 10,000, zi = (25,000  66,540)/22,322 =  1.8609, pZ(zi) =
0.0706 (from equation 5.5) and cr is estimated by s = 22,322.
Similar calculations for each of the class intervals are shown in table 5.1, with the results plotted
in figure 5.5. The sum of the expected relative frequencies is not 1 because the entire range of the
normal distribution was not covered.
Binomial Distribution
It was stated in chapter 4 that if X is a binomial random variable with parameters n, and
p and Y is a binomial random variable with parameters n2 and p, then Z = X + Y is a bino
mial random variable with parameters n = n, + n2 and p. Extending this to the sum of sev
eral binomial random variables, the Central Limit Theorem would indicate that the normal
Table 5.2. Corrections for approximating a discrete random variable by a continuous random variable
Discrete Continuous
110 CHAPTER 5
distribution approximates the binomial distribution if n is large. Thus, as n gets large the
distribution of
approaches a N(0, 1). This is sometimes known as the DeMoivreLaplace limit theorem (Mood
et al. 1974).
Example 5.7. X is a binomial random variable with n = 25 and p = 0.3. Compare the binomial
and normal approximation to the binomial for evaluating the prob(5 < X 5 8).
Using the normal approximation, the probability is determined as prob(5.5 < X < 8.5), which is
0.476. Therefore, the exact probability of 0.483 is approximated by the normal to be 0.476 for an
n of 25.
Example 5.8. Work example 4.11 using the normal approximation for the negative binomial.
Solution: The desired probability is prob(39.5 < X < 40.5). Using the standard normal distri
bution, the limits on Z are
This compares favorably with the 0.0206 computed using the negative binomial.
NORMAL DISTRIBUTION 111
Poisson Distribution
The sum of two Poisson random variables with parameters A, and A, is also a Poisson
random variable with parameter h = h, + A,. Extending this to the sum of a large number of
Poisson random variables, the Central Limit Theorem indicates that for large h, the Poisson may
be approximated by a normal distribution. In this case the distribution of
approaches an N(0, 1). Since the Poisson is the limiting form of the binomial and the binomial
can be approximated by the normal, it is no surprise that the Poisson can also be approximated by
the normal.
Continuous Distributions
Many continuous distributions can be approximated by the normal distribution for certain
values of their parameters. For instance, in example 5.6, it was shown that for large n the gamma
distribution approaches the normal distribution. To make these approximations one merely
equates the mean and variance of the distribution to be approximated to the mean and variance of
the normal and then uses the fact that
is N(0, 1) if X is N(p, u2).Not all continuous distributions can be approximated by the normal
and for those that can the approximation is only valid for certain parameter values. Things to
look for are parameters that produce near zero skew, symmetry, and tails that asymptotically
approach p,(x) = 0 as X approaches large and small values. Again, it is emphasized that ap
proximations in the tails of the distributions may not be as good as in the central region of the
distribution.
Exercises
5.1. Consider sampling from a normal distribution with a mean of 0 and a variance of 1. What is
the probability of selecting (a) an observation between 0.5 and 1.5? (b) an observation outside the
interval 0.5 to +0.5? (c) 3 observations inside and 2 observations outside the interval of 0.5 and
1.5? (d) 4 observations inside the interval 0.5 to 1.5 exactly two of which are not in the interval
0.5 to l.O?
5.2. What is the probability of selecting an observation at random from an N(100,2500) that is
(a) less than 75? (b) equal to 75?
5.3. For the Kentucky River data of table 2.1, what is the probability of a peak flow exceeding
100,000 cfs if the peaks are assumed to be normally distributed?
5.4. Construct the theoretical distribution for the data of exercise 2.2 if it is assumed that the data
are normally distributed. From a visual comparison with the data histogram, would you say the
data are normally distributed?
5.5. Work exercise 4.1 using the normal approximation to the binomial and plot the results on the
histogram developed for exercise 4.1.
5.8. A sample of 150 observations has a mean of 10,000, a standard deviation of 2,500 and is
normally distributed. Plot a frequency histogram showing the number of observations expected
in each interval.
5.9. The appendix contains a listing of the annual runoff from Cave Creek watershed near Fort
Spring, Kentucky. What is the probability that the true mean annual runoff is less than 14.0 in. if
one can assume the true variance is 22.56 in.'? What other assumptions are needed?
5.10. Random digits are the numbers 0, 1, 2, ..., 9 selected in such a fashion that each is equally
likely (i.e., has probability 1/10 of being selected). An experiment is performed by selecting 5
random digits, adding them together and calling their sum X. The experiment is repeated 10
times and X is calculated. What is the probability that X is less than 21.5? (Exercise 13.9 requires
that this experiment be carried out.)
5.1 1. Plot the individual terms of the Poisson distribution for A = 2. Approximate the Poisson
by the normal and plot the normal approximations on the same graph.
5.13. Assume the data of exercise 4.21 is normally distributed. (a) Within each month what is
the probability of 10 or more rainy days? (b) What is the probability of 20 or more rainy days in
the JulyAugust period? (c) What is the difference in assuming the data are normally distrib
uted, and in assuming the data are binomially distributed and approximating the binomial with
the normal?
5.14. Plot the observed frequency histogram and the frequency histogram expected from the nor
mal distribution for the annual peak flows for the following rivers. Discuss how well the normal
approximates the data in terms of the coefficient of variation and skewness. (Note: data are in the
appendix or may be obtained from the Internet).
5.15. The occurrence of rainstorms is sometimes considered to be a Poisson process so that the
time between rainstorms is exponentially distributed. If for a certain locality the mean of this
exponential distribution is 10 days, what is the probability that the elapsed time for 15 storms to
occur will exceed 120 days?
5.16. Lane and Osborn (1973) present the following data for the mean number of days with more
than 0.10 inches of precipitation at Tombstone, Arizona. If the occurrence of more than 0.10
inches of rain in any month can be considered as an independent Poisson process, what is the
probability of fewer than 30 days with more than 0.10 inches of rain in one year at Tombstone?
Jan. 2 July 7
Feb. 2 Aug. 7
Mar. 2 Sept. 3
Apr. 1 Oct. 2
May 0 Nov. 2
June 2 Dec. 2

Total 32
5.17. An experimenter is measuring the water level in an experimental towing channel. Because
of waves and surges, a single measurement of the water level is known to be inaccurate. Past
experience indicates the variance of these measurements is 0.0025 ft2. How many independent
observations are required to be 90% confident that the mean of all the measurements will be
within .02 feet of the true water level?
5.18. At a certain location the annual precipitation is approximately normally distributed with
a mean of 45 in. and a standard deviation of 15 in. Annual runoff can be approximated by
R = 7.5 + 0.5P where R is annual runoff and P is annual precipitation. What is the mean and
variance of annual runoff? What is the probability that the annual runoff will exceed 20 in.?
5.19. Plot a frequency distribution for a mixture of two normal distributions. Use as the first
distribution an N(0, 1) and as the second an N(l, 1). Use as values for the mixing parameter 0.2,
0.5, and 0.8.
6. Continuous Probability
Distributions
THERE ARE many continuous probability distributions in addition to the normal distribu
tion. This chapter covers some of these distributions, methods for estimating their parameters,
properties of the distributions, and potential applications for them. Further discussion on distribu
tion selection is contained in chapter 7. Other books may be consulted for more detailed treatment
of the various distributions (Kececioglu, 1991). Rao and Harned (2000) is particularly applicable
to hydrology.
UNIFORM DISTRIBUTION
If a continuous random process is defined over an interval a to P and the probability of an
outcome of this process being in a subinterval of a to P is proportional to the length of the subin
terval, the process is said to be uniformly distributed over the interval a to p (figure 6.1). The
probability density function for the continuous uniform distribution is
1
PX(X) = fora < X < p
X  a
P,(x) =  fora < X < p
Pa
CONTINUOUS DISTRIBUTIONS 115
The skewness is zero since the distribution is symmetrical about the mean. The methods of
moments yields the following estimators for the parameters a and P:
The method of maximum likelihood when applied to the uniform distribution results in
the estimators for a and p being the smallest and largest sample values respectively. That this
is the case can be seen by writing out the likelihood function and then selecting those values of
a and p (within the constraints that a < X < p for all X) that maximize the function.
The uniform distribution finds its greatest application as the distribution of Px(x) for all
probability density functions. That is the prob(Px(x) < y) is uniformly distributed over the inter
val 0 < y < 1 for any continuous probability distribution. This fact is used in generating random
observations from some probability distributions.
Example 6.1. Use the method of moments to estimate the parameters of the uniform distribution
based on the following sample: 1, 4, 3, 4, 5, 6, 7, 6, 9, 5. What are the maximum likelihood
estimators for this sample?
116 CHAPTER 6
Solution: By method of moments

x = 5.00 and s = 2.26
By maximum likelihood
& = 1.00 (smallest sample value)
TRIANGULAR DISTRIBUTION
The triangular distribution shown in Figure 6.2 is given by
It is unlikely that any natural hydrologic process would exactly follow a triangular distribution.
The distribution may be a reasonable approximation to the actual but unknown distribution of
some hydrologic quantities. The triangular distribution has been used in simulation studies
involving bounded random variables whose central tendencies are known.
The mean, variance, and coefficient of skew of the triangular distribution are
CONTINUOUS DISTRIE3UTIONS 117
X
Fig. 6.2. Triangular distribution (here y is 6 of equation 6.6).
The parameter 6 gives the mode of the triangular distribution. If 6 is known, the parameters
ci and p may be estimated based on the method of moments as
where
A =3
B = 36  9K
C = 9K2  96%+ 3S2  18s;
Some special cases of the triangular distribution yield the following estimators:
Mode
A
a I3
EXPONENTIAL DISTRIBUTION
In chapter 4 it was shown that the exponential distribution arises as the probability distribu
tion of the time between occurrences of events of a Poisson process. Among other things, the
exponential distribution has been used as the distribution of the time between rainfall events in
118 CHAPTER 6
stochastic precipitation models. The exponential density function is given by
The coefficient of skew is a constant, 2, indicating the exponential is skewed to the right for all
values of A. The curve labeled = 1 in figure 6.4 is an exponential distribution with A = 1.
Examples 3.2 and 3.4 demonstrated that when either the method of moments or maximum like
lihood is used for parameter estimation, the result is
Example 6.2. Haan and Johnson (1967) studied the physical characteristics of depressions in
northcentral Iowa. The data tabulated below shows the number of depressions falling into vari
ous classes based on the surface area of the depression. Plot a relative frequency histogram of the
data. Superimpose on the histogram the best fitting exponential distribution. Estimate the proba
bility that a depression selected at random will have an area greater than 2.25 acres.
Area (acres) No. of depressions
106
36
18
9
12
2
5
1
4
5
2
6
3
1
1
1
Total 212
C0NTI;tiOUS DISTRIBUTIONS 119
Solution: The relative frequencies are computed by dividing the number of depressions in each
class by the total number of depressions. The best fitting exponential is estimated by using equa
tion 6.15 to estimate the exponential parameter A. X is calculated from equation 3.16 as 1.27
acres. Then ); = 1fi = 0.787. The expected relative frequency in each class is then calculated
from equation 2.25b as
where xi is the midpoint of the class interval, Ax, = %, and pA(xi)is the exponential distribution
of area given by
A 
pA(xi)= hepAxi
Therefore
fxi = (1/2)0.7S7e0.787xi
For example, for the second class interval
f0.75 =
0.393 e0.787(.75) = 0 22
The observed fraction of depressions with areas in excess of 2.25 acres is 31/212, or 0.146.
0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
Area (acres)
Fig. 6.3. Observed and expected (according to the exponential distribution) number of depres
sions in various size categories for example 6.2.
GAMMA DISTRBUTION
The distribution of the sum of n exponentially distributed random variables each with
parameter A is a gamma distribution with parameters T = n and A. In general, q does not have to
be an integer. A comprehensive treatment of the gamma distribution and other distributions in I&
gamma family of distributions is given by Bobee and Ashkar (1991). The gamma density func
tion is given by
( + 1 = ( ) For q > 0
qT)= Jr tq 'e' dt For T > 0
The mean, variance and coefficient of skew for the gamma distribution are
The gamma distribution is positively skewed with y decreasing as q increases. Plots of the
distribution for various values of q and A are shown in figure 6.4. A wide variety of shapes rang
ing from reverse Jshaped for q < 1 to single peaked with the peak (mode) at x = (q  1)/A for
T > 1 can be produced by the gamma density function. Changing A and holding q constant
changes the scale of the distribution, whereas changing q and holding A constant changes the
shape of the distribution. Thus, A and q are sometimes known as scale and shape parameters.
The cumulative gamma distribution is
Some computer spreadsheets will evaluate Px(x) for the gamma distribution.
CONTINUOUS DISTRIBUTIONS 121
Gamma pdf
where x, is the sample geometric mean and +(x) = d In r(x)/dx is the psifunction. Thom
(1958) has proposed an approximate relationship based on the truncation of a series expansion of
the maximum likelihood estimator for q given by
Table 6.1. Correction factor for the maximum likelihood estimator for the parameter q of the
gamma distribution
where y is In Z  E,A: is a correction term arising because of the truncation and Inx is the
mean natural logarithm of the observations. Table 6.1 contains the values A+ for 4 for ranging
from 0:2 to 5.6. For 4 > 5.6 the correction is negligible (as it is anyway for many practical situ
ations regardless of the value of 4). The procedure for finding the correcf on factor is to assume
that 4 is equal to the first term of equation 6.24 and use the A: from table 6.1 corresponding to
this initial estimate for4.
The parameter h is then estimated by
Thom (1958) states that for q < 10 the method of moments produces unacceptable esti
mates for both h and T. For near 1 the method of moments uses only 50% of the sample infor
mation for estimating h and only 40% for q. This means the maximum likelihood estimators
would do as well with one half the number of observations.
Greenwood and Durand (1960) present the following rational fraction approximations for
the maximum likelihood estimators
for
and
for
CONTTNUOUS DISTRIBUTIONS 123
where
A is then estimated from equation 6.25. Greenwood and Durand (1960) state that the maximum
error in equation 6.26 is 0.0088% and in equation 6.27 is 0.0054%.
Equations 6.246.27 produce estimates for T and A that have a slight asymptotic bias. For
small samples the bias may be appreciable (Shenton and Bowman 1970). Bowman and Shenton
(1968) present the following approximate relationship for estimating the bias in the parameter T
when equations 6.246.27 are used.
0.1 11 0.032
317  0.677 +
T
+7
E($  T) E
for n r 4 and T r 1
n3
where E($  T) is the bias in T with error of less than 1.4%. The result of using this relationship
for estimating the bias in $ for a sample size n from a gamma distribution having a population
parameter of T = 2 is shown in figure 6.5. In practice, equation 6.29 can be used to correct $ for
bias. If the population T were known, there would of course be no need for estimating T.
Bowman and Shenton (1968) suggest that the bias in $ can be approximated from
which yields
The gamma distribution has been widely used in hydrology (Bobee and Ashkar 1991).
Rainfall probabilities for durations of days, weeks, months, and years have been estimated by the
0 10 20 30 40 50 60 70 Ba 90 100
Sample size n
Example 6.3. The annual water yield for Cave Creek near Fort Spring, Kentucky (USGS #
03288500) is shown in the following table. Estimate the parameters of the gamma distribution for
this data using both the method of moments and the method of maximum likelihood. Assuming
the data follows a gamma distribution, estimate the probability of an annual water yield exceed
ing 20.00 inches.
Annual Runoff Annual Runoff
Year (inches) Year (inches)
Thus, the maximum likelihood estimators are ); = 0.485 and $ = 7.107. These estimates
may be corrected for bias using either equation 6.29 or 6.30. If 6.30 is used
If q = 5.922 is substituted into equation 6.29, the result is E($  q)= 1.141 which is in
good agreement with the 1.185 produced by equation 6.30. The final estimated for q is now
$ = 5.922 and ); = $/x= 0.404. Using the method of moments the parameter estimates are
$ =9.513 and ); = 0.649, whereas the maximum likelihood estimates are +j= 5.922 and ); =
0.404. Following the recommendation of Thom (1958), the latter estimates will be used in esti
mating the probability of an annual water yield in excess of 20.00 inches.
Thus 1  Px(20.00) is 0.176, which is the desired probability. The prob(yie1d > 20.00) = 0.176
if the annual water yield follows a gamma distribution with parameters q = 5.922 and A =
0.404. In these calculations Microsoft Excel 97 was used to evaluate the gamma distribution.
Comment: If the moment parameter estimates had been used, the resulting probability would
have been 0.132, which is reasonably close to 0.176. This is because q is reasonably close to the
10.00 that Thom (1958) suggested is the smallest value of q for which the method of moments
results in good parameter estimates. For this data C, = 2/* = 0.82, so that the distribution is
moderately skewed to the right. If the normal distribution had been used to estimate prob(X >
20.00), the result would have been 0.126, which again is a reasonable approximation. However,
if the annual water yield with a return period of 100 years or a 1% chance of being exceeded is
evaluated by the gamma with q = 5.922 and A = 0.404 and by the normal with p = 14.56 and
a = 4.75,the results are 32.2 inches and 25.6 inchesagain showing the sensitivity of estimates
of rare events to the distributional assumption even though in the main body of distribution the
agreement is good.
Generally, 18 observations are not enough to make reliable probability estimates or to
determine the proper probability distribution to use. It is a small enough number that one can fol
low through all of the needed calculations for this example in a short time on a desk calculator,
however. The fact that the gamma and normal estimates differ greatly for this data at large return
periods does not mean the gamma (or the normal) is a better approximation for the data. This
question will be taken up later. Exercise 6.21 should be consulted for another approximate solu
tion to this example.
LOGNORMAL DISTRIBUTION
The Central Limit Theorem was used in deriving the general result that if a random variable
X is made up of the sum of many small effects, then X might be expected to be normally distrib
uted. Similarly, if X is equal to the product of many small effects, that is if X = X, X2...Xn,then
the logarithm of X, In X, can be expected to be normally distributed. This can be seen by letting
Y = In X so that Y = In(XlX2...X,) = In XI + In X2 + + In X,. Because the Xi are random
a  
variables, the In Xi are also random variables and Y = In X is a random variable made up from
the sum of many other random variables. From the Central Limit Theorem, Y can be expected to
be normally distributed with mean py and variance u;.
Because Y = In X
and
Note that equation 6.31 gives the distribution of Y as a normal distribution with mean py
and variance a;. Equation 6.32 gives the distribution of X as the lognormal distribution with
parameters py and a;. Y = In X is normally distributed while X is lognormally distributed.
CONTINUOUS DISTRBUTIONS 127
The parameters py and o; can be estimated by Y and S$ in the usual manner by first trans
forming all of the Xi's to Yi's by
then
and
with all of the summations from 1 to n. If a digital computer is used the above equations are
easily applied. Y and S; may be determined without taking the logarithms of all of the data from
where C, is the coefficient of variation of the original data (C, = S,/X). These relationships are
not general results but depend on data being lognormally distributed.
The mean, variance, and coefficient of variation of the lognormal distribution are
Thus, the lognormal distribution is positively skewed with the skew decreasing as the coefficient
of variation decreases. Based on the properties of the normal distribution, the skewness of the
logarithms of lognormal data is zero.
Tables of the standard normal distribution can be used to evaluate the lognormal distribu
tion. From equation 6.32 we have p,(x) = py(y)/x. But py(y) is a normal density function. From
equation 5.7, py(y) = pz(z)/sy or
The prob(X 5 x) is equal to the prob(Y 5 y) because Y = In X is a monotonic, single
valued function. Since Y is normally distributed, prob(Y 5 y) = prob(Z 5 z) where
Therefore, standard normal tables can be used with the proper transformations to evaluate px(x)
and Px(x) for the lognormal distribution.
Certain reproductive properties of the lognormal follow directly from the reproductive
properties of the normal distribution. For example, if X is lognormally distributed then Y = a x b
is lognormally distributed with
Pln Y = In a + bPln x
and
aty = b u h x
2 2
and
Two special cases of the above are if Z = XY and Z = X/Y with X and Y being independ
ently and lognormally distributed, then Z is lognormally distributed with its mean and variance
easily determined from equations 6.43 and 6.44.
Because of its simplicity, its ready availability in tables for its evaluation, and the fact that
many hydrologic variables are bounded by zero on the left and positively skewed, the lognormal
distribution has received wide usage in hydrology.
Example 6.4. Use the lognormal distribution and calculate the expected relative frequency for
the third class interval of the data in table 5.1.
The evaluation of px(x) from equation 6.41 requires an estimate for py and 0,. These are
estimated from equations 6.35 and 6.36.
CONTINUOUS DISTRIBUTIONS 129
or the expected relative frequency in the interval 40,000 to 50,000 according to the lognormal
distribution is
Example 6.5. Assume the data of table 5.1 follow the lognormal distribution. Calculate the
magnitude of the 100year peak flow.
Solution: The 100year peak flow corresponds to a prob(X > x) of 0.01. X must be evaluated
such that P,(x) = 0.99. This can be accomplished by evaluating Z such that Pz(z) = 0.99 and
then transforming to X. From the standard normal tables the value of Z corresponding to Pz(z) of
0.99 is 2.326. From equation 6.37
Example 6.6. Assume that the time between rains follows an exponential distribution with a
mean of 4 days. Also assume that the time between rains is independent from one rain to the next.
Irrigators may be interested in the maximum time between rains. Over a period of 10 rains, what
is the probability that the maximum time between rains exceeds 8 days?
Solution: 10 rains means 9 interrain periods, or n = 9. From equation 6.45 the probability that
the maximum interrain time is less than 8 days is
Therefore, the probability that the maximum interrain time will be greater than 8 is 1  0.27 1
= 0.729.
Comment: The probability density function for the maximum interrain time is from equation 6.46
This distribution is plotted in figure 6.6 for various values of n. Note that for even moderately
large n, the probability is very high that the extreme value (longest intkain time) will be from
the tail of the parent (exponential) distribution.
Frequently the parent distribution from which the extreme is an observation is not known
and cannot be determined. If the sample size is large, use can be made of certain general asymp
totic results that depend on limited assumptions concerning the parent distribution to find the
CONTINUOUS DISTRIBUTIONS 131
Fig. 6.6. Distribution of the largest sample value from a sample of size n from an exponential
distribution.
distribution of extreme values. Much of the work on extreme value distributions is due to
Gumbel (1954, 1958).Three types of asymptotic distributions have been developed based on
different (but not all) parent distributions. The types are:
a. Type Iparent distribution unbounded in direction of the desired extreme and all
moments of the distribution exist (exponential type distributions).
b. Type 11parent distribution unbounded in direction of the desired extreme and all
moments of the distribution do not exist (Cauchy type distributions).
c. Type 111parent distribution bounded in the direction of the desired extreme (limited
distributions).
Interest may exist in either the distribution of the largest or smallest extreme values. Exarn
ples of parent distributions falling under the various types are:
c. Type IIextreme value largest or smallest: Cauchy distribution (Hahn and Shapiro
1967; Thomas 1971)
d. Type IIIextreme value largest: beta distribution (Hahn and Shapiro 1967; Gibra 1973;
Benjamin and Cornell 1970)
px(x) = exp
a
+
[" n 7)
 exp +
where the  applies for maximum values and the + for minimum values. The parameters a and
p are scale and location parameters with p being the mode of the disMbution. The type I for max
imum and minimum values are symmetrical with each other about f3. Figure 6.7 is a plot of the
distributions for a = 3,897 and f3 = 7,750.
The mean and variance of the extreme value type I distribution are
= f3  yea (minimum)
CONTINUOUS DISTRIBUTIONS 133
Largest
Smallest
20 10 0 10 20 30 40
X (1000s)
7rL
Var(X) = a2 (both)
6
where yeis the Euler number having a value of 0.577216.
The skewness coefficient is
y = 1.I396 (maximum)
=  1.1396 (minimum)
where the  applies for the maximum values and the + for the minimum values. The cumula
tive distribution is
The designation "double exponential" distribution follows from these equations. The cumu
lative distribution for maximum and minimum values are related by
134 CHAPTER 6
The parameters of the type I extreme value distribution can be estimated in a number of
ways. Lowery and Nash (1970) compared several methods and concluded that the method of
moments was as satisfactory as other methods. If the method of moments is used. The estimators
are
and
y e 6
p=x
A
S (maximum)
5T
= X+ y e 6 S (minimum)
5T
The maximum likelihood estimators (Lowery and Nash 1970) can be determined by a
simultaneous solution to the equations
fi
Unfortunately, these equations cannot be easily solved explicitly for & and so that a numerical
solution is required.
The type I extreme value distribution for maximums has been used to define the "mean an
nual flood". The probability that an observation from this distribution will exceed the mean of the
distribution is 1  Py(y) where Py(y) is evaluated from equation 6.53 for y = ( p  P)/a. Since
p = E(X) = P + yea (equation 6.48), we simply have that y = y, and Py(y) = 0.5703. The
probability of a value in excess of the mean is 1  P,(y) = 0.4297. The return period of a flood
equal in magnitude to the mean is
1
T= = 2.33 years
1  PY(Y)
Often the "mean annual flood" refers to a flood with a return period of 2.33 years.
The parameters of the Weibull distribution can be estimated by the method of moments by sub
stituting the sample mean and variance for the population mean and variance respectively in
equations 6.62 and 6.63 and then solving the two equations simultaneously for & and 6.
The maximum likelihood estimates can be determined by letting
and
Either method of parameter estimation is difficult. Exercise 6.18 provides a method for
simplifying the solution of the moment equations.
136 CHAPTER 6
Fig. 6.8. Examples of extreme value type III minimum (Weibull) density curves.
The Weibull probability density function can range from a reverseJ with a < 1, to an ex
ponential with a = 1 and to a nearly symmetrical distribution (figure 6.8) as a increases. If the
lower bound on the parent distribution is not zero, a displacement parameter must be added to the
type III extreme value distribution for minimums so that the density function becomes
tables of eYcan be used to determine Px(x). Equation 6.68 is sometimes known as the 3parameter
Weibull distribution, or as the bounded exponential distribution.
The mean and variance of the three parameter Weibull distribution are
and
CONTINUOUS DISTRIBUTIONS 137
The coefficient of skew is again given by equation 6.64. Through algebraic manipulation, equa
tions 6.70 and 6.7 l can be put in the form (Gumbel 1958)
E = p  oB(a)
where
The moment estimates for a , P, and E can now be obtained by 1) solving equation 6.64 for
8,
&, 2) solving 6.74 and 6.75 for A(a) and B(a), 3) solving 6.72 for and 4) solving 6.73 for i .
Table 6.2 can be used to simplify the calculations.
Example 6.7. The minimum annual daily discharges on a stream are found to have an average of
125 cfs, a standard deviation of 50 cfs, and a coefficient of skew of 1.4. Using both the type 111
minimum and the type I minimum extreme value distributions, evaluate the probability of an
annual minimum flow being less than 100 cfs.
Type I minimum
Comment: The results of applying these two distributions to this problem are very different. This
should be expected as it is a situation where the type I for minimums would not be expected to
apply because there would be a lower bound and because the coefficient of skew was given as 1.4
whereas the coefficient of skew for the type I minimum is  1.1396.
Discussion
The theory on which the extreme value distributions depend is not as strong as the Central
Limit Theorem for the normal distribution. More assumptions concerning the underlying or
CONTINUOUS DISTRIBUTIONS 139
parent distribution must be made and the rate of convergence to an asymptotic extreme value
distribution may be rather slow. However, the extreme value distributions do provide a connec
tion between observed extreme events and models that may be used to evaluate the probabilities
of future extreme events.
The conditions under which the various extreme value distributions arise are such that for
many parent distributions (lognormal, gamma) the distribution of maximum values and the dis
tribution of minimum values are not of the same type. The minimum values from a lognormal
would be expected to follow the type 111distribution while the maximum values would follow a
type I distribution.
Various types of extreme value distributions are related. The logarithms of a random vari
able that follows a type III minimum are distributed as the type I minimum extreme value distri
bution. Chow (1954) has shown that if the coefficient of variation of the type I maximum extreme
value distribution is 0.364, the distribution is practically the same as the lognormal distribution
with the same coefficient of variation and coefficient of skew (1.139).
The Gumbel distribution is obtained when K = 0. For IKI < 0.3, the general shape of the GEV is
similar to the Gumbel extreme value distribution with some differences in the right tail. The
parameters 5, a, and K are location, scale, and shape parameters. For K > 0, the distribution has
a finite upper bound at 5 + a / and
~ corresponds to the extreme value type I11 distribution for
maximums that are bounded on the right. The moments of the GEV are
a a
K>O then x < 5 +  f o r ~ < O then x>5+
K K
and may be estimated by equations 3.71. The Lmoments, Xi may then be estimated by equa
tions 3.74.
The parameters of the GEV in terms of Lmoments are:
where
where Px(xp)is the cdf of X. In chapter 7 an example of the use of the GEV for flood frequency
analysis is given.
BETA DISTRIBUTION
A distribution that has both an upper and lower bound is the beta distribution. Generally, the
beta distribution is defined over the interval 0 to 1. It can, however, be transformed to any interval a
to p. If the limits of the distribution are unknown, they become parameters of the distribution, mak
ing it a 4parameter rather than a 2parameter distribution. The beta density function is given by
'
The function B(a, P) = J,' xa '(1  x ) ~ dx is called the beta function. The beta function is
related to the gamma function by
CONTINUOUS DISTRIBLTIONS 141
The beta function is tabulated. The mean and variance of the beta distribution are
The mean and variance can be used to get the moment estimators for a and P.
PEARSON DISTRIBUTIONS
Karl Pearson (Elderton 1953) has proposed that frequency distributions can be represented by
By choosing appropriate values for the parameters, equation 6.90 becomes a large number of
families of distributions including the normal, beta, and gamma distributions.
The Pearson type 111has found application in hydrology especially as the distribution of log
arithms of flood peaks. This distribution can be written
with the mode at X = 0. The lower bound of the distribution is X = a. The difference in the
mean and mode is 6 and the value of px(x) at the mode is po. It can be shown that the Pearson type
I11 is the same as the 3parameter gamma distribution. By shifting equation 6.91 so that the mode
is at X = a and the lower bound is at X = 0, we have
The gamma distribution has the mode at (q  l)/A and the mean at q/A. Thus a = (q  l)/A
and 6 = q/A  (q  l)/A = l/A. The value of px(x) at the mode for the gamma distribution is
ChiSquare Distribution
If Z is a standardized, normally distributed random variable, Z = (X  k)/a, then
where Y is the sum of squares of n random values of Z and has a chisquare distribution with n
degrees of freedom. The chisquare distribution is a special case of the gamma distribution when
X = j/2 and is a multiple of %. The distribution thus has a single parameter v = 27 known as the
degrees of freedom. The expression for the distribution is
The parameter v is usually known in any application of the chisquare to statistical testing.
Equation 6.95 produces the moment estimator for v as = X.In figure 6.4, the curve labeled
X = % is a chisquare distribution with v = 6. The coefficient of skew for the chisquare distri
bution is 2*/. The cumulative chisquare distribution is contained in the appendix in the
form
The t Distribution .
n
Var(T) =  for n >2
n 2
Example 6.8. A sample of size 8 from a normal distribution results in X = 12.7 and s2 = 9.8.
What is the probability that X is in error by more than 1.0?
The desired probability is the area to the right of t = 0.904. By interpolation in the t table,
this value is found to be 0.198. By symmetry, the area to the left of 0.904 is 0.198. The desired
probability is 0.198 + 0.198 = 0.396.
If a standard normal distribution had been used rather than a t distribution, it would have been
necessary to find prob(lZ1 > 0.904). This probability can be found from the standard normal table
to be 0.366. Thus, even for a sample as small as 8, the normal is a reasonable approximation.
The F Distribution
If U is a chisquare variate with y = m degrees of freedom and V is a chisquare variate
with y = n degrees of freedom and U and V are independent, then
has an F distribution with yl = m and y2 = n degrees of freedom (m and n are known as the
numerator and denominator degrees of freedom, respectively). The F distribution is given by
TRANSFORMATIONS
Often a transformation can be made in an attempt to anive at a probability distribution that
will describe the data. Common transformations are logarithmic transformations, translations
along the x axis, and n" power transformations for n = K,, %, 2, and 3.
We have already made one application of the logarithmic transformation to get the lognor
mal distribution from the normal distribution. Other distributions can be transformed by means
of this transformation as well. Benson (1968) and an Interagency Subcommittee on Water Data
(1982) discuss the use of the logPearson type I11 distribution for flood frequencies.
Translations are especially useful in the case of bounded distributions. We made use of a
translation in deriving the 3parameter extreme value type I11 for minimums from the correspon
ding 2parameter distribution. In general, a translation is accomplished by subtracting a location
parameter, E, from the random variable. For example
The fact that y must be now used to estimate means that for small samples accuracy is lost, be
cause y is based on the third sample moment. As shown earlier, the 3parameter gamma is the
same as the Pearson type 111distribution.
Sangal and Biswas (1970) have used the 3parameter lognormal distribution obtained by fit
ting a normal distribution to the logarithms of (X  E) where E is a parameter that must be esti
mated from the data. They found for 10 Canadian rivers that the 3parameter lognormal
distribution fit the observed distribution of peak flows. They also state that the Gumbel extreme
value distribution is a special case of the 3parameter lognormal distribution.
The three parameter lognormal is given by
where
Stidd (1953) and Kendall (1967) discuss transforming variables by Y = x"' and then fitting a
normal distribution to Y. They discuss this transformation in terms of precipitation probabilities.
Exercises
6.1. Show that the mean of the uniform distribution is (P + a)/2 and the variance is
(P  4 / 1 2 .
6.3. What is the skewness and coefficient of variation for the exponential distribution?
6.4. Fit the gamma distribution to the data of exercise 2.2. Plot the expected relative frequency
according to the gamma distribution on the plot of exercise 2.2.
6.6. Fit the lognormal distribution to the Kentucky River data of table 2.1. Is this a good
approximation for the data?
6.9. Repeat exercise 6.8 using the type I extreme value distribution for minimums.
6.12. Show that the exponential distribution is memoryless [i.e., show that prob(X 2 t +
T ~ X> t ) = prob(X > t)].
6.13. Plot the probability density function and the cumulative probability distribution for the
lognormal distribution with p, = 50,000 and ox= 25,000.
6.14. Plot the theoretical distribution of the largest value selected from a normal distribution
with p = 4 and u = 4 for sample sizes of n = 2,5,9, and 33. Compare the results with those of
example 6.6.
6.15. Derive expressions analogous to equations 6.45 and 6.46 for the smallest of n independ
ently and identically distributed random variables.
6.17. Assume that during month 1 the mean and standard deviation of the monthly rainfall are
0.750 and 0.433 inches, respectively. Similarly, during month 2 the mean and standard deviations
of monthly rainfall are 3.000 and 0.866 inches, respectively. Assume monthly rainfall amounts
can be approximated by the gamma distribution and that rainfall in month 2 is independent of
rainfall in month 1. What is the probability of receiving more than 3 inches of rain during the
twomonth period?
6.18. Show that for the 2parameter Weibull distribution the parameter a is a function only of the
coefficient of variation. Using this fact, describe a procedure for estimating a and lj of the distri
bution.
6.19. If peak discharge, q, is lognormally distributed with mean p, and variance oi, what is the
probability distribution of stage S? Assume stage and discharge are related by q = asb.
6.20. Work exercise 6.19 assuming the peak discharges are distributed as the type I extreme
value distribution.
4
6.21. In example 6.3 let be approximated by 6.0. Calculate from equation 6.25 and then
evaluate the prob(yie1d > 20.0) by using the equation following equation 6.21. Compare the
results with those of example 6.3.
6.22. Use the method of moments to estimate the parameters of the 3parameter lognormal
distribution for the North Llano River near Junction, Texas. What is the return period of a mean
annual flow of 273 cfs or more?
6.23. Calculate the return period associated with an annual runoff of 0.500 inches for Walnut
Gulch near Tombstone, Arizona (Data in Appendix C). Assume (a) lognormal distribution, (b)
gamma distribution, (c) extreme value type I, (d) normal distribution.
6.24. Assume the data of exercise 4.10 are distributed as a 2parameter exponential distribution.
Estimate the parameters of this distribution and prepare a table comparing the observed and
expected number of floods over the 100year period.
7. Frequency Analysis
ONE OF the earliest and most frequent uses of statistics in hydrology has been that of
frequency analysis. Early applications of frequency analysis were largely in the area of flood flow
estimation. Today nearly every phase of hydrology is subjected to frequency analysis. Although
most of the discussion in this chapter centers on flood flows or peak flows, the techniques are
generally applicable to a wide range of problems including runoff volumes, low flows, rainfall
events of various kinds, water quality parameters, measures of ground water levels and flows,
and many other environmental variables. The statistical and mathematical manipulations dis
cussed in this chapter do not depend on the units of measurement or the quantity measured. The
assumptions that are made, however, must be carefully compared to the situation under study.
The goal of a frequency analysis is to estimate the magnitude of an event having a given
frequency of occurrence or to estimate the frequency of occurrence of an event having a given
magnitude. The frequency is often stated in terns of a return period, T, in years, or a probability
of occurrence in any one year, p. Other terminology commonly used includes the estimation of a
"quantile" or "percentile" of the probability distribution of the quantity of interest. The loopth
percentile is simply the event having a probability, p, of occurring. The term "quantile" is used in
a similar manner. The 9othquantile is the same as the 9othpercentile. The loopthpercentile or the
loopthquantile is the value, xp, of the random variable X satisfying
Hydrologic frequency analysis can be made with or without making any distributional
assumptions. The procedure to be followed in either case is much the same. If no distributional
assumptions are made, the observed data are plotted on any kind of paper (not necessarily prob
ability paper) and judgment used to determine the magnitude of past or future events for various
return periods. If a distributional assumption is made, the magnitude of events for various return
periods is selected from the theoretical "bestfit" line according to the assumed distribution. If an
analytical technique is used, the data should still be plotted so that one can get an idea of how
well the data fit the assumed analytical form and to spot potential problems.
PROBABILITY PLOTTING
Once data for a frequency analysis have been selected, they must be carefully scrutinized to
ensure all of the observations are all valid representations of the hydrologic characteristic under
consideration. For example, in a flood frequency data set consisting of the annual maximum flow,
it is possible that the lower values are merely flows somewhat above the flows for the remainder
of the year but do not truly represent high flows or flood flows. In such a case, some truncation
of low flows might be instituted with the analysis done on the truncated data set and adjusted to
the full record length.
After accepting the data as valid, basic statistics (mean, variance, skewness) of the data
should be computed and the data plotted as a probability plot. Plotting probability density func
tions and cumulative probability distributions on arithmetic paper has already been discussed. In
general, when the cumulative distribution function, Px(x), is plotted on arithmetic paper versus
the value of X, a straight line does not result. To get a straight line on arithmetic paper, Px(x)
would have to be given by the expression Px(x) = ax + b or p,(x) = a, the uniform distribution.
Thus, if the cumulative distribution of a set of data plots as a straight line on arithmetic paper, the
data follows a uniform distribution. Probability paper can be developed so that any cumulative
distribution can be plotted as a straight line. Generally, the scaling of the probability axes is
unique for each of the different probability distributions to plot as a straight line. The scaling of
the probability axis may even have to change as the parameters of a particular distribution
change. Constructing probability paper is a process of transforming the probability scale so that
the resulting cumulative curve is a straight line. Many types of probability paper are comrner
cially available, including paper for the normal, lognormal, exponential, certain cases of the
gamma, extreme value (type I), Weibull, and chisquare distributions.
A few computer software packages provide for plotting using a normal distribution proba
bility scale. Some of the packages will plot probability directly whereas others use the Z trans
formation of the normal distribution. The resulting plots are similar. When the Z transformation
is used, the probability associated with the plotted Z values must be independently determined.
The most common probability paper has a normal probability scale and either an arithmetic
(normal probability paper) or logarithmic (lognormal probability paper) scale. Normally distrib
uted data will plot as a straight line on normal probability paper and lognormally distributed data
will plot as a straight line on lognormal probability paper. One way to determine if data might be
from a normal or lognormal distribution is to plot the data on normal and lognormal probability
paper and visually determine if a straight line is obtained.
A probability plot is a plot of a magnitude versus a probability. Determining the probability
to assign a data point is commonly referred to as determining the plotting position. For a
population, determining the plotting position is merely a matter of determining the fraction of
the data values less (greater) than or equal to the value in question. Thus the smallest (largest)
population value would plot at 0 and the largest (smallest) population value would plot at 1.00.
Assigning plotting positions to sample data is not as straightforward. Generally, a sample will
not contain the smallest or largest value of the unknown population. Thus, plotting positions of
0 and I should be avoided for sample data unless one has additional information on the popula
tion limits.
Plotting position may be expressed as a probability from 0 to 1 or a percent from 0 to 100.
Which method is being used should be clear from the context. In some discussions of
probability plotting, especially in hydrologic literature, the probability scale is used to denote
prob(X > x) or 1  Px(x). In this book we will adopt this convention. The reason for this is that
the return period, Tx(x), is l/prob(X > x) = 1/(1  Px(x)), or the reciprocal of the probabil
ity scale. One can always transform the probability scale from 1  Px(x) to Px(x) or even Tx(x)
if desired.
Probability plotting of hydrologic data requires that individual observations or data points
be independent of each other and that the sample data be representative of the population (unbi
ased). Some common types of sample data are complete duration series, annual series, partial
duration series, and extreme value series.
The complete duration series consists of all available data. An example would be all the
available daily flow data for a stream. This particular data set would most likely not have inde
pendent observations. Complete duration series data are rarely subjected to a standard frequency
analysis because they likely contain significant serial correlation. Since what is generally of
interest are rare events, often only the largest or smallest event over a period of time, generally
one year, are selected. Such a series is known as the annual series. The data in table 2.1 is an
annual series.
The partial duration series consists of all values above (below) a certain base. All peak flows
above 40,000 cfs in the Kentucky River, Salvisa, Kentucky, would represent a partial duration
series. This series may have more or less values in it than the annual series. For example, there
would be 9 years that would not have contributed any data to a partial duration series with a base
of 40,000 cfs for the data in table 2.1; however, some years may have more than one peak above
the base. The partial duration series is also known as the 'peaks over threshold' series.
The annual series and the partial duration series approach one another for long return
periods. Beard (1974) has shown that the relationship between annual series and partial duration
series flood peaks varies throughout the United States and recommends the use of empirically de
rived, regionalized relationships. Frequently, the annual series and the partial duration series are
combined so that the largest (smallest) annual value plus all independent values above (below)
some base are used. For periods of record longer than about 10 years, the annual series and the
partial duration series give very similar results.
The extreme value series consists of the largest (smallest) observation in a given time inter
val. The annual series is a special case of the extreme value series with the time interval being
one year.
FREOUENCY ANALYSIS 153
Regardless of the type of sample data used, the plotting position can be determined in the
same manner. Gumbel (1958) states the following criteria for plotting position relationships:
1. The plotting position must be such that all observations can be plotted.
2. The plotting position should lie between the observed frequencies of (m  l)/n and m/n
where m is the rank of the observation beginning with m = 1 for the largest (smallest) value
and n is the number of years of record (if applicable) or the number of observations.
3. The return period of a value equal to or larger than the largest observation and the return
period of a value equal to or smaller than the smallest observation should converge toward n.
5. The plotting position should have an intuitive meaning, be analytically simple, and be easy to use.
Several plotting position relationships are presented in Chow (1964) and Singh (1992). A general
plotting position relationship is given by
where a and b are constants (Adamowski 1981). Some of the most common relationships for
plotting positions are shown in Table 7.1. Unless specifically stated to the contrary, the Weibull
relationship is used in the remainder of this book. Benson (1962a), in a comparative study of
several plotting position relationships, found on the basis of theoretical sampling from extreme
Cunnane (1978)
of the distribution, so care must be exercised. The quantity 1  Px(x) represents the probability
of an event with a magnitude equal to or greater than the event in question. When the data are
ranked from the largest (m = 1) to the smallest (m = n), the plotting positions correspond to
1  Px(x). If the data are ranked from the smallest (m = 1) to the largest (m = n), the plotting
position formulas are still valid; however, the plotting position now corresponds to the
probability of an event equal to or smaller than the event in question, which is Px(x). Probability
paper may contain scales of Px(x), 1  Px(x), TX(x),or a combination of these.
Plotting data on probability paper results in an empirical distribution of the data. As an
example of probability plotting, consider the data in table 2.1. The steps in plotting this data are:
1. Rank the data from the largest (smallest) to the smallest (largest) value. If two or more obser
vations have the same value, several procedures can be used for assigning a plotting position.
The procedure adopted here is to assume they have different values and assign each a unique
rank. For example, in the data of Table 7.2, the value of 82,900 is assigned a rank of both 22
and 23 since it occurs twice in the data set.
3. Select the type of probability paper to be used. Normal probability paper is used in this
example.
The data of Table 2.1 are ranked and the plotting positions calculated based on the Weibull
relationship in Table 7.2. Figure 7.1presents the plotted data.
0.5 1 2 5 10 20 30 50 70 80 90 95 98 99
Exceedance proability
Historical Data
Occasionally, flood information outside of the systematic flow record is available from his
torical sources such as newspaper reports, earlier flood investigations, or from paleohydrologic
investigations. Such data contain valuable information that should not be ignored in a frequency
analysis. Bulletin 17B of the United States Water Resources Council (1981) demonstrates com
puting the plotting position of the historical observations on the basis of the historical record
length. Likewise, the plotting position of the systematic data is computed on the basis of the
historic record length, except that the rank used in the calculation is adjusted by a factor, W,
depending on the historic record length, H, the number of historic flows, Z, and the length of the
systematic record, N. These are related by
HZ
w=
N
with m being the unadjusted rank of the total record (systematic plus historic).
FREQUENCY ANL4LYSIS 157
Thus, if 20 years of systematic data and 2 historic observations larger than any values in the
systematic record are available from a 50year period preceding the systematic record, the plot
ting position for the 2 largest values would be 1/71 = 0.014 and 2/71 = 0.028. The weighting
factor would be
The remaining plotting positions would be calculated from the adjusted rank given by
The adjusted rank is then used in the plotting position relationship (equation 7.2). Thus, for
m = 3 (the largest systematic flow observation), the plotting position using the Weibull plotting
position relationship would be [3.40(3)  61/71? or 0.0592, and for m = 22 (the smallest value)
the plotting position would be [3.40(22)  61/71, or 0.9690. This compares to plotting positions
of 1/21, or 0.0476, and 20/21, or 0.9523, respectively, if the historic data had been ignored. If
the historic data had simply been used to augment the systematic record without using the
weighting factor, the plotting positions for these two events would have been 1/23, or 0.0435,
and 2/23, or 0.0870, respectively. Clearly, a plotting position of 0.0435 assigns too high a
probability of occurrence to the largest systematic value. Knowledge that there were 48 years
with no flows larger than the two historical events has been ignored in this later case. It is also
apparent that the weighting procedure adjusts the plotting position toward a more frequent
occurrence for the largest systematic value thus taking into account the fact that two flows greater
in magnitude than the largest systematic flow occurred.
Bulletin 17B also suggests the flow statistics be computed by weighting the contribution of
the systematic record to the various statistics by the factor W. Thus the adjusted mean is
where the X represents the systematic record and X, the historic data. Similarly, the variance and
skew can be determined from
If a log based distribution such as the lognormal or log Pearson III is being used, the X's and Xz7s
would be based on logarithms.
158 CHAPTER 7
Outliers
When probability plots of hydrologic data are made, frequently one or two extreme events
are present that appear to be from a different population because they plot far off of the line
defined by the other points. For example, it is entirely possible that a 100year event is contained
in 10 years of record. If this is the case, assigning a normal plotting position of 1/11 to this value
would not be reflective of its true return period. Unfortunately, the true return period is not
known. The treatment of these "outliers" is an unresolved and controversial question. The fact
that this occurs frequently in hydrologic data should not be surprising.
Using methods discussed in chapter 4, the probability of at least one occurrence of an nyear
event in a kyear record can be calculated as 1  (1  1 /n)k. For example, the probability of at
least one occurrence of a 100year event in a 32year record is 1  0 . 9 9 ~or ~ ,0.275. If we have
four independent 32year records, we expect one to contain at least one 100year event. This is
the case even though the 100year event is from the same population as the other 3 1 events in the
32year record.
Bulletin 17B suggests that outliers can be identified from
where XHand XLare threshold values for high and low outliers and K, can be approximated from
where XTis the magnitude of the event having a return period T and KTis a frequency factor. This
relationship comes about by writing any X as
and then stating that AX, the deviation from the mean, is the product of the standard deviation s
and a frequency factor K.
KT depends on the probability distribution being used and the return period.
Recalling that c, = s/X, equation 7.11 takes on the form of equation 7.9. Chow (1951,
1964) presents the frequency factors for many different types of frequency distributions.
Equation 7.9 can also be used to construct the probability scale on plotting paper so that the
distribution corresponding to KT plots as a straight line. The use of frequency factors is equiva
lent to using the method of moments for estimating the parameters of a pdf.
Normal Distribution
For the normal distribution it can easily be shown that KT is the standardized normal
variate Z. The standard normal distribution, along with equation 7.9, can be used to determine
the magnitude of normally distributed events corresponding to various probabilities. For
example, the magnitude of a 20year peak flow for the data of table 2.1 can be determined by
calculating
and
160 CHAPTER 7
The 20year event corresponds to a prob(X > x) of .05, so the probability of an event less
than the 20year event is 0.95. The value of Z corresponding to a probability of 0.95 is found
from standard normal tables to be 1.645. Thus
= 103,209 cfs
Lognormal Distribution
For the lognormal distribution, the magnitude of a flow with a given return period can be de
termined by recalling that the logarithms of the flow are normally distributed. The data are first
converted to their natural logarithms by Y = ln(X). The mean and standard deviation of the log
arithms are then determined. XT is then given by
where 7 and s, are based on the natural logarithms of X, and Kn is from the standard normal
distribution.
1. Transform the n annual flood magnitudes, Xi, to their logarithmic values, Yi (i.e., Yi = logloXi
for i = 1,2? .. n).
.?
5. Compute
FREQUENCY ANALYSIS 161
where KT is obtained from table 7.3. Note that this relationship is identical to equation 7.9 except
the logarithms are used.
This method has as a special case the lognormal distribution when C, = 0. For short periods
of record, the skew coefficient calculated from equation 7.13 may not be a reliable estimate of the
population skew coefficient and it may be desirable to replace it with a regionalized coefficient
Table 7.3a. KT values for positive skew coefficients Pearson type III distribution'
Table 7.3b. KTvalues for negative skew coefficients Pearson type III distribution'
(Beard 1962, 1974; Benson 1968). Figure 7.2 contains regionalized skew coefficients of annual
streamflow maximum logarithms computed by the U.S. Geological Survey.
The frequency factors of table 7.3 can be used for the Pearson type 111 distribution in the
same manner as for the log Pearson type ID. The actual data values rather than their logarithms
would then be used.
Approximate values of KT for the Pearson Type 111distribution are given by
FREQUENCY ANALYSIS 163
Fig. 7.2. Generalized skew coefficients of annual maximum stream flow logarithms.
where K, is the standard normal deviate (Interagency Advisory Committee on Water Data
1982). Because of certain limitations on this approximation, the use of the table for KT is rec
ommended. Obviously, the use of analytic approximations for KT for any of the distributions
makes the calculations for flows of various return periods quite easy using spreadsheets or other
computer software. Table 7.4 contains the maximum percent error in equation 7.14 as compared
to Table 7.3. Note that a 1% error in KT does not translate directly to a 1% error in flow. For
example, when the log Pearson type 111 is used in example problem 7.2, the 100year flow is
estimated at 29,719 cfs. The skewness of the logarithms was 0.296, so use of equation 7.14 has
a maximum error of 0.09%. With such an error, KT would be 1.0009 X 2.542, or 2.567, and
the resulting flow estimate would be 29,752 cfs, which represents a difference of 0.11 % from
Table 7.4. Errors in the use of equation 7.15 for estimating KT log Pearson distribution
164 CHAPTER 7
the estimate using the table value. This is a very small error when one considers the uncertain
ties present in estimates of this kind. Often interpolation has to be done in table 7.3, which may
introduce more error than the use of equation 7.14. Only for C, < 2.5 is the error in KT > 2%
for T of 50, 100, and 200 years.
where ye is the Euler number (0.577216) and Tx(x) is the desired return period of the quantity
being calculated. Potter (1949) presents some curves that simplified the application of the
extreme value type I. Kendall (1959) presents the frequency factors shown in table 7.5 for the
extreme value type I distribution. The values computed from equation 7.15 are equivalent to an
infinite sample size in table 7.5.
GENERAL CONSIDERATIONS
Many proponents (and opponents) of one analytical form or another for flood flow frequen
cies have come to the fore over the past few decades. The proponents claim that some particular
method is superior to some other method and "prove" their claim by a few rationalizations and
some case studies. The fact remains that these rationalizations involve questionable assumptions.
There is no direct theoretical connection between any analytical form of the frequency distribu
tion and the underlying mechanisms governing flood flows except through the limit theorems.
The primary consideration in selecting a particular analytical form for the frequency distribution
is that the distribution "fit" the observed data (Anderson 1967; Benson 1968).
Benson (1968) reported on the results of a study by a work group consisting of 18 represen
tatives from 12 federal agencies of the U.S. government. This group studied 6 methods of flood
frequency analysis on 10 streams located throughout the United States. The records on these
streams ranged in length from 40 to 97 years with an average of 55 years. The drainage areas
ranged from 16.4 to 36,800 square miles. The six methods of analysis consisted of 1) the gamma
distribution, 2) Gumbel distribution, 3) Gumbel distribution using the logarithms of the data, 4)
lognormal distribution, 5) log Pearson type 111distribution, and 6) Hazen's method. The compu
tational procedures used were much like those presented in this book. The Hazen method consists
of using an equation like equation 7.8 along with a table of empirically derived frequency factors
that are a function of the return period and the coefficient of skew (Hazen 1930). Large differ
ences were produced by the 6 different methods especially at long return periods. The results
showed that the lognormal, log Pearson type 111, and Hazen methods were about equally good.
The group suggested that the log Pearson type 111be used unless there was a good reason to use
some other method. This recommendation was made even though the group realized that "there
are no rigorous statistical criteria on which to base a choice of method". Benson's (1968) report
states that the study showed that "the range of uncertainty in flood analysis, regardless of the
method used, is still quite large" and that many questions concerning it remain unresolved.
In a followup study, Beard (1974) examined flood peaks from 300 stations scattered through
out the United States. Several probability distributions were tried, including the log Pearson type HI,
lognormal, Gumbel's extreme value distribution, and the 2 and 3parameter gamma distributions.
Beard concluded that only the lognormal and log Pearson type 111 with a regionalized skew
coefficient were not greatly biased in estimating future flood frequencies. He stated that the latter
distribution produced somewhat more consistent results but that ... regardless of the methodology
"
employed, substantial uncertainty in frequency estimates from station data will exist ...".
In selecting a particular analytical form for a frequency curve, one may be tempted to select
a distribution with a large number of parameters. Generally, the more parameters a distribution
has, the better it will adapt to a set of data. However, for the sample size usually available in
hydrology, the reliability in estimating more than 2 or 3 parameters may be quite low. Thus, a
compromise must be made between flexibility of the distribution and reliability of the parameters.
166 CHAPTER 7
Recognizing the short record lengths often available for frequency analysis, methods of aug
menting natural data by synthetic data are being developed. In some cases the rainfall record per
taining to a watershed is much longer than its streamflow record. In this event it may be possible to
calibrate a deterministic streamflow model to the watershed and then use the long rainfall record to
generate a long synthetic streamflow hydrograph. This synthetic hydrograph can then be combined
with existing data into a single frequency analysis. In the absence of rainfall records, it may be pos
sible to transfer records from a nearby station or to stochastically generate a series of rainfall data.
This data could then be used with the calibrated deterministic model to augment natural streamflow
data. One might consider weighting the natural data more than the augmented data in the final fre
quency analysis. Regression and correlation techniques might be used to relate peak flows to rain
fall or to peaks from nearby gages and using this relationship to extend the available record.
It was because of the many factors and uncertainties that are involved in the selection of a
probability distribution to use in flood frequency determinations, that several agencies of the U.S.
Federal government developed the guidelines published as "Guidelines for Determining Flood
Flow Frequency," commonly known as Bulletin 17B (Interagency Advisory Committee on Water
Data, 1982). Bulletin 17B has become a standard for flood frequency analysis of annual flood
peak discharges.
The developers of Bulletin 17B recognized that "there is no procedure or set of procedures
that can be adopted which, when rigidly applied to the available data, will accurately define that
flood potential of any given watershed. Statistical analysis alone will not resolve all flood fre
quency problems." The basic Bulletin 17B approach is to use the log Pearson type I11 distribution
as explained above. Because this distribution is a 3parameter distribution, the coefficient of
skew is used when estimating the parameters by the method of moments.
The skew coefficient is sensitive to extreme flood values and thus difficult to estimate from
small samples typically available for many hydrologic studies. Figure 7.2 presents a map of gen
eralized skew coefficients for the logs of peak flows taken from Bulletin 17B of the Interagency
Committee. The station skew coefficient calculated from observed data and generalized skew co
efficients can be combined to improve the overall estimate for the skew coefficient. Under the as
sumption that the generalized skew is unbiased and independent of the station skew, the mean
square error (MSE) of the weighted estimate is minimized by weighting the station and general
ized skew in inverse proportion to their individual mean square errors according to the equation
(Tasker 1978):
where Gw is the weighted skew coefficient, G is the station skew (from equation 7.1 3), G is the
generalized skew (from figure 7.2), MSEc is the mean square error of the generalized skew, and
MSEGis the mean square error of the station skew. MSEE is taken as a constant, 0.302, when the
generalized skew is estimated from figure 7.2. MSEGcan be estimated from (Wallis, Matalas, and
Slack 1974):
FREQUENCY ANALYSIS 167
where
A=0.33+0.081GI ifIGI10.90 (7.18)
=0.52+0.301GI ifIGI>0.90
N = record length
It is recommended that if the generalized and station skew differs by more than 0.5, the data
and flood producing characteristics of the watershed should be examined and possibly greater
weight given to the station skew.
CONFIDENCE INTERVALS
Any stream flow record is but a sample of all possible such records. How well the sample
represents the population depends on the sample size and the underlying population probability
distribution, which is unknown. Both the form and parameters of the underlying distribution
must be estimated. If a second sample of data were available, certainly different estimates would
result for the parameters of the distribution even if the same distribution were selected. Different
parameter estimates will obviously result in different return period flow estimates. If many
samples were available, many estimates could be made of the distribution parameters and
consequently many estimates could be made of return period flowssay Qloo.One could then
examine the probabilistic behavior of these estimates of Qloo.The fraction of the Q,,'s that fell
between certain limits could be determined.
In actuality, we have just one sample of data from which to make estimates of QT. Statisti
cal procedures are available for estimating confidence intervals about estimated values of QTthat
will give a measure of uncertainty associated with QT. Confidence limits give a probability that
the confidence limits contain the true value for QT.A 90% confidence limit indicates that 90% of
the time intervals so calculated will contain the true estimate for QT.
Letting L,and UT be the lower and upper confidence intervals
where X and s, are the sample means and standard deviations and KT, and KT,, are the lower and
upper confidence coefficients. If a distribution like the log Pearson type III distribution is used, X
168 CHAPTER 7
and sx are based on the logarithms of the data and L,and UT are the logarithms of the confidence
limits.
Approximations for KT,, and KT,Ubased on large samples and the noncentral tdistribution are
where
In these relationships, KT is the frequency factor of equation 7.9, Z, is the standard normal
deviate with cumulative probability c = 50 + a / 2 if a is expressed as a percent. If a is 90%,
then c is 95%. The sample size is n. Confidence limits can be placed on frequency curves plotted
on probability paper by making calculations such as above for several values of T.
TREATMENT OF ZEROS
Most hydrologic variables are bounded on the left by zero. A zero in a set of data that is be
ing logarithmically transformed requires special handling. One solution is to add a small constant
to all of the observations. Another method is to analyze the nonzero values and then adjust the
relation to the full period of record. This method biases the results as the zero values are essen
tially ignored. A third and theoretically more sound method would be to use the theorem of total
probability (equation 2.10).
In this relationship, prob (X # 0) would be estimated by the fraction of nonzero values and
prob(X 1 xlX # 0) would be estimated by a standard analysis of the nonzero values with the
sample size taken to be equal to the number of nonzero values. This relation can be written as a
function of cumulative probability distributions.
FREQUENCY ANALYSIS 169
or
where Px(x) is the cumulative probability distribution of all X (prob(X 5 xlX 2 0)), k is the
probability that X is not zero, and Px*(x) is the cumulative probability distribution of the non
zero values of X (ie., prob(X < X ~ X # 0)). This type of mixed distribution with a finite
probability that X = 0 and a continuous distribution of probability for X > 0 was discussed in
chapter 2. Jennings and Benson (1969) have demonstrated the applicability of this approach to
analyzing flood flow frequencies with zeros present.
Equation 7.23 can be used to estimate the magnitude of an event with return period Tx(x) by
solving first for Px*(x) and then using the inverse transformation of P,*(x) to get the value of X.
For example the 10year event with k = 0.95 is found to be the value of X satisfying
Note that it is possible to generate negative estimates for Px*(x) from equation 7.23. For
example, if k = 0.25 and Px(x) = 0.50, the estimated Px*(x) is
This merely means that the value of X corresponding to Px(x) = 0.50 is zero. This makes sense
because Px(x) = 0.50 corresponds to the 2year flow, or the flow equaled or exceeded every
other year. If only 25% or 114 of the annual flows are greater than zero, then the flow exceeded
every other year must be zero.
Example 7.1. Seventyfive years of peak flow data are available from an annual series; 20 of
the values are zero; and the remaining 55 values have a mean of 100 cfs, a standard deviation
of 35.1 cfs, and are lognormally distributed. (a) Estimate the probability of a peak exceeding
125 cfs. (b) Estimate the magnitude of the 25year peak flow.
Solution:
Px*(125) can be evaluated by solving equation 7.12 for KN and then using the table for the
normal distribution to get the desired probability.
From equations 6.35 and 6.36
From a table of the standard normal distribution, this K, for a C, of 0.351 corresponds to a
prob(X, < x) of 0.795.
The probability of a peak flowing any year exceeding 125 cfs is 0.15. The conditional probability
of a peak exceeding 125 cfs given that the peak is not zero is 1  0.795 = 0.205.
The value of X corresponding to Px*(x) = 0.945 can be obtained from equation 7.12. Z, for
P(x) = 0.945 is 1.60. Therefore, X2, = exp(4.547 + 0.341 *1.60) = 163 cfs.
       
Example 7.2. Table 7.6 contains annual peak flow data for Black Bear Creek near Pawnee,
Oklahoma, for the years 1945 through 1997.
(b) Plot the "best" fitting normal, lognormal, extreme value type I, and log Pearson type I11
distributions on the plot of part a.
(c) Estimate the 100year peak flow based on the four distributions of part b.
(d) Estimate the 90% confidence intervals on the log Pearson type ID estimates.
(e) Estimate the 100year peak flow using the log Pearson type 111with a weighted skew
coefficient based on the station skew and the generalized skew coefficients.
FREQUENCY ANALYSIS 171
Table 7.6. Annual peak flow data for Black Bear Creek near Pawnee, Oklahoma
Solution:
(a) The plotting positions are calculated by ranking the data from largest to smallest and
then using the relationship pp = m/(n + 1) where m is the rank and n is the number of
observations (53). Since the largest observation is 30,200 cfs, it is assigned a pp of 1/54,
or 0.0185. The second largest value is 19,200 cfs with a pp of 2/54 or 0.0370, and so
forth until the smallest value of 1,560 cfs with a pp 53/54 or 0.9815. The data are
plotted in figures 7.3.
(b) The best fitting lines for the various distributions can be obtained by calculating several
points from equation 7.9. The basic statistics of the data are found to be
Data In of data
The next step is to determine the appropriate frequency factors for various return periods for
the four distributions. The frequency factor for the nonnal and log normal distributions comes
from the standard normal distribution. KT for the extreme value and log Pearson distributions
come from equations 7.15 and 7.14, respectively. Sample calculations follow for a return period
of 20 years.
N LN EV 1 LP3
prob flow flow flow flow
Fig. 7.3b. Flood frequency curves for Black Bear Creek using the standard normal z and loga
rithmic flow scales.
FREQUENCY ANALYSIS 173
Exceedance probability
Fig. 7 . 3 ~ .Flood frequency curves for Black Bear Creek using normal probability paper.
\.

.... .. .. Normal
 Lognormal
 Extreme value
log Pearson
Data
Exceedance probability
Fig. 7.3d. Flood frequency curves for Black Bear Creek using lognormal probability paper.
Normal distribution:
Lognormal distribution:
Figures 7.3ad show the resulting plot of the data and the best fitting distributions. The four
plots all contain the same information but show different formats. The first two plots use the z
transformation and the second two plots use normal probability scales. Both arithmetic and log
arithmic scales are shown for flow. Note that the normal distribution plots as a straight line when
the arithmetic scale is used and the lognormal distribution plots as a straight line when the log
arithmic scale is used.
(c) The 100year flow estimates are contatined in the last line of the above table.
(d) The calculations of the confidence intervals are contained in the following table:
(4) (5
Kt, 1 Kt, u
 1.7499
0.1786
1.1122
1.6588
2.1360
2.2795
2.6994
3.089 1
FREOUENCY ANALYSIS 175
1 2 5 10 20 3 0 4 0 5 0 6 0 7 0 80 90 95 9899
Exceedance probability
Fig. 7.4. Flood frequency curves for Black Bear Creek with confidence intervals.
Explanation of columns in above table: (1) Return period; (2) From equation 7.14; (3)
From equation 7.22b; (4) and ( 5 )From equation 7.21 ; (6) and (7) Equation 7.20; (8) Exp(col(6));
(9) Exp(col(7)); (10) Last column of previous table
The results are plotted in the figure 7.4.
(e) Station skew, G, is 0.296. The generalized skew from figure 7.2 is 0.22. From equa
tions 7.16 to 7.18 we get
Example 7.3. Estimate the 100year flow for Black Bear Creek using the GEV distribution.
Solution:
From example problem 7.2, E = 6683, sx = 5337, and C, = 0.296.
From equations 3.7 1
cx[T(l + K)  11
c=h,+ = 4108
K
Fig. 7.5. Estimated 100year flow as a function of truncation level for Black Bear Creek.
Treatment of Zeros section. Figure 7.5 shows the estimated 100year peak flow for Black Bear
Creek using data from example problem 7.2, the log normal distribution, and various truncation
levels. For example, if a truncation level of 3000 cfs is selected, the 11 values less than 3000
would be truncated and the k in equation 7.23 would be (53  11)/53, or 79.
1. Is the type of storm that is likely to produce the 100year flow represented in the observed
sample?
2. Is the contributing area of the basin the same for extreme floods as it is for small ones?
3. Are there ponds and reservoirs that may discharge at high rates during rare floods and not dur
ing smaller flows? What is the possibility of a dam breach and what would be the resulting
flow?
FREQUENCY ANALYSTS 179
4. Are the channel flow and storage characteristics the same for extreme flows as they are for
smaller flows?
5. Are land use and soil characteristics such that flows from rare storms may relate to precipita
tion in a manner different from more common storms?
6. Are there seasonal effects such that rare floods are more likely to occur in a different season
than the more common floods?
7. Is the rare flood represented in the sample of data? If so how is it treated? Is it assigned a re
turn period of 15 years where in fact its return period may be much greater than that?
8. Are changes going on within the basin that may cause change in the hydrologic response of
the basin to rainstorms?
9. Are there climatic changes occurring that may influence flood flow frequencies?
These last few paragraphs paint a discouraging picture for flood frequency analysis. That
need not be the case as long as one does not discard hydrologic knowledge in the process.
Often, the questions posed can be answered in such a way as to make the statistical analysis
valid. At other times, when problems with the statistical procedures are recognized, adjust
ments can be made in the resulting flow estimates to more accurately reflect the hydrology of
the situation.
Hydrologic frequency analysis should be used as an aid in estimating rare floods. Some
times the estimates made on the basis of the statistical frequency analysis can be taken as the
final estimate. Sometimes the statistical estimate may need to be adjusted to better reflect the
hydrology of the situation.
It should be kept in mind that other hydrologic estimation techniques suffer from some of
the same difficulties as do the statistical techniques. For example, if a hydrologic model is being
employed, the parameters of the model must be estimated in some way. This is generally done on
the basis of observed data from the basin in question, from observed data, from a similar basin or
from socalled physical relationships such as Manning's equation, infiltration parameters, and so
forth, and a set of accompanying tables. Regardless of how the parameters are estimated, the
same type of questions regarding these estimates and the nature of the hydrologic model itself
must be answered as outlined above for frequency analysis estimates. We cannot substitute math
ematical and empirical relationships for hydrologic knowledge any more than we can substitute
statistics for hydrologic knowledge.
Based on this discussion, one might conclude that the magnitude of rare events should not
be estimated because the estimates may be so uncertain. Generally, however, this is not one of the
options available. An estimate must be made. Hydrology must not be ignored in making this
estimate. Statistical, modeling, or empirical flow estimates should be made and then adjusted, if
required, to reflect the hydrologic situation. This is not to say a factor of safety is to be applied.
Adjustments should be based on hydrology, not rules of thumb.
180 CHAPTER 7
REGIONAL FREQUENCY ANALYSIS
Regional flood frequency analysis has three major components, namely, delineation of
homogeneous regions, determination of appropriate probability density functions (or frequency
curves) of the observed data, and the development of a regional flood frequency model (i.e., a
relationship between flows of different return periods, basin characteristics, and climatic data).
Historical Development
Weldu (1995) reviewed several articles on regional flow estimation. The earliest approach
to the regionalization problem was to use empirical equations relating flood flow to drainage area
within a particular region (Benson 1962~).The formulas were based on few data for a particular
region and contain one or more constants whose values are empirically determined. Such a for
mula, in generalized form, is
where Q is the flow, C is coefficient related to the region, and A is the drainage area. The above
equation, although simple to derive and apply, does not address the frequency of the flow and the
effect of variations in precipitation or topography on the flows that are not accounted for. The var
ious "culvert formulas" used by railroad and highway engineers, such as the Talbot formula
(AISI 1967) are of this general type. The Talbot formula is widely used and is denoted by
One major weakness in this type of empirical formula is that the coefficients will remain
constant only within regions in which other hydrologic factors vary little, which implies that the
regions must of necessity be fairly small.
Statistical Methods
Other methods of regionalization include the application of statistical techniques to hydrologic
data. Statistics provides a means of reducing a mass of data to a few useful and meaningful figures.
The distribution of the data could be represented by a probability density function or a curve that
defines the frequency of values of the variable. Statistical procedures may also provide methods of
relating dependent variables to one or more independent variables through regression analysis.
Most applications of statistical techniques require a considerable amount of data. The value
of the analysis is directly related to the quantity and quality of the data that are available. Often,
hydrologic estimates are required at locations where there is little or no data. The design of a
bridge opening or culvert, for example, on one of the many streams for which there is no data
may be required. Regionalization is an attempt to use data from locations in the same region as
the point of interest to make hydrologic estimates at the point of interest.
Regional flood frequency models have extensively been used in hydrology for transfemng
data from gaged to ungaged sites. Two such regionalization procedures, namely the indexflood
and regressionbased methods, have evolved over the years and have extensively been used in
regional flood frequency analysis.
This treatment will focus on flood frequency analysis. The goal is to estimate flood flows of
various return periods for streams and locations where there is little or no data.
Frequency Distributions
After a homogeneous region has been identified, the next stage in the specification of the
regional flood frequency model is the choice of appropriate frequency distribution(s) to represent
the observed data. The distributions most commonly used in hydrology are normal, lognormal,
Gumbel extreme value distribution (type I), and log Pearson type 111. The U.S. Water Resources
Council (1982) conducted studies involving comparison among different probability distribution
functions and their recommendation was to use the log Pearson type I11 as the basic distribution
for defining the annual flood series. The Council also recommended that this distribution be fitted
to sample data using the method of moments. In a more detailed study, the U.K. Natural
Environment Research Council (1975) found that 3parameter distributions such as the log
Pearson type 111and the generalized extreme value distribution (GEV) were found to fit data from
35 annual flood series better than the 2parameter distribution functions.
The log Pearson type III (LP 111) distribution has extensively been used in flood frequency
analysis since its favorable recommendation by the Water Resources Council in 1976. The
frequent use of the LP III attracted a number of detailed mathematical and statistical studies
regarding its role in flood frequency analysis. Various alternative fitting techniques for the LP I11
distribution have been suggested by Matalas and Wallis (1973) and Condie (1977). These
researchers carried out comparisons between the method of moments and the method of maxi
mum likelihood, and concluded that the latter method yielded solutions that are less biased than
the method of moment estimates. Bobee (1975) and Bobee and Robitaille (1977) suggested using
moments of the original data instead of using moments of the logarithmic values. Nozdryn
Plotnicki and Watt (1979) studied the method of moments, the method of maximum likelihood,
and the procedure proposed by Bobee (1975), and found that none of the methods were superior
FREQUENCY ANALYSIS 183
than the others and concluded that the method of moments was the best because of its computa
tional ease.
An important step in a regional flood frequency analysis is to ensure that the data that are
being used are of good quality. The data must be representative of the region and they must be
representative of the longterm flood characteristics of the region. Data on the physical charac
teristics of the catchments and any other data that are used must be of good quality. There are no
regional flood frequency techniques that can overcome faulty data.
After collecting and screening the data, the first step is to fit various pdfs to the observed
peak flow data at locations where sufficient data exist. Once all of the available data are fit to the
candidate distributions, assumptions and statistical tests must be made in an effort to select the
distribution that best describes each data set. This selection of pdfs is based on probability plots
of observed data along with the fitted distributions. Statistical tests such as the chisquare test and
the KolmogorovSmirnov test discussed in chapter 8 may be made. Personal judgment based on
the probability plots is also used.
Once the best fitting pdf is selected for each data set, that pdf can be used to estimate the
peak flow for various return periods. The pdf which best fits the data for the majority of the sta
tions or locations included in the study is generally used for all locations.
Several options are now available for the next phase of the analysis:
1. Develop a relationship between the peak flows of various return periods and measurable char
acteristics of the catchments producing the flows (QT
 = f(X)).

2. Develop a relationship between parameters of the pdf that best fits a majority of the flow data
and measurable characteristics of the catchments producing the flows (0 =  f(X)).

3. Develop a dimensionless flood frequency curve for the region plus a relationship between
some index flood for each catchment and measurable characteristics of the catchments (QT/Q
vs T and Q, = f(X)).

RegressionBased Procedures
All three of the options mentioned above require relationships with measurable characteris
tics from the catchments for which flow data are available. Characteristics that might be included
in the analysis include precipitation variables, such as mean annual rainfall and 24hour rainfalls
for various return periods. Physical characteristics such as catchment area, land slopes, stream
lengths, stream slopes, and land use might be included. Soils information such as permeability
and water holding capacities can be used. There are also a large number of geomorphic parame
ters such as drainage density, catchment shape factors, and measures of elevation changes that
might be included.
The result of this data collection effort will be a matrix of data having n observations on m
catchments. Therefore  X is an m X n matrix. The n observations on each catchment come about
from making a single measurement or observation on each of the n characteristics included in the
analysis. The m represents the number of catchments in the study. Thus, a study that involved
30 catchments and 12 characteristics on each catchment would produce a data matrix having
30 rows (one for each catchment) and 12 columns (one for each characteristic).
A regional flood frequency approach, in addition to the m X n data matrix of independent
variables, will include an m X p data matrix of dependent variables which are the peak flow
estimates for the various return periods. With return periods of 2,5, 10,25,50, and 100 years, p
will be 6. With 30 catchments, a 30 X 6 matrix of dependent variables, where the rows are the
catchments and the columns correspond to the various return periods, will result.
Multiple regression techniques can now be used to relate the dependent variables to the in
dependent variables based on the 30 observations on hand. Regression based on the regional data
and based on logarithms of the regional data can be investigated. Through the estimation process
based on multiple regression, the independent variables that are not useful in predicting the
dependent variables can be eliminated. The goal is to find if the peak flow for the various return
periods can be estimated based on a small subset of the original n catchment characteristics.
Although not always possible, it is desirable to use the same subset of independent variables for
predicting each of the p dependent variables. This will help to ensure a consistent set of predic
tions for various return periods on a particular catchment.
Using multiple regression to estimate the magnitude of a flood event that will occur on
average once in T years, denoted by QT, by using physical and climatic characteristics of the
watershed has a long history (Benson 1962c, 1964; Benson and Matalas 1967; Thomas and
Benson 1970). Sauer (1974) developed regional equations relating flood frequency data for
unregulated streams in Oklahoma to basin characteristics through multiple linear regression
techniques. Similar studies have been done throughout the United States (Jennings et al. 1993).
The Hydrology Committee of the U.S. Water Resources Council (1981) investigated numerous
methods of estimating peak flows from ungaged watersheds and found that the results obtained
using regional regression compared favorably well with more complex watershed models.
A logarithmic transformation of the QT,physiographic, and climatic data may be required to
linearize the regression model and to satisfy other assumptions of regression analysis. The rela
tionship most commonly used is of the form
where XI, X,, ... X, represent the basin and climatic data, and b,, b,, b,, ... b, are the regres
sion parameters. Regression parameters may be estimated using the ordinary least squares
(OLS), weighted least squares (WLS) or generalized least squares (GLS). OLS do not account
for unequal variances in flood characteristics or any correlations that may exist between
streamflows from nearby stations. To overcome these deficiencies in the OLS method, Tasker
(1980) proposed the use of WLS regression with the variance of the errors of the
observed flow characteristics estimated as an inverse function of the record length. Using a
weighting function of
FREQUENCY ANALYSIS 185
where N is the number of stations, toand t , are constants, and ni is the record length of station i,
Tasker (1980) reported that the WLS produced a smaller expected standard error of predictions
than the OLS. Using Monte Carlo simulation, Stedinger and Tasker (1985) demonstrated that
the WLS and GLS provide more accurate estimates of regression parameters than the OLS. A
major drawback of the WLS and GLS is the need to estimate the covariance matrix of the resid
ual errors. The covariance matrix of the residual errors is a function of the precision with which
the model can predict the streamflow values.
Estimating a peak flow for some return period on an ungaged catchment now becomes an
exercise in applying the appropriate regression equation to the ungaged catchment. The required
catchment characteristics are used in the appropriate prediction equations to estimate the peak
flow.
Regional frequency analysis using option b is very similar to option a except the dependent
variables in the regression analysis are the parameters or some function of the parameters of the
pdf selected to represent the flood peak flows. If a lognormal distribution is used, there will be 2
dependent variables, the mean and standard deviation of the logarithms of the flows. If the log
Pearson type It1 is used, there will be 3 dependent variables.
.98 .95 .90 80 .70 .60 .SO .40 .30 .20 .I0 .05 .02
+ Median of 18 stations
1.02 1.11 2 5 10 20 50
Return period (yrs.)
Again, it is desirable to use the same set of independent variables to predict all of the pa
rameters of the selected pdf. This is because the parameters will most likely be correlated. Using
the same set of independent variables helps ensure that one maintains a consistent relationship
among the parameters of the pdf.
Estimating peak flows for an ungaged catchment consists of using the derived prediction
equations to estimate the parameters of the flow frequency pdf. These parameters are then used
in the pdf to estimate flow magnitude with the desired return periods.
Indexflood method
Another widely used statistical procedure in regional flood frequency analysis is the index
flood method. This method, first described by Dalrymple (1960), involves the derivation and use of
a dimensionless flood frequency distribution applicable to all basins within a homogeneous region.
the righthand side of this equation is nothing but the standard deviation of the normalized flows.
The indexflood method started to be popular once again in the late 1970s and early 1980s since
the introduction of the probability weighted moments (PWM), a generalization of the usual
moments of a probability distribution (Greenwood et al. 1979). Greis and Wood (1983) reported
that improved regional estimates of flood quantiles were obtained by applying the PWM over the
conventional methods such as the method of moments and maximum likelihood estimation.
Parameter estimation by PWM requires the calculation of moments Mijkdefined as
where i, j, and k are real numbers and X is a random variable with distribution function, F(x)
where F(x) = Prob(X 5 x). M1,o,ois identical to the conventional moment about the origin and
the probability weighted moments corresponding to MITO,,or Mk are denoted as
All higherorder PWMs are linear combinations of the ranked observations x, r . . . 5 x,,
which is an indication that PWM estimators are subject to less bias than ordinary moments.
Ordinary moment estimators such as variance (s2)and coefficient of skewness (C,) involve squar
ing and cubing of observations respectively, with a potential to give greater weight to outliers,
resulting in a substantial bias and variance. However, one major weakness of the PWM is that it
cannot be used to estimate parameters for those distributions which cannot be expressed in
inverse form, such as LP 111.
Consider K sites with flood records Xi(k) for i = 1, 2, ...,n, and k = 1, 2, ..., K. Normalize the
Xi(k) by dividing the observations at a site by the mean of the observations at that site.
1. At each site compute the three Lmoments X,(k), X2(k), and X3(k) of the normalized observa
tions using the probability weighted moments (PWM) estimators. The Lmoments are linear
combinations of the ranked observations.
where xQjis the jthorder statistic of the normalized observations with x(,) the smallest and
x(,) the largest.
 c:= ~t[ir(k)/fil(k)l
A: =
1
k
for r = 2 , 3
Ck=lWk
For sites without flow records on which to estimate h:, a regional regression could be used to
develop an equation of the form
and 24hour rainfall depths converted to a partial duration series by using the factors shown in
table 7.7. For example, if the 5year partial duration series value estimated from the maps is 2.00
inches, the corresponding annual series depth would be 0.96(2.00) or 1.92 inches. For return
periods greater than 10 years, the conversion factor is essentially unity.
The 2year rainfall amounts were determined by plotting on loglog paper the return period
versus the rainfall depth using the California plotting position formula (Table 7.1), drawing a
smooth curve through the points, and reading the 2year value.
The 100year rainfall amounts were determined by using the type I Extreme Value distribu
tion for selected stations with long rainfall records. The ratio of the 100year to the 2year rain
fall amount was then determined for these stations and a map prepared showing the value of this
ratio. The 100year rainfall amounts for the stations with short records was estimated by the
100year to 2year ratio.
The rainfall depths for other return periods were determined by plotting the 2year and
100year depths on special paper, connecting the points by a straight line, and reading off the
desired rainfall depths. The spacing of the return periods along the abscissa of this special paper
was empirical from 1 to 10 years based on freehand plotting of partial duration series data and
theoretical according to the type I extreme value distribution from 20 to 100 years. The transition
between 10 and 20 years is smoothed by hand from the type I values.
The rainfall depths for durations other than 1 hour or 24 hours were obtained by plotting the
1hour and 24hour values on a second special paper and connecting the points with a straight
line. This diagram was obtained empirically from an analysis of records from 200 firstorder U.S.
Weather Bureau stations. The depth of rainfall for the 30minute duration is obtained by multi
plying the 1hour value by 0.79.
From these analyses, curves called depth (or intensity)durationfrequency curves can be
prepared. Data from the maps in TP 40 can be used to determine depthdurationfrequency
(DDF) relationships for locations where actual data does not exist. Often, in developing DDF
curves, the interpolation from the maps of TP40 may result in rather rough plots. The curves can
be smoothed by using an empirical smoothing equation. One such equation is
KTFx
D=
(T + b)"
where D is the depth, T is the duration, and F is the frequency of the rainfall. The coefficients K,
x, b, and n may be estimated using nonlinear regression techniques. Figure 7.7 shows the results
of such an analysis for Stillwater, Oklahoma, based on TP40 data.
FREQUENCY ANALYSIS 191
0.1 1 10 100
Duration (hrs)
Rainfall data for longer durations, such as weeks or months, can be analyzed by using the
gamma distribution. Barger and Thom (1949) have shown the gamma distribution applicable to
rainfall data. Barger, Shaw, and Dale (1959), Friedman and Janes (1957), Strommen and Hors
field (1969), and Mooley and Crutcher (1968) are among those who have used the gamma distri
bution for rainfall.
By using equation 7.23, it can be seen that the probability of a rainfall R exceeding X is
given by
where k is the probability of rain or the proportion of time intervals with rainfall and P*(x) is the
cumulative probability distribution of rain given that R Z 0. often the gamma disfribution is used
for rainfall data. The parameters of the gamma distribution generally are determined by using
equations 6.18 and 6.19. Bridges and Haan (1972) have presented a technique for determining
the reliability of rainfall estimates from the gamma distribution based on simulation studies.
Exercises
7.1. Assume.that daily rainfall on rainy days follows an exponential distribution. The average
daily rainfall on rainy days is 0.3 inches. If 30% of all days are rainy, what is the probability that
on some future day, the amount of rainfall received will exceed 1.OO inch? Assume daily rainfalls
are independent.
7.2. Derive a table of frequency factors for the exponential distribution corresponding to T = 2,
5, 10,20,50, and 100 years.
7.3. Select several streams in a single locality and prepare a plot of the ratio of the Tyear flood
to the mean annual flood (as in figure 7.6).
7.4. An analysis of 50 years of data showed that the probability of a flood peak exceeding 90,000
cfs on a certain river was 02. During a 10year period 2 such peaks occurred. If the original
estimate of the probability of this exceedance was correct, what is the probability of getting 2
such exceedances in 10 years?
7.5. Forty years of peak streamflow data are available. All but one of the data points indicate that
a lognormal distribution with = 125,000 cfs and sx = 50,000 describes the data very nicely.
The one outlier is equal to 285,000 cfs. What is the probability that an event of 285,000 cfs or
greater could occur in the 40year period if the flood peaks truly follow the lognormal distribu
tion with X and sx as given?
7.6. Select a set of data consisting of 20 or more independent observations. Plot these data on nor
mal probability paper using several of the plotting position relationships contained in table 7.1.
7.7. Compute the 100year peak flow for the annual series data of example 7.2 assuming the data
follow the gamma distribution.
7.8. Prepare a plot on loglog paper of low flow frequencyvolumeduration for Cave Creek near
Fort Spring, Kentucky. Plot volume in inches as the ordinate, duration in months (use 1, 2, 3, 6,
FREQUENCY ANALYSIS 193
and 12 months) as the abscissa and use as curve parameters frequency (use 2, 5, 10, and 25
years).
7.9. Work exercise 7.8 for maximum flow frequencyvolumeduration on Cave Creek.
7.10. Plot the annual runoff data for Walnut Gulch near Tombstone, Arizona, on normal and
lognormal probability paper. Does either of these distributions appear to "fit" the data?
7.1 1. Plot on normal probability paper the annual runoff data for (a) Piscataquis River near
Dover Foxcroft, Maine, (b) North Llano River near Junction, Texas, and (c) Spray River, Banff,
Canada. Is there any apparent relationship between the curvature (or lack of it) and the skewness?
7.12. Work exercise 7.11, only plot the data on lognormal probability paper.
7.13. For the Piscataquis River near DoverFoxcroft, Maine, estimate the 100year annual flow
assuming the data follow the (a) normal distribution, (b) lognormal distribution, (c) Pearson type
III distribution, (d) log Pearson type I11 distribution, (e) extreme value distribution.
7.14. Work exercise 7.13 for the 100year annual flow on the North Llano River near Junction,
Texas.
7.15. Work exercise 7.13 for the 100year annual flow on the Spray River, Banff, Canada.
7.16. In reference to exercises 7.13,7.14 and 7.15, which distribution would you expect to give
the "best" estimate for the 100year flow on each of the three rivers? Discuss in terms of the
means, variances, coefficient of variation, and skewness.
7.17. Plot the annual peak discharge of Walnut Gulch near Tombstone, Arizona, on lognormal
probability paper. Draw in what you consider the best fitting straight line. Estimate the mean and
variance of the data from this plot.
7.18. Plot the suspended sediment load data for the Green River at Munfordville, Kentucky on
normal and lognormal probability paper. Draw in the best fitting straight line.
7.19. Use the lognormal distribution to estimate the 25year runoff volume for July on Walnut
Gulch near Tombstone, Arizona. Plot the data on lognormal probability paper and draw in the
theoretical best fitting straight line.
8. Confidence Intervals
and Hypothesis Testing
IN CHAPTER 3, parameter estimation was discussed in general terms. In chapters 4,5, and
6 specific methods for estimating the parameters of certain probability distributions were
discussed. Again, it should be recalled that parameter estimates are called statistics, are functions
of the sample (random) values, and are themselves random variables. Parameter estimates have
associated with them probability distributions.
Thus far we have discussed methods of getting point estimates for parameters and certain
properties of these point estimates. The possible errors in these point estimates due to inherent
variability in random samples of data have not been discussed. This chapter considers the relia
bility of parameter estimates and the testing of hypotheses regarding population parameters.
Hypothesis testing and confidence interval estimation may be classed as parametric or
nonparametric depending on whether or not assumptions are made regarding the probability
distribution of the observations and/or the parameters under consideration. Parametric and
nonparametric tests have certain assumptions in common. They both rely on independence in
the observations and randomness of the sample. They both require samples of data to be
representative of the situation under analysis. Parametric statistics deal with actual values of
observations while nonparametric methods often rely on the ranking or relative position of
data values.
The use of parametric statistics is frequently criticized because of deviations from the
distributions assumed by a particular test. One of the consequences of deviating from the
assumed distribution is that the level of significance of the test is no longer exact. This may be a
serious problem, but in most cases is not. Generally, the selection of the level of significance is
somewhat arbitrary. Early statisticians used 5 and lo%, so everybody uses 5 and lo%! If one
HYPOTHESIS TESTING 195
doesn't know how to select a level of significance, it makes little sense to be overly concerned if
the level of significance is unknown due to deviations from distributional assumptions. What is
purported to be an exact test becomes an approximate test, but that is often the nature of hydro
logic analysis. Uncertainty abounds! An approximate test provides information to the decision
maker just as does a "socalled exact test and is certainly better than no test at all. Several pa
pers are available indicating that nonparametric procedures are nearly as good as parametric pro
cedures for some tests when distributional assumptions are met and are superior when distribu
tional assumptions are not met (Helsel and Hirsch 1992).
In any application of hypothesis testing or confidence interval estimation, it must be kept in
mind that assumptions must be made concerning the data and the process under study. It is
unlikely that in an actual application the assumptions will be exactly met. Again, if the assump
tions are not fully met, then the tests or confidence intervals become approximate.
If we reject the hypothesis that two streams have different BOD loadings, we do not neces
sarily believe their BOD loadings are exactly the same. It would be rare indeed to have two nat
ural streams that have identical BOD loadings or any other quantifiable characteristic. We know
before we run the test, indeed before we collect any data, that the BOD loadings are not precisely
the same on two streams.
What we are really concerned with is whether the BOD loadings are "significantly" differ
ent. In statistical jargon, we are assessing whether the difference we detect in BOD is of such a
magnitude that it cannot be attributed to chance if the BOD loadings in the two streams are in fact
the same and meet the conditions of the test.
For example, consider a situation where the BOD level on two streams is sampled. Assume
that on each of the streams the true distribution of BOD is N(4, 1) and the BOD in the two streams
is uncorrelated. These are strong assumptions that we can never verify completely. If we could,
then statistical testing would be superfluous. It is hypothesized that the BOD levels are the same
in the two streams. The investigator decides to sample each of the streams and declare the BOD
levels different if the samples from the two streams differ by more than 1 mg/l. What is the prob
ability an error will be made?
The error that might be made is to declare the BOD in the two streams different when, in fact,
they are, unknowingly to the investigator, the same. Since the BOD level is actually N(4, I), the
difference in two independent samples is N(O,2). The probability of selecting a random number
from an N(O,2) that is larger in absolute value than one is the probability of making an enor with
the test. Since the test statistic, the observed difference, has an N(O,2) pdf, the standardized Z value
corresponding to a difference in excess of the absolute value of one is (1  0)/* = 0.707. The
probability of Z exceeding 0.70 in absolute value for a standard normal distribution is 0.48. There
is a 48% chance of rejecting the hypothesis even though it is true.
If the investigator thinks this probability of an error is too great, the appropriate value for the
test statistic consistent with the acceptable error probability can be determined. For example, if
the investigator wants to be 90% confident of not concluding the streams are different when in
fact they are not, the cutoff value for Z is such that the prob(Z > z,) = 0.05, which corresponds
to Z = 1.645. Then the actual difference is computed from (d  0 ) / = ~ 1.645 or d = 2.33.
Therefore, the stream would be considered not significantly different unless the absolute value of
the difference in the samples from the streams exceeded 2.33 mg/l.
If the BOD distribution on one stream was N(3, 1) and on the other N(4, I), the distribution
of BOD would have been truly different on the two streams. The distribution of the difference in
BOD would be N(l, 2). The probability of getting a difference in excess of rt 1 would be the
probability of a value <O or > 1 from an N(l, 2). Again, using the standard normal distribution,
this probability can be found to be 0.74. In this case, the BOD distributions are different yet there
is a 26% chance of erroneously concluding they are not different.
What becomes apparent is that there is always a chance of making an error in statistical
tests of hypotheses. The first part of the example demonstrates how one could wrongly conclude
a difference when none existed and the second part shows how one could fail to detect a differ
ence when one does exist. These two errors are rejecting a true hypothesisknown as a Type I
erroror accepting a false hypothesis knownas a Type I1 error.
The probability of a Type I and a Type I1 error are usually denoted by cx and P, respectively.
In this example when the true situation was no difference, cx was 0.48. In the situation where there
was a difference, P was 0.26.
CONFIDENCE INTERVALS
A parameter 0 is estimated by 6. The statistic 6 is a random variable having a probability
distribution. If 6 can take on any value in some continuous range, then prob(0 = 6) is zero.
Rather than a point estimate for 0, it may be more desirable to get an interval estimate such that
the probability that this interval contains 0 can be specified. Such an interval is known as a con
fidence interval. This statement may be written
where L and U are the lower and upper confidence limits, so that the interval from L to U is the
confidence interval and 1  a is the confidence level, or confidence coefficient. Note that in
equation 8.1, 0 is not a random variable. One does not say that the probability that 0 is between
L and U is 1  cx but that the probability is 1  cx that the interval L to U contains 0. The differ
ence in these two interpretations is subtle but based on the fact that 0 is a constant while L and U
are random variables.
Mood et al. (1974) discuss a general method for determining confidence intervals. Ostle
(1963) presents expressions for the confidence intervals for many different statistics. In the
discussion to follow, a procedure known as the method of pivotal quantities for determining con
fidence limits will be illustrated. This method consists of finding a random variable V that is a
function of the parameter 0 but whose distribution does not involve any other unknown parame
ters. Then v, and v, are determined such that
This inequality is then manipulated so that it is in the form of equation 8.1 where U and L are ran
dom variables depending on V but not 0.
HYPOTHESIS TESTTNG 197
Mean of a Normal Distribution
As an example of using equation 8.2, the confidence intervals on the mean of a normal
distribution will be determined. We have shown that the quantity
has a t distribution with n  1 degrees of freedom, where n is the number of observations used
to estimate Z. Using equation 8.2 we have
Because F and s, are both random variables, L and U are random variables as well, with
estimates 1 and u given by equation 8.4. Note that the assumption that the observations are
normally distributed was made.
Example 8.1. The sample mean and variance of the Kentucky River data contained in table 2.1
have been calculated as Z = 66,540 and sx = 22,322. What are the 95% confidence limits on the
mean assuming the sample is from a normal population?
Solution:
Thus, we can say that we are 95% confident that the interval 62,076 to 71,004 contains the true
population mean.
Comment: If a 90% confidence interval is calculated, it is found to be 62,817 to 70,263. Thus, the
90% confidence interval is shorter than the 95% confidence interval but our degree of confidence
that the interval contains F, has decreased from 95% to 90%.
If a second independent sample of peak flows on the Kentucky River near Salvisa were avail
able, this sample would have a different mean and variance. In this case, the 95% confidence
intervals would be different as well. If many samples were available and the 95% confidence
limits were calculated for each, 95% of the confidence limits would contain the true population
mean while 5% would not if the data were actually from a normal distribution. The 100(1  a)%
confidence interval on the mean can be made as small as desired by increasing the sample size.
This is because s, decreases as the sample size is increased. An increase in the reliability of the
sample mean comes at the expense of an increase in the sample size. Unfortunately, in many
hydrologic problems the sample size is fixed. For a normal distribution, equations 8.4 provides a
means for determining the sample size required in order to estimate J.L, within a given reliability.
HYPOTHESIS TESTING 199
If the population variance of the normal distribution is known, then the pivotal quantity in
equation 8.3 becomes (X  y)/u,, which has a standard normal distribution. The confidence
limits then become
where z, a/2 is the value of Z from the standard normal distribution such that the area to the right
of Z is a/2.
Equations 8.4 and 8.5 are based on the assumption that the underlying population of the
random variable X has a normal distribution. Only through the Central Limit Theorem can these
relations be applied to nonnormal distributions. Confidence limits calculated by these relation
ships for the means of random samples from nonnormal populations are only approximate with
the approximation improving as the sample size increases. If these approximations are not satis
factory, other methods are available (Ostle 1963; Mood et al. 1974).
which is in the form of equation 8.1. Thus, the confidence limits on u2 are
Again, equations 8.6 are strictly valid only if X is from a normal distribution and approxi
mate for X from a nonnormal distributionwith the approximation improving as the sample
size increases.
Fig. 8.2. Confidence limits on a chisquare distribution.
The chisquare distribution is not symmetrical so that s i  1 is not equal to u  si. As the
sample size and, thus, the degrees of freedom on the chisquare distribution increases, the distri
bution approaches a symmetrical distribution so that the upper and lower confidence limits are
nearly the same distance from s;. This is illustrated in figure 8.2.
Example 8.2. Determine the 90% confidence limits on the variance for the situation described in
example 8.1.
Solution:
The 90% confidence intervals on the standard deviation are found (by taking the square
roots of the above limits) to be 20,001 to 25,33 1 cfs.
Comment: In the preceding two examples the confidence limits on the mean and variance of a
normal distribution were calculated. If the joint confidence limits on ?and
i s; are desired, they
cannot be computed separately as was done in these examples. Mood et al. (1974) discuss the
estimation of ioint confidence intervals.
The analogous results would hold for any onesided, lower or upper confidence limit.
{
value as the mean and a variance of nE  In px(x, 8)
K g IT1 .
Using this information, it is possible to construct confidence intervals and joint confidence
intervals for the parameters of these distributions. The book by Mood et al. (1974) should be con
sulted for the procedures to be used.
HYPOTHESIS TESTING
Often the acceptability of statistical models can be judged without actually making any
statistical tests. This would be the case when observed data is predicted very closely by the model
or when observed data deviates very greatly from the model. On the other hand, a common
occurrence is for the observed data to deviate some from the model but not enough for one to
state that the model is obviously inadequate. In this latter situation one must determine whether
the deviations represent true inadequacies in the model, or whether the deviations are chance
variations from the true model.
The general procedure to be followed in making statistical tests is
7. Determine if the calculated value of the test statistic falls in the rejection region of the distri
bution of the test statistic.
Table 8.1. Errors in hypothesis testing
For many statistical tests, steps 2 4 have been completed and may be found in a wide vari
ety of statistics books. For many of the tests that a hydrologist might like to make, adequate test
statistics and their distributions have not been determinedlargely because of restrictive as
sumptions. Nonpararnetric tests relieve this problem to some extent.
It is not possible to develop tests that are absolutely conclusive. All of the tests have a
possibility of two kinds of errorrejecting a true hypothesis (Type I error) or accepting a false
hypothesis (Type I1 error). Table 8.1 depicts the two types of errors. The probability of a Type I
error is denoted by a and the probability of a Type I1 error by P. The significance level is defined
as 100(1  a ) (in percent). In testing hypotheses, the probability of a Type I error can be speci
fied; however, the probability of a Type I1 error is not known unless the true parameter values
being tested are known. In general as the value of a decreases, the magnitude of P increases.
As an example, assume we select an observation xo at random from a normal distribution
with variance a; and hypothesize that the distribution has a mean po.The test statistic could be xo
itself, which has a normal distribution with unknown mean and variance a;. If the hypothesis is
true (something that is not known or the test would not be made), the distribution of the test
statistic would be a normal distribution with mean po and variance 0; and would appear as in
Figure 8.3. If it is decided to accept the hypothesis if xo is within 2 standard deviations of po and
reject the hypothesis otherwise, the critical region or rejection region would be the shaded area in
Figure 8.3. From the properties of the normal distribution, it is known that 95.44% of the area of
the normal curve is within 2 standard deviations of the mean, so the critical region occupies 4.56%
of the area. It is also apparent that there is a 4.56% chance that x, will be in the critical region and
the hypothesis rejected even though it is true. Thus, by definition a = 0.0456, or there is a 4.56%
chance of making a Type I error due to random variation in the x, selected. It is more common to
specify a and from this information determine the critical region. For example, if one wanted a to
be 0.10, then the critical region would be I (xo  po)/aoI > 1.645, which is the value of the stan
dard normal distribution such that the area outside the limits  1.645 to 1.645 is 0.10.
Po2% P o Po+2q,
Fig. 8.3. Critical region.
HYPOTHESIS TESTING 203
In order to evaluate p, the true parameter values must be known. Again, consider selecting
a single value xo from a normal population with variance 4 and an unknown mean. Let the
hypothesis be that p = po and the alternative be p # po. If p actually equals p,, then the
situation depicted in figure 8.4 would exist and there is a loop% chance that xo will fall in the
acceptance region of N(po, a;) and thus a Type I1 error committed. From figure 8.4 it can be seen
that as a is increased, p will decrease. It can also be seen that the nearer p1is to po, the greater
will be p. This is because it is increasingly difficult to tell the difference between the two distri
butions. It is not possible to determine the magnitude of P because it is a function of the unknown
population mean p,. Example 8.3 shows how P can be evaluated if p1is known. Of course, p1
would not be known or else one would not hypothesize p = po.
Example 8.3. Assume a single observation is selected from a normal distribution with mean
p1 = 7 and variance a; = 9. It is hypothesized that p = po = 5. If the test is conducted at the
10% significance level, what is P?
Solution:
Reference should be made to figure 8.5.
The area to the left of z, = 0.978 from a standard normal distribution is 0.8365. Similarly, if X,
is the boundary of the lower critical region, we have (x,  5)/3 =  1.645, or x, = 0.0645. A, is the
area of a normal distribution with mean 7 and variance 9 to the left of 0.0645. z, = (0.0645  7)/3
or z, = 2.3,l. A, = 0.0104. Now P = A,  A, or P = 0.8365  0.0104 = 0.8261. Thus, the
probability of accepting the hypothesis that p = 5 when in fact p = 7 is 0.8261 when a is 0.10. The
probability of a Type II error is 0.8261.
If calculations such as those contained in example 8.3 are carried out for various values of
pl, a curve relating P to p1can be constructed. Such a curve is shown in figure 8.6. Figure 8.6
shows the p curve for a = 0.05 and a = 0.10. Curves such as shown in figure 8.6 are often
called operating characteristic (OC) curves.
Figure 8.6 verifies the earlier statements that P increases as a decreases and P increases as
the true mean, pl, approaches the hypothesized mean, po. In fact, as p1 gets close to po, the
PI
Fig. 8.6. Probability of a type I1 error as a function of the true mean for example 8.3.
HYPOTHESIS TESTING 205
POWER = 18
I
 10 5
I I
0
I
5
I
10
I
15
1
20
PI
Fig. 8.7. Example power curve.
probability of accepting p = po when p = p, is true gets very large. This may not be a serious
problem in practice because we may not care, for instance, whether p is 5 or 5.5.
The quantity 1  P is called the power of a test. Ideally, we would like the power to be large
for all values of p, .In fact in testing a hypothesis, we would like a to be small and the power to be
large. Figure 8.7 shows that power of a test is a function of a and true parameter values. The power
of a test is also a function of the test itself. For instance, we could have chosen as our test statistic
& + 3 and then rejected the hypothesis if x, + 3 fell in the critical region. Figure 8.7 compares the
power of this test with the test that rejected the hypothesis if x, fell in the critical region.
Figure 8.7 shows that for certain values of p,, the X, + 3 test is more powerful than the X,
test. Ideally, we would like to use the test that was the most powerful over the entire range of the
unknown parameter. Such a test is known as a uniformly most powerful test. Unfortunately, uni
formly most powerful tests do not exist in many situations.
Selecting which test to use comes down to the purposes of the test and the consequences of
making an error. In our example, if accepting the hypothesis p = 5 when in fact p;> 5 is a very
serious error, whereas accepting it if p < 5 is of little consequence, we might prefer the X, + 3
test becuase it is more powerful in the region p > 5. If the consequence of an error depended only
on the magnitude of the error, the X, test might be preferred.
From the above discussion, it should be apparent that the selection of a and the type of test
to be used depends on the problem at hand. Mood et al. (1974) discuss these concepts in more
theoretical terms. The level of significance, a , is usually chosen to be 0.10, 0.05, or 0.01. In
theory, a should be based on the problem at hand. In practice, a is generally arbitrarily selected.
Many tests of hypothesis are of the type 0 = I1 versus the alternative 0 # 0,. Accepting
such a hypothesis as true does not mean that one strictly feels that 0 = but rather that 0 is not
significantly different from el. For example, if we calculate the mean of a random sample and then
accept the hypothesis that the true mean is 5, we may not believe that the true mean is exactly 5 but
rather the true mean is not significantly different from 5. What constitutes a significant difference
has been defined by the type of test used and the level of significance. Furthermore, a statistically
significant difference and a physically significant difference are not the same. For example, if
6 = 4.0 is an estimate for 0 and a test of hypothesis shows 6 is not significantly different from
zero, it does not mean 0 = 0 should be used in some physical analysis if this physical analysis is
sensitive to differences in 0 of this order of magnitude. A physically significant difference depends
on the problem being studied.
The following is a discussion of several common tests of hypotheses. The hypothesis to be
tested is denoted by H, and the alternative hypothesis by Ha. For the tests that follow to be cor
rect statistical tests, the assumptions involved in developing the test statistic must not be violated.
A primary assumption is that the statistics are estimated based on a random sample. In practice,
at least some.of the assumptions are generally violated with the result that the tests are only
approximate tests. This approximation is manifest in the fact that the actual level of significance
will not be equal to 100a%. Because these tests are often approximate due to assumption viola
tions does not render the tests of no value. It is the analyst that must make the decision, not a
statistical test following some prescribed procedure. The analyst may put less weight on a statis
tical test in arriving at a decision if the violations of the assumptions of the statistical test are of
concern, however.
and

x L + tl,,nl sx/v'h for p1 < p2 (8.12)
H, is rejected if
Izl = 1 (X  Po)
ax/ v'h
(> zl+.
(X  Po)
t=
sx/ v'h
H, is rejected if
It1 = 1 (K  Po)
sx,& 1 > t1a,2n1
This test cannot be applied to every set of data. The assumption has been made that the
observations are from a normal distribution.
  
Example 8.4. The annual runoff for Cave Creek near Fort Spring, Kentucky, for the period 1953
to 1970, has a mean of 14.65 inches and a standard deviation of 4.75 inches. Test the hypothesis
that the mean annual runoff is 16.5 inches.
Solution: The testing procedures we have available to us all are based on the assumption of
normality. If we assume the annual runoff is normally distributed, we can use equation 8.14 to
test H,: (I. = 16.5 versus Ha: p # 16.5.
There are 18 observations. The test statistic is
208 CHAPTER 8
In this case, Z has a standard normal distribution, so the rejection region is 1 z I > z,
If the variance of the two normal distributions are equal but unknown, the H,: k1  p2 = 8
versus Ha: p1  p2 # 6 is tested by calculating the statistic
Again, note that these two tests are based on sample normality. For large samples, the Central
Limit Theorem may enable us to use these tests as approximate tests for nornormal samples.
Gibra (1973), Ostle (1963) and others discuss testing the H,: k,  p2 = 6 versus Ha: k1 
k2# 6 when sampling from two normal populations with unknown and unequal variances. Ostle
recommends the following approximate procedure. Compute the test statistic
HYPOTHESIS TESTING 209
The hypothesis is rejected if
where
w, = s:/nl
Otherwise H, is rejected.
ST
where is the larger sample variance. F is distributed as an F distribution with n,  1 and n2  1
degrees of freedom, where n, is the sample size for the sample having the larger variance and n2 is
the sample size for the sample with the smaller variance. H, is rejected if
and
H, is rejected if
In this test, Ha is 02 that are not all equal. This means that at least one is different from
the other (~2.
The test is known as Bartlett's test for homogeneity of variances. Homogeneity of
variance is also known as homoscedasticity.
Example 8.5. For the preceding example, test the hypothesis that the variance is 36.00.
Solution: The assumption of normality is used. The test is based on equation 8.18 using a = 0.05
hypothesized relative frequency curve. The second method was to plot the data and the hypothe
sized distribution as a cumulative probability distribution on appropriate paper and judge as to
whether or not the hypothesized distribution adequately describes the plotted points. Statistical
tests corresponding to these visual tests will be discussed. In the following discussion, the
hypothesis being tested is that the data are from a specified probability distribution.
where k is the number of class intervals, and Oi is the observed and Ei the expected (according to
the distribution being tested) number of observations in the ithclass interval. The distribution
of X: is a chisquare distribution with k  p  1 degrees of freedom, where p is the number of
parameters estimated from the data. The hypothesis that the data are from the specified distribu
tion is rejected if
Example 8.6. As an example of using the chisquare test, consider the Kentucky River data of
table 2.1 and test the hypothesis that the data are from a normal distribution. The observed and
expected numbers in each class interval are obtained by multiplying the relative frequency by 99,
which is the number of observations. Table 8.2 shows the calculation of x:. The degrees of
7
Total 99
freedom is k  3, or 7, since two parameters (pXand a;) were estimated for the normal distri
bution. Comparing x:. of 4.98 with Xg,90,7 = 12.0, it is concluded that the normal distribution can
not be rejected for this data for a = 0.10. If x:. had exceeded X:,,k,,, the hypothesis that the
normal distribution describes the data would be rejected.
In constructing Table 8.2 the expected number in a class interval is based on n[Px(xi) 
P,(X~~)] for all intervals except the first and last ones. For the first interval the expected number
is (xi) and for the last interval n[Px(w)  Px(x,,)I. In these expressions xi represents the
right boundary of the i" class.
Comment: By examining table 8.2 and equation 8.21, it is apparent that the chisquare goodness
of fit test is quite sensitive in the tails of the assumed distribution. Because of this many statisti
cians recommend that classes be combined if the expected number in a class is less than 3 (or 5).
If the 5 criteria is used, the first two classes and the last two classes must be combined. This
makes the calculation of X2 as shown in table 8.3 and X: value is reduced to 3.62. The degrees of
freedom are reduced to 5.
Perhaps a better way of conducting the chisquare goodness of fit test is to define the class
intervals so that under the hypothesis being tested the expected number of observations in each
class interval is the same. This means that the class intervals will be of unequal width and that the
interval widths will be a function of the distribution being tested.
Example 8.7. A chisquare test for normality of Kentucky River data using 10 class intervals
each having the same expected frequency can be conducted as follows.
Ten class intervals means that the expected relative frequency or probability in each interval
is 0.1. The class boundaries can be determined by solving the inverse of the cumulative distribu
tion. For instance, the boundaries of the 4thclass intervals are given by the values of x satisfying
Px(x) = 0.3 and Px(x) = 0.4.
HYPOTHESIS TESTING 2 13
Table 8.4. Chisquare test based on equal expected numbers per class interval
Table 8.4 contains the data for conducting the chisquare test based on 10 class intervals
having equal expected numbers of observations (99/100 or 9.9) in each interval. In this case, Xi
is 10.60, which is less than Xi.90,7
of 12.02. The hypothesis is, again, not rejected.
1. Let Px(x) be the completely specified theoretical cumulative distribution function under the
null hypothesis.
2. Let S,(x) be the sample cumulative density function based on n observations. For any
observed x, Sn(x) = k/n, where k is the number of observations less than or equal to x.
4. If, for the chosen significance level, the observed value of D is greater than or equal to the crit
ical tabulated value of the KolmogorovSmimov (KS) statistic, the hypothesis is rejected.
The KolmogorovSmimov test statistic is included in the appendix.
This test can be conducted by calculating the quantities Px(x) and Sn(x) at each observed
point, or by plotting the data as in figures 7 . 3 and
~ d and selecting the greatest deviation on the
probability scale of a point from the theoretical line. If the latter approach is used, care must be
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability
taken to select the largest deviation on the probability scale which is not necessarily linear. The
largest deviation of the empirical distribution from the known distribution is sought. The empir
ical distribution gives the Prob(X i x) and is thus a step function with steps at each data point.
Rather than falling on a data point, the largest deviation may be at a point where the probability
takes a step change. Example 8.8 and figure 8.8 illustrate the determination of this maximum
deviation, which in this case occurs just prior to X = 18 and has a value of 0.29 on the probabil
ity scale. In this case, the known distribution is an exponential distribution. There are eight data
points. The critical value for the KS statistic with a = 0.10 is 0.411. Thus, the hypothesis that
the data are from this particular distribution cannot be rejected.
When few observations are available, it is very difficult to use a statistical test to find an
appropriate distribution for the data. In figure 8.8 only 8 observations are available. Obviously,
the chisquare test can not be used because adequate data for grouping are not available. As al
ready seen, the KS test is insensitive for a sample of this size since it requires a large deviation
to reject the hypothesis with this small sample. If the KS test is used to test the hypothesis
that these data came from a uniform distribution or a normal distribution, these hypotheses could
not be rejected either. With small samples, the power of the KS test is not very great and the
probability of failing to reject a false hypothesis, a Type I1 error, is great.
Example 8.8. Consider the data values 18,29,45,56, 50,40,20, 10. Test the hypothesis that the
data are from an exponential distribution with known mean of 33.5.
The critical value is 0.411 for n = 8 and ci = 0.10. The hypothesis cannot be rejected.
Note that for the KolmogorovSmirnov test, P,(x) is a completely specified, cumulative
probability distribution. That is no parameters for the distribution must be estimated from
observed data. Crutcher (1975) points out that when parameters must be estimated to specify
P,(x), the KolrnogorovSmirnov test is conservative with respect to the Type I error. That is, if
the critical value is exceeded by the test statistic obtained from the observed values, the hypoth
esis is rejected with considerable confidence. Crutcher (1975) presents a table of critical values
for sample sizes of 25 and 30 as well as infinitely large samples for the exponential, gamma,
normal, and extreme value distributions when parameters of these distributions must be estimated.
In general, these critical values are smaller than the values given in the KolmogorovSmirnov
table in the appendix.
Conover (1980) discusses Lilliefors's extension of the KS test to the normal distribution
with mean and variance estimated from the data (Lilliefors, 1967) and the exponential distribu
tion with mean estimated from the data (Lilliefors, 1969). The tests are conducted as with the
KS except that the critical values are smaller. Conover (1980) presents tables for the required
critical values. Based on data in Conover, letting KS represent the critical value of the
KolmogorovSmimov statistic and L represent the critical value for the Lilliefors test, the
approximation L = a + bKS can be used where a and b are given in the following table for 4 to
30 observations. For n greater than 30 the approximation L = c / 6 from Conover ( 1 980) yields
reasonable estimates for the critical values.
Distribution a a b c
Solution: The calculated mean is 33.5, so the observed maximum deviation is 0.29, as before. If
the calculated mean had been other than 33.5 the values for Px would change. The critical D
based on the Lilliefors test using the exponential distribution and the approximations above is
with ci = 0.10. The tabled value for KS is 0.411. Therefore L is found to be 0.003 +
0.780(0.411) or 0.324. The hypothesis cannot be rejected.
21 6 CHAPTER 8
Example 8.10. Test the hypothesis that the Kentucky River peak flow data are normally distrib
uted. Use the KolmogorovSmirnov test.
Solution: The data are plotted in figure 8.9. The maximum deviation between the best fitting
line, Px(x), and the plotted points, S,(x), on the probability scale is about 0.074 at X = 55,200
cfs (table 8.5). Because the test for normality is being done and the mean and variance are esti
mated from the data, Lilliefors approach is used. For a = 0.10 and n = 99, the critical value
for the Lilliefors statistic is 0.805/* or 0.081. Table 8.5 shows the calculations needed to find
the maximum deviation. The maximum deviation is the maximum value in the columns under
Normal distribution
Fig. 8.9a. Normal probability plot of Kentucky River data on annual flow.
Normal distribution
Fig. 8.9b. Lognormal probability plot of Kentucky River data on annual flow.
Table 8.5. coritirzzred
87100 0.82
87200 0.83
88900 0.84
89400 0.85
91500 0.86
92500 0.87
93700 0.88
94300 0.89
96100 0.9
98400 0.91
99100 0.92
101000 0.93
105000 0.94
107000 0.95
111000 0.96
112000 0.97
115000 0.98
144000 0.99
Max dev
HYPOTHESIS TESTING 219
I .i . i i i I i I i . . i . I
0 10 20
. .
30
. .
40
. .
50
, .
60
. .
70
. .
80 90 100
Sx  Px and S(x  1)  Px. Because the largest value is less than the critical value, the
hypothesis of a normal distribution cannot be rejected.
Other tests for normality include the ShapiroWilkes test (Conover 1980) and a test based
on the correlation between the standardized Zvalue associated with the plotting position and the
values of the observations (Helsel and Hirsch, 1992). The critical correlation values are shown in
figure 8.10. The test is credited to Looney and Gulledge (1985). For this test the Weibull plotting
position should not be used.
Table 8.6 shows the Kentucky River data, the plotting positions calculated from the
Cumane (1978) relationship and the Zvalues associated with the plotting positions. The corre
lation between the flow and the Zvalues is 0.989. Figure 8.10 shows that for 99 observations and
a = 0.05, the critical correlation value is about 0.98. Thus, the hypothesis of normality can not
be rejected. It is interesting to note that the correlation between the logarithms of the values and
the Zvalues is 0.993 indicating that a hypothesis of log normality cannot be rejected either.
Example 8.11 The following data are from two independent samples. Test the hypothesis that the
underlying distribution is the same for both samples.
X = 2.25,2.63, 3.09, 3.47, 3.76,4.01,4.14,5.51,6.10,6.33
Y = 5.37,5.60,6.33, 8.90
Table 8.6. Kentucky River data
The maximum deviation is 0.70. From Conover (1980), with a = 0.10, the critical test
value is 13/20, or 0.65. Thus, one can reject the hypothesis that the two samples are from the
same distribution. Conover (1980) can be consulted for more details on this test and for a com
panion onesided test.
Exercises
8.1. A sample of 20 random observations produced a mean of 145 and a variance of 30. What
are the 95% confidence intervals on the mean assuming a normal distribution if (a) the true
variance is estimated by 30; (b) the true variance is 30. Discuss the reason you feel that the
confidence intervals computed for part (a) are wider than for part (b).
8.2. What are the 95% confidence intervals on the variance for the samples of exercise 8. l?
8.3. Test the hypothesis that the true mean of the data producing the sample whose properties are
given in exercise 8.1 is 165.
8.4. Discuss any connection between hypothesis testing and confidence intervals that you can
discern. What are the differences?
222 CHAPTER 8
8.5. Assuming the data are normally distributed, test the hypothesis that the mean peak discharge
on the Kentucky River near Salvisa (table 2.1) for the period 18951 9 16 is different than it is for
the period 19391 960.
8.7. Using the data of table 2.1, test the hypothesis that the variances of the peak discharges are
the same for the three periods 18951916,19171938,19391960.
8.8. Test the hypothesis that the mean monthly rainfall for September and October are the same
on the Walnut Gulch watershed near Tombstone, Arizona. What assumptions did you make? Are
these assumptions reasonable?
8.10. Test the hypothesis that the difference in the mean monthly rainfall on Walnut Gulch near
Tombstone, Arizona, for September and October is 0.50 inches. Discuss the validity of the
assumptions that are made.
8.1 1. Test the hypothesis that monthly rainfall in October on the Walnut Gulch watershed near
Tombstone, Arizona, is normally distributed.
8.12. Test the hypothesis that annual rainfall on the Walnut Gulch watershed near Tombstone,
Arizona, is normally distributed.
8.13. Comment on the results of exercises 8.1 1 and 8.12 in terms of the Central Limit Theorem.
8.14. Would the plotting position relationship used in exercise 7.6 have any effect on the results
of a test for normality on the data set you selected?
8.16. Use the KolmogorovSmimov test to test for normality the three sets of data plotted in
exercise 7.11.
8.17. Use the KolmogorovSmirnov test to test for lognormality of three sets of data plotted in
exercise 7.12.
8.20. What distribution do you think would fit the data of exercise 2.2? Use the chisquare test
to evaluate your assertion.
HYPOTHESIS TESTING 223
8.21. The following are experimentally determined values of Manning's n for plastic pipe as
determined by Haan (1965). Test the hypothesis that the mean value of n is different from the
recommended design value of 0.0090.
9. Simple Linear
Regression
NOTATION
IN THIS chapter an upper case letter will represent a variable, a lower case letter will represent
the difference between a variable and its mean, and a subscript will be used to denote a particular
value for the variable. Thus Y represents a variable which may take on values Y,, Y,, Y3, and so on.

Y is the mean of Y. y = Y  Y and yi = Yi  Y.Parameters are denoted by Greek letters and a
corresponding English letter is used to denote an estimate for the parameter. Thus a is a parameter
estimated by a (& = a). The lower case letter e will be used to denote the difference between an
observed value of Y and its predicted value ?.Thus Y  ? = e and Yi  Pi = ei.All summations
m this chapter will run from 1 to n unless otherwise specified, where n is number of observations
on Y and X.
SIMPLE REGRESSION
Possibly the most common model used in hydrology is based on the assumption of a linear
relationship between two variables. Generally, the objective of such a model is to provide a
means of predicting or estimating one variable, the dependent variable, from knowledge of a sec
ond variable, the independent variable. The statistical procedure used for determining a linear
relationship between two variables is known as regression. Often the term regression is reserved
for use when all of the X variables being considered are random variables. In this book liberties
will be taken and the term applied whether or not the X variables are random variables. As used
in this chapter, dependent and independent are not the same as dependence or independence of
random variables. Here, dependent means that the variable may be expressed as a (linear)
SIMPLE REGRESSION 225
o Data
+ Mean //
 c=a+bx
/
/
 95% CI on regression line / 8
8
25
so[  95% CI on individual ~redictedv ,/
01
30
/ I
35
I
40 45
1 I
50
I
55
Annual precipitation (in.)
Fig. 9.1. Annual rainfallrunoff relation for Cave Creek.
function of a second variable known as the independent variable. Obviously, if the variables are
strictly independent in a statistical sense, one variable would give no information about the other.
Figure 9.1 shows a situation where it may be desirable to find a linear relationship between the
annual runoff, Y, and the annual precipitation, X, for Cave Creek near Lexington, Kentucky. The
annual runoff is the dependent and the rainfall the independent variables. The data used in
constructing figure 9.1 is contained in table 9.1.
Table 9.1. Annual precipitation and runoff for Cave Creek, near Lexington, Kentucky
adequately represent the relationship between Y and X? For what values of a and P is the repre
sentation the best? Here E is the difference in Y and a + PX.
In looking at the question of the "best" straight line, a criteria for judging "bestness" is
needed. One intuitive criteria would be to estimate a and P by a and b so as to minimize the
deviation ei between the observed values of Y, Yi, and the predicted values of Y, Y,. In this way,
values for a and b would be sought that minimize the sum
Closer scrutiny of equation 9.2 reveals that it is not desirable to minimize the sum in an algebraic
sense becausethat would be equivalent to finding an a and b such that E ei is a.
Another criteria might be to find an a and b such that X ei is zero. The fallacy with this can
be seen by considering two points. If the line Y = a + bX goes through the two points, then X ei
would be zero; however, the sum is also zero for any line that overpredicts one point by the same
amount that it underpredicts the second point. Thus, there is an infinity of lines such that
E ei = 0, and an additional restriction or criterion is needed to select a single line.
The X ei may be positive or negative. A criterion that is not sign dependent is needed. Such
a criterion might be to minimize X leil or to minimize X e'. Since absolute values are difficult to
work with mathematically, the second criterion is generally selected. Thus it is desired to
estimate a and p by a and b such X e' is a minimum. Denoting this sum by M, we have
This sum can be minimized with respect to a and b by taking the partial derivatives of M
with respect to a and b and setting the resulting equations equal to zero.
These equations can then be written in the following form, known as the normal equations.
Equations 9.6 and 9.7 provide estimates for a and b such that C. e' is a minimum. Because
the procedure is based on minimizing the error sum of squares, C. e', the estimates a and b are com
monly called least squares estimates. Equation 9.4 indicates that this solution also satisfies C. ei =
0. Equation 9.7 indicates that the line Y = a + bX goes through the point Y = Y and X = X.
The line Y = a + bX is commonly known as the regression line of Y on X. The procedure
of determining a and b is known as simple regression. The term "simple" regression is used when
only one independent variable is involved, as opposed to multiple regression when several inde
pendent variables are involved. The parameter estimates, a and b, are known as the regression
coefficients.
Equations 9.6 and 9.7 show that a and b are functions of the sample values of Y and X. If
another sample of observations were obtained and a and b were estimated from this sample,
different estimates would result. We have already seen that
Similarly
Thus, ei represents the deviation between an observed Yi and its predicted value qibased on the
regression equation estimated from the particular sample of data at hand. E; represents the devia
tion between an observed Yi and the assumed true but unknown relation between Y and X given
byY = a + P X .
Example 9.1. Determine the regression coefficients for the data plotted in figure 9.1.
Solution: The data required for solving equations 9.6 and 9.7 are contained in table 9.2. The
equation used to calculate b would depend on the method of calculation. If a small desk calcula
tor is used, the first of equations 9.6 might be employed. If an electronic calculator or computer
is used, the latter of equations 9.6 might be employed. Generally, less roundoff error will result
if the latter form of equation 9.6 is used. In practice, readily available software would be used.
13.26
3.31
15.17
15.50
14.22
21.20
7.70
17.64
22.91
18.89
12.82
11.58
15.17
10.40
18.02
16.25
Total 234.04
Average 14.63
Comment: The last two columns of table 9.2 contain qi and Yi  9,. Note that except for

rounding errors, Y = 9 , C (Y,  q i ) = C ei = 0 and E = 0.
or
Y,  Pi = (Y,  P)  (9, 7)
Through algebraic manipulations, it can be shown that
The total sum of squares, 2 Y:, has been partitioned into three components. These three
components are:
2. C (Yi  Pi)2 = C e:, the sum of squares of deviations from regression or the residual sum of
squares
C y2 = C e2 + b C xiyi
230 CHAPTER 9
Therefore, the total sum of squares corrected for the mean is made up of two components
the sum of squares of deviation from regression (also known as the error or residual sum of squares)
and the sum of squares due to regression. The larger the sum of squares due to regression in com
parison to the residual sum of squares, the more of the total sum of squares corrected for the mean
is explained by the regression equation. The ratio of the sum of squares due to regression to the total
sum of squares corrected for the mean can be used as a measure of the ability of the regression line
to explain variations in the dependent variable. This ratio is commonly denoted by 2 and may be
written in a number of ways.
2 is called the coefficient of determination. If the regression equation perfectly predicts every
value of Yi , then ei would be zero for every i and 2 e' would be zero. Under these conditions,
equation 9.11 states that 2 y' = 2 (Pi  u)2, so that from equation 9.13 2 is seen to be one. On
the other hand, if the regression equation explains none of the variations in Y, then C e' will equal
2 yZ and C (Pi Y)2will be zero. Under this condition 2 will be zero. Thus, the range in possi
ble values for 2 is from 0 to 1. The closer 2 is to 1, the better the regression equation "fits" the
data. 2 is the fraction of the total sum of squares about the mean that is explained by the regression
equation.
From equations 9.6 and 9.13 we can write
Because 0 < 2 < 1, we have  1 < r < 1. The sign on r is identical to the sign on b because sx
and s, are always positive. From equation 9.14 it can be seen that r may also be written as
which would be equal to the sample correlation coefficient if X and Y were both random
variables. In fact, r is commonly called the correlation coefficient and can be shown to
be equal to the correlation between Y and ?. Correlation is discussed in more detail in
chapter 11.
SIMPLE REGRESSION 23 1
Example 9.2. What percent of the variation in Y is accounted for by the regression of example 9. l?
Solution:
Thus, 66% of the variation in Y is explained by the regression equation. The remaining 34% of
the variation is due to unex~lainedcauses.
The positive square root of the Var(~)is known as the standard error of the regression equation.
An unbiased estimate (Graybill 1961) for V a r ( ~ is
~ )s2 calculated from
The least squares estimation procedure produces estimates for a and b such that the standard
error of the regression equation is a minimum.
Another way to look at the coefficient of determination is to write equation 9.13 as
Therefore, if the estimated standard error of the regression equation is nearly equal to the
standard deviation of Y, 8 will be close to zero and the regression equation is of little value in
explaining variation in Y.
Figure 9.3 depicts the relationships among the pdfs of X, Y, and e in a linear regression.
What is of interest is the spread or variance in the pdf of e, s2,in comparison to that of Y, s;. The
smaller is s2 in comparison to s;, the greater is 8 and the stronger is the linear relationship
between Y and X. This is stated mathematically by equation 9.18.
Example 9.3. Is there reason to believe the residuals of example 9.1 are not normally
distributed?
Solution:
95% of the e, should be between 2s and 2s or between 5.94 and +5.94. An inspection of
table 9.2 shows that none of the 16 observations are outside this interval. The number of
observations is not sufficient to determine if the ei are N(0, a2), however, there is not sufficient
evidence to reject this possibility.
SIMPLE REGRESSION 233
Inferences on Regression Coefficients
In order to place confidence intervals on a and P and to test hypotheses concerning them, it
is necessary to know the Var(a) and Var(p) which will be designated as a : and 0;and estimated
by s: and s:. si and st can be estimated from
and
(b  Po)
t=
Sb
Example 9.4. Compute the 95% confidence intervals on a and p and test the hypothesis that
a = 0 and the hypothesis that P = 0.500 for the regression of example 9.1.
Solution:
sa=s
[a
+
X2
Ex?]
'D
Since It1 < t()975,14,we cannot reject H,. The slope is not significantly different from 0.5.
Comment: The significance of the overall regression can be evaluated by testing H,: P = 0.
Under this hypothesis
Because It( > f0.975,14 we reject H,. The regression equation explains a significant amount of the
variation in Y.
u2X
Mood and Graybill (1963) give Cov(a, b) =  Therefore
E x2'
Equation 9.25 indicates that the variance of & depends on the particular value of X at which
A 
the variance is being determined. The ~ a r ( & )is a minimum when Xk = X and increases as Xk
deviates from X.
236 CHAPTER 9
Confidence limits on the regression line are now given by
where = a + bX, and sqk is given by equation 9.26. Because s e increases as xk or x,X

increases, the confidence intervals are the narrowest at X, = X and widen as Xk deviates
from X
The confidence limits on an individual predicted value of Y would be wider than the
confidence interval on th_eregression line since for an individual Y, the Var(~)or 0
' would have
to be added to the ~ a r ( & ) .Thus the variance of an individual predicted value of Y would be
A
would be substituted for skk.The confidence limits on a future predicted value of Y are the same
as those for an individual predicted value of Y.
Example 9.5. Calculate the 95% confidence limits for the regression line of example 9.1.
Calculate the 95% confidence interval for an individual predicted value of Y for the same
problem.
where the  applies to the lower limit, 1, and the + to the upper limit, u. Similarly, the 95%
confidence interval on an individual predicted value of Y is given by
By substituting various values of Xk into these equations, the desired confidence limits are
obtained. These intervals are plotted in figure 9.1.
SIMPLE REGRESSION 237
Confidence Iintervals on Standard Error
Confidence intervals may be placed on a2 by noting that the quantity (n  2)s2/a2 is
distributed as a chisquare distribution with n  2 degrees of freedom. Thus, confidence limits on
a2are given by
EXTRAPOLATTON
The extrapolation of a regression equation beyond the range of X used in estimating a and
p is discouraged for two reasons. First, as can be seen from figure 9.1 and equation 9.27, the
confidence intervals on the regression line become very wide as the distance from is increased.
Second, the relation between Y and X may be nonlinear over the entire range of X and only
approximately linear for the range of X investigated. A typical example of this is shown in 
figure 9.4.
GENERAL CONSIDERATIONS
Many authors discuss several different linear models depending on the assumptions made
concerning Y, X, and E (Graybill 1961; Benjamin and Cornell 1970; Mood and Graybill 1963).
These different models revolve around whether X (or  X in multiple regression) is a random or
nonrandom variable, whether measurement errors are made on Y and/or X, the distribution of X
if X is a random variable, and the joint distribution of Y and X if X is a random variable.
True relation /
/
/
2. Y and X are both random variables having a joint distribution, the conditional distribution of
Y is N(a + PX, a2), and the marginal distribution of X is independent of a , P and a 2 .
It turns out that under either of the above conditions, the procedures given in this chapter are
valid for tests of hypotheses and confidence interval estimation at a specified level of significance.
Graybill (1961) points out that the power of the tests are not the same for the two conditions.
If X is a fixed variable measured without error and ei is independently and identically
distributed N(0, a2); or Y and X are from a bivariate normal distribution and are measured
without error.; or Y and X are from a bivariate nonnormal population with the conditional distri
bution of Y being N(a + PX, a2)and the marginal distribution of X independent of a , P and a2;
then the least squares estimates of a , p and a2are also maximum likelihood estimators. The least
squares estimates for the regression coefficients are unbiased.
If significant measurement errors are made on the X variables, then complications arise. For
this situation reference can be made to Graybill (1961) or Johnston (1963). Certainly, measure
ment errors are always present; however, if these errors are small relative to X, then the theory
presented in this chapter and chapters 10, 11, and 12 may still be applied.
The reason that measurement errors on X cause problems can be seen by considering the
model Y = a + PX + E. If Y and X contain measurement errors, then Y and X are not observed.
What is observed is Y * and X*, where
where ey and ex are the measurement errors on Y and X. Thus, the normal equations are solved
in terms of Y* = a + PX* + E, or Y + ey = a + p(X + ex) + E = a + f3X + f3ex + E.
Now if ex is small in comparison to X, this latter equation becomes Y = a + f3X + E  ey, or
Y = a + f3X + e,, which can be handled by the methods outlined in this chapter.
Recall that no distributional assumptions are required to get the least squares estimates for
a and f3. The assumptions are involved when confidence intervals and tests of hypotheses are of
concern, or when it is desired to state that the least squares estimates for a and P are also maxi
mum likelihood estimates. Johnston (1963) points out that the least squares estimates for a and
p are biased if significant measurement errors are present on X.
One of the assumptions used in developing confidence intemals and tests of hypotheses was
that the E~ are independent. If E, is correlated with E ~ + , ,the least square estimates of a and f3 are
unbiased, however, the sampling variance of a and f3 will be unduly large and will be underesti
mated by the least squares formulas for variances rendering the level of significance of tests of
hypotheses unknown. Also, the sampling variances on predictions made with the resulting equa
tion will be needlessly large. Correlation between E, and frequently arises when time series
data are being analyzed. This type of correlation is known as autocorrelation or serial correlation.
SIMPLE REGRESSION 239
Johnston (1963) discusses least squares estimation procedures in the presence of autocorrelation.
Autocorrelation of errors is discussed in more detail in the next chapter of this book.
In some situations the assumption of homoscedasticity [Var(~,)= 0' for all i] is violated.
Quite commonly, Var(ei) increases as X increases. Such a situation is depicted in figure 9.5. Draper
and Smith (1966) and Johnston (1963) discuss least squares estimation under this condition.
Another point to be made concerning hypothesis testing in general is that a statistically
significant difference and a physically significant difference are two entirely different quantities.
For example, when the H,: P = 0 was tested in example 9.4, the conclusion was that the
regression line explained a significant amount of the variation in Y. This refers to a statistically
significant amount of the variation at the chosen level of significance. It means that recognizing
an a%chance of an error, the relationship Y = a + bX cannot be attributed to chance. It does
not imply a cause and effect relationship between Y and X.
Looking at the confidence limits on the regression as plotted in figure 9.1 and the scatter of
the data, it can be seen that this simple relationship Y = a + bX leaves a lot to be desired in
terms of predicting annual runoff. Whether or not the derived relationship is usable depends on
the use to be made of the predicted values of Y and not on the fact that the Ho: p = 0 is rejected.
It may be that the standard error of the equation, s2,is so large as to render the estimate made with
the equation in some particular application too uncertain to be used even though the equation is
explaining a statistically significant portion of the variability in the dependent variable.
Exercises
9.1. The following data are the maximum air and soil temperatures (bare soil at 2inch depth)
recorded for the first 30 days of July 1973, at Lexington, Kentucky. Derive a linear relationship
via simple regression for predicting the maximum soil temperature from the maximum air
240 CHAPTER 9
Max Temp
Air Soil Air Soil Air Soil
temperature. Estimate a and 3 for the resulting regression. Test the hypothesis that (a) the inter
cept is 0, (b) the slope is 1, (c) the regression explains a significant amount of the variation in the
maximum soil temperature. Would you recommend using this relationship for predicting maxi
mum soil temperature?
9.2. The asterisks following the soil data in exercise 9.1 indicate days on which rainfall occurred.
Using only these rainfall days, work exercise 9.1.
9.3. Calculate the regression coefficients in the relationship Q, = a + bQ where Q, is the annual
suspended sediment load and Q is the annual water discharge for the Green River at Munfordville,
Kentucky. Calculate the standard error of the regression equation and the correlation coefficient.
Plot the data along with the 95% confidence intervals on the regression line. Is this a usable
prediction equation?
9.4. Show that the correlation coefficient in simple regression is equivalent to the correlation
between Y and ?.
9.5. Calculate the regression equation for the data of table 9.1 considering the runoff as the in
dependent variable and the precipitation as the dependent variable. Rearrange the resulting
equation to be in the form of the prediction equation of example 9.1. Does the resulting
regression equation agree with the regression equation in example 9.1? Should it agree? Why?
Which equation should be used?
9.6. A technique used by hydrologists to detect changes in the hydrologic response of a watershed
is to examine mass curves for changes in slope. A mass curve is a plot of the accumulation over
time of one variable versus the accumulation over time of a second variable. The data below are
the annual runoff and precipitation for Thorne Creek experimental watershed in Pulaski County,
Virginia. It is thought that there was a change in the hydrologic characteristics of this watershed
during the 11year period of study. Plot the accumulated precipitation as the abscissa and the
accumulated runoff as the ordinate. Does there appear to be a change in the rainfallrunoff
relationship? During what year? Calculate the slope of the regression lines describing the data
SIMPLE REGRESSION 24 1
both before and after the apparent change. Test the hypothesis that these slopes are not signifi
cantly different.
9.7. Occasionally it is desirable to restrict the intercept of a simple regression to 0, thus requir
ing the regression line to pass through the origin. Derive the normal equation for the slope in this
case. Use the resulting equation to calculate the slope of the line describing the data plotted for
exercise 9.6. Neglect the apparent change in the slope for this problem (i.e., use all of the data to
estimate b in the equation accumulated runoff = b [accumulated precipitation]).
where Y is a dependent variable, XI, X2, ..., Xp are independent variables, P,, P2, ..., Pp are
unknown parameters, and E is an error component. This model is linear in the parameters, Pj.
and
where Yi is the ithobservation on Y and Xij is the ithobservation on the j' independent variable.
Equations 10.2 can be written
When the model is written in the form of equation 10.5, it is easy to see that  Y is an n X 1
vector of observations on the dependent variable,  X is an n X p matrix made up of n observations
on each of p independent variables, and  P is a p X 1 vector of unknown parameters. For equation
10.4 to have an intercept term, it is necessary that Xi,l = 1 for all i. p1 is then the intercept. In the
following development, it is assumed that Xi,, = 1 for i = 1 to n.
The model discussed in chapter 9,
In matrix notation
Differentiating this expression with respect to and setting the partial derivative equal to
zero results in
which represents the normal equations. The solution of equation 10.7 is obtained by premulti
plying by (x'x)'.

MULTIPLE REGRESSION 245
p can be estimated by
or we have the result that 
The  X'X matrix plays an important role in estimating  P and in the variance of the 6,'s.
The X'X  matrix is made up of the sum of squares and cross products of the independent variables.
For the p x p matrix  X'X to be inverted, its rank must be p. That is, no row or column can be a
linear function of any combination of the other rows and columns. If this occurs, it is known as
multicolinearity.
z = [z,,,], then 
If we define zi, to be (Xi,  X,)/s, and let  zfz/(n
  1 ) is a p X p correlation
matrix,  R = [Rij], where Rid is the correlation coefficient between the ith and jth independent
variables. By definition, Rid = 1 for i = j. If 1Rij( = 1 for some i # j, then the ithindependent
variable is a linear function of the j' independent variable and the rank of  X'X will be less than
p. This means that an independent variable cannot be a (perfect) linear function of any other
independent variable. Furthermore, for the rank of  X'X
 to be p, an independent variable cannot
be linearly dependent on any linear function of the remaining independent variables. For exarn
ple, if p is 4 and X2 = a x 1 + bX3 + c, then X2 is a linear function of XI and X3 SO that the rank
of 
X'X  would be at most 3. If there is near linear dependence in X,the calculation of ( X '
X)~'
may involve roundoff errors and loss of significance leading to nonsensical estimates for  P
(Draper and Smith 1966).
As in the case of simple regression, the total sum of squares can be partitioned into three parts.
Draper and Smith (1966) demonstrate that equation 9.10 can be written in matrix notation as
Y'Y
so that the three components of the total sum of squares,   or 2 Y: are:
~ 'x 
3.  ' Y n? = 2 (Pi Y)', the sum of squares due to regression
Mean 1 nY2
Regression P1 fi'x'y  n p

Continuing the analogy with simple regression, define e as Y  XP. The estimation proce
dure guarantees that E(e)
 = 0. An unbiased estimate for the Var(e,) or u2is s2 where
e'e
 ( Y  @ ) ' ( Y
 X S ) (Y'Yfi'x'~)
      

nP nP " P
The standard error of the regression equation u is estimated by s. An expression for R2 that
is analogous to equation 9.18 is
Again, this shows that if the regression equation is explaining a large part of the variation in Y.
The standard error of the equation will be significantly less than the standard deviation of Y.
Example 10.1. Benson (1962) studied flood frequencies on many streams in the northeastern
United States. The following table contains a partial listing of some of Benson's data. Using this
data: (a) Estimate the regression coefficients for the model
where Q is the mean annual flood in thousands of cfs, A is the watershed area in thousands
of square miles, and I is the average annual maximum 24hour rainfall depth in inches.
MULTIPLE REGRESSION 247
(b) Calculate R'.., (c) Calculate Q~for each observation on the independent variables. (d) Calcu
late ei for each Qi.
Station No. Q A I Q e
Solution: To maintain consistency in notation, let Yi = Qi, Xi,I = 1, Xi,, = Ai, Xi,, = Ii. For this
problem n = 14 and p = 3. The column of data under Q is the 14 X 1 vector  Y, a column of 1's
along with the data under A and I is the 14 X 3 matrix X,and the 3 X 1 vector  P is made up of
b,, b,, and b3. From equation 10.8, we have
(&'&)' is found to be
3.71678
0.18094
 1.37537
0.18094
0.02028
0.06124
 1.37537
0.06 124
0.52329 1
248 CHAPTER 10
This means that 99% of the variation in Y is explained by the regression equation
Values for Q contained in the above table were calculated from this relationship.
Values for ei'were computed from
Mean 1 6,606.381
Regression 2 13,182.600 6591.300
Residual 11 171.090 15.554
Total 14 19,960.071
If a large number of significant figures are not carried in computing the (x'x)'matrix,
significant errors can result. To demonstrate this, the elements of the X'X and X'Y matrices were
rounded to two decimal places resulting in estimates for b of b1 6,
= 1.10, = 12.24, and b3
=
5.28. Computational problems of this type are rarely a problem when using wellestablished
computer routines unless there is near colinearity in the X matrix.
MULTIPLE REGRESSION 249
CONFIDENCE INTERVALS A I D TESTS OF HYPOTHESES
As was the case in simple regression, in order to use some welldeveloped theorems on con
fidence intervals and tests of hypotheses in multiple regression, some assumptions must be made.
All of the comments of chapter 9 regarding the assumptions in simple regression remain valid in
multiple regression. The assumption will now be made that the E, are identically, independently,
and normally distributed with mean 0 and variance u2 .That is, the ei are iid N(0, u2)(see General
Comments section of chapter 9.)
fii
The variance of is equal to the covariance of fii with itself and is therefore c? times the ia diagonal
element of (X'X)I.
 fii
The covariance of with fij is c? times the i, j~ element of (X 'X ) ~ ' .If we let
C =
  X'X then C' = (x'x)'

and
H,, denote the full model as the model containing all p of the independent variables. Denote as
the reduced model the model obtained by deleting the last k independent variables. The reduced
model contains p  k independent variables. Now let
Q2 = sum of squares due to regression on the full model with p  1 degrees of freedom
Q1 = residual sum of squares on the full model with n  p degrees of freedom
Q2* = sum of squares due to regression on the reduced model with p  k  1 degrees of
freedom
The quantity
iMULTIPLE REGRESSION 25 1
fi
in p dimensional space. is a p X 1 vector consisting of the estimates for 
P. The var(9,) is given
by (Draper and Smith, 1966)
which can be estimated by replacing u2 with s2.The confidence limits on Yh are given by
The confidence intervals on an individual predicted value of Y, are given by equations 10.21
where var(9,) is replaced by the variance of an individual predicted value of Y at Xh which is
given by u2(1 + 
xh(Xrx)'x',).
 
Example 10.2. For the regression equation of example 10.1: (a) Test the hypothesis that the
regression equation is not explaining a significant amount of the variation of Y. (b) Test the H,:
p2 = 0. (c) Test the H,: p, = 0. (d) Calculate the 95% confidence limits on P2. (e) Calculate the
95% confidence limits on the regression line at the point A = 4,000 square miles and I =
2.0 inches. (f) Calculate the 95% confidence intervals on u2.
252 CHAPTER 10
Solution:
The tabulated F,95,2 is 3.98. Therefore, H, is rejected. The regression equation does explain a
significant amount of the variation in Y.
Because It1 < t.975,11,we cannot reject H,. The mean annual maximum 24hour rainfall depth
does not explain a significant amount of the variation in the mean annual peak flow.
(d) The 95% confidence limits on P2 are calculated from equations 10.15 as
(e) The 95% confidence limits on the regression line at X2,h= 4.00 and X3,h= 2.0 are
determined from equation 10.2 1. The var(Ph)is from equation 10.20.
(f) The 95% confidence intervals on u2 are calculated from equation 10.13
The 95% confidence intervals on u can be obtained by taking the square root of these limits to
obtain 2.806.69.
Comment: The hypothesis H,: P2 = 0 and H,: P, = 0 were both tested in this example as
though the tests were independent. In fact, P2 and P, are not independent. The cov(p2, b,)
can
be determined from C;: s2 as .0612(15.554) = 0.9519. The correlation between p2 6,
and can
be estimated from cov(b2, fi,)/(up, up3)as 0.9519/(0.562 x 2.85) = 0.59. The test of H,: P, =
0 is made relative to the full model that includes all of the P's. The acceptance of H, implies that
p, = 0 given that p1 and p2 are in the model. In general, if there are p p7s and H,: Pi = 0 is
tested for each of them, with the result that k of the hypotheses can be accepted, one cannot
eliminate these k variables from the model on the basis of this test alone because each of the
individual H,: Pi = 0 assumes all of the other p  1 p7sare still in the model. To eliminate k
variables at once, the test must be based on equation 10.19.
As an example of the application of equation 10.19, the H,: P, = 0 will be tested. The
ANOVA for the full model is contained in example 10.1. The reduced model is simply Y = b1 +
b2X where X is the watershed area in thousands of square miles. Because this is a simple regres
sion situation, we can compute the sum of squares due to regression from b 2 Zxiyi where
b = 2 xiyi/Z xf.The result of this calculation is the sum of squares due to regression for the
reduced model, which is 13,182.60.
The test statistic from equation 10.19 is
The table value of F.95.1,11 = 4.84, so we fail to reject H,: p3 = 0. Note that this test is
identical to the test conducted in part (c) of this example. From F and t tables it can be seen that
254 CHAPTER 10
F,,,,, 
 t:an,n, SO for the special case where k = 1 variable is being tested, equations 10.17
and 10.19 produce identical results.
Because H,: P3 = 0 was not rejected, the next logical step is to eliminate I from the model
and consider only A. Ln so doing the resulting regression equation is
The dependence of p's again is evident because the intercept is not the same as was obtained
when rainfall depth was included in the model. This is a somewhat special example in that P2
accounts for nearly all of the variation in Y, leaving virtually none of the variation to be explained
by P3. Again, one reason for this unusual situation is the units on Y and A and the proximity of all
of the watersheds to each other, resulting in similar rainfalls on all of the watersheds. Unless the
relationship between the dependent variable and an independent variable is quite strong, vari
ability in the dependent variable due to variability in the independent variable cannot be detected
if there is little variability in the independent variable.
where Kj and sj are the mean and standard deviation of the j" independent variable. Then define
z = [zij] so that the correlation matrix is
where Rij is the correlation between the i" and j" independent variables. R is a symmetric matrix
because Rij = Rj,i.We have already seen that if Ri,j = 1 for i # j, then either variable i or vari
able j must be omitted from the model or else the  X'X
 matrix cannot be inverted. If Ri,j is close
P estimated. If Rijis close to unity,
  can be inverted and 
to unity (but not equal to unity), then X'X
then the var(bi) or var(bj) may be very large. Tests of hypothesis on Pi and Pj may indicate that
neither is significantly different from zero when in fact either Pi or Pj when used alone may be
significantly different from zero. The problem here is that since Xi and Xj are nearly linearly
MULTIPLE REGRESSION 255
related, they both are attempting to explain the same thing in the linear model. By having both Xi
and Xj in the model, the part of the variation in Y that either would explain if used alone may be
split between them in such a fashion that neither is significant. In other words, the effect of one
explanatory factor (which may be reflected in either Xi or Xj) is being divided between two
correlated variables.
Retaining variables in a regression equation that are highly correlated (multicolinearity)
makes the interpretation of the regression coefficients difficult. Many times the sign of the
regression coefficient may be the opposite of what is expected if the corresponding variable is
highly correlated with another independent variable in the equation. Multicolinearity is discussed
below.
A common practice in selecting a multiple regression model (and one that is not necessarily
being advocated) is to perform several regressions on a given set of data using different
combinations of the independent variables. The regression that "best" fits the data is then
selected. A commonly used criterion for the "best" fit is to select the equation yielding the largest
value of R2.
Looking at equations 10.21, another and perhaps better criterion is apparent. The confidence
intervals on the regression line are a function of s, the estimated standard error. The line with the
smallest standard error will have the narrowest confidence intervals.
Often the two criteria of the largest R2 and the smallest s give the same resultsbut not
always. As more variables are added to a regression equation, the R2 value can never decrease.
Thus, from the standpoint of the R2 criterion, one should use all of the available variables. This,
however, makes a clumsy equation and one in which it is extremely difficult to place a meaning
ful interpretation on the coefficients.
As more variables are added to a regression equation, the standard error may get larger. This
can be seen from equation 10.11. Every time a variable is added, n  p gets smaller as does
 
Y'Y P'X'X.
  However, the numerator may not, and often does not, decrease proportionally to
n  p, so that as variables are added s may actually increase. This is a tipoff that the added
variables are not contributing significantly to the regression and can just as well be left out.
All of the variables retained in a regression should make a significant contribution to the
regression unless there is an overriding reason (theoretical or intuitive) for retaining a non
significant variable. The variables retained should have physical significance. If two variables are
equally significant when used alone but are not both needed, the one that is easiest to obtain or
easiest to interpret should be used.
The number of coefficients estimated should not exceed 2535% of the number of observa
tions. This is a rule of thumb used to avoid "overfitting", whereby oscillations in the equation
may occur between observations on the independent variables.
Thus far all decisions on which regression equation to use have been made by the
investigator. In many cases this is the most reliable method of selecting a regression equa
tion. Using computers, it is possible to perform many regressions on large sets of data. This
has led to several formal procedures for selecting a regression equation. Two methods will
be discussed hereallpossibleregressions and stepwise regression. For a discussion of
some other techniques, reference should be made to Draper and Smith (1966) and Neter
et al. (1996).
Allpossibleregressions involves calculating regression equations having every possible
combination of the X variables. If all of the equations are required to have an intercept term,
2P' regression equations would have to be calculated where p is the number of independent
variables, one of which is always equal to one to produce the intercept term. Thus, if p = 4, 8
regression equations would be calculated (not an impossible task or a bad procedure); however,
if p = 11,1024 regressions would have to be calculated and examined. Thus, as p gets even mod
erately large, the number of regressions required becomes prohibitive and intelligent thought
could eliminate many of them. When this many regressions are calculated, the probability of
getting a significant regression by chance becomes large.
One of the most commonly used procedures for selecting the "best" regression equation is
stepwise regression. This procedure consists of building the regression equation one variable at a
time by adding at each step the variable that explains the largest amount of the remaining unex
plained variation. After each step all the variables in the equation are examined for significance
and discarded if they are no longer explaining a significant amount of the variation. Thus, the first
variable added is the one with the highest simple correlation with the dependent variable. The
second variable added is the one explaining the largest variation in the dependent variable that
remains unexplained by the first variable added. At this point the first variable is tested for sig
nificance and retained or discarded depending on the results of this test. The third variable added
is the one that explains the largest portion of the variation that is not explained by the variables
already in the equation. The variables in the equation are then tested for significance. This pro
cedure is continued until all of the variables not in the equation are found to be insignificant and
all of the variables in the equation are significant. This is a very good procedure to use but care
must be exercised to see that the resulting equation is rational. An alternative stepwise procedure
is to start with a full model and eliminate variables one at a time with the least significant vari
able being chosen for elimination. At each step the significance of all remaining variables is
checked to ensure the retained variables are, in fact, more important than the eliminated ones.
Of course, the real test of how good the resulting regression model is depends on the ability
of the model to predict the dependent variable for observations on the independent variables that
were not used in estimating the regression coefficients. To make a comparison of this nature, it is
necessary to randomly divide the data into two parts. One part of the data is then used to develop
the model and the other part to test the model. Unfortunately, in hydrologic applications there are
often not enough observations to cany out this procedure.
EXTRAPOLATION
The comments on extrapolation contained in chapter 9 relative to simple regression are
equally applicable to multiple regression. In multiple regression an additional problem arises. It
is sometimes difficult to tell the range of the data. In example 10.1, A ranges from 0.091 to 8.27
and I ranges from 1.7 to 3.2. Is the point A = 6.0 and I = 2.7 in the range of the data?
A plot of A and I is shown in figure 10.1. From this plot it is apparent that A and I do not
cover the entire range defined by 0.091< A < 8.27 and 1.7 < I < 3.2. The point A = 6.0 and
I = 2.7 does not appear to be in the range of the data. In more than 2 dimensions it is much more
difficult to visualize the range of the data.
MULTIPLE REGRESSION 257
AUTOCORRELATED ERRORS
One of the assumptions that is made in linear regression is that the errors are independent.
This means that there should be no correlation between the errors at successive observations.
Correlation in the errors from one observation to the next is common in time series data, espe
cially if the hydrologic system involves considerable storage. For example, if the dependent
variable is the elevation of the ground water in a particular observation well on a monthly basis,
it would not be uncommon that if this water level were underpredicted at a particular time step,
it would tend to be underpredicted in the next time step. Correlation of this type is often called
autocorrelation or serial correlation. The chapters in this book on Correlation and on Time Series
Analysis deal with this topic as well. Neter et al. (1996) has a good treatment of regression when
serial correlation is present.
It is important to note that what is of concern is autocorrelation in the error term of the
regression model, not in the dependent or the independent variables. Often, but not always,
autocorrelation in the dependent variable leads to autocorrelation in the error term of the regres
sion model. Time series data such as daily or monthly streamflow, monthly ground water levels,
and monthly reservoir levels generally have significant serial correlation, and regressions using
these as dependent variables often have serial correlation in the error terms. The error term rep
resents deviations between the predicted and observed values of the dependent variable. Serial
correlation in the predicted variables can arise because the model predicts similar values from
one time step to the next. It is only when overpredictions at one time step tend to follow over
predictions at the previous time step and underpredictions tend to follow underpredictions that
serial correlation in the error term exists.
Serial correlation in the errors can be detected by examining a time series plot of the errors
and noting any patterns. Random scattering of the errors indicates a lack of serial correlation or
258 CHAPTER 10
independence of the errors. Any pattern in the errors may be indicative of serial correlation. The
correlogram (chapter 14) of the errors can also be computed. A large first order serial correlation
indicates correlated errors.
Estimated regression coefficients in the presence of serial correlation in the errors are unbi
ased but their variances are incorrectly estimated, and thus the level of significance of hypothe
sis tests regarding these coefficients is unknown. The standard error of the regression equation is
also affected so that hypothesis tests involving the standard error are also at an unknown level of
significance.
Serial correlation may indicate that one or more important explanatory variable is missing
from the regression equation. Serial correlation implies that
where E, is the error at time t, p is the serial correlation, and E, is independent with mean zero. Neter
et al. (1996) indicate that if E, is iid N(0, a2) then e, has a mean of 0 and a variance of a2/(1  p2)
where p is the first order serial correlation between e, and e,,. This in turn implies that
we can perform a regression of Y: versus XI, and eliminate the problem of serial correlation in
the errors. Equation (10.25) requires that p be known. It can be estimated by computing the first
order serial correlation if the errors from the original equation involving Y, and the Xi,,?sfrom
In the event that turns out to be iid N(0, u2), standard tests of hypotheses can then be used
E,
to eliminate nonsignificant P's. In equation (10.28) the Y,, and Xi,ll are known as lagged
variables.
Lagged variables can often represent changes in storage. We know from continuity
MULTIPLE REGRESSION 259
where I is inflow, 0 is outflow and AS is the change in storage for a particular hydrologic system.
In many systems of areal extent A, (Y,  Y,,)A may be proportional to the change in storage
from time t  1 to t. A prediction of Y, might be based on the difference in inflow and outflow
from t  1 to t and Y,,.
T  6 = (Y,  Y,,)A
It + It1 Ot + Ot1
At  At = (Y,  Y, ,)A
2 2
An exact test is not available, but Durbin and Watson have obtained lower and upper bounds
dL and du such that values of D outside these bounds lead to the decision that the hypothesis can
not be rejected if D > dUand the hypothesis is rejected if D < dL.If dL< D < du, the test is in
conclusive. Tables of d, and dUare contained in the appendix for various values of n, p, and for
levels of significance equal to 0.05 and 0.01.
260 CHAPTER 10
Neter et al. (1996) indicate that a test for negative serial correlation can be done by using as
a test statistic 4  D. The test is then the same as for positive serial correlation. Helsel and Hirsch
(1992) indicate that the DurbinWatson statistic requires the data to be evenly spaced in time.
Corrective Action
When serial correlation in the errors is detected, the first step should be to determine if some
important explanatory variable is missing from the regression equation. Often in hydrology the
serial correlation is a result of storage in the system. In this case, a measure of this storage may
need to be included as a predictor variable. In other cases some function of time may correct the
problem.
Aggregating data over longer time periods may reduce or eliminate serial correlation. As the
time between observations increases, the dependence of one observation on another can be
expected to decrease. At large enough time intervals, independence may be achieved.
As indicated earlier, the inclusion of lagged variables, both on the dependent and the inde
pendent variables, may help reduce serial correlation. The chapter on time series modeling
should be consulted for more on this topic.
MULTICOLINEARITY
In multiple linear regression it is unfortunate that the predictor variables in 
X are called "in
dependent" variables. This terminology reflects that  Y is being predicted as a function of 
X. Thus
Y has been termed the "dependent" variable because 
 Y is thought to depend on X. By extension,
the X's have become known as the "independent" variables because they are what  Y is depend
ent upon. Independence has a special meaning in statistics that differs from the above, as we have
seen. We know that if all of the X's in  X are mutually independent, then the correlation matrix
computed from  X will be a diagonal matrix with ones on the diagonal and zeros elsewhere.
In most natural sciences where "independent" variables are measured values from uncon
trolled experimentation,it is rare to achieve true statistical independence. Some level of correlation
almost always exists among the predictor variables. These correlations among the independent
variables are often called multicolinearities. Much of the discussion on multicolinearitycomes from
Neter et al. (1996). Generally the term multicolinearity is reserved for the case when rather strong
correlations exist within the  X matrix.
As the name implies, multiple linear regression attempts to exploit linear relationships be
tween  Y and  X to develop a prediction or descriptive equation for  Y. If two X variables, say X,
and X2, are perfectly linearly related, then r,,, = 1. Furthermore, all of the information relative to
a linear relationship between  Y and 
X,, will be contained in the relationship between  Y and 
X,. In
other words, nothing is gained by including both  X, and X, in a linear regression with 
Y. As a mat
ter of fact, there is not a unique linear relationship between  Y and 
X, and X, if 
X, and 
X, are per
fectly correlated. Each of the relationships will predict the same value of Y for all X, and X2pairs
that follow the linear relationship between  X, and 
X,.
When  X, and X2 are perfectly correlated, the residuals of the regressions Y on 
X,, 
Y on X,,
and Y and  X, and X, will all be exactly the same since the same information, in a linear sense,
will be contained in all three regressions. (Note the brief mention of multicolinearity following
equation 10.8.)
MULTIPLE REGRESSION 26 1
If we now relax the requirement that  Xl and X2 are perfectly correlated to requiring that they
be "highly" correlated, an approximation to what is discussed above results. Now the residual
sum of squares of regressions of  Y on X,, Y on 
X,, and Y on 
X, and  X, will be nearly the same
depending on the strength of the linear relationship between XI  and  X2. Thus, if a regression of
Y on X1 is performed followed by a regression of  Y on X1
 and  X2, the reduction in the residual
sum of squares brought about by the addition of  X, will be small because very little information
in a linear sense is added to the regression.
What may happen if both  X, and X, are included is that the linear effects between Y and  XI
or X2 may be split between  Xl and X2 in such a fashion that the regression coefficients do not
make physical sense. For example, they may have the wrong sign. Furthermore, the individual
regression coefficients may test nonsignificant on both  X1 and  X2 even though the overall
regression is significant.
By splitting the importance of either  X2 among both X,
X, or   and  X,, the variance of the
regression coefficients on  Xl and  X2 become larger, indicating increased sampling variability
relative to these coefficients. Again, this is brought about by splitting the effect of one important
linear relationship among two (or more) variables that are closely linearly related. Substantial
changes in the values for the regression coefficients upon the addition or removal of a variable
from a regression equation is an indication that multicolinearity may be present.
Having both XI and X, in the regression equation will not cause prediction problems as long
as the predictions are confined to the region of  X, and X, defined by the original data sets. This
means that values used for  X, and 
X, for prediction must exhibit the same near linear relation
ship as did the original values used in estimating the regression coefficient.
Multicolinearity is not restricted to correlations between pairs of X variables. It also in
cludes correlation between any one of the X's and any linear combination of any of the remain
ing X's. Obviously, correlations between pairs of X's are easily detected from the correlation
matrix of X. Correlations with linear functions of several X's are not always easily detected. One
way to identify the possibility of an X being correlated with a linear combination of the other X's
is to compute the regression of  Xi on 
X*, where  X* is 
X with Xi removed. The multiple R, can
be examined and used as an indication of multicolinearity. This procedure can be carried out for

alloftheX,'s,i = 2, ..., p.
A summary of what has been indicated about the effect of multicolinearity is:
1. Multicolinearity in itself does not inhibit the predictive ability of a regression model provided
the prediction is made within the regions of the independent variables used in deriving the
regression coefficients.
3. Individual regression coefficients may be hard to interpret in terms of their impact on Y. They
may even have the wrong sign. Thus, even though the overall equation makes a valid predic
tion, the contributions of the individual X variables may not be decipherable.
262 CHAPTER 10
4. The values for individual regression coefficients may change substantially upon the addition
or deletion of an X variable that involves multicolinearity.
Detection of Multicolinearity
Some general indications of the possible presence of multicolinearity that have been identi
fied are:
4. Large changes in the values of regression coefficients upon the addition or deletion of a vari
able from the regression equation.
Possibly the most commonly used formal method for detecting multicolinearity is through
the use of the Variance Inflation Factor, VIF, defined as
1
VIF =
1  R;
where is the multiple coefficient of determination between Xi and all of the other X's in the re
gression equation. When R: is zero, then Xi is linearly independent of the other X's and the VIF
is one. If R: = 1, then the Var(Pi) and the VIF are unbounded. Large values of VIF indicate the
presence of multicolinearity. The exact value of VIF at which multicolinearity is declared de
pends on the individual investigator. Some use a value of 5 and others 10. A VIF of 10 corre
sponds to an R: of 0.90 and a VIF of 5 corresponds to ~ ? e ~ utoa 0.80.
l
Some will compute an average VIF over all p  1 regression coefficients and declare that if
this average VIF is "considerably" larger than one, multicolinearity is indicated.
Some statistical packages will compute the VIE Some statistical packages use an indicator
called the tolerance, which is l/VIF. Thus, a VKF of 10 corresponds to a tolerance of 0.1 and a
VIF of 5 corresponds to a tolerance of 0.2.
Table 10.4. Correlation matrix for data of Haan and Read (1970)
Table 10.5. Regression analysis of data of Haan and Read (1970) (10 independent variables)
Analysis of Variance
Regression
Residual
Total corrected for mean
R = 0.98
Std. Error = 0.69
Variable b sri t
Constant
Precipitation
A
S
L
P
di
Rs
F
R,
Since the correlation matrix is symmetrical, it is customary to show only the diagonal elements
and the elements either above or below the diagonal.
The mean and standard deviation of runoff are 16.55 and 1.93 inches, respectively. Table
10.5 contains the results'ofl the multiple regression of runoff on all 9 of the independent
variables. Because an intercept term was included, p is equal to 10. In the ANOVA table, the
sum of squares for the mean and the total sum of squares are not shown. Instead the total sum of
squares corrected for the mean is given. The F that is given is the calculated F for the overall
regression equation (from equation 10.18) used in testing the hypothesis that the regression does
not explain a significant amount of the variation in Y. Because F,,,,,, is 8.81, this hypothesis is
rejected.
The lower part of table 10.5 contains the estimated regression coefficients, the standard
errors of the regression coefficients, and the calculated t (equation 10.17) used in testing H,: Pi = 0.
The only b's with calculated t's greater than 2.0 are those based on precipitation, P, and R,.If all of
the variables except these three and the intercept are eliminated at one time, the regression shown
in table 10.6 results. In going to the second regression, R*has been reduced from 0.97 to 0.91, the
F increased to 28.7, and the standard error has remained unchanged. All of the regression coeffi
cients with the exception of the intercept are now significantly different from zero at the one
percent level of significance since t.95,5,9is 3.25.
MULTIPLE REGRESSION 265
Table 10.6. Regression analysis of data of Haan and Read (1970) (4 independent variables)
Analysis of Variance
Variable I3
 Sa t
The t test used to test the hypothesis that Pi = 0 makes the test assuming that all of the
other p's are still in the equation. Thus, when a decision is made to eliminate more than one
variable, the t's are unreliable and the F test using equation 10.19 should be used. This test
determines if several variables are simultaneously making a significant contribution to
explaining the variation in the dependent variable. As an illustration of the use of equation
10.19, the hypothesis that PA = P, = p, = P,i = P,, = P, = 0 be tested. For this example
n = 13, p = 10, k = 6, Q2 = 43.45, Q2* = 40.64, and Q, = 1.44. The F calculated from
equation 10.19 is 0.98. Since F.95,6,.3= 8.94, it is concluded that the variables A, S, L, di, R,,
and F are not significant.
The resulting prediction model is
Variables included
of 3 variables in the regression equation and each combination would produce a different s, R ~ ,
and F.
where
Standard regression techniques can now be used to estimate a' and 6' for equation 10.33
and a and 6 estimated from equations 10.34. Two important points should be noted. First, the
estimates of a and p obtained in this way will be such that E (Yf  Yi )2 is a minimum and not
" 1
such that E (Y,  qi )2 is a minimum. Second, the error term on equation 10.33 is additive
(Y' = a' + P'X' + E') implying that it is multiplicative on equation 10.31 (Y = axp).These
errors are related by E' = ln E. The assumptions used in hypothesis testing and confidence intervals
must now be valid for E' and the tests and confidence intervals made relative to the transformed
model.
In some situations the logarithmic transformation makes the data conform more closely to
the regression assumptions. For example, if the data plot as in figure 10.3, a logarithmic trans
formation may make the assumption of constant variance on the error more realistic.
The normal equations for a logarithmic transformation are based on a constant percentage
error along the regression line, whereas the standard regression is based on a constant absolute
error along the regression line. For example, the difference between Yi = 200 and Yi = 100 on
an arithmetic scale is 100 times as large as the difference between Yi = 2 and Yi = 1. However,
on a logarithmic scale In 200  In 100 = 5.29832  4.60517 = .69315, which is the same as
Fig. 10.3. Example of the effect of a logarithmic transformation on the error variance.
In 2  In 1 = .693 15  .000 = .69315. In a situation of this type, the standard regression pro
cedure would attempt to fit the point at Y = 100 in order to minimize X (Y  qi)'at the
expense of the point Y, = 1 because its contribution to X (Y,  ?i)2 is small. The logarithmi
cally transformed model would give equal percentage weight to both points.
The above discussion can be extended to the model
can be transformed to
1nY = h a + PX
Yevjevich (1972a) lists several possible transformations. Whatever the transformation, it
must be remembered that the principles of and assumptions regarding least squares apply to the
transformed model, not the original model.
where I is an indicator variable. Using this approach, the data would be coded such that I would
be 0 for one of the years (say 1991) and 1 for the other year. The resulting equation would then
effectively be
Y =a + bX for 1991
Y = (a + c) + bX for 1992
Thus, the slopes for the two regressions are the same, but the intercepts are a function of
year. The advantage of using the indicator variable is that all of the data are used to estimate a
common slope for the two lines. If two independent regressions were done, the slopes would
likely be different.
Indicator variables can be used to generate two lines having different slopes but a common
intercept using the model
Y=a+bX+cI+dIX
Y =a + bX for 1991
Y = (a + c) + (b + d)X for 1992
Obviously, the use of indicators uses extra degrees of freedom and thus requires more data for
parameter estimation.
The use of indicator variables can be extended to produce three lines. For three equally
spaced lines having a common slope, the appropriate model is
where values of  1,0, and 1 are used for I. The resulting models are
Y=(ac)+bX forI=1
Y = a + bX for1 = O
Y = (a + c) + bX for I = 1
270 CHAPTER 10
Three unequally spaced lines can be generated using the model
I1 I2 Line
Line 1 0 0 Y=a+bX
Line 2 0 1 Y = (a + d) + bX
Line 3 1 0 Y = (a + C) + bX
The resulting three equations are shown above.
Three lines with different slopes and intercepts can be generated from the model
using the same values for the indicator variables as above, with the result
Y =a + bX for line 1
Y = (a + d) + (b + f)X for line 2
Y = (a + c) + (b + e)X for line 3
Occasionally, it is desirable to fit a line through a set of data such that the line has a definite
break in its slope at some fixed point X = C. Figure 10.5 shows such a situation.
A regression of the form
will accomplish this where I = 0 for X 5 C and I = 1 for X > C. The resulting equations are
Y = a + bX forXIC
Y = (a  cC) + (b + c)X for X >C
Three slopes can be accomplished using a model of the form
where C, and C, (C, < C,) are the values of X at which the slope changes and the indicator
variables have values given by
Finally, a line with a change in slope and a jump or discontinuity as shown in figure 10.6 can
be estimated using the model
where I, and 1, are 1 for X > C and 0 for X r C. The resulting equations are
LOGISTIC REGRESSION
Frequently, one must be able to classify a variable into one of two possible classes. For
example, in looking at ground water for drinking one might want to class the water as acceptable
(Y = 1) or unacceptable (Y = 0). Based on a set of independent variables, it may be desired to
IMULTIPLE REGRESSION 273
determine the probability, p, that water from a particular well is acceptable for drinking. This is
equivalent to determining prob (Y = 1).
A regression model for classifying the binary variable Y as a 0 or 1 might be written
then
so that
A major difficulty with this regression model is that the assumptions of ordinary least
squares regression are violated in that the error term is not iid N(0, a2). Neter et al. (1996) show
that ei are not normal since they can take on only values of 1   PIXi
 when Yi = 1 and  PIXi

when Yi = 0. They also show that a: is (PIXi)(l
  PIXi),
 indicating a nonconstant variance.
Experience has shown that often p or E(Y) is related to  PIXi
 in a sigmoidal fashion as in
figure 10.7. Such a function can be expressed as
Defining the odds, Od, as the ratio of the probability that Y = 1 to the probability that Y = 0,
one obtains
The transformation to p' is sometimes called a logit transformation. As p goes from 0 to 1, p'
goes from m to 03.
Equation 10.51 provides an alternative to equation 10.43. Neter et al. (1996) presents details
of maximum likelihood estimation of the p's for equation 10.51. The procedure is known as
logistic regression with equation 10.51 being the logistic model. Some statistical packages con
tain routines for carrying out the computations involved in logistic regression. The programs
result in estimates for the p's and the standard error of the estimate for Pi, sbi.
A test of the hypothesis that pi = 0 versus pi # 0 is made by computing
The hypothesis is rejected if 1 zc I > z1,/2 where z, ,,, is the standard normal variate and a is
the level of significance. Most logistic regression computer programs provide estimates of sbi.
P = 0 and
The significance of the overall model is tested by formulating the hypothesis that 
using a chisquare test based on p  1 degrees of freedom where p is the number of coefficients
estimated. The chisquare value is based on the likelihood ratio. Again, the calculated chisquare
value is generally provided by a logistic regression program. If the calculated chisquare exceeds
the table value, then the hypothesis that P = 0 is rejected.
Neter et al. (1996) and Helsel and Hirsh (1992) discuss other aspects of logistic regression
including evaluating the overall ability of the model to correctly classify observed values of the
dependent variable.
The predicted value, E(Pi), represents the predicted probability that Yi should be set equal
to 1. One classification scheme is to use the logistic regression model to estimate E(Pi) and set Yi
= 0 if E(PJ < 0.5 and Yi = 1 if E(P,) a 0.5. Such a decision rule is appropriite if it is equally
likely that the actual outcome is 0 or 1 and it is desired to have equal probability of incorrectly
classifying the outcome.
A decision rule can also be arrived at in other ways. For instance, if some observations had
been set aside and not used to estimate the model, various decision rules might be evaluated and
the one selected that performs the best in classifying these holdout observations. The same
scheme could be used if there were no holdout observations by using the model to classify the
MULTIPLE REGRESSION 275
observations used to develop the model. Obviously, this latter procedure would not independ
ently evaluate the model because the same observations are used to evaluate as were used to
develop the model.
Example 10.3. In a certain locality wetland areas are thought to be impacted by groundwater
pumpage. By examining a wetland, ecologists can determine if a wetland is impacted or not. By
looking at certain bioindicators such as fungi lines on trees, the normal water level for a wetland
may be determined. Water level records can be used to estimate the median water level. The
distance to the nearest pumping water well is also known. It is desired to develop a model for
classifying a wetland as impacted (Y = 0) or not impacted (Y = 1) based on the difference in the
median water level and the normal water level, X2, and the distance to the nearest pumping well, X3.
Solution: A logistic regression model of the form of equation 10.49 is fit to the data shown in
table 10.8. The results of the logistic regression are shown in table 10.9. Table 10.9 shows that
the overall regression is significant (X1 = 40.14) but that fi3on X3 is not significant
(z, = 0.022/0.180 = 0.12). A second logistic regression was computed eliminating X3 with the
0.004
0.005
0.007
0.007
0.014
0.015
0.016
0.019
0.02 1
0.022
0.025
0.029
0.037
0.038
0.045
0.054
0.062
0.082
0.207
0.390
0.658
0.304
0.123
0.028
0.008
0.003
0.001
0.001
0.000
(continued)
276 CHAPTER 10
Table 10.8. (continued)
Classification Table
Predicted
Actual 0 1 Total
results shown in table 10.10. A significant overall regression (Xz = 40.12) and both fil and 6, are
significantly different from zero. The resulting model is
Classification Table
Predicted
Actual 0 1 Total
0 Count
Row percent
Column percent
1 Count
Row percent
Column percent
Total Count
Row percent
Column percent
Percent correctly classified = 89.74.
Actual Predicted
Row @OUP POUP Score Residual
278 CHAPTER 10
This model correctly classified 35 of the 39 observations. Figure 10.8 shows the data and the
resulting model. Using a probability level of 0.5, E(Y) = 0.5, for classification, the value of X2
that the model equation 10.53 specifies as the division between impacted and unimpacted
wetlands is
7.09
x2=   2.06 feet
3.44
If an E(Y) = 0.78 is used as a cutoff for impact evaluation, only 2 of the wetlands would be
misclassified. The true test of the model would be how it performs on an independent data set.
Exercises
10.1. Use the matrix methods of this chapter to work example 9.1.
10.3. Use the matrix methods of this chapter to work example 9.4.
10.4. Use the matrix methods of this chapter to work example 9.5. Calculate the confidence
interval for the point X equals 50.0 inches of rainfall.
10.6. Use the data in table 10.2 to develop a prediction equation for annual runoff using the
XP
model Y = ~ & f "    x?. Would you prefer this equation over the one contained in table
10.6? Compare the equations in terms of the confidence interval on the regression lines at the
mean values of the variables contained in the respective equations.
MULTIPLE REGRESS ION 279
10.8. Derive the normal equations that minimize 2 (Y  Q)' for the model Y = axb. Suggest
a method for solving these equations.
10.9. The relationship between stage and discharge (rating curve) for many streams has been
found to follow an equation of the type Q = a sbwhere Q is the discharge and S is the stage.
Using the following data from the Cumberland River at Cumberland Falls, Kentucky, derive such
a rating curve. Test the hypothesis that b = 1.5.
10.10. The data in table 10.11 is a partial listing of the data used by Benson (1964) in a study of
floods in the Southwest. Derive a prediction equation for Q,, the mean annual flood, in terms of
the remaining variables. Consider both the models given by equation 10.1 and by the multiple
regression extension of equation 10.24.
(continued)
Table 10.11 . (continued)
where sx,, is the sample covariance between X and Y, and sx and sy are the sample standard
deviations of X and Y, respectively. Figure 3.5 and the accompanying description discussed some
typical values for r,,~ and their meaning. Here it was emphasized that 1) rxYycan range from  1
to 1; 2) r,,~ = f1 implies a perfect linear relationship between X and Y; 3) r,,, = 0 implies
linear independence but leaves room for other types of dependence; and 4) if X and Y are
independent, then rX,, = 0.
In chapters 9 and 10 the concept of correlation was extended to give a measure of the
strength of the linear relationship between a random variable Y and a second variable which was
a linear function of one or more X variables, each of which may or may not be a random variable.
Throughout the text many of the results that have been developed have included the
assumption that the random variables were independent or that the sample being analyzed was
composed of random observations. A random observation simply means that every possible
element in the sample space has an equal chance of being selected during any trial.
Random variables may be either uncorrelated (r,,, = 0) or correlated (rX,Y# 0). Even when
sampling from uncorrelated populations, it would be rare for the sample correlation coefficient to
be exactly zero. More likely it will deviate from zero due to chance. Thus, statistical tests are
needed to evaluate whether the deviation of the sample correlation coefficient from zero may be
ascribed to chance or whether the deviation is too large to attribute to chance.
If successive observations in a time series of hydrologic data are correlated, this must be
taken into account in any inferences made about the data or in attempts to model the process that
produced the data. Again, a procedure is required for determining if the sampled elements from
a time series can be considered as random. These and other properties of correlation are the sub
ject of this chapter.
has a t distribution with n  2 degrees of freedom, where n is the sample size. Thus, to test
H,: p = 0, the test statistic is calculated from equation 11.3 and H, is rejected if It/ > t,
If n is moderately large (n > 25), then the quantity W is approximately normally distributed
with mean w and variance (n  3)' where
: [:+:I
W =  In  = arctanh r
and
To test the hypothesis H,: p = p* against the alternative Ha: p # p* for p*, a known
constant, the quantity
CORRELATION 283
can be considered to be normally distributed with a mean of zero and a variance of one. If
IzI> z,~/,(Zis the standard normal variable), H, is rejected.
Confidence limits on p can be estimated from
has a chisquare distribution with k degrees of freedom. H, is rejected if X2 > x;,,~. Rejection
of the hypothesis infers that at least one of the pi's is not equal to p*.
The hypothesis H,: p, = p, =    = pk (all correlation coefficients are equal) is tested by
noting that
H, is rejected if X2 > X:u,kl. Rejection of this hypothesis infers that at least one of the pi's is
not equal to the other pj's for i # j.
If the hypothesis that all of the correlation coefficients are equal is not rejected, it may be
desirable to calculate a "best" combined estimate 7 of the common correlation p ("best" means
weighted with inverse variance). Such an estimate is given by
Example 11.1. Burges and Johnson ( 1973) present the following sample correlation coefficients
for monthly flow volumes for the Sauk River in Washington and Arroyo Seco in California. In the
following table rj represents the sample correlation coefficient between the monthly flow
volumes in months j and j  1. Assume the coefficients are based on 30 observations each and
that the parent populations are all bivariate normal (Burges and Johnson actually used the
lognormal distribution in their study). 1) Test the hypothesis that p, for the Sauk River is equal to
0.50.2) Compute the 95% confidence limits for p8 of the Sauk River. 3) Test the hypothesis that
p, on Arroyo Seco is zero. 4) Test the hypothesis that on each of the streams all of the monthly
correlation coefficients are equal. 5) Assume the hypothesis in part 4 is accepted for the Sauk
River and estimate an average correlation coefficient for the Sauk River.
Month j Sauk River Arroyo Seco
October
November
December
January
February
March
April
May
June
July
August
September
Solution:
1) H,: p, = 0.5 for Sauk River
From equation 11.6
where
2) The 95% confidence limits on p, for the Sauk River are calculated from equation 11.7 as
4) The test, H,: all pj are equal, is tested by using equation 11.9.
i ri Wi = arctanh ri ri Wi = arctanh ri
5) An average correlation coefficient for the Sauk River is calculated from equation 11.11.
where

W = 0.585
m=
C [(i  3 )   27/29 = 1/29
1 ) 
2 (ni  3) 27
Comment: In parts 4 and 5 of this problem several simplifications were made in the summations
since ni was equal to 30 for all i; in general this cannot be done. In part 5 an overall average
correlation coefficient was calculated. Since in part 4 it was shown that the correlations for the
various months are significantly different, the utility of an overall average correlation is suspect.
Graybill (1961) presents the exact probability distribution of r and states that for small
samples, the exact distribution should be used in hypothesis testing. References to tables that aid
in hypothesis testing for small samples and examples of their use are also given.
Again, it is emphasized that the above tests are based on a random sample from multivariate
normal distributions. Even under these conditions, only the test of H,: r = 0 conducted using
equation 11.3 is "exact". The other tests are approximate with the approximation improving as
the sample size increases.
For nonnormal populations, it may be possible to transform the variables to a normal
situation and then apply the above tests to the transformed data. If a transformation of a non
normal random variable is not possible or not desired, then the above tests must be considered as
CORRELATION 287
approximate with the approximation becoming poorer as the coefficient of skew of the random
variables increase.
S E W CORRELATION
It is not uncommon to find in a time series of hydrologic data that an observation at one time
period is correlated with the observation in the preceding time period. Such correlation is termed
serial correlation or autocorrelation. By definition, the elements of a sample of data possessing
serial correlation are not random elements. A serially correlated sample of size n contains less
information about a process than a completely random sample of size n. In a serially correlated
sample, part of the information contained in each observation is already known through its
correlation with the preceding observation
Such correlation can also exist between an observation at one time period and an
observation k time periods earlier fork = 1,2, . . . In this discussion of serial correlation, it is as
sumed that observations are equally spaced in time and that the statistical properties of the
process do not change with time (stationary process). The population serial correlation coeffi
cient is denoted by p(k) (and frequently called the autocorrelation coefficient) where k is the lag
or number of time intervals between the observations being considered. The sample serial corre
lation coefficient will be given by r(k). The sample serial correlation coefficient for a sample of
size n is given by
nk nk
nk xi=lxiEi=l X i + k
XiXi+k 
(n  k)
r(k) =
(2;:: xi I?
nk nk
From equation 11.12 it is seen that r(0) is unity. That is, the correlation of an observation
with itself is 1. Equation 11.12 also demonstrates that as k increases, the number of pairs of
observations used in estimating r(k) decreases because all of the summations contain n  k
terms. Serial correlation should only be estimated for k considerably less than n.
If p(k) = 0 for all k # 0, the process is said to be a purely random process. This indicates that
all of the observations in a sample will be independent of each other. In chapter 14, Yevjevich
(1972b), Matalas (1966, 1967b),Julian (1967) and others treat hydrologic time series ,inmore detail.
Anderson (1942) has proposed a test of significance for the serial correlation coefficient for
a circular, normal, stationary time series. A circular series is one that closes on itself so that xn is
followed by xl. Under these assumptions
Although the assumption of a circular series is unrealistic, values of r(k) from equation
11.13 will not differ greatly from those calculated from equation 11.12 if n is large in comparison
to k. Under these conditions r(k) will be approximately normally distributed with mean
 l/(n  1) and variance (n  2)/(n  112if p(k) = 0. The confidence limits on p(k) are then
estimated by
If the calculated r(k) falls outside these confidence limits, the hypothesis that p(k) is zero
[H,: p(k) = 0 versus Ha: p(k) # 0] is rejected.
Example 11.2.' Frequently, in the analysis of runoff volumes, one finds there is significant serial
correlation caused by storages on the watershed. Appendix C contains a listing of the monthly
and annual runoff volumes for Cave Creek near Lexington, Kentucky. Test the hypothesis that
p(1) = 0 for the annual runoff volumes.
Solution: This solution assumes a = 0.05 and is based on equation 11.14, and therefore assumes
that the annual runoff is normally distributed and is a stationary time series. Furthermore, p(1) is
estimated from equation 11.13 assuming that the series is circular [in this case this is equivalent
to assuming x,+, = x, in calculating r(l)].
Since 0.520 < r(1) < 0.402, &: p(1) = 0 is not rejected.
Comment: From the width of the confidence interval, it is apparent that the above test is not very
powerful for small samples. A sample of around 400 observations would be required to reject H,:
+
p(k) = Oifr(k) = 0.1.
CORRELATION 289
Matalas (1967b) has suggested that for hydrologic data r(1) tends to be greater than zero due
to persistence caused by storage. If r(1) is found to be less than zero, it is in many cases difficult
to explain hydrologically. In this case one might take r(1) as equal to zero.
Matalas and Langbein (1962) state that in an autocorrelated series, each observation
represents part of the information contained in the previous observation. They discuss stationary
time series having r(1) # 0 and r(i) = 0 for i = 2, 3, . . . They state that n observations of a
nonrandom series having r(1) > 0 give only as much information (measured in terms of a
variance) about the mean as some lesser number, n, of observations in a purely random time
series.
This lesser number of observations is called the effective number of observations and is
given by
If r(1) = 0, then n, = n. If r(1) > 0, then n, < n. Equation 11.15 is expressed graphically as
figure 11.1. As an example, a 50year record for which r(1) = 0.2 contains only as much
information about the mean as a 33year record with r(1) = 0. Note that if n is large or r(1)
small, the second term in the denominator of equation 11.15 can be neglected with little loss
in accuracy.
n
Fig. 11.1. Relation between n and n, for various values of p(1) (after Matalas and Langbein 1962).
CORRELATION AND REGIONAL ANALYSIS
Matalas and Langbein (1962), Yevjevich (1972a), Alexander (1954), and others demon
strate that the information relative to estimating the regional mean contained in data from n sta
tions in a region having an average interstation correlation of p is equivalent to the information
contained in n' uncorrelated stations in the region where n' is given by
As n gets large, n' approaches l/p. For a of 0.2, the maximum information about the
regional mean contained in n stations could not exceed the information contained in 5 uncorre
lated stations.
From a consideration of equation 11.16, it seems it would be logical to establish relatively
few independent hydrologic stations in a region rather than several correlated stations. However,
by the very concept of a hydrologic region, the hydrologic characteristics may be correlated.
Correlation within a region can be exploited to yield improved estimates of a particular
hydrologic variable at a point through correlation with another hydrologic variable at that point
or a similar characteristic at another point. For instance, let Y and X represent two random
hydrologic variables having no serial correlation for which n, and n, + n2 observations, respec
tively, are available. Also consider that Y and X are correlated with a correlation coefficient of
ryx. Now, the record on Y can be extended by using the correlation between Y and X. This
relation is merely a simple regression considering Y as the dependent and X the independent vari
able. The relation is developed based on the n, common observations. From equation 9.15 it can
be shown that the regression between Y and X is given by
where ryXis the estimate for pyx and y and x are deviations from their respective means. Now n2
estimates of Y can be computed from equation 11.17 based on the n2 observations on X not
common to the observations on Y. Let Y1 and Y2 represent the mean of Y based on the original
n, observations and the n2 estimated observations, respectively. A new weighted mean for Y
based on n, + n, observations can now be computed from
For the n2 additional observations to improve the estimate of Y, it is necessary that ryx be greater
than 1/(n,  2) (Matalas and Langbein 1962).
If the random variables Y and X contain significant serial correlation, the situation is
somewhat more complex. Matalas and Langbein (1962), Matalas and Rosenblatt (1962), and
Yevjevich (1972a) contain treatments of this case. In general, serial correlation serves to decrease
the information relative to the mean while crosscorrelation tends to improve information rela
tive to the mean.
CORRELATION AND CAUSE AND EFFECT
At this point it should be apparent that a high correlation between two variables does not
necessarily imply that there is a causeandeffect relation between the variables. The fact that the
monthly flows on adjacent small streams are correlated does not mean that changes in the
monthly flow of one stream causes a corresponding change in the other stream. More likely both
changes are caused by the same external factors operating on both watersheds.
Again, it is emphasized that independent variables are uncorrelated and correlated variables
are not necessarily related through cause and effect. The dependence in correlated variables is a
stochastic dependence and not a physical or causeandeffect dependence. Dependence and
correlation are linear properties. Dependence among variables may be strong and nonlinear in the
presence of a nonsignificant (linear) correlation coefficient.
SPURIOUS CORRELATION
Spurious correlation is any apparent correlation between variables that are in fact
uncorrelated. Spurious correlation can arise due to clustering of data. For example, in figure 11.2,
the correlation of Y with X within either of the data clusters is near zero. When the data from both
clusters are used to calculate a single correlation coefficient, this correlation is found to be quite
high. This is spurious correlation. Figure 11.3 shows a plot of Y versus X where both Yi and Xi are
random variables obtained by adding 11 to a random observation from a standard normal
distribution. For a sufficiently large sample rx,, would be zero. If both Yi and Xi are divided by yet
a third random observation Zi, obtained in the same manner as Xi and Yi, and the correlation be
tween Yi/Zi and Xi/Zi computed, for a sufficiently large sample the correlation will be near 0.5.
Figure 11.4 is a plot of Yi/Z, versus Xi/Zi. Figure 11.4 indicates that Xi furnishes information use
ful in estimating Yi when in fact Yi and Xi are uncorrelated. The correlation between Yi/Zi and
Xi/& is spurious.
X
Fig. 11.2. Spurious correlation due to data clustering.
292 CHAPTER 11
Fig. 11.4. Spurious correlation introduced by dividing 2 random variables by a common third
random variable.
CORRELATION 293
Pearson (18961 897) investigated the spurious correlation that can arise between ratios. Let
Y = X1/X2and Z = X3/X4. The correlation between Y and Z, rxz, was found to be a function of
the variances, covariances, and means of the X's. Pearson's derivation assumed that the X's were
normally distributed and that the coefficient of variation of each X was small enough so that its
third and higher powers could be neglected. Reed (1921) arrived at the same results without
specifying the parent distribution of the X's. Pearson's general formula is
where rij is the correlation between Xi and Xj, and Ci is the coefficient of variation of Xi.
Chayes (1949) and Benson (1965) considered many special cases of equation 11.19. For
example, if X2 = X4, r12 = r13 = r34 = 0, r24 = 1, and C, = C2 = C3 = C4, equation 11.19
reduces to rxy = 0.5, which is the case shown in figure 11.4. Benson (1965) produced a table
(Table 11.1) showing many special cases of ratio and product correlations.
Spurious correlation can arise in hydrology when dimensionless terms or standardized
variables are used. Benson (1965) presents several examples of possible spurious correlation in
hydrology.
Exercises
11.1. Calculate the firstorder serial correlation coefficients for the sediment load and annual dis
charge data for the Green River at Munfordville, Kentucky. Test the hypothesis that these two
correlations are equal. Discuss the assumptions you have made and how they affect the validity
of the tests you have made.
11.2. Calculate the correlation between the sediment load and annual discharge for the Green
River at Munfordville, Kentucky. Test the hypothesis that this correlation is equal to 0.50.
11.4. Calculate the firstorder serial correlation coefficient for the Spray River, Banff, Canada.
Test the hypothesis that the first order serial correlation is zero.
11.5. Work exercise 11.4 for the Piscataquis River near DoverFoxcroft, Maine.
11.6. If the annual runoff from the Spray River, Banff, Canada, is normally distributed, how
many independent observations would provide as much information relative to estimating the
mean annual runoff as does the 45 years of actual record?
11.7. Work exercise 11.6 for the Piscataquis River, near DoverFoxcroft, Maine and its 54 years
of record.
.p
u"u"u"
rnmm
u " 5
r~
nm
*u"u"
5 u"Urnm
2
n
PI
2 u_
%
h
u_ 4
U,
Ph
U
IC) I
=
rn
+
rn
+
PIN
I U
PIPI
U
N N
U
s
V
PI PI
U :
U
n
PI PI
+
PI  V
1%
PI
U V
IX
+ 121 12 1)
c
V
11
21 x"
PI
X,
x
u" %
<
h
2
I rn
* I
u,
L' U
PI*
+ PI PI
U
V
11.8. The following data were collected on two streams in southeastern Kentucky. Use the data
to extend the peak flow record of Cave Branch through 1972. Estimate the average peak flow for
the entire record plus estimated record for Cave Branch. Is this estimated average an
improvement over an estimate based on the actual observed record of Cave Branch?
NOTATION
In this chapter an uppercase underlined letter will denote a matrix and a lowercase under
Z could be an n X p matrix made up of p n X 1
lined letter will denote a column vector. Thus 
zj
column vectors for j = 1,2, ..., p.
PRINCIPAL COMPONENTS
Often, when data are collected on p variables, these p variables are correlated. This correla
tion indicates that some of the information contained in one variable is also contained in some of
the other p  1 variables. The objective of principal components analysis is to transform the p
original correlated variables into p uncorrelated, or orthogonal, components. These components
are linear functions of the original variables. Such a transformation can be written
The total system variance, V, is defined as the sum of the variances of the original variables
and can be estimated as
V = Trace 
S= 2:='=, s ~ , ~
zJ . = Xa.
J (12.4)
where Jz . is an n X 1 and Ja . a p X 1 column vector. zj can also be written z . = [zij] where
J
Zij = zk=Xi,kakj
P
(12.5)
MULTIVARIATE ANALYSIS 299
The variance of zjis found from
z,
 is thus defined by the vector 
a, that maximizes the variance of 
z, subject to the constraint that
a' 1a = 1. This is a normalizing constraint without which there would be no unique solution.
1
Equation 12.7 can be maximized by using the Langrangian multiplier A, to introduce the
constraint = 1. Let
Q is maximized by differentiation.
For the solution of equation 12.8 to be other than the trivial solution  0 we must have
al = 
This is a classical characteristic value problem. A, is called the characteristic root and 
a, the char
acteristic vector of 
S. Equation 12.9 has p solutions for A,. This is easily seen by considering the
special case of S to be a 2 X 2 matrix in which case equation 12.9 becomes
Premultiplication by 
a; results in
from which it follows that a,, the coefficients of the second largest principal component, are the
coefficients of the characteristic vector associated with the second largest characteristic root of
S. Premultiplying equation 12.14 by 
 a; also results in A, =  a;Sa2
 = Var(z2).
 In general, the jb
principal component of the pvariate sample  X is the linear function 3 = JX a where 2j are the
elements of the characteristic vector associated with the jth largest characteristic root of  S.
From equation 12.1 we can find Z'Z   as 
Z'Z = (XA)'(XA)
  = A'X'XA = (n  1)  A'SA.
 It can
be easily shown (see equations 12.2412.28) that A'SA   is a diagonal p X p matrix with the ith
diagonal element equal to A,. This matrix may also be written as  A'SA =  D, where D,
 is the
diagonal matrix whose diagonal elements are the characteristic roots of S.
E is an orthogonal matrix, then the trace of E'FE
One property of matrices is that if    equals
the trace of 
F. Therefore
 = Trace(AISA)
Trace (D,)   =V
= Trace(S)
However
The sum of the characteristic roots which equals the sum of the variances of the principal com
ponents also equals the total system variance.
MULTIVARIATE ANALYSIS 30 1
zi and
1.  zj are uncorrelated for i # j
From item 4 above, it can be seen that the fraction of the total variance accounted for by the
jth principal component is A,/V. In many situations the first q components account for a large
fraction (say 90% or more) of the system variance, indicating that the last p  q components are not
needed in terms of explaining variance. Many times these last p  q components are discarded with
the effect that the problem has been reduced from one of dealing with an n X p  X matrix containing
correlation to dealing with an n X q(q < p)Z  matrix that is orthogonal.
The question of how many components are needed to satisfactorily explain the system vari
ance or what part of the total system variance should be explained is an unresolved one. Morrison
(1967) suggests that only the first 4 or 5 components should be extracted since later components
will be difficult to physically interpret in terms of the problem at hand. Unfortunately, there are
no statistical tests that can be used to determine the significance of a component. The sampling
theory of principal components is not well developed, especially when the components are ex
tracted from the correlation matrix rather than the covariance matrix as in later examples.
The covariance between the original variables,  Z, is given by
X, and the principal components, 
Example 12.1. Consider the data in table 10.2. Let  X be a 13 X 3 matrix made up of 13
observations on mi^), S(%), and L(ft). Compute the principal components of  X based on the
covariance matrix. Compute the correlation between the variables and the components.
Solution: 
S is computed from equation 12.2.
which simplifies to
MULTIVARIATE ANALYSIS 303
The solutions to this cubic equation are
Solving these three equations simultaneously for a,,,, a , , , and a,,, results in
Using the constraint that a:, + a;,, + a:,, = 1, the solution is a,,, = 030, a2,, = .999 and
= .020. Similarly, for A, and A, we get
Thus,
The values for the principal components can now be calculated from
where X is composed of deviations from the meanthat is, deviations of A, S, and L from their
means.
304 CHAPTER 12
The correlation matrix between the variables and the components can be computed from
x2 and 
equation 12.17. For example, the correlation between  z1 is
Cor(x2,
  Z,) = ~ : ~ a ~ ,=, 155.9631/2(0.999)/155.7691P
/s~ = 0.9995
Example 12.1 illustrates that using the S matrix in a principal component analysis presents some
problems if the units of the X variables differ greatly. In example 12.1, the magnitude of the
observations associated with the second variable were much greater than those associated with
the other two variables. Consequently, the variance of x2 was much greater than either Var(x,) or
Var(x3).
 x2  accounted for 100 Var(x2)/
 Trace S or 96.4% of the system variance. This means that
the first principal component is merely a restatement of  x2. This can also be seen from the fact
that the correlation between  2, is  1.000.
x2 and 
In most hydrologic studies the problem of noncommensurate units on the X's has been
handled by standardizing the X's through the transformation ( x , ~ Xj)/sj. The covariance matrix
of the standardized variables becomes the correlation matrix  S = R, as can be seen from
equation 12.2. The principal components analysis is then done on  R. The total system "variance"
now becomes Trace  R = p because  R has 1's on the diagonal.
The characteristic roots and vectors are determined from
The correlation between the jthstandardized variable and the jth component (equation 12.17)
reduces to
These correlations are sometimes called factor loadings. The factor loadings can be used to attach
physical significance to the components. If a particular component is highly correlated with 1,2,
or 3 variables, then the component is a reflection of these variables. For example, in a study of
watershed geomorphic factors, it might be found that a component is highly correlated with the
average stream slope and the basin relief ratio. This being the case, that particular component
might be termed a measure of watershed steepness.
 

Solution:
A, = 0.1035
In this formulation, 
z, accounts for 100(1.9692)/3 or 65.64% of the system "variance"
z, and 
whereas  z, account for 30.91% and 3.45%, respectively.
The corresponding characteristic vectors are
Since component 1 is highly correlated with both area and length, this component might be
called a "size" component. Likewise, component 2 might be called a slope component. In terms of
explaining the "variance" of R, component 3 could be eliminated because it explains only 3.40%
of the variance and is not correlated with any of the variables. We cannot eliminate any variables,
however, because component 1 is strongly dependent on   whereas component 2 depends
XI and X3
on X,.
In terms of explaining the variance of R, we have reduced our problem from one of considering
a 13 X 3  X matrix with correlations to a 13 X 2  Z matrix without correlations (assuming Z3 is
discarded).
The values for the components are computed from
where
thus
MULTIVARIATE AVALYSIS 307
where Yi is the ithobservation on Y. Y is the mean of Y, Xi, is the i" observation on the j" variable,
and X, and sj are the mean and standard deviation of the jthvariable. Centering Y is not necessary.
It eliminates the need for an intercept and simplifies notation. The matrix of principal
components,  Z, is determined from Z = XA with A being a p X p matrix whose jthcolumn is a,,
the characteristic vector computed from equation 12.18 with  R = X'X/(n
  1).
The regression model is
where Y is an n X 1 vector whose elements are the n observations of the centered dependent
Z is an n X p matrix whose elements, Zij, represent the i" value of the jth principal
variable, 
component.
p is estimated from equation 10.8 as
where Jz. is an n X 1 vector whose elements are the n values of the jthprincipal component.
so that
Z'Z
  zj]
 = [zi'
From equation 12.4 we have
(z'z)'
 is therefore
,, = 0
C O V ( ~p,) for i#j
uL
var(@,) = for i =j
(n  l)Aj
b, 6,
Thus is independent of for i f j. The independence of the b's is a result of the onhog
,.
onality of the principal components. Since the p's are independent, the ttest given by equation
10.17 can be repeatedly applied to test hypotheses on the 6's from a single regression equation.
Furthermore, the numerical value for the fi's retained in the regression will not be altered by
eliminating any number of the other b's. This is the distinct advantage of having an orthogonal
matrix of independent variables.
A second advantage of having independent b's is that the interpretation of the fi's in terms
of the independent variables is greatly simplified. Thus, if some hydrologic meaning can be
attached to a component through an examination of the factor loadings, hydrologic significance
can also be attached to the 6's. Unfortunately, in most hydrologic applications of principal
components analysis, a clear and distinct interpretation of the principal components has not been
possible. This, in turn, means the hydrologic significance of the fi's is unclear as well.
Some authors (DeCoursey and Deal 1974) state that yet another advantage for using regres
sion on principal components as compared to normal multiple regression is that the resulting re
gression coefficients are more stable when applied to a new set of data because the coefficients
are fitted on the basis of only statistically significant orthogonal components. This could imply
that using an equation based on regression on principal components for prediction on a sample
not included in the equation development would have a smaller standard error on this sample
than would a normal multiple regression equation. If this is the case, it would be an important
advantage for the regression on principal components technique. An adequate demonstration of
this hypothesis needs to be developed, however.
A disadvantage of using principal components in a regression analysis is that even if all but
one of the components is eliminated, all of the original variables (the X's) must still be measured
because each component is a function of all of the X's (equation 12.4).
In reporting the results of a regression on principal components, it is generally desirable to
transform the resulting regression equation into an equation in terms of the original X variables.
This can be done 
since yi = Yi  Y, the 6 ' s are known constants, zil = akjx,, and
x1 ..~ = (X.' J.  Xj)/sj. Thus equation 12.22 becomes
where the p*'s are constants. If only q(q < p) components are retained in the final regression
equation, and the components are rearranged so that the first q components are retained, the first
summation in equation 12.33 would run from I to q; however, the second summation would still
run from 1 to p. This means the summation in equation 12.34 would run from 1 to p. It also means
that even though the equation contains only q components, all p of the original variables must be
measured to predict Y.
Some of the original X variables can be eliminated from the analysis before any regres
sions are performed by examining the factor loadings and eliminating variables that are not
highly correlated with any of the components. The remaining X variables are then resubmitted
3 10 CHAPTER 12
to a principal components analysis with the multiple regression being performed on the new
components. This procedure has the advantage of reducing the number of variables that must be
measured to use the resulting regression equation. It has the disadvantage of eliminating X vari
ables rather arbitrarily (there is no statistical test for the significance of the factor loadings)
without ever having them in a position to determine their usefulness in explaining the variation
in the dependent variable, Y.
In many applications of regression on principal components, the last p  q components are
discarded before the regression is performed. The number of retained components, q, is selected
so that a large proportion of the variance of  X is accounted for. This procedure reduces the
number of coefficients that must be estimated but runs the risk of eliminating a component that
may explain a significant amount of the variation in  Y even though it explains little of the
variance of  X.
Equation 12.31 gives = a;XIY/(n  1)X, whereas equation 12.32 gives ~ a r ( 8 , = )
u2/(n  l)Xj. The statistical significance of Pj is tested using equation 10.17 with Po = 0.Thus
the test statistic is
There is no reason to believe before the regression is performed that this test statistic will be
nonsignificant for small values of Xj (i.e., for the last p  q components). Therefore, the
regression should be performed on all of the components and then the components that prove to
be nonsignificant can be eliminated.
The value of the test statistic given by equation 12.35 can be shown to be proportional to the
correlation between 
Y and Jz . as follows:
Therefore
or the significance of the jh component is directly proportional to its correlation with the de
pendent variable. Equation 12.38 can be used to test the significance of the jthcomponent.
At this point, it should be noted that if a dependent variable  Y is regressed on p principal
components extracted from a p X p correlation matrix and then transformed via equation 12.33,
the results are identical to those that would be obtained by a direct regression of Y on the original
p variables. This is because multiple regression is a linear operation and the principal
components are independent linear functions of the original variables that explain all of the vari
ance of the variables.
MULTIVARIATE ANALYSIS 311
1MULTIVARIATEMULTIPLE REGRESSION
Occasionally, it is desirable to predict several dependent variables from the same set of
independent variables. Such a situation might be predicting the mean annual flood, 10year peak
flow, and 25year peak flow for a setting where it is desirable to maintain the correlation among
the dependent variables. This can be accomplished using a multivariate extension of multivariate
regression. The prediction model would be
b
where  and 
Y are partitioned into q p X 1 vectors. Furthermore
demonstrating that the solution to equations 12.40 is equivalent to q multiple regressions each
involving the same X but a different vector of dependent variables. Tests of hypothesis
concerning  p J can be made using the procedures set forth in chapter 10.
In multivariate regression as in multiple regression, one commonly has a large number of
independent variables all of which are not important in predicting the q dependent variables. If q
separate multiple regressions are performed and independent variables eliminated using the pro
cedures of chapter 10, it would be unlikely that the resulting equations would contain the same
set of independent variables.
If the multivariate regression model is used, all q of the prediction equations will contain the
same set of independent variables. Press (1972) presents a procedure for testing the hypothesis that
j3. = 
1
PT where P, is a 1 X q vector made up of the coefficients associated with the ithindependent
variable for each of the q dependent variables and  PT is a 1 X q vector of constants. To test that
the ithindependent variable was not significant would be equivalent to the test that  Pi = 0. Thus,
a procedure is available for eliminating variables from the regression to produce a useable model.
One distinct advantage in using the same independent variables for estimating several
dependent variables is that the correlation structure of the dependent variables is preserved.
DeCoursey (1973) used such an approach to derive prediction equations for the 2, 5, lo, and
25year peak flows on watersheds in Oklahoma. In situations like this it is highly desirable to
retain the observed correlations among the dependent variables in the resulting prediction
312 CHAPTER 12
equations. In the case of flood flows, if this is not done it might be possible to have equations that
are inconsistent and predict, say, a 10year peak to be greater than the 25year peak flow.
Another place where retention of the correlation structure among a set of dependent variables
is important is in estimating the parameters descriptive of runoff hydrographs. Rice (1967) dis
cusses this application of multivariate, multiple regression in simultaneously estimating the runoff
volume, peak discharge, and a base time parameter for runoff hydrographs based on data presented
by Reich (1962). Rice states that even though three separate regressions produce slightly better fits
to the original pool of data, the multivariate solution might be more effective in predicting hydro
graphs for storms on watersheds not included in the original data sample.
CANONICAL CORRELATION
Canonical correlation examines the relationship between two sets of variables. Consider the
2. Partition 
X with covariance matrix 
n X p matrix  2 so that
X and 
X = [Y
 ,Z
]  i s n X p, a n d
whereY Z i s n X p2and
where 2
 is p X p, Zl1 &2 is p2 x p2 with pl +
is p1 x pl, 2 1 2 is PI X p2, 2 2 1 is p2 X P I ,and 
p2 = p and p, 5 p2. In this formulation  Ell = Var(Y), z2, = Var(Z),
 and  C12= 22, =
Cov(Y,
  Z).
Canonical correlation investigates the correlation between  Y and  Z. Linear functions of 
Y
and 
Z are formed and then the correlation between these linear functions determined. Define
U, and 
The variances of  V, are
Var(U,)
 = var(a;
  a;cl,
Y') =   g, and
Therefore
Unfortunately,  r is not symmetric, so the determination of the resulting p2 values of the h's
may require special computing techniques. hi is numerically equal to the square of the correlation
between Ui and  Vi. For convenience the hi are arranged so that hl > h2 >   . > A,,.
The hope is that hi is sufficiently large that other h's can be dropped so that attention can be
focused on  U1 and  V,, which are vectors, rather than Y and  Z, which are matrices. Of course,
 associated with each of the Xi's. If h2 is sufficiently large, two vectors
there will be a 1U. and a Vi
on 
U and two vectors on  V may have to be considered.
The vector  Z to 
art used to transform  V is found by determining the eigenvectors of
CLUSTER ANALYSIS
The main objective of a regional flood frequency analysis is to develop regional regression
models which can be used to estimate flow characteristics at ungaged stream sites. Hydrologic
data from several gaging stations in hydrologically homogeneous regions are collected and ana
lyzed to obtain estimates of the regression parameters. Identification of these hydrologically ho
mogeneous regions is a vital component in any regional frequency analysis. One method used to
identify these regions is a multivariate statistical procedure known as cluster analysis.
Cluster analysis is a method used to group objects with similar characteristics. Two clustering
methods are used for this purpose. The first type of procedures is known as hierarchical methods, and
they attempt to group objects by a series of successive mergers. The most similar objects are first
grouped and as the similarity decreases, all subgroups are progressively merged into a single cluster.
The second type of procedures is collectively referred to as nonhierarchical clustering techniques
and, if required, can be used to group objects into a specified number of clusters. The clustering
process starts from an initial set of seed points, which will form the nuclei of the final clusters.
The most commonly used similarity measure in cluster analysis is the Euclidean distance,
defined by:
where Di, is the Euclidean distance from site i to site j, p is the number of variables included in
the computation of the distance (i.e., the basin and climatic variables) and zi,, is a standardized
value for variable k at site i.
In many applications the variables describing the objects to be clustered (discharges, water
shed areas, stream lengths, etc.) will not be measured in the same units. It is reasonable to assume
that it would not be sensible to treat, say, discharge measured in cubic meters per second, area in
square kilometers, and stream length in kilometers as equivalent in determining a measure of
similarity. The solution suggested most often is to standardize each variable to unit variance prior
to analysis. This is done by dividing the variables by the standard deviations calculated from the
complete set of objects to be clustered. The standardization process eliminates the units from
each variable and reduces any differences in the range of values among the variables.
To get a feel for how cluster analysis works, consider six precipitation stations and their
associated annual precipitation in mm:
station 1 2 3 4 5 6
precipitation 1000 1200 600 700 500 1100
It is desired to see if these stations can be grouped into homogeneous groups based on the aver
age annual precipitation.
The first thing that is done is to standardize the precipitation values. For this set of data, the
mean is 850 and the standard deviation is 288. Table 12.1 contains the data and results. Equation
12.47 is used to calculate Dij.For example, DlY2is d ( 0 . 5 2  1.21)~which equals (0.52  1.21),
or 0.69. The results for all of the Dij are shown in Section A of table 12.1.
The next step is to find the minimum value of the similarity measure, Di,,.This value is seen
to be 0.35. The value 0.35 appears several times. The pair (3,4) was arbitrarily chosen as the first
similar pair. Section B of table 12.1 contains the Dij values from Section A except for the (3, 4)
row. This row contains the minimum of D3,jand D4,jfor j = 1, 2, 5, and 6. For example, D3,1is
1.39 and D4.1 is 1.04. Therefore, the (3,4), 1 entry in Section B is 1.04. Other values in the (3,4)
row are similarly determined.
Again, the minimum entry in Section B is found to be 0.35 corresponding to the (1,6) pair. Thus
(1,6) is clustered as in Section C and entries for Section C are determined from Section B in the same
manner as entries in Section B were determined from Section A. The next step results in (1,6) and 2
being clustered to form (1, 2, 6). This is followed by (3, 4) being clustered with 5 to form (3,4, 5).
Table 12.2 is similar to table 12.1 except that the value of precipitation for the third station
is changed from 600 to 1050 mm. Carrying through the analysis as was done for table 12.1 re
sults in forming the clusters (4,5) and (1, 2, 3,6).
In table 12.3, the third station value is changed to 1800 mm. The cluster results are (1,2,4,
5, 6) and 3. In all of these analyses, the Di, entry is a measure of the similarity that exists. For
Table 12.1. First cluster analysis of rainfall data
Table A
Table B
Table C
Table D
Table E
Table 12.3. Third cluster analysis of rainfall data
Table A
Table B
Table C
Table D
Table E
318 CHAPTER 12
example, in table 12.3, the Dij values of 0.22 indicate strong similarity. The value of 0.44 shows
that stations 4 and 5 are not as similar as are stations 1, 2, and 6. The value 0.67 shows that the
cluster (4,5) and (1,2,6) are less similar than either 4 and 5 or l , 2 , and 6. Finally, the value 1.33
shows that 3 is not very similar to the cluster (1, 2,4,5,6).
Clustering may stop when there is a significantjump in the similarity measure. In table 12.3 one
might conclude with three clusters, (1,2,6), (4,5), and (3),or with two clusters, (1,2,4,5,6) and 3.
Table 12.4 extends the analysis to consideration of two measures of the stations being con
sidered, precipitation and potential evapotranspiration. Again, Section A was constructed from
equation 12.47. For example, the Dl,, entry is calculated from standardized values as
D I 2 = d (  0 . 1 1  0.33)~+ (1.21  1.21)~or 2.47. The analysis is completed based on
Section A in the same manner as for tables 12.112.3. Here a satisfactory clustering doesn't
appear to exist. It looks as though 2 and 6 might be clustered but possibly the other stations can
not be clustered.
Table 12;5 is based on the ratio of precipitation to potential evapotranspiration. Using this
system measure, 2, 4, and 6 certainly form a cluster. Depending on the purpose of the analysis,
one might conclude that (1, 3) and (2,4,5, 6) represent the final clustering.
Exercises
12.1. Calculate the correlation matrix for the first two variables contained in the tables of exer
cise 10.8.
12.2. Calculate the characteristic values and characteristic vectors associated with the correla
tion matrix of exercise 12.1.
12.3. Compute the numerical values of the principal components of the data in the first two
columns of the table in exercise 10.8 (based on the correlation matrix).
12.4. (a) Work exercise 12.1 using the first three variables. (b) Work exercise 12.2 based on the
first three variables. (c) Work exercise 12.3 based on the first three variables.
12.5. (a) Work exercises 12.1, 12.2, and 12.3 based on the covariance matrix. (b) Work exercise
12.4 based on the covariance matrix.
12.6. Work exercise 12.4 using all of the variables in the table of exercise 10.8 except Q,. (Note:
Don't try this without a computerlife is too short!)
12.7. Calculate the factor loadings for the data of (a) exercise 12.2, (b) exercise 12.4, or (c) ex
ercise 12.5.
Table A
Table B
Table C
Table D
Table E
Table 12.5. Cluster analysis of rainfallevaporationdata
Ratio
z
Table A
Table B
Table C
Table D
Table E
13. Data Generation
CHAPTER 15 discusses several stochastic models that have been found useful in hydrol
ogy. Stochastic models contain random components. These random components contain random
elements. If a stochastic model is to be used to generate hydrologic data, methods must be avail
able for generating the random elements of the models.
A random element is usually thought of as an element selected in a fashion such that each
element in the population has an equal chance of being selected. If the sample results from choos
ing a number at random from a population of numbers in such a fashion that each number in the
population has an equal chance of being selected, the procedure is equivalent to sampling from a
uniform distribution. More generally, a random element can be selected from any probability
distribution as long as the elements are independent of each other. This chapter first sets forth tech
niques for generating random samples from probability distributions. Next, a method for generat
ing a multivariate random sample that preserves the correlations between the variates is presented.
Finally, several possible areas of application for data generation methods are discussed.
In any application of data generation methods, it must be kept firmly in mind that data gen
eration cannot improve or overcome faulty data. At best, one can generate a set of data having
statistical properties equal to the properties of the sample used in estimating the population
parameters. In addition to this, data generated stochastically is subject to the same sampling
errors as natural data. As a matter of fact, data generation has been widely used to study sampling
errors, an application discussed later in the chapter.
1. Select a random number R, from a uniform distribution in the interval (0, 1).
3. Solve for y.
Step 3 in this procedure is known as obtaining the inverse transform of the probability
distribution.
Fig. 13.1. Procedure for generating a random observation from a probability distribution.
DATA GENERATION 323
As an example, consider the Weibull distribution with
and
By substituting Ru for Py(y), random values of Y from the 3parameter Weibull distribution can
be generated from
For some distributions it is not possible to solve equation 13.1 explicitly for y. That is, an
analytic inverse transform cannot be found. The normal and gamma distributions are examples
of this. Fortunately, in the case of the normal distribution, numerically generated tables of
standard random normal deviates are widely available. A standard random normal deviate is a
random observation from a standard normal distribution. Random observations for any normal
distribution can be generated from the relationship
where RNis a standard random normal deviate and p, and o are the parameters of the desired normal
distribution of Y. Computer routines are available for generating standard random normal deviates.
For some distributions, relationships with other distributions can be used in the generat
ing process. For example, a gamma variate with integer values for rl has been shown to be
the sum of rl exponential variates each with parameter A. Therefore, gamma variates with
integer values for rl can be generated by summing rl values generated from an exponential
distribution.
Whittaker (1973) discusses a method for generating random gamma variates with any
shape parameter rl. Because the gamma distribution is closed under addition, a gamma random
variable with any shape parameter can be constructed if one with a shape parameter in the
internal 0 < rl < 1 can be constructed. Let Rul,Ru2,and Ru3be independent uniform random
variables on (0, 1). Define S1 and S2by
Then Y has a gamma distribution with shape parameter q and scale parameter A.
This procedure requires the generation of at least 3 uniform random variables. If S, + S2
> 1, then R,, and RU2are rejected and new values generated. The probability that S, + S2 5 1
is given as .rrq(l  q) cosec(.rrq) and has a minimum of n / 4 at q = 6 and is symmetric about
this value.
To generate a gamma variate with q > 1, a gamma variate with an integer shape parameter,
and a shape parameter <1 can be added as long as the scale parameter, A, is held constant. For
example, to generate a gamma random variate with q = 3.6 and any A, a gamma variate with
q = 3 and A can be added to a gamma variate with q = 0.6 and A.
Table 13.1 presents a summary of some analytical methods for generating observations from
selected common probability distributions. The table is modified from Hahn and Shapiro (1967).
Computer routines are available for generating random numbers from many different probability
distributions.
Where analytical inverse transforms cannot be found, numerical procedures can be
employed. One numerical method is to select a random number between 0 and 1 and then
numerically integrate equation 13.1 along the xaxis until the accumulated integral equals the
selected random number. At this point y would be equal to the value of x that had been reached.
A second numerical method, and one that would be faster if a large number of random
observations were needed, would be to numerically integrate equation 13.1 starting at the
extreme left of the distribution. The integration would proceed to the right in small increments
along the xaxis until the accumulated integral was sufficiently close to 1. At each step of the
integration, the value of x and the accumulated integral would be saved in the form of a table. The
generation process would then consist of selecting a random number in the interval (0, I),
entering the table with this random number considered as an accumulated integral, and finding
the corresponding value of x. This value of x would then be set equal to the desired random
variate y.
Example 13.1. Generate 22 observations from an exponential distribution with A = 2. Plot the
observations on semilogarithmic (probability) paper. Estimate A from the observations.
Solution: The 22 observations are generated from the relationship y = ln(R,)/A where
R, is a randomly selected number in the interval 0 to 1. The values of Y so generated are
shown below. Figure 13.2 is a plot of the resulting numbers along with the lines describing
the exponential distribution with parameter A = 2 and with parameter i= 1.718 calculated
as i= 1 / ~ .
Comment: This problem illustrates the random variations possible when sampling from a proba
bility distribution. As the sample size increases, ); should approach A and the plotted points will
lie more nearly on the line describing the exponential distribution with A = 2.
In hydrologic frequency analysis, the data represent a sample from an unknown population.
Thus, uncertainty as to the proper frequency distribution exists as well as uncertainty in the
values for population parameters.
326 CHAPTER 13
Sum = 12.806
X = 

ZA'
where z is a 1 X p vector consisting of a single value for each of the p uncorrelated components.
The mean and variance of the jthprincipal component are 0 and Aj, respectively. This equation
can be used to generate standardized normally distributed random variables that preserve their
correlation. The components are uncorrelated so that a random value of 3 is generated as
3 = (zl, z2, ..., zp) where zk is a random observation from a normal distribution with mean zero
zj
and variance A,. Post multiplying by 4' then produces  xj
x j. n values for can be generated by
repeating this process n times. Then
1. Assume the data are multivariate normal and generate a sample of the required size of
correlated, multivariate N(0, 1)observations, xij, having the desired correlation structure, R,.

3. Transform the Px(xij)to yij by substituting Px(xij) into the desired inverse cumulative distri
bution for the corresponding yij to find vectors having the correct pdfs. Note that this last
transformation guarantees that the y's will have the correct pdfs. Because all of the
transformations are nonlinear, the correlation matrix,  R2, for the generated y's will only
approximately preserve the original correlation matrix. In other words, R2 will not be exactly
R,.
equal to 
Example 13.2. Generate a sample of 20 observations from the 3variate normal distribution
having the properties p1 = 3.173, p2 = 16.462, p3 = 2.566, al = 2.113, a2= 12.481, a3=
1.150, p,,, = 0.1713, p1,3 = 0.8958 and p2,3 = 0.2059.
Solution: This correlation structure corresponds to the correlation matrix in example 12.2. The
procedure is to first generate 20 observations from a 3variate standard normal distribution
having the desired correlation structure by 20 applications of equation 13.8. The matrix A is
contained in example 12.2. A 1 X 3 vector  z is generated as (Z,, Z2, Z3) where Z, is a random
observation from a normal distribution with mean 0 and variance Xi. The Xi are obtained from
example 12.2 as 1.9692,0.9273, and 0.1035. The 1 x 3 vector  x = (XI, X,, X3) is then computed
from equation 13.8. Finally, a 1 X 3 vector y = (y,, y2, y3) is computed as yi = xiai + pi. This
process is repeated 20 times, generating the required 20 values for y. The following matrix
Y.
contains the resulting 20 observations on 
DATA GENERATION 329
Population 3.17 16.46 2.57 2.11 12.48 1.15 .I7 90 .21
20 3.79 12.83 2.90 1.52 13.14 0.94 .I3 .89 .03
200 3.12 17.10 2.58 2.22 11.39 1.20 .I3 .9 1 .I5
999 3.13 15.48 2.54 2.08 12.38 1.12  .20 .90  .23
Example 13.3. Generate a sample of 20 observations having the properties given in example
13.2 except assume that the distributions of the random variables XI, X2, and Xg are normal,
exponential, and lognormal, respectively. Note that X, can only be approximately exponential
because p2is not exactly equal to a,.
Solution: The first steps are the same as for example 13.2. The Y matrix is then transformed
to a cumulative probability matrix  P is transformed to h e required X
P and the   by finding the
330 CHAPTER 13
values of xij corresponding to p i j based on an inverse transformation using the appropriate
probability distribution. For X, the distribution is normal so Y, = X,. For X2 the distribution is
exponential so
For X,, the mean and standard deviation of the logarithms are computed from equations
6.306.3 1. The ln(X,) is then obtained as the inverse of the normal distribution, having the de
termined logarithmic mean and standard deviation. Finally, X, is the antilogarithm of the
resulting normally generated value. A part of the indicated matrices are shown here:
As an example, consider p,,,. This is the probablity that Y2 < 5.56 if Y2 is N(16.462,
12.4812).The value of this probability is 0.038829. In general, pi, = probability of X, < xijif X,
is N(pj, 0;).To transform pij to the actual xij, one finds the value of Xij satisfying prob(xj < X)
= pij, where the probability is based on the appropriate pdf. For example, xlz is from the
cumulative exponential distribution whose value is plz.
The value of XI,, is generated from a lognormal distribution. Using equations 6.35 and 6.36,
the mean and standard deviation of the logs of yl,, are found to be 0.85083 and 0.42782. The value
of the standard normal distribution having a cumulative probability of 0.610928 is 0.281739.
Thus, is given as
This procedure was also carried out generating 1000 observations with the results shown
below:
Figure 13.3 shows histograms of the resulting simulations of the 1000 observations. The
histograms indicate the data generally follow the desired distributions.
DATA GENERATION 33 1
Fig. 13.3. Frequency histograms for data generated for example 13.3.
1. Synthetic hydrologic traces do not provide a mechanism for overcoming biased or faulty data.
McMahon et al. (1972) discuss the use of simulated streamflow in reservoir design. Burges
and Linsley (1971) investigated the influence of the number of traces used in determining the
frequency distribution of reservoir stage. In their study, they generated inflows from both an
annual and a monthly, normal, Markov model. They found that, in general, fewer traces were
required to define the storage distribution when the monthly model was used than when the annual
model was used. They also found that about 1000 traces should be used to determine the storage
frequency distribution when the annual model is used.
Hahn and Shapiro (1967) discuss evaluating system performance by Monte Carlo simula
tion. Benjamin and Cornell (1970) discuss using simulation to derive the probability distribution
of a random variable that is a function of other random variables. Smart (1973) discusses the use
of simulation to determine relationships between certain parameters of random geomorphological
models. Shreve (1970) used simulation to generate a sample of topologically random channel
networks. Fiering (1961) discusses simulation in reservoir design. Fiering and Jackson (1971)
develop models for simulating streamflow.
A widely used application of data generation has been in the general area of uncertainty,
reliability, and risk analysis. Data generation is used to examine a large number of outcomes or
possibilities from a system from which probabilistic statements can be made. Chapter 16 should
be consulted for more detail on this topic.
The stochastic nature of quantities estimated from stochastic models can be investigated
using data generation techniques. The design of any water resources system is dependent upon
estimates of hydrologic quantities. These estimates are based on some type of stochastic model
whether it be a flood frequency curve or a comprehensive river basin simulation model. One of
the first steps in developing design estimates is the selection of the stochastic model to be used.
Regardless of what stochastic model is finally selected, the parameters of this model must be
estimated from historical data. Because the parameters are functions of random variables (the his
torical data), the parameters themselves are random variables. Furthermore, the design estimate that
is arrived at using the model is a random variable because it is dependent on the model parameters.
As an example, consider the design capacity of a reservoir required to meet a given crite
rion. This capacity might be determined based on an available historical streamflow record. If a
different historical streamflow record were available and was used to determine the required
capacity, the estimate based on this historical record would differ from the estimate based on the
original historical record. The estimated design would be a random variable because it is a func
tion of the available streamflow record and streamflow is a random variable. Intuitively, if two
extremely long streamflow records were used, one would expect less difference in the estimated
reservoir capacity than if two short streamflow records were used. Furthermore, one would expect
the estimated capacity based on the long record to more closely approximate the "true"capacity
than the estimate based on a short record.
In general, the variance of a parameter estimate is a decreasing function of the sample size.
The larger the sample, the smaller the variance of the parameter estimate. This, in turn, implies
DATA GENERATION 333
that the variance of the design estimate will decrease as the sample size increases. The difference
in a design estimate and its true population value may be thought of as a prediction error.
A general procedure for determining the probability distribution of prediction errors as a
function of sample size is presented in Haan (1972b). The procedure assumes that the correct
stochastic model is being employed. The procedure is as follows:
1. Estimate the parameters of the stochastic model and assume these estimates are equal to the
population values.
2. Simulate k independent sets of data of the type being studied with the model using the as
sumed population parameters. Each set of data consists of n observations or years of record.
3. Reestimate the parameters of the model being used from the n simulated observations for each
of the k data sets. This results in k parameter sets.
4. Estimate the desired quantity, Q (mean annual runoff, 50year peak flow, 90day low flow,
etc.) with the model using each of the k parameters sets. This will result in k estimates for Q.
5. Look at the probability distribution, PQ(q), of the k estimates for Q and determine the
probability of an individual estimate being outside some acceptable limits. If Q* represents
the estimate of Q and Q, and Q, are the lower and upper limits, then the probability that Q*
will be outside the desired interval is given by
7. Select the record length that gives an acceptable probability (sufficiently low) of Q* falling
outside the interval Q, to Q,.
This procedure can be applied to many types of stochastic models. Haan (1972a) presents an
illustration of its use in conjunction with the Thomas and Fiering (1962) streamflow simulation
model (chapter 15). In this example, a set of population parameters were assumed for the model
and k = 100 sets of observations generated. Each set of observations consisted of n years of
record. The process was carried out for n = 10, 15,25, and 50 years. Using each set of observa
tions and each record length, the parameters of the Thomas and Fiering streamflow model were
estimated, and, from these parameters, the mean annual flow determined by simulation. This
gave 100 estimates for the mean annual flow for each of the 4 record lengths. A probability
distribution (normal distribution) was fit to the 100 estimates for the mean annual flow. The prob
ability that a single estimate of the mean annual flow would deviate more than a given amount,
d, from the population mean annual runoff of 371.6 mm was evaluated. This entire process was
repeated 3 times, giving a total of 300 simulated traces.
334 CHAPTER 13
Table 13.2. Probability of error greater than d in mean annual runoff for problem described
by Haan (1972a)
n, in d in millimeters
years 6.40 12.70 25.40 38.10 50.80 76.20 101.60
The results of this analysis, presented in table 13.2, show the expected result that as the
number of years of record increases, the probability of making an error greater than a given
value decreases. For example, for this particular stream, there is a probability of 0.22 of missing
the true mean annual runoff by more than 50.8 mm if 10 years of record are available, whereas
the probability is only 0.03 if 50 years of data are available for parameter estimation.
This procedure for estimating prediction error probabilities requires that the population
parameters for the stochastic model be known. Because this is rarely the case in hydrology, these
parameters must be estimated from all of the available information. Obviously, these estimated
parameters will not equal the population parameters, but when used as population values along
with the above simulation technique, they will yield estimates of error probabilities that can serve
as a guide in determining how much data is needed to ensure an acceptably low probability of
making an unacceptable error with the stochastic model.
Exercises
13.1. Without using a table of standard normal deviates, generate 20 observations from a normal
distribution with a mean of 100 and a variance of 100. What is the mean and variance of the 20
observations?
13.2. Select 100 observations from a normal distribution with mean 0 and variance 1 (use a table
of standard normal deviates). Plot a histogram of these observations. Test the hypothesis that
these are from a normal distribution using the X 2 test and the KolmogorovSmirnov test. Why do
the mean and variance of the data not equal 0 and 1, respectively?
13.3. Generate 20 observations from an exponential distribution with A = 0.5. (a) Test the hy
pothesis that the observations are normally distributed. (b) Test the hypothesis that the observations
are exponentially distributed.
13.4. Generate independent samples of size 10, 20, 30, 50, and 100 from an N(0, 1). Plot the ob
servations on probability paper using one plot for each sample. Repeat this entire process 5 times.
Study the resulting probability plots in an attempt to develop a "feel" for the scatter that one can ex
pect when sampling from a frequency distribution. (This might be undertaken as a class project with
each student working through a sample of 10,20,30,50, and 100. The results can then be shared.)
DATA GENERATION 335
13.5. Any number of variations of exercise 13.3 can be worked using different initial
distributions, test distributions, parameter values, and sample sizes. Some variations should be
used to assist in developing a "feel" for the scatter present in random samples from frequency
distributions and for the discriminatory power of the chisquare and KolmogorovSmimov tests.
13.6. Write a computer program for generating random observations from a gamma distribution
for integer values of q. Generate independent sets of size 10,20,30,40,50, and 100 observations
using q = 2 and A = 1.5. Test the hypothesis that these generated values are from a (a) gamma
distribution, (b) normal distribution, (c) exponential distribution.
13.7. Repeat example 13.2 for samples of size 20,200, and 999. Why are your results not iden
tical to those of example 13.2?
13.8. Weekly rainfall during a particular week of the year at a weather station is thought to fol
low a gamma distribution with q = 2 and A = 1.5. If 25 years of data are available for estimat
ing the parameters of the gamma distribution, what is the probability that the estimated 50year
weekly rainfall based on the estimated gamma parameters will be in error by more than
0.5 inches?
DEFINITIONS
A sequence of variables collected over time on a particular variable is a time series. A time
series can be composed of a quantity either observed at discrete times, averaged over a time inter
val, or recorded continuously with time. An ensemble of time series is a set of several time series
measuring the same variable. A single time series is called a realization. Thus, an ensemble is
t t
a. Stochastic b. Stochastic + Trend
t
I t
c. Stochastic+ Periodic d. Stochastic + Jump
Fig. 14.1. Time series containing stochastic and several types of deterministic components.
made up of several realizations. A time series may be composed of only deterministic events,
only stochastic events, or a combination of the two. Most generally, a hydrologic time series will
be composed of a stochastic component superimposed on a deterministic component. For
example, the series composed of average daily temperature at some point would contain seasonal
variationthe deterministic componentplus random deviations from the seasonal valuesthe
stochastic component. The deterministic components may be classified as a periodic component,
a trend, a jump, or a combination of these. Figure 14.1 shows typical stochastic time series with
various types of deterministic components.
Trends in a hydrologic time series can result from gradual natural or human induced
changes in the hydrologic environment producing the time series. Changes in watershed
conditions over a period of several years can result in corresponding changes in streamflow
characteristics that show up as trends in time series of streamflow data. Urbanization on a large
scale may result in changes in precipitation amounts that show up as trends in precipitation
(Huff and Changnon 1973). Climatic changes or shifts may introduce trends into hydrologic
time series.
Jumps in time series may result from catastrophic natural events such as earthquakes or
large forest fires that may quickly and significantly alter the hydrologic regime of an area.
Anthropomorphic changes such as the closure of a new dam or the beginning or cessation of
pumping of ground water may also cause jumps in certain hydrologic time series. Astronomic
cycles are generally responsible for periodicities in natural hydrologic time series. Annual cycles
are many times apparent in streamflow, precipitation, evapotranspiration, groundwater level, soil
moisture and other types of hydrologic data. Weekly cycles may be present in water use data such
as industrial, domestic, or irrigation demands. Many times the latter time series will contain both
annual and weekly periodicities. SalasLaCruz and Yevjevich (1972) and Yevjevich (1972~)dis
cuss periodicities and trends in hydrologic data in more detail.
The time scale of time series may be either discrete or continuous. A discrete time scale
would result from observations at specific times with the times of the observations separated
by At or from observations that are some function of the values that actually occurred during
At. Most hydrologic time series fall in this latter category. Examples would be the average
monthly flow in a stream (At = 1 month), annual peak discharge (At = 1 year), and daily
rainfall (At = 1 day).
A continuous time scale results when data are recorded continuously with time such as the
stage at a stream gaging location. Even when a continuous time scale is used for collecting the
data, the analysis is usually done by selecting values at specific time intervals. For example, rain
gage charts &e usually analyzed by reading the data at selected times (i.e., every 5 minutes) or at
"break points" (here At is not a constant).
In this chapter it will be assumed that the data are available at discrete times evenly spaced
At time units apart. Even though the discussion centers around a time scale concept, a distance or
space scale can be used as well. For example, the width of a stream along a certain reach might
be a stochastic process where the width would be the random variable and distance along the
reach the "time".
The random variable described by the time series may be discrete or continuous. A sequence
of 0's and 1's denoting rainless and rainy days would be a discrete stochastic process with a
discrete time scale. The amount of daily rainfall would be a continuous stochastic process with a
discrete time scale (At = 1 day). Thus a times series may be composed of either discrete or
continuous random variables on discrete or continuous time scales.
A stochastic process can be represented by X(t). The probability density function of X(t) is
denoted by px(x;t) which describes the probabilistic behavior of X(t) at the specified time, t. If
the properties of a time series do not change with time, the series is called stationary. For a sta
tionary series p,(x;t,) equals px(x;t,) where t, and t2 represent any two different possible times.
If px(x;t,) and px(x;t,) are not equal, the series is termed nonstationary. Of the series shown in fig
ure 14.1, only that given in 14.la can possibly be stationary. If the deterministic component is re
moved from 14.lb, c, and d, they too might be stationary.
The properties of a time series can be obtained based on a single realization over a time in
terval or based on several realizations at a particular time. The properties based on a time inter
val of a single realization are known as time average properties. The properties based on several
realizations at a given time are known as the ensemble properties. If the time average properties
and the ensemble properties are the same, the time series is said to be ergodic.
Figure 14.2 shows several possible realizations for a continuous stochastic process on a
continuous time scale. The time average over the time interval 0 to T of the i" realization is given by
TIME SERIES 339
X~
0 I
0
 t
t t+r
0
. i ( t )I l l ~.;t+r)
s
Fig. 14.2. Several realizations of a stochastic process.
For a realization on a discrete time scale, the time average would be determined from
where n is the total number of equally spaced points at which Xi(t) was observed.
The ensemble average at time t is given by

If the process is such that X(t) = X(t + T) for all values o f t and T, the process is said to be
stationary in the mean, or firstorder stationary.
The ensemble covariance of X(t) and X(t + T) is given by
If the covariance given by equation 14.5 is independent o f t but dependent on T (the lag), the
time series is stationary in the covariance. If T = 0, equation 14.5 gives the variance of the series.
Stationarity in the covariance implies stationarity of the variance. If a series is stationary in the
mean and in the covariance, the series is said to be secondorder stationary, or weakly stationary.
340 CHAPTER 14
If a series is stationary in the covariance but not in the mean, the term weakly stationary or second
order stationary should not be used. For many hydrologic applications, one is satisfied with sec
ondorder stationarity. If a process is secondorder stationary and p,(x; t) is a normal distribution,
the process can be shown to be stationary.
Bendat and Piersol (1966) state that in actual practice, random data representing stationary
physical phenomena are generally ergodic. For ergodic random processes, the time average
mean, as well as all other time average properties, equals the ensemble averaged value. Thus the
properties of a stationary random phenomenon can be measured properly, in most cases, from a
single observed time history record.
Generally, only one realization of a stochastic process is available. More than one realiza
tion can be obtained by breaking the single realization into several shorter series. Unfortunately,
most hydrologic records are so short that breaking them into even shorter series may not be prac
tical. If the statistical properties of the parts of a time series are not significantly different from
one another, the series is said to be selfstationary.
For a skgle realization the mean is determined from equation 14.1 or 14.2.The covariance
can be determined by
TREND ANALYSIS
A common deterministic component in a time series is a trend. A trend is a tendency for
successive values to be increasing (or decreasing) over time. Changing hydrologic conditions can
introduce trends into a hydrologic time series. Urbanization may contribute to increased peak flows
or runoff volumes. Increased demands on groundwater may result in declining groundwater levels
or declining base flows in streams. Climate change may result in changes over time in rainfall, tem
perature, and other climatic variables which in turn may alter strearnflows and groundwater levels.
If the data meet the assumptions of regression, simple linear regression can be used to test
for the presence of a linear trend and multiple linear regression may be used to test for trends
1890 1910 1930 1950 1970 1990
Year
where X(t) is the total annual rainfall in year t. The slope of 0.028 has a standard error of 0.0267.
The calculated "t" statistic for testing the hypothesis that the slope is zero is 1.05, which is clearly
not significantindicating that the hypothesis cannot be rejected. Figure 14.4 is a probability
Normal Distribution
Fig. 14.4. Normal probability plot of residuals of time series regression of Stillwater, Oklahoma,
annual rainfall.
342 CHAPTER 14
20
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
Year
plot of the residuals indicating the assumption of normality is valid. The firstorder serial corre
lation coefficient of the residuals is 0.083, indicating a lack of serial correlation in the residuals.
Thus, the regression assumptions are satisfied and the conclusion of no trends is supported.
To illustrate the hazards of using a short hydrologic record to evaluate trends, the data for
Stillwater rainfall from 1978 through 1987, shown in figure 14.5, were investigated. The result
ing relationship was
The standard error on the slope was 0.489 and the calculated t for testing for zero slope was
3 . 3 6 a n indication one must reject the hypothesis of no trend. Figure 14.6 shows a probability
Normal Distribution
Fig. 14.6. Normal probability plot of residuals of a portion of the Stillwater, Oklahoma, annual
rainfall.
plot of the residuals for this regression. The first order serial correlation coefficient for the resid
uals was 0.03. Again, the assumptions of regression are not violated.
The Stillwater rainfall example illustrates that short periods of apparently nonstationary
data may be embedded in a longer stationary data series. Concluding nonstationarity from the
short series and projecting data either forward or backward in time based on this conclusion can
clearly lead to erroneous projections.
If the data under consideration do not meet the assumptions of regression as set forth in
chapters 9 and 10, conclusions based on regression are approximate with an unknown level of
confidence associated with statistical tests. Helsel and Hirsh (1992), Salas (1993) and others
discuss the use of nonparametric tests for trends. Nonparametric tests do not depend on distribu
tional assumptions regarding the data and residuals but are generally based on relative ranks of
data points. Conover (197 1) presents many nonparametric statistical procedures.
Salas describes the MannKendall nonpararnetric test for trends in the series X(t) for t = 1,
+ +
2, ..., N. Each value in the series X(tl) for t' = t I, t 2, ..., N is compared to X(t) and assigned
a score z(k) given by
if there are few values of z(k) = 0 (ties). In hydrologic data such as rainfall totals, streamflow
rates or volumes, groundwater levels, and so forth, one would expect very few exact ties. If the
data series is on a discrete time scale, ties might be more common. In that event, Salas (1993) or
Helsel and Hirsh(1992) should be consulted.
The hypothesis of no trend is rejected if lu,l > z,+ where z is from the standard normal
distribution and a is the level of significance.
Example 14.1. The annual rainfall for Stillwater for the period 19781 987 is given below. Use
the MannKendall test for a significant trend in the data.
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
Sums
Values of z(k) are determined by constructing N  1 columns of z(k) values with the first value
of z(k) in column j occupying the j + IS' position. Thus, in column 3 the first value of z(k)
occupies the 4" position. The value of z(k) is determined by the assignment rules given as equa
tions 14.8. Consider the third column. By applying equations 14.8, t = 3 and X(3) = 34.03. The
first entry in the third column is in the 4lhrow and is  1 because t' = t + 1 = 4, and X(3) = 34.03
is less than X(4) = 35.72.
S is simply the sum of all N(N  1)/2 or 45 values of z(k) and is equal to 25. The value
+
of m is 1 since S < 0.
The trend in a set of data may be removed by subtracting the trend line from each data point.
If the data follow the assumptions of linear regression, the detrended data X1(t)would be given by
Nonparametric estimates of a and b may also be obtained. Helsel and Hirsh (1992) suggest
that the slope may be estimated from
for t' < t and t' = 1,2, ..., n  1; t = 2,3, ..., n. The intercept is estimated from
2 = x(t)med  6 ~ 4 (14.14)
Example 14.2. Compute the nonparametric estimates for the slope and intercept for the Still
water rainfall data of example 14.1.
Solution:
Values for a and b are estimated from equations 14.12 and 14.14. The median of the values in the
above table is 1.70667, which is the estimate for b. The median of the values in the column
labeled t is 1982.5 and in the column labeled X(t) is 34.88. Therefore
For example, for t = 1980 the estimate is 3348.59 + 1.70667(1980), or 30.61. Figure 14.7
shows the resulting nonparametric regression line. The nonparametric estimate of the detrended
data, Xf(t),is again given by equation 14.12 using the nonparametric estimates for the slope and
intercept.
Year
Fig. 14.7. Nonparametric regression line on part of the Stillwater, Oklahoma, annual rainfall.
JUMPS
Jumps or abrupt changes in the mean of a time series may be detected using equations 8.15
through 8.17 if the time point at which a jump is suspected is known and the necessary distributional
assumptions (normality) are valid. If the distributional assumptions are not met, these tests become
approximate with an unknown level of significance. The degree of approximation depends on the
severity of the deviation from normality. For highly skewed data, the approximation could be quite
poor. The procedure in making the test is to divide the time series into two subseries at the point of
the suspected jump with n, and n, observations in the subseries where n, + n, = n, the total number
of observations. A test of the hypothesis that pl= p2for the two subseries is then made.
For data that do not meet the assumptions associated with the parametric tests, a nonpara
metric test for the hypothesis p1= p2is available. Conover (1971, 1980) and Salas (1993) pres
ent the MannWhitney test for the equality of means.
The entire sample is ranked with Ri being the rank of the ih observation in the series for
i = 1 to n. The quantity
are computed. If IT1 > z,,/, where z,,/~ is the standardized normal z value with probability
z > z,,/, equal to a/2, the hypothesis of equality of means is rejected. Conover (1971, 1980)
should be consulted if the data has many ties or groupings of equal values.
Example 14.3. Below are annual flow data for Beaver Creek in western Oklahoma. The data are
plotted in figure 14.8. It has been hypothesized that after the 28th year, the flow regime has
TIlME SERIES 347
0 5 10 15 20 25 30 35 40 45 5
Year
distinctly changed. Test the hypothesis that the flow for years 128 differs from the flow for years
2 9 4 6 using (a) a normality assumption and (b) a nonparametric approach.
Solution:
(a) If a normality assumption is made, the test statistic comes from equation 8.17. The mean
and standard deviation of the first 28 observations are 83,684 and 75,046 and for observations
2 9 4 6 they are 33,046 and 3 1,759, respectively. Using equation 8.17, the t statistic is calculated as
which would indicate the two parts of the records have different means.
AUTOCORRELATION
One method of characterizing correlation within a time series over time is the autocorrela
tion function, P(T),given by
For T = 0 equation 14.17 indicates that p(0) = 1 because Cov(X(t), X(t + T)) = Cov(X(t), X(t)) =
Var(X(t>> 
From figure 14.2 it can be seen that for small values of T the covariance term would be
positive because for the most part like signs are being multiplied (X(t)  X and X(t + T)  X
have the same sign for small T).As T increases, a point is reached where the covariance, and thus
P(T),may become negative. Some authors call Cov(X(t), X(t + T)) the autocorrelation function.
In keeping with the terminology established earlier in this book, the Cov(X(t), X(t + T)) will be
called the autocovariance.
A plot of the autocorrelation function against the lag T is called a correlogram. For random
data such as shown in figure 14.la, the correlogram would appear as in figure 14.9a. In the case
of data containing a cyclic and stochastic component such as shown in figure 14. lc, the correlo
gram would be cyclic as in figure 14.9b where p is the period of the cycle.
Correlograms are useful in determining if successive observations are independent. If the
correlogram indicates a correlation between X(t) and X(t + T), the observations cannot be
assumed to be independent. The autocorrelation function may thus be said to indicate the "mem
ory" of a stochastic process. When p ( ~ becomes
) zero, the process is said to have no memory for
what occurred prior to time t  T. In practice, P(T) should be zero for large T for most random
TIME SERIES 349
Z r
Fig. 14.9. Typical correlograms: (a) Random process. (b) Random process superimposed on a
periodic process.
processes. If p ( ~ for
) large T exhibits a pattern that is not zero, it may be an indication of a deter
ministic component. For example, if the correlogram appears as in figure 14.9b, it indicates the
data contains a periodic component.
A hydrologic time series representing a process involving significant storage is likely to
have values at time t + 1 that are correlated with values at the previous time t. The correlation
+
may extend over several time increments so that X(t k) is correlated with X(t) k time units ear
lier. Daily flows in a stream and daily, monthly, and possibly annual groundwater levels are ex
amples of hydrologic time series that often exhibit correlation over time. Annual maximum peak
flow is an example of a time series that is unlikely to exhibit correlation over time.
For a discrete time scale, the autocorrelation function becomes p(k) where k is the lag or
number of time intervals separating X(t) and X(t + T).The relationship between T and k is given by
where At is the length of the time interval (eg., 1 day, 1 month, 1 year, etc.). If p(k) = 0 for all
k f 0, the process is said to be a purely random one. This indicates that the observations are lin
early independent of each other. If p(k) # 0 for some k # 0, the observations k time increments
apart are dependent in a statistical sense and the process is referred to as simply a random one. If
a time series is nonstationary, p(k) will not be zero for all k # 0 because of the deterministic
element, even if the random element is itself a purely random time series (Matalas 1967b).
Unless the deterministic element is removed, one cannot determine to what extent nonzero values
of p(k) are affected by the deterministic element.
The population autocorrelation function, p(k), may be estimated by r(k), which is given by
nk nk
nk Ci=l Xi Xi+k
xixi+k 
nk
r(k) =
nk
Ci=l xi
2

(2:;
nk
nk 2
xi=lX i + k
 (2: xi+, )2
nk 1
1/2
with Xi = X(ti), XI+, = X(ti + kAt), and n is the total number of observations. Some authors use
the terminology autocorrelation function for p(k) or p ( ~ and
) serial correlation function for r(k)
or r ( ~ )This
. distinction is not made in this text.
350 CHAPTER 14
For any observed series, it is unlikely that r(k) will be exactly zero. If r(k) differs from zero
by more than is expected by chance, then the observations k time periods apart cannot be assumed
independent. Procedures are available for testing the hypothesis that p(k) = 0 and for placing
confidence intervals on p(k) (see chapter 11). Often, computer programs that generate correlo
grams also compute confidence intervals such that if the computed autocorrelation falls outside the
confidence interval for a particular value of k, a hypothesis of r(k) = 0 would be rejected.
PERIODICITY
Autocorrelation analyzes a time series in the time domain. It provides information on the
behavior of the series over time, especially with regard to the memory of the process or how the
process at one instance of time is dependent on, or related to, the process at some prior time. An
alternate analytic approach is to examine the series in the frequency domain. With this approach,
an attempt is made to quantify the variability in the series in terms of repeating patterns having
fixed periods or, what is equivalent, fixed frequencies. The variance of the process is partitioned
among all possible frequencies so that the predominate frequencies can be identified. Let
X, = Xi = X(t = ih) for i = 1 to n. That is, the X's are equally spaced in the time domain.
We can express X, as a Fourier series
The maximum value for the number of terms in the series, q, is given by q = (n  1)/2 if n
is odd and q = n/2 if n is even. The frequency, fi = i/n, represents the i" harmonic of the funda
mental frequency l/n.
The coefficients a and b can be estimated from
1
% =; X:=lxt= x
and
b, = 0
or
TIME SERIES 35 1
The periodogram, I(fi), is defined as
For a discrete time series, the angular frequency, mi, is equal to 2nfi or 2ni/n.
The variance of X, is given by
Because % is a constant,
= 1 otherwise
Similarly,
1
Var(X,) =  Cq=,(a:
2
+ b:) n odd
I(fi)
Let g(fi) =  so that
nu2
The function g(fi) is the spectral density function representing the fraction of the variance of
X, associated with fi. Recall that
0 0 02 0 04 0 06 0 08 0 1 0 12 0 14 0 16
Frequency
For this function f is 1/12 cycles per time unit and the wave length is 12 time units. A software
package, NCSS 2000 (1998), was used to make the calculations used in generating the plots of
0.8 J
Lag
Fig. 14.11. Sum of 3 cosines (top) and its correlogram (middle) and spectral density (bottom).
354 CHAPTER 14
figure 14.10. The frequency axis of the periodograrn is actually 27r/f in this plot. Note that for
this deterministic function the correlograrn reflects the function exactly.
The function for figure 14.11 is
Autocorrelation Plot
Periodograrn
0.1 0.2
Frequency
Z, = a + bt and w, = z,  2,,
then
where the C$lq ... C$+q,represents the p" order AR and a,  0 .. Oqa,q represents the q"
order MA. Most hydrologic applications never get any more complex than an ARIMA(2, 1, 2).
where q represents an unobserved white noise series. The q are identically and independently
distributed random variables (iid rvs) with a mean pa= 0 and variance o:,and z, is a stationary
time series with zero mean. A mean term can be added later if necessary.
MA( 1)
A firstorder moving average process, MA(1), is given by
For notational convenience let y, = Cov(zt, zt,,). Some properties of a MA(1) series are:
Note that this last equation presents a way of estimating O1. We can estimate p, by r1 and
equate
and solve for 8,. This is a moment estimator and is not very efficient.
For 0, to be real, 4r: must be less than 1. This implies that rf must be less than '/4 or % 5 r1 5 %.
For a MA(1) process, p, must lie between 2%. When p, = 0.5,0, = 1. When p, = +0.5, el =  1.
Therefore,  1 5 8, 5 1. Because of the randomness of a sample, it is possible for IrlI > % but if that
occurs it brings into question the appropriateness of a MA(1) as a descriptor of the process.
Once again, the expression for p1and p2provide a means for estimating 8, and 02; however,
the two simultaneous equations may be difficult to solve.
MA(q)
The general result for pk for a MA(q) can now be written as
Note the numerator for p, is  3' , and pk = 0 fork > q. This can be used to identify the order
of a MA(q) process.
Autoregressive Processes
An autoregressive process of order p is given by
This last equationshows that /+,I < 1 or else yo would be less than zero, which is not pos
sible. 4, must be inside the unit circle.
Fork = 1
For k = 2
For k = k
+
Because I+,\ < 1, pk exponentially decays toward zero. For positive, pk is positive. For
negative, pk alternates between positive and negative values.
For k = 1, 4, = Y,/Y, = pl and can be estimated by r,. Another way to estimate 4, is
through linear regression using the model
+ +
which is analogous to Y = a bX E. Because zt has a mean of 0 and the Var(zt) = Var(z,,),
regression will result in a = 0 and b = r,.
AR(2)
A secondorder autoregressive process is given by
360 CHAPTER 14
with the property
It can be noted that is the same as & z ,  ( ~  ~ , and &2ztk is the same as Z,Z,(,~) so
that
Dividing by yo
AR@)
A general pb order autoregressive process is given by
For an AR(1) process, the YuleWalker solution is pl = 4,. For an AR(2) process, the
YuleWalker relationships are
To get an estimate of 4, and <b2,the p1 and p2 are replaced by r, and r2. In general, the
YuleWalker equations are solved to estimate the 4's.
362 CHAPTER 14
Autoregressive Moving Average Models ARMA(p, q)
A general ARMA(p, q) model is given by
A R M . (1 , l )
An ARMA(1, 1) model is given by
= = for k 1 I
Y0 1  201+, + 0:
pk decays exponentially as k increases. The damping factor is 4,. The decay starts at p1 rather
than po = 1, as for the AR(1).
For AR(P)
6; = (1  &r1  &2r2     &$p)s2
For AR(1)
6; = (1  6)s2 since$, = r,
For MA(q)
For ARMA(1, 1)
AR Models
For all practical purposes, the least squares estimates for the +'s can be obtained by solving
the YuleWalker equations.
MA Models
MA(1)
TIME SERIES 365
a,, 
 z ,  ~+
a, = z, + 8,z,, + 8:a,,
This process can be continued to get
From equation A, calculate zi for i = 1 to n by taking a, equal to its expected value of zero. Then
The procedure is to search the interval ( 1 < 8 < 1) for 8 that minimizes S(8). Cryer (p. 133)
discusses a GaussNewton procedure based on a linear approximation and recursive differentiation.
MA(q)
We must simply generalize the above procedure. Now we have a multivariate search
,
with a, = a = a, =   = a,, = 0. The objective is to find the set of 9's that minimizes E 2.
Consider a, = a,(+,, 8,) and minimize S = (+,, el) = 2 2.We can write
To get a, we have a startup problem, namely zo. We could set zo = E(z,). Another way is to
simply start the sum at t = 2 and thus avoid the zo problem.
366 CHAPTER 14
ARMA(P' q)
with a, = a,, =  0  = a,, = 0 and minimize to get the 0's and $'s.
14.1. Let X(k) = cos(2nk/ 12) + E where E is an independent random observation from a normal
distribution with a mean of zero and a variance of 0.5. Compute and plot the correlogram and
spectral density function for this process by generating values for X(k) for k = 0 to 200. Compare
the results with figure 14.10.
14.3. The following data represent the years of eruptions of the volcano Aso for the period 1229
to 1962. Let k equal the eruption number beginning with 1229 so that k = 1 for 1239, k = 2 for
1240, and so on. Let X(k) be the number of years since the last eruption so that X(l) = 10, X(2)
= 1, X(3) = 25, and so forth. Compute the correlogram and spectral density function for X(k).
Is there any apparent pattern to the eruptions?
Years of eruptions of the volcano Aso for the period 1229 to 1962
14.4. The following data represent an unusual phenomena in that they are observations of a true
time series from the geologic past. The Ecocene lake deposits of the Rocky Mountains consists
of thinly laminated dolomitic oil shales hundreds of feet thick. It has been established that the
laminations are varves, or layered deposits caused by seasonal climatic changes in the lake
basins. By measuring the thickness of these laminations, a record of the annual change in the rate
of deposition through the lake's history is obtained (Davis 1973). Compute and plot the correlo
gram and spectral density function for this data. Discuss any apparent patterns. The data are pre
sented columnwise.
368 CHAPTER 14
Thickness (mm) of successive varves of a section through the Green River Oil Shale
14.5. Compute and plot the correlogram and spectral density function for the weekly precipita
tion at Ashland, Kentucky, for the week of March 17. Discuss any apparent patterns. (Data in
appendix.)
14.6. Work exercise 14.5 using the monthly precipitation for Walnut Gulch near Tombstone,
Arizona. (Data in the appendix.)
14.7. For a MA(1) process with a, iid N(0,l) and values of 8, = 0.9,0.4, and 0.9:
(b) Plot z, vs t.
14.8. For the following MA(2j processes, assume a, is iid N(0,l). For each case:
(d) Based on population 8's, calculate p, and p., Compare to r(1) and r(2).
14.9. Assume an AR(1) process with a, iid N(0, 1). For = 0.9 and 0.4
14.10. Assume an AR(2) process with a, iid N(0, 1) and <PI = 0.5 and = 0.3
Even during the 100year life of a project, an observed historical flow sequence of 10 to 25
years in length will not repeat itself. In all cases the designer would agree that the worst flood
(or drought) on record is not the worst possible flood (or drought). The use of historical records
alone can only approximate the risk involved. That is, if a design is made based on a historical
record, chances are the design would be adequate if the historical record repeated itself.
However, we know that the historical record will not repeat itself. There is, thus, a certain risk
that the design will be inadequate for the unknown flow sequence that the system will actually
experience. This latter point can be illustrated by considering the design of a facility that might
have a 5year lifesay a small, temporary boat dock. For this design assume that 100 years of
flow records are available. The 100 years of record would provide 20 independent 5year flow
sequences. A proposed design could then be evaluated on 20 independent flow sequences equal
in length to the design life of the facility. If it is found that in 15 of the sequences, the design
is adequate and inadequate in the remaining 5, then one would estimate that for some future
5year period there would be a probability of 0.25 (5/20) that the design would be inadequate.
If the design is adopted, a risk of 0.25 exists. A risk of 0.25 may be unacceptable. For this case,
consider that an acceptable risk is 0.05. This means the design should be increased and reeval
uated until it proves inadequate in only 1 of the 20, 5year observed sequences. Generally, the
design life of a water resources project exceeds the length of available record so that a risk
evaluation using this procedure is not possible.
In other cases it may be desirable to know the severity of a shortage. For instance, one might
be looking at a system to control the thermal pollution from a power plant. It may be that the de
sign requirement is to affect the natural water temperature by less than 5C. During low flows, the
ratio of the volume of heated water discharge to the volume of natural flow may be such that it is
difficult to keep the overall temperature rise to less than 5.5"C. In this case, it would be desirable
to know the magnitude as well as the frequency of failing to meet the design standard since a 6C
temperature rise would be less damaging than a 15C temperature rise.
The approach outlined above assumes that there is some probabilistic mechanism underly
ing the generation of streamflows and that this mechanism is sufficiently stable that it can be
considered stationary. It is also assumed that the sample in hand is a representative one.
An even better approach to determining risk probabilities would be through operations
(analytic or Monte Carlo) on the underlying, exact probability distribution or distributions that
the natural hydrologic process follows. Of course, this type of information is never available and
in practice must be approximated. Even if the exact distribution was known, its parameters would
have to be estimated from an observed record (sample) and would not equal the population pa
rameters. Thus, to overcome the objections of design evaluation based on a single (and many
times short) flow record, a data generation scheme or stochastic model is needed.
A stochastic model is a probabilistic model having parameters that must be obtained from
observed data. Stochastic streamflow models, for example, do not convert rainfall to runoff
through theoretical or empirical relationships as do deterministic models, but use the information
in past or historical streamflows. Stochastic streamflows are neither historical flows nor predic
tions of future flows, but they are representative of possible future flows in a statistical sense.
Stochastically generated data can be used in evaluating risk probabilities, providing a "satisfac
tory" stochastic model is available.
372 CHAPTER 15
It should be noted that a stochastic model depends heavily on the assumptions of stationarity
and representativeness. The effects of watershed changes for example cannot be evaluated. On the
other hand, deterministic models might be able to simulate a changing hydrology as the basin
changesbut remember the future rainfall problem. Thus, one approach to watershed modeling
is to stochastically generate rainfall and use a deterministic model to convert the rainfall to
streamflow.
In developing a stochastic model, it is assumed that the data are the result of a random
process or one that involves chance. One cannot precisely state what the data values will be at
any particular future time, but one will be able to make statements of probability concerning
future data values. In looking over past data records it is apparent that strearnflows are not com
pletely random with no constraints, but do possess certain recognizable features. For instance, if
the average annual flow has been around 15 inches for a long period of time, it is unlikely that it
will suddenly change to 25 inches unless the watershed is altered in some fashion. If the flows
have tended to be between 10 and 20 inches per year with only an occasional yearly total outside
these limits, the model should not produce a large number of flows outside these limits. Thus, the
model should preserve the overall mean and spread or variance of the data.
Further, it may be noted that there is some degree of persistence in that low flows tend to fol
low low flows and high flows tend to follow high flows. A streamflow model should retain this
property. From this it can be seen that historical records certainly guide us in model development.
It is not the purpose of this chapter to promote stochastic hydrologic models. Rather, some
of the most prominent models are discussed. There is a very rapidly expanding literature on
stochastic hydrologic models. No attempt is made to cover all of the models currently in the
literature or to discuss all of the features of the models that are covered in the chapter.
In selecting a stochastic model it is important to be able to state what characteristics of the
phenomena being modeled are important and what characteristics are unimportant. For example,
if streamflow is being modeled, the following is a partial listing of the questions that must be
considered.
2. Are annual peaks sufficient or will other peaks occurring during the year be important?
3. Is the time during the year when the peak occurs important?
5. Is the simultaneous occurrence of a peak flow and some other event important?
10. Is the dependence of the flow in one time period on the flow in previous time periods
important?
11. Is it sufficient to model the mean flow for a period? Is the variance important too? What
about the skewness?
12. Is the relationship of the flow on one stream and that on nearby streams of concern?
14. Is there evidence of trends or jumps in the flow record? Will there be trends or jumps in the
future? Is it important to model these features?
16. What is the quality and quantity of data available for model selection and parameter
estimation?
17. Given the available historic data, can the model parameters be estimated with sufficient
accuracy?
The answers to these and many other questions must be obtained before a model can be
applied to a particular problem. Of course, the answers to these questions depend heavily on the
use that is to be made of the model. Generally, it is desirable to select the simplest model that will
provide necessary information.
If one is considering the design of a reservoir to be located on a single stream to provide
irrigation water, quite likely a model that is capable of producing synthetic monthly streamflows
could be used. If a more complex model that considers daily flows or peak flows is used, it may
be necessary to sacrifice accuracy of monthly flow simulation to obtain accuracy of daily flows
or peaks. In any case, the more complex model would be more expensive to develop and test, and
certainly more expensive to use in developing long synthetic strearnflow traces. On the other
hand, if the proposed reservoir is to also provide flood control benefits, then estimates of flood
peaks and possibly shorter duration flow volumes would be required.
The design of an irrigation reservoir illustrates the case where joint probabilities or joint hy
drologic time series may be required. It may be that the land to be irrigated and the watershed
supplying water to the reservoir are subjected to similar climatic conditions. Thus, the inigation
demand may be the greatest at precisely the same time that flows into the reservoir are the low
est, and vice versa. Neglecting this possible correlation between supply and demand can result in
an underdesigned reservoir. If the demand and supply are independent, designing for the highest
demand at a time of the lowest supply may be overdesigning, because the joint occurrence of
these events may have a very low probability.
374 CHAPTER 15
There is no substitute for a thorough knowledge of the problem to be solved and the fea
tures of the problem that must be reproduced by the simulation model. It is relatively easy to
develop a simulation model for a problem by making unrealistic simplifying assumptions. It is
difficult to develop a model for use in solving a problem as it really exists. It is generally bet
ter to develop an approximate solution to the real problem than an exact solution to an unreal
problem.
In chapter 14 it was stated that a hydrologic time series may contain trends or jumps. If a
historical record contains trends or jumps and it is desired to use this record for the estimation of
the parameters of a stochastic model, it is necessary to be able to separate the deterministic and
stochastic components of the historical record. Once the deterministic component is removed, the
stochastic component can be used for parameter estimation. The model that is developed may
have to incorporate both deterministic and stochastic components. This is especially apparent in
cases where trends are present. If the trend is expected to continue into the period being modeled,
the trend component must be present in the model that is developed. Trends can generally be
modeled by a polynomial equation of the type
Where Tp(t)represents the trend in the parameter p as a function of time t and Po, P,, P2, ... are
coefficients that may be estimated by multiple regression. The order of the polynomial can also
be tested by determining the highest order trend having a regression coefficient that is signifi
cantly different from zero.
Jumps in a hydrologic time series may be identified by computing the mean value of the pa
rameter of interest during the two time periods on either side of the jump. These two means are
then tested to see if they are significantly different from each other. The exact time at which a
jump occurs cannot be easily identified from the data alone because of the presence of stochastic
variation. In this case, a review of the data gathering procedure and factors affecting the variable
under study should be undertaken in an attempt to identify possible causes for the jump and the
time that these factors became important.
Many stochastic models require the estimation of a large number of parameters. Again, the
limited hydrologic data that is available at a point may be inadequate to estimate these parame
ters. A regional approach to parameter estimation may help in this situation, provided regional
data are available (Benson and Matalas, 1967; Stedinger et al., 1994; Helsel and Hirsh, 1992; also
chapters 7 and 10).
Two classical stochastic models, the Bernoulli process and the Poisson process, and some
of their potential applications in hydrology, were discussed in chapter 4. The remainder of this
chapter is devoted to other selected models that appear frequently in the hydrologic literature.
known. Stochastic generation from a model of this type merely amounts to generating a sample
of random observations from a univariate probability distribution.
This type of model might be appropriate for generating a synthetic record of flood peaks.
Problems with the method are the uncertainty as to the proper probability distribution to use and
the uncertainty in the parameter values of the probability distribution. These two types of
uncertainty exist in all stochastic models to some extent. The larger the sample for estimating the
model parameters and testing the derived model, the less will be these uncertainties. Regional data
can also be used in some situations to assist in distribution selection and parameter estimation.
Another slightly more advanced application of a purely random model might be in generat
ing sequences of point storm rainfall amounts. The time between storms might be modeled as an
independent Poisson or Bernoulli process (Lane and Osborn 1973) and the amount of rain as a
gamma variable. The model could be made more complex by assuming the distribution parame
ters are a function of the time of year or that the parameters of the gamma distribution depend on
the generated time since the last storm.
Whether or not a process can be considered as a purely random process may be indicated by
its correlogram, or spectral density. If r(k) is not significantly different from zero for k greater
than zero, or if the spectral density function oscillates randomly with no apparent peaks, the
process may be a purely random process. The difficulties in selecting the proper probability den
sity function and in parameter estimation remain, however.
where Xi is the value of the process at time i, px is the mean of X, px(l) is the firstorder serial
+ a~ random component with E(E) = 0 and Var(~)= a
correlation, and E ~ is :. This model states
that the value of X in one time period is dependent only on the value of X in the preceding time
period plus a random component. It is also assumed that E ~ is + independent
~ of Xi. The variance
of X is given by a; and can be shown to be related to a: by
If the distribution of X is N ( F ~ a;)
, then the distribution of E is N(0, a:). Random values of
Xi+, can now be ei+, randomly from an N(0, a:) distribution. If t is
N(0, l), then tu, a:). Thus, a model for generating X's that are N ( F ~ ,
a;) and follow the firstorder Markov model is
The procedure for generating a value for Xi+, is to estimate F ~ax, , and px(l) by %, Sx, and
rx(l), respectively, and then select a ti+, at random from an N(0, 1) distribution and calculate Xi+,
based on x, Sx, and rx(l) and Xi. The first value of Xi, X,, might be selected at random from an
N ( F ~ a;).
, To eliminate the effect of X, on the generated sequence, the first 50 or 100 generated
values might be discarded.
Equation 15.3 has been widely used for generating annual runoff from watersheds (Fiering
and Jackson 1971). Since t is N(0, l), it is possible to generate values of X that are less than zero.
If this occurs it is generally recommended that the negative X be used to generate the next value
for X and then discarded. This procedure will result in a slight bias. If the occurrence of negative
X's is common in the generation process, it may indicate that X is not normally distributed. In
this event, some other distribution of E must be used. Equation 15.3 generates normally distrib
uted X's with a mean of FX, of variance of a$and firstorder serial correlation of px(l). Serial
correlation is common in hydrology, and, depending on the use to be made of the model, may be
quite important. Note that if px(l) = 0, equation 15.3 reduces to the independent process of
selecting a random observation from N ( F ~ a$. , On the other hand, if px(l) = 1, equation 15.3
is completely deterministic in that Xi+, is completely specified by Xi(Xi+, = Xi).
For a firstorder Markov process, the lag k serial correlation, px(k), is given by
Thus, the correlogram exponentially decays from px(0) = 1 to pX(m) = 0 according to equation
15.4. If an observed correlogram has this property, the Markov model may be an appropriate
generating model.
Equation 15.3 can be applied to the logarithms of data through the transformation
Yi = ln(Xi). The generation model is given by
where FY, ayand py(l) refer to the mean, standard deviation, and firstorder serial correlation of
the logarithms of the original data. Generation by equation 15.5 preserves the mean, variance,
coefficient of skew, and firstorder serial correlation of the logarithms of the original data, but not
of the data itself. Matalas (1 967) suggests a procedure for using a firstorder Markov model on
the logarithms that preserves the mean, variance, skewness, and firstorder serial correlation of the
STOCHASTIC MODELS 377
original data. The procedure is based on the transformation Yi = ln(Xi  a ) with the parameters
of equation 15.5 related to the parameters of X through the following equations:
px = a + exp :( + py )
In these equations, px, a;, yx, and px(l) refer to the mean, variance, coefficient of skew, and
firstorder serial correlation of the original data and are estimated by X,s;, CSx,and rx(l),
respectively. The quantities py, uy, py, and a are estimated from equations 15.615.9 and then
used in equation 15.5 to generate values for Yi+,. Xi+, is then calculated from
The X's generated in this fashion have the same mean, variance, skewness, and firstorder serial
correlation as the sample used to estimate px, a x2 , yx, and px(l).
The procedure that is recommended for estimating py, uy, py, and a is to solve equation
15.8 for s,. Equation 15.9 then yields ry(l), equation 15.7 yields Y,and equation 15.6 yields a.
Equation 15.1 can be used to generate X's that are distributed approximately gamma with
mean X, variance s;, and skewness cSx(Thomas and Fiering 1963). The procedure is to define ye
as the skewness of the random component, E. y, is estimated by
where ti+, is a random value from an N(0, 1). Xi+, is then generated by
with the resulting generated X's being approximately gamma distributed with mean X, variance
s;, firstorder serial correlation rx(l), and skewness cSx.
The firstorder Markov model (equation 15.1) is also known as the firstorder autoregressive
model because px(l) is equal to the regression coefficient P that would be obtained with a
regression using Y as Xi+, and X as Xi.
with n equal to the number of years of data, and Xijthe data value in the jth season of the ithyear.
Similarly, u i j is estimated by s;,,, yxi is estimated by CSx,and ~ , ~ (is lestimated
) by rX,,(l).
Note that ~ ~ ~is ~the( firstorder
1 ) serial correlation between values in successive seasons. If
monthly streamflow is being considered, px,,(l) would be the firstorder serial correlation
between flows in months 4 and 5. pXj(l)would be estimated by
where
In equation 15.15 there are some notational problems when j = m. In this case, j + i should be
taken as 1 because the first season follows the m" season (January follows December, for example).
With this notation, the multiseason, firstorder Markov model for normally distributed flows
becomes
In any application, the population parameters are estimated by the corresponding sample
statistics. The subscript notation of equation*1517 again has problems in that Xtj+, is really
equal to Xi+,,,when j = m. For instance. if a monthly model is considered, then Xi,,, (or the 1 3 ' ~
STOCHASTIC MODELS 379
monthly value in the ith is actually Xi+,,,(or the first monthly value in the next year). c.j+l
is again a random observation from an N(0, 1). Values generated by equation 15.17 are thus the
sum of the mean for the season plus a regression coefficient times the deviation from its mean of
the value in the previous period plus a random component that is normally distributed with mean
zero and variance ox,j+ ,2
.
The firstorder Markov model can also be generalized to a seasonal model for gamma vari
ates (Fiering and Jackson 1971) by generalizing equations 15.11, 15.12, and 15.13; equation
15.11 becomes
where tij is a random value from an N(0, I). Equation 15.13 becomes identical to equation 15.17
with population parameters replaced by their estimates except E ~ , is ~ +used
~ in place of tij+,. The
resulting XiVjwill be distributed almost gamma. Because skewness varies from season to season,
the representation is not statistically pure (Fiering and Jackson 1971). This is because the sum of
gamma variates is not gamma unless the scale parameter, A, is the same.
Equation 15.17 can be applied to the logarithms of the original data. In this case, X i jwould
refer to the logarithm of the value in the ith year and jth season. The parameters of the model
would also be based on the logarithms. The model used in this way would preserve the mean,
variance, skewness, and firstorder serial correlation of the logarithms of the data, but not of the
data itself. Equations 15.3 and 15.17 have been widely used in hydrology. Equation 15.17 is
sometimes known as the ThomasFiering model because of the early work of these two
researchers with the model (Thomas and Fiering 1962, 1963; Fiering 1967). The model in the
form of equation 15.17 requires that many parameters be estimated. For each season the mean,
variance, and firstorder serial correlation must be estimated. This results in estimating 3m
parameters (a monthly model requires one to estimate 36 parameters). This large number of
parameters requires considerable data. The technique based on data generation given in chapter
13 can be used for evaluating the effect of the length of record available for parameter estimation
on the reliability of the ThomasFiering model or other stochastic models.
The Xj's might represent actual data values or their natural logarithms. In the case of a normal
model, the random element becomes
where a$ is the variance of X; R2 is the multiple coefficient of determination between Xi+,and Xi,
,;
Xi1, ..., Xi +,ti+, is a random observation from N(0, 1); and the p's are multiple regression
coefficients.
The multilag model permits one to incorporate linear influences on data in one period re
flected by data in several preceding periods. The regression coefficients, p, can be estimated by
normal multiple regression means. The question of how many lags to include can also be ana
lyzed by the methods of multiple regression devoted to determining whether or not a particular
"independent" variable is important.
One difference between this model and the multiple regression procedures is that the
number of observations available for parameter estimation, n*, changes as the number of lags
changes. If there are n total observations, then there are n  1 observations available for
estimating px(l) of the firstorder Markov model. If two lags are considered (m = 2), then there
are only n  2 observations available for parameter estimation. In general, for an mthorder
Markov model, there are n* = n  m observations for parameter estimation. What this means is
that multiple regression techniques are not strictly applicable because the sample size and
variables involved in the regressions change as the number of lags change. For instance, R2 may
actually increase if the number of lags included decreases because the data set involved in the
regression has changed. Generally, it is recommended that if a kLhorderlag is included, then all
lags up to k also be included. For example, if a thirdorder lag is included, then the first and
secondorder lags should be included as well.
The conditional probability, prob(X, = aj(X,, = q), gives the probability that the process at time
t will be in "state" j given that at time t  1 the process was in "state" i. Equation 15.22 says that
this conditional probability is independent of the "states" occupied at times prior to t  1.A state
is simply a subdivision of the process X, into some interval. Thus, if X, represents the depth of
rainfall on day t, one state might be defined as no rainfall, another as between 0.00 and 0.05
inches of rainfall, and so forth.
The prob(X, = ajlXL, = q ) is commonly called the one step transition probability. That is,
it is the probability that the process makes the transition from state q to state aj in one time period
STOCHASTIC PyIODELS 38 1
or one step. The prob(Xt = aj/Xt, = q) is usually written as pivj(t),indicating the probability of
a step from a, to aj at time t. If pij(t) is independent o f t (pij(t) = pij(t + T) for all t and T),then
the Markov chain is said to be homogeneous. In this event
Higherorder Markov chains can be defined to represent stochastic processes such that the
value of the process at time t is dependent on its value in several immediately preceding time pe
riods. Thus an nthorderMarkov chain is one in which
The treatment of Markov chains in this text will be limited to firstorder homogeneous Markov
chains.
If a process is divided into m states, then m2 transition probabilities must be defined. How
ever, at each step the process must either remain in state i or proceed to one of the other m  l
states. Thus
With this restriction, an mstate Markov chain requires that m(m  1) transition probabilities
(parameters) be estimated. The remaining m pi,,'s can be determined from equation 15.25. The m2
transition probabilities can be represented by the m X m matrix P given by
Equation 15.25 states that the elements in any row of  P must sum to unity. A matrix having
this property is said to be a stochastic matrix. Some authors define P as 
P = [pi+j]'= pji. Under
this definition the columns of P sum to unity. The definition given by equation 15.23 will be used
in this treatment.
The transitional probability matrix P
 can be estimated from observed data by tabulating the
number of times the observed data went from state i to state j, ni,j. Then an estimate for pij
would be
Considerable data may be required to get accurate estimates of pi,jif p i j is small. This is because
in an observed set of data, ni,jmay be uncharacteristically high or low if plj is close to zero and
the sample is small.
382 CHAPTER 15
The more states that a process is divided into, the less accurate will be the estimates for pij.
For example, if a daily rainfall model is being considered, one might like to have 10 states to ad
equately represent the possible amounts of rainfall. However, 10 states require the estimation of
90 transition probabilities. This, in turn, requires a large amount of data.
Once P is known, all that is required to determine the probabilistic behavior of the
Markov chain is the initial state of the chain. In the following the notation py)
means the prob
ability that the chain is in state j at step or time n. The 1 X m vector has elements py).
Thus
Under this definition p(0)is the initial probability vector. P(') is then given by
where 
Pn is the n~ power of 
P. In general
For a proof of these relationships, reference should be made to any number of books on proba
bility or stochastic processes (see for instance Bailey 1964; Feller 1957; Brieman 1969).
As the Markov chain advances in time, pj") becomes less and less dependent on
That is to say the probability of being in state j after a large number of steps becomes inde
pendent of the initial state of the chain. A point is reached where = p(n+m) for a sufficiently
large n. From equation 15.31 we then get, for a sufficiently large n, that P"
 = Pn+". When this
occurs the chain is said to have reached a steady state. Under steady state conditions
p(n) =)m'$
 and can thus be denoted simply as p. The 1 X m vector p can be thought of as
giving the probabilities of being in the various states after a large number of steps. Under
steady state conditions
One can therefore calculate the steady state probabilities simply by computing Pn  for a large
Example 15.1. Consider a 2state, firstorder Markov chain for a sequence of wet and dry days.
Let state 1 be a dry day and state 2 be a wet day. Assume the transitional probability matrix to be
Thus the probability of a dry day following a wet day is given by p2,1as 0.5. Evaluate:
Solution:
p16 is
 assumed to be the steady state nstep transitional probability matrix because the are not
changing much and the two rows are identical. Thus, p$loo)= pf') = 0.1667. The probability of
rain on any day in the distant future is 0.1667. For this to be true, an analysis of rainfall records
should show that 16.67% of the days are wet. This serves as a check on P. 
Comment: Another check on the steady state probabilities is to see if equation 15.33 is valid.
This demonstrates that p = (0.8333, 0.1667) is the steady state probability matrix. See example
15.2 for further comment on this.
 
Example 15.2. Consider a Markov chain model for the amount of water in storage in a reservoir.
Let state 1 represent the nearly full condition, state 2 an intermediate condition, and state 3 the
nearly empty condition. Assume that the transition probability matrix is given by
Note that it is not possible to pass directly from state 1 to state 3 or from state 3 to state 1 with
out going through state 2. Over the long run, what fraction of the time is the reservoir level in
each of the states?
Solution: The fraction of time spent in each state is given by p. Equation 15.33 can be used to
determine p. Examination of equation 15.33 shows that if p is a solution so is Ap. Therefore,
If we let p, = 1, the first of these equations gives p2 = 3. With p2 = 3, the last of the equations
gives p, = 6/7. Therefore, one solution of p P = p is p = (1,3,6/7). Since 2 pi must equal 1, p
can be scaled so that p = (0.2059,0.6176, 0.1765). This solution can be substituted into equation
15.33 to verify that it is, in fact, a solution. Another check would be to compute 
Pnfor large n and
show that Pn = (p p p)'. Thus, over the long run the reservoir is nearly full 20.59% of the time,
nearly empty 17.65% of the time, and in the intermediate state 61.76% of the time.
p. The Pn
Comment: This problem illustrates the direct solution via equation 15.33 for   approach
for determining  P can, however, be used. For this example p8 P'~
 and  can be found to be (4 sig
nificant figures):
Data generation from a Markov chain requires only a knowledge of the initial state and the
transitional probability matrix  P. To determine the state at time 2, a random number is selected
between zero and one. If this random number is between c;: pi,, and C;=, pi,, for n =
1, 2, ..., m, the next state is taken as state n. Example 15.3 illustrates this procedure.
386 CHAPTER 15
  p 
Example 15.3. Assume that the reservoir of example 15.2 is nearly full at t = 0. Generate a
sequence of 10 possible reservoir levels corresponding to t = 1,2, ..., 10.
and
Markov chains have been used in hydrology for modeling rainfall (Gabriel and Neumann
1962; Pattison 1964; Bagley 1964; Grace and Eagleson 1966; Hudlow 1967). Lloyd (1967) pres
ents a discussion of the application of Markov chains to reservoir theory.
Some of the difficulties in using Markov chains in hydrology are:
2. Determining the intervals of the variable under study to associate with each state.
3. Assigning a number to the magnitude of an event once the state is determined (i.e., how much
rainfall should be assigned given that chain moved to state 3 and that state 3 encompasses all
rainfalls between 1 and 2 inches).
4. Estimating the large number of parameters involved in even a moderate size Markov chain
model. A chain with 5 states has 20 parameters to estimate. If seasonality is encountered and
4 seasons are needed, 80 parameters are required.
STOCHASTIC MODELS 387
5. Handling situations where some transitions are dependent on several previous time periods
while others are dependent on only one prior time period. Hudlow (1967) found the drydry
transition for hourly rainfall showed a sixthorder Markov dependence while a firstorder
dependence was adequate for the other transitions.
Woolhiser, Rovey, and Todorovic (1973) discuss an nday rainfall model in which the tran
sition from wet to dry days is based on a 2state Markov chain and the amount of rain on rainy
days is exponentially distributed. Haan et al. (1976) describe a 7state Markov chain model of
daily rainfall in which the amount of rain in each state is assumed uniformly distributed except
for the last state, in which a shifted exponential distribution is used.
Carey and Haan (1976) present a modified Markov chain daily rainfall simulation model in
which the transitional probabilities are replaced by a continuous probability distribution. That is,
given that the system is in state i on day n in season k, then the probability distribution of the
amount of rain on day n + I is given by
where X, is the amount of rain on day n, ptlis the probability of no rain on day n + 1 given X,
was in state i and season k, and Px(x 1 i, k) is the cumulative probability distribution of rainfall on
day n + 1 given that rain occurs on day n + 1 and that X, was in state i and season k. Hence, to
each state in each season there is a corresponding distribution function of the form of equation
15.35. The parameters ptl
are estimated as fkl/fr where f;, is the historical frequency of transi
tion from state i to state I (no rain) in season k and fr is the total number of occurrences in state
i and season k. The parameters of each distribution ~ , ( x ] ik)
, are determined from historical data
using the set of observations [X,+~(X,+~ > 0, X, in i, season k].
Synthetic traces of daily rainfall are generated from equation 15.35 in the following manner:
2. Generate a uniform random number, R,, from the interval (0, I).
For Kentucky rainfall, Carey and Haan (1976) used 3 states and 12 seasons. Gamma
distributions were used for P,(xl i, k). Furthermore, for a given season, the same gamma distri
bution could be used for all 3 states. Thus, for each season 5 parameters2 parameters of the
gamma distribution and 3 values for pF,had to be estimated, or a total of 60 parameters. This
compares with 505 parameters when the Markov chain approach of Haan et al. (1976) was used
[a 7 X 7 transition probability matrix for each of 12 seasons plus an exponential parameter.
12(7 X 6) + I = 5051. When simulated rainfall for these two models was compared to histor
ical rainfall at 7 Kentucky locations, the CareyHaan model proved superior.
Exercises
15.1. Develop a stochastic model for generating a sequence of numbers that could represent the
years between eruptions of the Volcano Aso (see exercise 14.3). Use the model to generate a
series of 100 possible times (years) between eruptions. Compare the correlogram and spectral
density functions for the generated and observed sequences.
15.2. Assume that the time (days) between rains follows a Poisson distribution with a mean of
2 days. Further assume that the amount of rain (inches) on rainy days follows a gamma distribution
with a mean of 1 inch and a variance of 0.50 inch. Simulate I year of rainfall using this model.
15.3. Use the firstorder Markov model to generate 100 years of annual runoff (inches) for Cave
Creek near Fort Spring, Kentucky. (Basic data in Appendix.)
15.4. Generate a random sample of size 100 from a gamma distribution with n = 3.5 and X =
2.5. Plot the observed and expected relative frequencies.
15.5. The following data are presented by Burges and Johnson (1973) for the Sauk River in
Washington. Based on this data and the firstorder, seasonal, lognormal Markov model, generate
50 years of streamflow data. Compute and plot the correlogram and spectral density function for
the generated data.

Month xj Sx,j Yxj Month xj SX,~ Yxj
15.6. Generate 100 years of monthly streamflow data for Cave Creek near Fort Spring, Ken
tucky, using the seasonal firstorder normal Markov model. Compare the correlogram and spec
tral density function of the simulated and observed data. (Basic data are in Appendix.)
15.7. Write out and explain how a model such as described by equations 15.20 and 15.21 can be
used as a higherorder, multiseason Markov model. Apply the model to Cave Creek near Fort
Spring, Kentucky, (Appendix for data), using a secondorder, monthly Markov model.
15.8. Use the firstorder, normal Markov model to generate 100 years of annual runoff for the
Spray River near Banff, Canada. Compare the correlogram and spectral density functions for the
observed and simulated data.
STOCHASTIC MODELS 389
15.9. Use equation 15.29 to show the individual generating equations for a 2site model in terms
of p1,2(0)7pl(l), ~2(1)7 and xi.
15.11. Generate 1 year of rainfall letting the sequence of wet and dry days be defined by the
Markov chain of example 15.1 and the amount of rainfall on a rainy day by a gamma distribution
with a mean of 1 and a variance of 0.50 inches.
15.12. Generate a succession of 200 water level states for the situation described in examples
15.2 and 15.3. What fraction of the time was the reservoir level in each of the three states? How
does this compare to the predicted results of example 15.2?
16. Probabilistic Methods
for Uncertainty, Risk,
and Reliability Analysis
WATER RESOURCES and environmental engineering systems deal with the extremely
complex nature of the physical, chemical, biological, and socioeconomic processes. While
designing or analyzing the performance of a given system, most often a mathematical model is
used to describe the interrelationships and interactions among its component processes. Typically,
hydrologic and water quality models are complex and might be written in a generic form as
where  0 represents the outputs being modeled, I represents the inputs to the model such as rain
fall, temperature, and so on, P represents the parameters required by the model, t represents time,
and e represents errors associated with the modeling process.
One axiom of stochastic processes is that any function of a random variable is itself a
random variable. Thus, if any of the variables in I or P are uncertain and known only in a prob
abilistic sense or if the nature of the functional relationships in the model are uncertain, then 
0
is also uncertain and can be known only in a probabilistic sense. The design and analysis of hy
drologic, hydraulic, and environmental projects are subject to uncertainty because of inherent
uncertainty in natural systems, a lack of understanding of the causes and effects in various
physical, chemical, and biological processes occurring in natural systems, and insufficient data.
This chapter was written by Dr. Aditya Tyagi, formerly a graduate research assistant in the
Biosystems and Agricultural Engineering Department of Oklahoma State University, Still
water, Oklahoma, and currently a water resources engineer with CH2M Hill, Austin, Texas.
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 39 1
As a result of these uncertainties, the performance of a project will also be uncertain. The pres
ence of uncertainties brings into question conventional deterministic design practices due to
their inability to account for possible variations of system responses. The issues involved in the
design and analysis of water resources and environmental engineering systems under uncer
tainty are multidimensional. Therefore, quantification of system uncertainties is imperative in
order to design or operate a project successfully. Reliability, risk, and uncertainty analysis are
therefore becoming increasingly important in modeling and designing water resources infra
structure and decision support systems. In some cases, uncertainty analysis is mandatory, par
ticularly when critical decisions involve potentially high levels of risk. A systematic quantita
tive uncertainty analysis provides insight into the level of confidence warranted in model
estimates and in understanding judgements associated with modeling processes. It may also
play an illuminating role in identifying how robust the conclusions about model results are and
help target data gathering efforts.
It is apparent that considerable work may be involved in gathering the data required to char
acterize the uncertainty in each parameter and the parameters as a whole. Before making any data
collection effort, it would be wise to investigate the importance of various parameters to the
process being modeled. If a parameter has little impact on the output of a model, there is no need
to spend a great deal of time and money to estimate that parameter or worrying about uncertainty
in that parameter. Sensitivity analysis is used to measure the importance of a parameter.
SENSITIVITY ANALYSIS
Sensitivity analysis is the study of how the variation in the output of a model can be appor
tioned, qualitatively or quantitatively, to different sources of variation, and how the given model
depends upon the information fed into it. It ranks model parameters based on their contribution
to overall error in model predictions. While carrying out sensitivity analysis, selection of an effi
cient sensitivity analysis method is critical.
where S is the absolute sensitivity (output unitslinput units), S, is the relative sensitivity (dimen
sionless), 0 represents a particular output, and P represents a particular input parameter. Graphi
cally the terms in these relationships are shown in figure 16.1.
392 CHAPTER 16
P
Fig. 16.1. Definitions for numerical derivatives.
Most hydrologic and water quality models are a collection of algorithms and not a continu
ous function of the parameters in the usual sense of function, thus numerical derivatives are used
to approximate the partial derivatives of equation 16.2. The numerical derivatives may be
approximated as
where AF' is the amount a parameter is perturbed from its base value (this is generally taken as
10%or 15%of P). The numerical derivatives are calculated about base parameter values. The rel
ative sensitivity coefficients are dimensionless and thus can be compared across parameters,
whereas the absolute sensitivity coefficients are affected by units of output and input and cannot
be directly compared across noncommensurate parameters.
Most hydrological and environmental engineering models are complex and contain a large
number of parameters. The disadvantage of determining model response to one parameter at a
time (performing the traditional sensitivity analysis) is that it requires considerable computation
and provides information about only one point in the parameter space. In some cases, this type of
sensitivity analysis may be misleading as such combinations of inputs would be unlikely in the
real world. To overcome this problem, global sensitivity analysis may be used.
Repeat n times
Fig. 16.2. Monte Carlo simulation.
the parameters, perhaps in a Monte Carlo simulation (MCS) form. The MCS process is illus
trated in figure 16.2 and discussed in detail in the subsequent section, Uncertainty Methods.
If several parameters are simultaneously and independently varied, then the multiple
regression of output 0 on all parameters, Pi, is
where bi represents regression coefficients. Normalized sensitivity indices can be obtained for
each variable in equation 16.4 by subtracting its mean and dividing by its estimated standard
deviation. The normalized regression model is
where 6 and so are the mean and standard deviation of simulated output 0 , Pi and share the mean
and standard deviation of ith parameter. By equating equation 16.4 with equation 16.5, the
relationship between the standardized coefficient and unnormalized multiple regression
coefficient is
where pi is the normalized sensitivity index of ithparameter. It can be shown that Pi is the corre
lation coefficient between simulated output 0 and generated parameter Pi.
   
Example 16.1. The head loss, hf(m), in a pipe is given by the HazenWilliams equation as
Compute the sensitivity coefficients (equation 16.2) assuming L as constant (1500 m) and mean
values of Am, Q, C, and D are 1.0 (nondimensionless), 0.915(m3/s), 130 (SI units), 0.305(m),
respectively.
Solution: Output, h , is calculated by substituting base values (mean values) in the given
functional relation of hf as
Analytical partial derivatives of hf with respect to various uncertain parameters are determined and
absolute sensivity coefficients are evaluated by substituting mean values in the resulting expressions.
The calculation is presented in table 16.1, which indicates that D is the most sensitive parameter.
Parameter, Pi
Example 16.2. For the preceding example, determine S and S, using numerical approximation.
Solution: First, the output value 0 at the base values of the input parameters is found to be
535.29. Then, assuming AP = lo%, parameters are perturbed about their base values. The cal
culation is presented in table 16.2.
For nonlinear models, the error in sensitivity coefficients depends upon magnitude of per
turbation (AP) and nonlinearity of model response with respect to different parameters.
Table 16.3 presents effect of magnitude of perturbation on relative sensitivity coefficients.
Table 16.3 demonstrates that when the functional relationship is linear with respect to a
parameter, there is no impact of magnitude of AP on S,. The inexactness of S, increases with
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 395
Output at perturbed
Perturbed input values input values Sensitivity coefficients
Example 16.3. Use the global sensitivity analysis method for the preceding example. Assume
the parameters are independent with A, normally distributed with a coefficient of variation (CV)
of 0.12, and Q, C, and D lognormally distributed with CV's of 0.10,0.15, and 0.05 respectively.
Solution: MCS is used to generate 5000 random observations for each of the uncertain parame
ters. The value of hf is calculated for each of the 5000 sets of parameters. Based on these data the
following regression equation (Rsquare = 0.89) is obtained.
Uncertainty Analysis
The main objective of uncertainty analysis is to assess the statistical properties of model out
puts as a function of stochastic input parameters. In water resources engineering projects, design
quantities and model outputs are functions of several parameters, not all of which can be quantified
with absolute accuracy. The task of uncertainty analysis is to determine the uncertainty features of
the model outputs as a function of uncertainties in the model itself and in the stochastic parameters
involved. It provides a formal and systematic framework to quantify the uncertainty associated with
the model outputs. Furthermore, it offers the designer useful insights regarding the contribution of
each stochastic parameter to the overall uncertainty of the model outputs. Such knowledge is es
sential in identifying the important parameters to which more attention should be given to improve
assessment of their values and then reduce the overall uncertainty in the model output. Quantitative
characterization of uncertainty provides an estimate of the degree of confidence that can be placed
on the analysis and findings.
As an example, water quality models are formulated to describe both observed conditions
and predict planning scenarios that may be substantially different from observed conditions.
Planning and management activities such as checking basinwide water quality for regulatory
compliance, waste load allocation, and so forth, require the assessment of hydrologic, hydraulic,
and water quality conditions beyond the range of observed data. These inadequacies in model
parameters or inputs force water quality modelers to characterize the impacts of parameter
uncertainties quantitatively so that appropriate decisions regarding water pollution abatement
programs can be made. The most complete and ideal description of uncertainty is the pdf of the
quantity subject to uncertainty. However, in most practical problems, a pdf is very difficult, if
not impossible, to derive precisely. In most situations, the main objective of uncertainty analysis
is to evaluate the first and second moments of a model output in terms of input random
variables.
uncertain parameters that are difficult to accurately quantify. An accurate reliability assessment
of such models would help the designer build more reliable systems and aid the operator in mak
ing better maintenance and scheduling decisions.
The reliability of a system can be most realistically measured in terms of probability. The
failure of a system can be considered as an event in which the demand or loading, L, on the sys
tem exceeds the capacity or resistance, R, of the system so that the system fails to perform satis
factorily for its intended use. The objective of reliability analysis is to ensure that the probability
of the event (R < L) throughout the specified useful life is acceptably small. The risk, Pf, defined
as the probability of failure, can be expressed as (Ang and Tang 1984; Yen et al. 1986)
where P denotes the probability function. Equation 16.7 can be rewritten in terms of the per
formance function Z as
P, = P(Z < 0)
where Z is defined alternatively as
Z=RL
where PR,L (r, 1) is the joint pdf of R and L; c is the lower bound of R; and a and b are the lower and
upper bounds of L, respectively. The resistance, R, and load, L, are random variables given as
where U is the vector representing input parameters of the model representing R, and 
V is the
vector representing input parameters of the model representing L. In some problems L may be a
398 CHAPTER 16
deterministic quantity representing a hydrologic/hydraulic/environrnentaltarget level such as
peak discharge; volume; contaminant concentration in soil, water, or air; minimum dissolved
oxygen in a stream; critical cancer risk; and so on. Alternatively, by using the performance vari
able Z defined in equations 16.9, 16.10, and 16.11, the risk can be written as
where pZ(z) is the pdf of Z. The pdf of Z is unknown, or difficult to obtain. In most cases the
exact distribution of Z may not be required, as any of several distributions can be used to make a
decision if correct information about the moments of pZ(z)is available.
 = (X,, X,, .. ., XJ, a vector containing n random variables. In FOA, a Taylor series
where X
expansion of the model output is truncated after the firstorder term
where X, = (X,,, X,,, ..., X,,), a vector representing the expansion points. In FOA applications
to water resources and environmental engineering, the expansion point is commonly the mean
value of the basic variables. Thus, the expected value and variance of Y are
UNCERTAINTY, RISK, AND RELIABTLITY ANALYSIS 399
  
where a, is the standard deviation of Y; X = (XI, X2, ..., Xn), a vector of mean values of the
input basic variables. If the basic variables are statistically independent, the expression for
Var(Y) becomes
where Co and ri are constants and Xis are independent stochastic input random variables. The
firstorder mean of the model output, by,can be written as
where px is the mean of Xi. The firstorder variance of the multiplicative form, 6$, can be
approximated as
where CVq is the coefficient of variation of Xi. Dividing equation 16.24 by the square of equa
tion 16.23, the approximate coefficient of variation of Y, e ~ , ,can be evaluated as
Another form of interest is the additive form obtained when two or more power functions
are added. The general additive form is written as:
The approximate mean of Y is given as
So evYcan be evaluated by
ev, =
A third,functional form is the combination of multiplicative and additive forms. This form
is obtained when two or more multiplicative forms having common power function(s) are added.
The general form can be represented as
For evaluating the mean and variance of combined forms of Y such as equation 16.30, the mean
and variance of the additive part must be determined first using equation 16.27 and equation
16.28. Next equations 16.23, 16.24, and 16.25 are used to determine the mean, variance, and CV
of Y by treating the combined form as a multiplicative form assuming the additive part as a mul
tiplicative component with known mean and variance.
To estimate the reliability, 8,of a system, it is typically assumed that Z is normally distrib
uted. Using p,(z) to be a normal distribution with its parameters E[Z] and ozdetermined by FOA,
equations 16.8 and 16.12 are used to determine the risk and reliability of a given system.
An alternative method to define a system reliability is the reliability index, P, which is de
fined as the reciprocal of the coefficient of variation of Z, given as
The great advantage of FOA is its simplicity, requiring knowledge of only the first two sta
tistical moments of the basic variables and simple sensitivity calculations about selected central
values. FOA is an approximate method that may suffice for many applications (Ku 1966), but the
method does have several theoretical and/or conceptual shortcomings (Melching 1992a; Cheng
1982). The main weakness of the FOA method is that it is assumed that a single linearization of
the system performance function at the central values of the basic variables is representative of
the statistical properties of system perfofinance over the complete range of basic input variables.
The accuracy of the estimates is influenced in part by the degree of nonlinearity in the functional
relationship, and the importance of higherorder terms which are truncated in the Taylor series
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 40 1
expansion (Bum and McBean, 1985). In applying FOA in risk and reliability analyses, it is gen
erally assumed that the performance function is normally distributed, which is seldom true. Any
attempt to characterize the tails of the actual distribution based on an assumption of normality is
likely to result in an inexact answer (Burn and McBean, 1985).
Example 16.4. Determine the firstorder mean and standard deviation of head loss function
given in example 16.1. Use the same mean and CV values as given in the preceding examples.
Solution: The FOA estimate for the mean, bhf,is calculated using equation 16.23 as
Example 16.5. Using Manning's equation, the flow in a compound channel is given as
where A, is the model correction factor to account for model uncertainty with mean and CV
values of 1.0 and 0.15, respectively. Y, and Y, represent section factors (Yi = A R ~ / for
~ ) the
main channel and overbank sections, respectively. Consider section factors to be deterministic
(Y, = 296.9 m8/3and Yb = 0.6 m8j3),and n,, n,, and S are random variables with mean values
of 0.034, 0.068, and 0.005 and CV values of 0.17, 0.38, and 0.25, respectively. Determine the
mean and standard deviation of Q.
Solution: Substituting values of Y, and Yb, the expression for flow is rewritten as
+ +
where is a dummy variable representing the additive form = 296.9%' + 1.2nb1.The first
+
order mean and standard deviation of are calculated from equations 16.27 and 16.28 as
b,+= xi=
2
1 Cik$i = 296.9(0.034)I + 1.2(0.068)I = 8750 and
6: =
2 2 2
Ci i k y
xi=1
2ricv2
X, 296.92(1)2(0.034)2(1)(0.
17)~ + (1.2)2( 1)2(0.068)'1)(0.38)2
= 22037852, So
Now, Q = A,+SO.~ can be considered as a multiplicative form and equations 16.23 and 16.25 can
be used to determine the overall FOA mean and CV of Q. The FOA estimate for the mean, Go, is
Example 16.6. For a storm sewer, peak runoff, QL,is given by the rational formula as:
The definition and statistical characteristics of uncertain variables are listed in table 16.4. Deter
mine the risk.
To determine the firstorder mean and standard deviation of Z, first the mean and CVs of Qc and
QLare determined using multiplicative formulas as
Using these estimates, the mean and standard deviation of Z are determined considering it as an
additive form. Using equation 16.27
Table 16.4. Statistical data for storm sewer design
Using equation 16.31, the reliability index, P, is 0.79 and the corresponding risk (equation 16.16
assuming a nomal distribution) is
where z is the standard nomal variate defined as z = (X  yx)/ux, and @(z) is the standard
normal cumulative distribution function.
Example 16.7. For the preceding example, determine the risk using the following definition of
the performance function, Z = Qc/QL  1.
where $ = Qc/QL, also known as safety factor. The firstorder mean and standard deviation of $
are determined using formulas corresponding to multiplicative forms.
where pi is the correlation coefficient between i" parameter and the output as defined in equation
16.6.
Another simulation technique similar to MCS is the Latin hypercube sampling (LHS), in
which a stratified sampling approach is used. In LHS the probability distribution of each basic
variable are subdivided into nonoverlapping intervals (say, m) each with equal probability (l/m).
Random values of the basic variables are simulated such that each range is sampled only once. The
order of the selection of the ranges is randomized and the model is executed m times with the ran
UNCERTAINTY, RISK, AND RELIABILITY ANALYSIS 405
dom combination of basic variables from each range for each basic variable. The output statistics
and distributions may then be approximated from the sample of m output values. McKay et al.
(1979) has shown that the stratified sampling procedure of LHS converges more quickly than an
equidistribution sampling employed in MCS. The main shortcoming with this stratification scheme
is that it is onedimensional and does not provide good uniformity properties on a kdimensional
unit hypercube (Diwekar and Kalagnanam 1997). Except reducing computation effort to some ex
tent, LHS has the same problems that are associated with MCS.
Example 16.8. Using MCS, determine the mean and variance of head loss for example 16.3.
Also determine the contribution due to each input parameter.
Solution: Using MCS, means and standard deviations of head loss are determined for different
numbers of simulations as shown in figure 16.3. The MCS estimates for the mean and standard de
viation of head loss with 20,000 simulations were obtained as 595.5 m and 270.2 m, respectively.
Equation 16.32 and the regression of example 16.3 show that ,A Q, C, and D contributed 7.9,
20.8,39.3, and 32.0 percent, respectively, of the overall variance.
550 4 1
0 2000 40M) 6000 8000 10000 12000 14000 16000 18000 20000
Number of simulations
Number of simulations
where E[] is an expectation operator, and py, is the mean of the ithpower function
Equation 16.35 shows that the output uncertainty of a multiplicative model is governed by the
most uncertain component function. Using the additive form (equation 1626),the mean of Z, p,,
is given by
where c and r are constants. The FOA estimate for the mean (Benjamin and Cornell 1970), by,is
UNCERTAINTY, RISK, AND RELIABILITY A N U S I S 407
*2
The FOA estimate for the variance of Y (Benjamin and Cornell, 1970), oy, is
These estimates for pYand oycontain errors. The exact value of any moment can be computed as
FOA estimate
Exact value =
1  E(.)
where E(.) is the relative error in a moment estimated using FOA. Analytical relationships for
E(.) in FOA estimates for the means and the variances of component functions were developed
(Tyagi 2000) for generic power and exponential functions using five common distributions.
These analytical expressions can be used as a guide for judging the suitability of the FOA by
determining the relative errors in the most sensitive parameters. Further, when relative error is
more than the acceptable error, these analytical relationships enable one to correct FOA esti
mates for means and variances of model components to their true values. Using these corrected
values of means and variances for model components, one can determine the exact values of
mean and variance of an overall model output. Tables 16.5 and 16.6 present the developed
expressions for E(cy) and ~ ( 6 ; )for a generic power function (Y = cXr). Similarly, tables 16.7
and 16.8 present the developed expressions for E(bY)and ~ ( 6 ;for ) a generic exponential
function (Y = becx).
To further simplify the correction procedure, these analytical relationships have been pre
sented graphically by Tyagi (2000). The relative error plots show where FOA estimates are
acceptable and where they are unacceptable and need to be corrected. In specific situations, a given
Table 16.5. Generalized relative error in FOA predicted mean of a power function
Uniform
Lognormal
Gamma
Exponential
Note:
(1) To avoid singularity at r =  1, r should be taken as 0.9999.
(2) To avoid singularity at r = 2, r should be taken as  1.9999.
Table 16.6. Generalized relative error in FOA predicted variance of a power function
Uniform 1
+ 1) r2(r + 1 ) 2 ~ ~ 4 ,
12(2r
{ 2 f i c v x ( r + 1 ) ~ [ ( 1+ ~ ~ ~ f i ) ~ ~+c ' ~ ~( lf i ) ~' (2r
+ ~+ ]1)[(1 + ~ \ j ~ f i ) I + ~ 
 (~l ~ ~ d ? i ) ' + l ] ~ )
Symmetrical 36(2r + 1) r2(r + 1)'(r + ~)'CV;
triangular 1
{3(r + l)(r + 2 ) 2 ~ ~ i [+( 1C V ~ ~ 2+(1 ) ~ " C V ~ ~ )~21(2r ' + ~+ 1)[(1 + c v x 6 ) " 2+(1  C V X 6 ) ' +'21')
Lognormal
Uniform
Symmetrical triangular
Normal
1
Gamma 1  (1  c ~ ~ c vexp(cp,)
:)~
Exponential 1  (1  ~ I J J ~ ) ~ ~ P ( ~ I J J ~ )
Table 16.8. Generalized relative error in FOA predicted variance of an exponential function
12 C 4 p ; Z ~ ~ ~ e 2 f i ~ ~ C V V
Uniform 1
(e2fic*xCVx
 I ) [ ( ~ ~ C ~
~~ +
C V) , e ~ ~d 3~c C*L X
x ~~
vXI ]~ ~ +
Symmetrical 6 6 v6e2<&*.~v,
triangular 1 72c PXC x
(edCkCVx  1 ) 2 [ ( 3 ~ 2 p : ~ +
~ :2)(e2*kcVx + 1) 2 e 6 ~ ~ C V ~ ( 3 ~+2 2)]
p:~~~
c2u:
Normal 1
exp(c2u;) [exp(c2u:)  1]
c2p:exp(2c PX)
Gamma 1 1 2
(1  2cpxCv;),:  (1  cpxcv:>,:
c2p;exp(2c PX)
Exponential 1
(1  2cpx)'  (1  cPx)l
function may be very nonlinear (represented either by a very large or very small exponent of a
power function). These situations can be identified and dealt with by using the relative error plots.
In absence of the knowledge of the complete pdf, but knowing the mean and variance ex
actly, certain exact statements on the probability of an output random variable lying within given
bounds can be estimated using the Chebyshev inequality (equation 3.77) which states that
+
where t is a constant. In example 16.7, the safety factor is defined as = R/L = Qc/&. Then,
the greatest lower bound of the system reliability [here, 8 = P($ 2 I)] is given (Huang 1986) as
410 CHAPTER 16
Considering example 16.7, the greatest lower bound of the probability of safety is 3 2 0.347.
This result may be used as a reference value.
Example 16.9. Estimate the mean and standard deviation of Q given in example 16.5 assuming a
normal distribution for ,A a uniform distribution for n, and n,, and a lognormal distribution for S.
Solution: To determine exact values of the mean and standard deviation of Q, firstly, FOA
estimates for the mean and variance of component power functions are estimated using equations
16.38 and 16.39. Then using relative error functions corresponding to given distributions from
tables 16.5 and 16.6, the FOA estimates are corrected. The calculation procedure is presented in
table 16.9.
Using equation 16.35 and corrected means of the individual power functions from
table 16.9, the exact mean of the additive form, p+, is 9019.9 m3/s. Similarly, using equation 16.36
and corrected variances of the component power functions, o+is 1586.2 m3/s. The corresponding
CV+is 0.176. Now, treating Q as a multiplicative form with A,, +, and S$%S its components with
known means and CV values, PQ is 632.99 m3/s from equation 16.31, and CV, is 0.265 from
equation 16.33. Thus, uQ is 167.73 m3