You are on page 1of 329

THIRD EDITION

An Introductionto
STATISTICALPROBLEM SOLVING
in
Geography

J.Chapman McGrew,Jr.
J
Salisbury University

Arthur J.Lembo, Jr.


Salisbury University

CharlesB. Monroe
University of Akron

Craphicsby John Patrick Soderstrom

WAVEIAND

Long Grove, Illinois


For information about this book, contact:
Waveland Press, Inc.
4180 IL Route 83, Suite 101
Long Grove, IL 60047-9580
(847) 634-0081
info@waveland.com
www.waveland.com

Copyright© 2014 by J. Chapman McGrew, Jr., Arthur J. Lembo, Jr., and Charles B. Monroe

10-digit ISBN: 1-4786-1119-7


13-digit ISBN: 978-1-4786-1119-6

All rightsreserved.
r No part of this bookmay
m bereproduced,storedin a retrievalsystem, or transmittedin
anyform or by any means without permissionin writingfrom the publisher.

Printed in the United States of America

7 6 5 4 3 2 1
Contents

Preface v
Acknowledg1nents vii

I 1 PA Ill 75
BASICSTATISTICALCONCEPTSIN THE TRANSITIONTO INFERENTIAL
GEOGRAPHY PROBLEMSOLVING
1 Introduction: The Context of Statistical 3 5 Basics of Probability and Discrete 77
Techniques Probability Distributions
1.1 The Role of Statistics in Geography 4 5.1 Basic Probability Processes, Terms,
1.2 Examples of Statistical Problem Solving and Concepts 77
in Geography 8 5.2 The Binomial Distribution 81
2 Geographic Data: Characteristics and 21 5.3 The Geometric Distnbution 83
5.4 The Poisson Distribution 85
Preparation
2.1 Selected Dimensions of Geographic Data 21 6 Continuous Probability Distributions 93
2.2 Levels of Measurement 23 and Probability Mapping
2.3 Measurement Concepts 25 6.1 The Normal Distnbution 93
2.4 Basic Classification Methods 27 6.2 Probability Mapping 96
7 Basic Elements of Sampling 101
7.1 Sampling Concepts 101
7.2 Types of Probability Sampling 108
II 37 7.3 Spatial Sampling 111
DESCRIPTIVEPROBLEMSOLVINGIN 8 Estimation in Satnpling 117
GEOGRAPHY 8.1 Basic Concepts in Estimation 117
8.2 Confidence Intervals and Estimation 121
3 Descriptive Statistics and Graphics 39 8.3 Geographic Examples of Confidence Interval
3.1 Measures of Central Tendency 39 Estimation 126
3.2 Measures of Dispersion and Variability 44 8.4 Sample Size Selection 134
3.3 Measures of Shape or Relative Position 52
3.4 Selected Issues: Spatial Data
and Descriptive Statistics 53
4 Descriptive Spatial Statistics 62
4.1 Spatial Measures of Central Tendency 63
4.2 Spatial Measures of Dispersion 69

iii
iv Contents

IV 139 VI 237
INFERENTIALPROBLEMSOLVING STATISTICALRELATIONSHIPS
IN GEOGRAPHY BETWEENVARIABLES
9 Ele1nentsof InferentialStatistics 141 16 Correlation 239
9.1 Terms and Concepts in Hypothesis 16.1 The Nature of Correlation 240
Testing: One-Sample Difference of 16.2 Association of Interval-Ratio Variables 242
Means Test 142 16.3 Association of Ordinal Variables 247
9.2 One-Sample Difference of 17 Simple LinearRegression 252
Proportions Test 148
9.3 Selected Issues in Inferential Testing 150 17.1 Form of Relationship in Simple
Linear Regression 253
10 Two-Satnpleand Dependent-Sample 155 17.2 Strength of Relationship in
(Matched-Pairs)DifferenceTests Simple Linear Regression 256
10.1 Two-sample Difference ofMeans Tests 155 17.3 Residual or Error Analysis in Simple
10.2 Two-sample Difference of Linear Regression 258
Proportions Test 163 17.4 Inferential Use of Regression 261
10.3 Dependent-sample (Matched-pairs) 17.5 Example: Simple Linear
Difference Tests 168 Regression- Lake Effect Snow in
Northeastern Ohio 264
11 Three-or-More-SatnpleDifference Tests: 174
Analysis of VarianceMethods 18 Examples of Multivariate 269
Problem-SolvingIn Geography
11.1 Analysis of Variance (ANOVA) 174
11.2 Kruskal-Wallis Test 177 18.1 The Basics of Multiple Regression 269
11.3 Example Applications in Geography 178 18.2 The Basics of Cluster Analysis 277
12 Categorical DifferenceTests 187
12.1 Goodness-of-fit Tests 187
12.2 Contingency Table Analysis 195 VII 287
EPILOGUE
289
V 203 19 Problen1 Solving and Policy
Detertnination in Practical
INFERENTIALSPATIALSTATISTICS Geographic Situations
13 General Issues in Inferential 205 19.1 Geographic Problem Solving
Spatial Statistics and Policy Situations 289
19.2 Answers to Geographic Problems
13.1 Types of Spatial Pattern 205 and Policy Situations 301
13.2 The Concept of Autocorrelation 206
13.3 Who is my Neighbor? 208
14 Point PatternAnalysis 210
14.1 Nearest Neighbor Analysis 210 Appendix:Statistical Tables 303
14.2 Quadrat Analysis 216 Index 315
15 Area PatternAnalysis 222
15.1 Join Count Analysis 223
15.2 Moran's I Index (Global) 229
15.3 Moran's I Index (Local) 233
Preface

When contemplating a third edition of this text, we than 50 map examples throughout the following pages. It
reexamined an interrelated set of questions that we have is worthwhile comparing this text to others that deal with
been asking for years: statistics in geography, where maps are virtually nonexis-
tent. We believe that geography students will learn how
• What is the most effective way to introduce statistical
to use statistics more effectivelywhen presented with real,
methods and techniques to undergraduate and begin-
contemporary problems and maps.
ning graduate geography students? Generally speaking, what must a geography student
• How can we best demonstrate the usefulness and prac- learn to become skilled in applying statistics to solve real-
tical benefits of statistics to geography students in a world problems? While something of an oversimplifica-
way that holds their interest and generates enthusiasm? tion, two essential skills should be mastered. First, when
• Can students who believe they are weak in math and presented with a geographic situation or problem to solve,
statistics achieve a strong fundamental competency in you must be able to select the correct statistical technique
the use of statistical methods? (or set of techniques) that allow you to approach that situ-
ation or problem in the most effective and productive way.
We believe that these questions can all be answered Second, when you have results from a statistical analysis
successfully if the statistical procedures are presented in (the computer output, for example), you must fully and
the context of solving real-world geography prob- properly interpret that output to reach the correct conclu-
lems-problems that already interest geography students. sions. Simply stated, you must fully understand how to
As much as possible, our focus centers on actual geo- interpret the results of statistical analysis and use that
graphic problems and issues. Previous editions of this text understanding to recommend appropriate geographic pol-
already pointed strongly in this direction, but we are con- icies and plans.
vinced more than ever that "geographic problem-solving" What began as a fresh coat of paint on what we con-
is the best guiding principle and contextual framework sidered a good book, ended with a complete rewrite over
through which to present statistical techniques. a 3-year period. While the general chapter-by-chapter
As a result, in this third edition you will see even organization remains relatively unchanged, most of the
more emphasis placed on the development of descriptive examples are different. In addition, a number of new sta-
statements and statistical hypotheses that relate directly to tistical methods are presented for the first time. More
contemporary problems in geography. Chapter 1 illus- than a decade has passed since the second edition was
trates this "geography-first" approach by discussing sev- published, and many aspects of statistical problem-solv-
eral contemporary spatial patterns, all presented in the ing in geography have changed dramatically. The signifi-
context of the scientific research process. Right from the cant advances in Geographic Information Sciences
beginning, students are exposed to critical geographic (GIS), for example, have altered the ways geographers
issues including the distribution of obesity levels across the conduct research. Consequently, we have placed greater
United States, the global variation of life expectancy lev- emphasis on a variety of spatial statistics that focus
els, the timing of the last spring frost across the southeast- entirely on geographic patterns. In fact, an entire segment
ern U.S., and the recent changes in population patterns for of the book (Part Five) is now devoted to inferential spa-
both states and counties in the U.S. These issues continue tial statistics, including more detailed discussions of spa-
to appear as geographic variables to illustrate new statisti- tial autocorrelation and related concepts such as
cal procedures as they are introduced throughout the text. variograms. Also presented for the first time is a separate
With attention centered on real-world geographic chapter dealing specifically with two key multivariate
problem-solving using statistics, our text presents more techniques-multivariate regression models and cluster

V
vi Preface

analysis. Quite simply, it is fair to say that this is a virtual metric test are run, we can examine the differences in test
rewrite of the second edition. statistic results and better evaluate the comparative
In addition to maintaining the successful general advantages and disadvantages each test offers.
organization of the previous editions, we also keep sev- We expect you will use a statistical software package
eral features that characterized the earlier editions. We to do most of the analysis, rather than making extensive
continue to stress the importance of written narratives calculations manually. Nevertheless, when presenting
that clearly explain each statistical technique in ways that many of the statistical techniques, we feel the knowledge
undergraduate and beginning graduate students in geog- and understanding gained through showing the basic
raphy can understand. This is accomplished without steps in the calculation procedure will help you better
compromising or oversimplifying the statistical integrity understand the goals and objectives of the technique.
of the material. We also continue to be enthusiastic advo- A CD is available with this text. The CD contains
cates of the exploratory, investigative approach. There- virtually all of the major data sets we use in the various
fore, in many parts of the text you will see the calculation problems and examples. For additional information
of descriptive summaries and graphics used as intermedi- related to the CD, please contact Waveland Press
ate steps in a multi-stage geographic research process. through their website at www.waveland.com.
In some cases, we continue to use inferential statis- Finally, upon review of the final product we realize
tics as exploratory and descriptive tools to reinforce the that too much information is included to cover every-
basic concept of learning as much as possible about a sit- thing in a single undergraduate course in statistics and
uation in geography. More specifically, we demonstrate geography. We defer to an instructor's best judgment as
how the magnitudes of multiple test statistic values can to the proper pace to follow when using the text in a
be compared to gain geographic insights, even though the course. Nonetheless, we encourage instructors to strive
statistical test is not being used to make formal probabi- for a positive student experience in the hope that an
listic statements. appreciation for the role of statistics in geography will
Also, in several instances, we "pair together" a encourage the student to become a lifetime learner in the
slightly stronger parametric inferential test (such as discipline. One possible strategy for instructors wishing
ANOVA) with a slightly weaker non-parametric test to reduce the volume of material presented in a single
(such as Kruskal-Wallis). This "pairing" allows a course is to consider using certain chapters in more
researcher to gather as much useful information as possi- advanced courses. For instance, the inferential spatial sta-
ble about a geographic problem or situation from appro- tistics in Part V (chapters 13-15) might be introduced in
priate statistical techniques. For example, in some cases an undergraduate GIS course. Similarly, the multiple
it may be uncertain if the data fully meet all of the regression and cluster analysis material in chapter 18
required assumptions needed to run a stronger paramet- could be presented to students in an advanced statistical
ric test. If both a parametric and corresponding nonpara- applications course in geography.
Acknowledgments

Many people have helped with the development of Darren Parnell, who graciously provided original data for
the third edition of this book. As we point out in the pref- last spring frost dates at various weather stations in the
ace, this edition is fundamentally a thorough rewrite of the Southeastern United States.
second edition. Of particular significance are the many Several students and former students at Salisbury
new examples and considerable strengthening of the maps University also contributed to the creation of geographic
and graphic support. examples, data collection, or statistical analysis. Particu-
In this context, certain individuals deserve special larly, we would like to recognize the following individuals:
recognition. John Patrick Soderstrom created or revised Danielle Bruner, Court McGrew, and Denise Tweedale.
earlier versions of all the figures and tables that appear in The personnel at Waveland Press have been incredi-
the book. This was a complicated and demanding task, bly supportive throughout the entire process. Don Rosso,
which Patrick performed at the highest standard of excel- an Editor and the Production Manager at Waveland, has
lence. The success of this book will be due in part to the always been available if we had a question or issue that
exemplary quality of his graphics and ancillary material. needed to be addressed, and he has constantly main-
Also deserving particular mention is Dr. Kevin A. tained an enthusiastic and positive attitude about our
Butler. In the earlier, formative stages of book develop- project. Perhaps as important as anything else, he encour-
ment and organization, Kevin played a critical role. He aged us to keep improving the manuscript, even if it
created the original ideas for a number of the geographic meant lengthening the amount of time involved. Produc-
problems new to this edition, including: the linear direc- ing the finest product has always been his stated goal.
tional mean of hurricane and tropical storm tracks across Dakota West has done a fine job in editing the manu-
Puerto Rico; the standard deviational ellipse of anti-ship- script and preparing it for distribution. We thank them for
ping (pirating) activities off the east coast of Africa; near- their efforts.
est neighbor analysis of public services in Toronto, Finally, we would be remiss to overlook the support
Canada; and Moran's I analysis of racial and ethnic of our respective families. This project took more time,
groups in Cleveland, Ohio. and was substantially more work than any of us origi-
A number of faculty members commented critically nally estimated. Without the support of our families, we
on aspects of book organization and development of geo- could not have devoted the time required to produce a
graphic examples. Particular thanks are extended to Dr. quality textbook.

Vll
PART I

BASICSTATISTICAL
CONCEPTSIN
GEOG HY
Introduction
The Context of Statistical Techniques

1.1 The Role of Statistics in Geography


1.2 Examples of Statistical Problem Solving in Geography

Geography is an integrative spatial science that The recent dramatic increase in the use of Geo-
attempts to explain and predict the spatial distribution graphic Information Sciences (GIS) technology and the
and variation of human activity and physical features on Internet make the assembly of relevant data for tradi-
earth's surface. Geographers study how and why things tional geographic analysis even easier. Not only are GIS
differ from place to place, as well as how spatial patterns software packages easier than ever to use, but an almost
change through time. Of particular interest to geogra- endless supply of data are now collected and distributed
phers are the relationships between human activities and by stakeholders worldwide. Nevertheless, although GIS
the environment and the connections between people and the Internet provide additional new sets of data and
and places. People have been interested in such geo- technologies, it is still the responsibility of geographers
graphic concerns for thousands of years. Early Greek to evaluate this spatial information properly and to make
writers such as Eratosthenes and Strabo emphasized the intelligent decisions. Unfortunately, unprecedented
earth's physical structure, its human activity patterns, access to geographic data without good quantitative
and the relationships between them. training can often lead to suboptimal (or even poor) deci-
These traditions remain central to the discipline of sion-making. The accomplished GIS technician there-
geography today. Contemporary geography continues to fore needs to properly understand statistical problem
be an exciting discipline that attempts to solve a variety solving in geography.
of problems and issues from spatial and ecological per- Clearly, professional geographers are no longer con-
spectives. The spatial perspective focuses on patterns and tent to limit themselves to spatial or locational descrip-
processes on the earth's surface, and the ecological per- tion. After a spatial pattern is described and the where
spective focuses on the complex web of relationships questions are adequately answered, attention shifts to
between living and nonliving elements on the earth's sur- why questions. Why does a particular spatial pattern
face (Geography Education Standards Project, 1994). exist? Why does a locational pattern vary in a specific
The geographer starts by asking where questions. observable way? What spatial or ecological processes
Where are things located on the earth's surface? Where have affected a pattern? Why are these processes operat-
are features distributed on the physical or cultural land- ing? As such questions are answered, or as speculation
scape? What spatial patterns are observable, and how do occurs about why a spatial pattern has a particular distri-
phenomena vary from location to location? Historically, bution, we gain a better understanding of the processes
geographers focused on trying to answer these questions. that create the pattern. Sometimes we find different vari-
In fact, the popular image of the discipline of geography ables related spatially, thus providing in.sights into
probably remains almost exclusivelyfocused on the loca- underlying spatial processes. In other in.stances,geogra-
tion of places. phers try to determine if circumstances differ between

3
4 Part I Basic Statistical Concepts in Geography

locations or regions and seek to understand why such statistic with a number of components that few fans
differences exist. can understand.
In recent decades, geographers are increasingly con- 2. Statistical procedures are used routinely in political
cerned with the practical applications and policy implica- polling and opinion gathering. When statistical sam-
tions of this spatial information. Frequently, we ask what pling is applied properly, certain characteristics about
to do questions that involve the development of spatial a statistical population can be inferred based solely on
policies and plans. More geographers now want to be information obtained from the sample. Political opin-
active participants in both public and private decision- ion polling is commonplace before any major election
making. Some questions geographers might explore and exit polling of a sample of voters is used to predict
include: What type of policy might best achieve more the election results as quickly as possible.
equal access for urban residents to city services and facil-
3. Statistics are widely used in market analysis. Busi-
ities? What should be done to recommend a balance of
nesses constantly scrutinize all aspects of consumer
wetlands protection and economic development in a
fragile environment? behavior and details of consumer purchasing patterns
Geography is now a problem-solving discipline, and because a fuller understanding of consumer actions
we are concerned with applying spatial knowledge and and spending often translates into greater profit. Major
understanding to the problems facing the world today. business decisions, such as where to advertise and
Noted geographer Risa Palm once stated, "[G]eography which markets are best to introduce a new product,
involves the study of major problems facing humankind depend on sample statistics. Many of us are familiar
such as environmental degradation, unequal distribution with the ratings systems used to estimate the relative
of resources and international conflicts. It prepares one popularity of various television programs. These deci-
to be a good citizen and educated human being" (Associ- sions, involving literally millions of dollars of television
ation of American Geographers, n.d.). advertising and variable advertising rates, rely on well-
designed statistical samples of the viewing population.
4. Most of us use a wide variety of statistical measures
1.1 THE ROLEOF STATISTICS
IN GEOGRAPHY and procedures in making personal financial decisions.
Whether purchasing a home, buying auto or life insur-
The term statistics is generally defined as the collec-
ance, trying to develop a workable budget, or setting
tion, classification, presentation, and analysis of numeri-
up an investment or savings plan, an understanding of
cal data. Statistical techniques and procedures are
different monetary terms and statistical measures is
applied in all fields of academic research. In fact, when-
indispensable. Knowledgeable reading of stock trends
ever data are collected and summarized or whenever
numerical information is analyzed or scientific research and financial data requires an understanding of vari-
conducted, statistics are needed for sound analysis and ous statistical measures. Two common examples are
price-earnings ratio (the closing price of stock divided
interpretation of results.
Beyond the widespread application of statistics in by the company's earnings per share for the latest
academic research, use of statistics is also seen in many twelve-month period) and yield (current annual divi-
aspects of everyday life. Just consider for a moment this dend rate divided by the closing price of a stock,
sampling of common applications of statistics: expressed as a percentage).
5. Weather is an everyday practical concern for many of
1. In sports, statistics are frequently cited when measur- us. We want to know which coat to wear in the morn-
ing both individual and team performance. The suc- ing or whether we will need an umbrella. Perhaps a
cess of a team (and the job security of the coach!) is winter storm system is approaching and we want to
often gauged by the team's winning percentage. Base- know the likelihood of getting measurable snow as the
ball fans are well-versed in such individual player sta- system moves through. A farmer may want to know
tistics as batting average, earned run average, and
the probability of getting sufficient precipitation over
slugging percentage. It is not uncommon to see a well- the next two weeks as an aid to scheduling spring
performing pitcher removed during a critical time in
planting. Weather forecasting models based on statis-
the game (to the casual baseball fan, this removal may
tics attempt to provide answers to such questions,
seem inexplicable) because a left-handed or right-
often using probability estimates from similar histori-
handed batter is coming to the plate. The manager has
cal weather situations.
very likely based his decision to remove the pitcher on
a statistical analysis of the pitcher's performance in Geographers use statistics in numerous ways. Statis-
past similar situations. Goalies in hockey may be tical analysis benefits geographic investigation by helping
benched if their "goals against average" gets too high. answer the where,why, and what-to-doquestions posed in
Quarterbacks in professional football are evaluated by the introductory discussion. Among many general appli-
their quarterback rating-a multifaceted descriptive cations, the use of statistics allows the geographer to:
Chapter I Introduction: The Context of Statistical Techniques 5

• describe and summarize spatial data, pher's view of the world occurred in the late 1950s and
• make generalizations concerning complex spatial pat- early 1960s. During that time, our discipline began to
terns, move from qualitative descriptions of the spatial distnbu-
tion and variation of human and physical features to
• estimate the likelihood or probability of outcomes for
quantitative analyses of the same features. The application
an event at a given location,
of quantitative methods (including a variety of statistical
• use limited geographic data (sample) to make infer- techniques) now serves as a fundamental methodological
ences about a larger set of geographic data (population), approach or paradigm for geographic research.
• determine if the magnitude or frequency of some phe- The geographic research process and the roles of sta-
nomenon differs from one location to another, and tistics in that process are summarized in a multiple-step
• confirm that an actual spatial pattern matches some organizational framework (figure I. I). Very generally,
expected pattern. the left column of the figure lists the series of steps that
lead to the formulation of inferential hypotheses. The
Here are several typical examples: right column diagrams additional possible steps involved
I. Medical geographers attempt to identify patterns in scientific research after an inferential hypothesis is
across the landscape to determine if a phenomenon is stated. Clearly, formulating hypotheses is central to the
clustered, random, or dispersed. For instance, a health research process. In many different ways statistical tech-
organization might look for cancer dusters to see niques are often involved in geographic research both
whether a pattern of cancer occurrence is concen- before and after hypotheses are generated.
trated in a specific part of an area. Similarly, law While the sequence of tasks outlined in figure 1.1 is
enforcement officials often try to identify patterns of typical of the process many geographers follow when
conducting research, it is not the only mode of geo-
crime within a city, and ecologists try to determine if
graphic research. Moreover, this framework should not
the distribution of diseased trees in a forest follows a
be viewed as a rigid series of steps, but rather as a gen-
specific pattern.
eral, flexible guide used in geographic research.
2. Economic geographers might want to determine if The research process generally begins when we
there is a negative correlation between monthly apart- identify a worthwhile geographic problem to investigate.
ment rental prices and the distance from a large col- To recognize a productive research problem, the geogra-
lege or university. That is, the further the distance pher must have background knowledge and experience
from the school, the lower the monthly rental price. in the area studied. There is simply no substitute for
This relationship identifies a variation of a basic prin- having a strong background in the appropriate branch of
ciple in economic geography called distancedecay. the discipline.
3. Natural resource planners may want to confirm that A hypothesis is an unproven or unsubstantiated gen-
the size of fish in one lake is significantly different eral statement concerning the problem under investiga-
than the size of fish in another lake, assuming an tion. Sometimes you have identified a possible problem or
experimental growth hormone was introduced in one research area, but are not yet ready to formulate a hypoth-
of the two lakes. Similarly, an agronomist may want to esis. Perhaps more information is needed, or preliminary
see if different fertilizer treatments on various experi- questions related to the problem require an answer. In
mental plots produce different crop yields. these cases, one or more descriptivestatements can be
4. Planners may want to determine if average home sale formulated. Geographers often create such statements by
prices in one neighborhood are higher than average collecting location-based data and presenting summaries
home sale prices in another neighborhood. When of this information using graphical procedures, descrip-
determining recipients for a community grant, an tive statistics, and maps. During the descriptive phase of
urban planner might want to know if a particular com- the investigation, statistical analysis can provide quantita-
munity has a larger percentage of children under the tive summaries or numerical descriptions of the data,
age of I 8 than the state percentage. allowing the geographer to explore the data through these
methods to identify useful hypotheses. The information
The use of statistics must be placed within the context gathered allows us to draw conclusions about some
of a general scientific research process. Virtually all geog- research questions, develop a model of the spatial situa-
raphers recognize the overall importance of statistics in tion, or possibly proceed further and generate inferential
research. Some may view the application of statistical hypothesesabout a population based on the collection of
methods as an essential element of any scientific geo- sample data. It is important to note that we must always
graphic research, whereas others view statistical methods have sample data if we are to make inferences about a
as one of many approaches that geographers can apply. population. In other cases, we may have sufficient back-
Whatever one's particular perspective, however, the gen- ground knowledge or information from previous research
eral methodological shift or "revolution" in the geogra- (such as a review of the literature) to readily develop infer-
6 Part I Basic Statistical Concepts in Geography

Steps preceding Steps following


hypothesisformation hypothesisformation

Problem

Developquestionsto
investigateproblem

Collect and
Preparedata

Processdescriptivedata
(maps, graphics,and
descriptivestatistics)

Reach conclusions Formulatehypothesis Collect and prepare


(may include model (sometimesbased sample data
development) on model)

Test hypothesis
(inferentialstatistics)

Hypothesisnot

Hypothesisconfirmed

Developor refine Incorporateresults


model,law, or theory into spatial policies
and plans

FIGURE 1.1
The Role of Statistics in the Geographic Research Process
Chapter I Introduction: The Context of Statistical Techniques 7

ential hypotheses. Often, hypotheses are formulated using summary measures is superior to working directly with a
a model, which is a simplified replication of the real large group of values. Descriptive statistics allow us to
world. A well-known model in geography is the spatial work efficiently and communicate effectively.
interactionmodel that predicts the amount of movement Replacing a set of numbers with a summary mea.sure
expected between two places as a function of their popu- necessarily involves the loss of information. Various
lations and the distance that separates them. descriptive statistics are available, each with different
Clearly, different research scenarios are possible. In advantages and limitations. It is important to select the
some cases, it may not be realistic or practical to go any descriptive measure whose characteristics seem appropri-
further than the collection and descriptive processing of a ate to the geographic problem being analyzed. In many
set of data. For example, we would be limited to descrip- geographic situations, the objective is to minimize the
tive statistics and exploratory techniques if it were loss of relevant information when moving from individ-
impractical or impossible to collect sample data for the ual observations to descriptive measures.
problem variable being examined. In other cases, when it The purpose of inferential statistics is to create and
is realistic and practical to collect sample data or analyze test hypotheses about a statistical population based on
a set of sample data that has already been collected by information obtained from a sample of that population.
someone else, it may be appropriate to formulate inferen- In the context of inferential statistics, a statistical popu-
tial hypotheses and the research process then continues lation is the total set of information or data under investi-
down the right column of figure 1.1. We will examine gation in a geographic study. A sample is a clearly
these possibilities in more detail later. identified subset of the observations in a statistical popu-
At this point in the research process, several options lation. Ifwe are to make proper inferences about the pop-
are possible: (!) Incorporate the research findings into ulation based on the collection of sample data, then the
actual or recommended spatial policies and plans. (2) sampled subset must be representative of the entire statis-
Further refine the results into a spatial model predicting tical population from which it is drawn.
what is likely to occur under various scenarios. If a The most basic element in statistics is data or numeri-
hypothesis is repeatedly verified as correct under a vari- cal information. We often use groups of data, referred to
ety of circumstances (perhaps at various locations or as a data set and presented in tabular format (table 1.1). A
times) it may gradually take on the stature of a law. If data set consists of observations, variables, and data val-
various laws are combined, they constitute a theory. (3) ues. The elements or phenomena under study for which
If a hypothesis is tested and found partially or completely information (data) are obtained or assigned are often
incorrect, we may need to return to an earlier step in the referred to as observations.Observations are sometimes
geographic research process. With only a partially vali- called individualsor cases.Geographers use many types of
dated hypothesis, it might be best for us to conduct fur- observations in their research. Some are spatial locations
ther exploratory and descriptive analysis-collecting such as cities or states, and others are non-spatial items
additional information or refining the original hypothe- such as people or households. A property or characteristic
sis. If a hypothesis is proven totally wrong, we might of each observation that can be measured, classified, or
need to return to the initial question development step. counted is called a variablebecause its values vary among
Descriptivestatistics provide a concise numerical or the set of observations. The resulting measurement, code,
quantitative summary of the characteristics of a variable or count of a variable for each observation is a data value.
or data set. Descriptive statistics describe-usually with a The data set in table 1.1 is typical-several variables
single number-some important aspect of the data, such are chosen from a set of observations. Variables such as
as the "center" or the amount of spread or dispersion. population in millions and mobile phone subscribers
For most geographic problems, using such descriptive per 100 inhabitants are presented in the columns of the

TABLE 1.1
A Typical Set of Geographic Data
Mobile Phone
Population in Millions Population Per Subscribers Per 100
Country (mid-2010) Square Kilometer Infant Mortality Rate Inhabitants

Canada 34.1 3 5.1 66


Chile 17.1 23 8.3 88
Guatemala 14.4 132 34 109
Kenya 40.0 69 52 42
United Kingdom 62.2 256 4.7 126
Vietnam 88.9 268 15 80
Annual number of deaths of infants under age one per thousand live births.
Source: Population Reference Bureau (PRB), World Population Data Sheet. 2010
8 Part I Basic Statistical Concepts in Geography

TABLE 1.2

Number of lnterprovlncial Migrants, Canada, 2001 • 2006

Province of Destination
Province of
Origin NL PE NS NB QC ON MB SK AB BC YT NT NU
Newfoundland & Lab. 455 4255 1350 975 9060 705 600 11355 2220 30 610 400
Prince Edward Island 365 1525 1165 475 2125 105 55 1345 470 0 50 10
Nova Scotia 3635 1935 6290 3445 19450 1200 675 12625 5960 120 465 235
New Brunsv.;d( 2895 1325 8000 6750 11395 985 570 7760 2165 40 195 100
Quebec 760 420 2665 5345 52765 1815 950 9750 10070 195 285 180
Ontario 10160 2680 19245 11200 44535 11125 6150 49455 56035 545 1090 580
Manitoba 1750 205 1275 805 1800 13975 5855 19590 11455 150 235 225
Saskatchewan 240 90 830 375 1220 7060 5670 37430 10700 130 480 85
Alberta 4115 635 5295 3175 5890 29795 7750 16635 62795 750 1655 195
British Columbia 1385 500 4330 1685 7880 38120 6580 6995 72685 1375 820 215
Yukon Territory 55 0 125 35 135 355 100 145 1455 1480 110 20
Northwest Territories 135 45 300 110 125 900 385 290 3105 1165 290 185
Nunavut 280 10 185 30 310 780 165 105 305 195 35 365

Total In-migration 25775 8300 48035 31570 73555 185785 36585 38930 226870 164710 3665 6360 2430
Total Out-migration 32020 7690 56040 42185 85200 212705 57330 64315 138690 142580 4010 7040 2770
Net Migration -6245 610 ·8005 -10615 -11645 ·26920 -20745 ·25385 88180 22130 ·345 ·680 -340

Source: Statistics Canada

table and observations (Canada and the United King- examples the data of the problem variable encompass an
dom, for example) are displayed in the rows of the table. entire population (all country-level data globally or all
A value or unit of data inside the table represents the state-level data in the United States). Because sample
magnitude of a single variable for a particular observa- data are not utilized in these examples, the statistical
tion. Examples of data values are Canada's mid-2010 analysis later in the text is limited to the development of
population of 34.1 million, Kenya's infant mortality descriptive statements and summaries. These circum-
rate of 52, and Vietnam's subscription rate of 80 mobile stances apply to the exploration of the global pattern of
phones per I00 inhabitants. life expectancy from country to country and to the exam-
An alternative type of geographic data matrix or ination of recent population growth rates for states and
array is shown in table 1.2. This data set depicts multi- counties in the United States.
ple data values for a single variable-the number of In the other two example problems, suitable sample
Canadian interprovincial migrants from 2001 to 2006. data are available making it possible later in the text to
We are often interested in the description and analysis of create inferential hypotheses and conduct inferential tests
such spatial interaction (origin-destination) data. It is as well as explore descriptive statements. Samples are
interesting to note the single largest data value is the available with obesity rate patterns across the United
number of people moving from British Columbia to States and the timing of the last spring frost in a portion
Alberta (with 72,685 migrants). This certainly confirms of the Southeastern U.S.
the recent employment opportunities in energy develop- All four of these examples have interesting location-
ment and shale oil mining. Maybe not too surprisingly, based patterns which are the result of complex spatial
not a single person moved from Yukon to Prince processes that are not completely understood. The
Edward Island (or vice versa). The only two provinces upcoming figures and graphic displays provide some
with substantial positive net migration values are British introductory information about whereaatopic or variable
Columbia and Alberta and the provinces with the largest of interest is distributed and how the magnitude of the
negative net migration values are Ontario and Saskatch- values seems to vary from place to place. This is often an
ewan. Clearly, the "westward movement" is still alive excellent starting point for developing geographic sum-
and well in Canada! mary statements and hypotheses. However, the figures
do not explain why these particular spatial patterns
exist-this is where descriptive statements and inferential
1.2 EXAMPLESOF STATISTICAL hypotheses enter the geographic research process. Many
PROBLEMSOLVINGIN GEOGRAPHY of the descriptive statements and hypotheses are explored
in later chapters as we consider each of the example
Four different illustrations of the geographic problems in greater detail.
research process are now introduced. In two of these
Chapter I Introduction: The Context of Statistical Techniques 9

Example:Obesity Levelsfor Statesof the United States

Obesity is a growing concern in the United States and Using these three standard weight status categories, we
other developed nations. According to the Centers for Disease can display the rapid nationwide trend toward obesity. The
Control and Prevention (CDC),American society has become prevalence and trend data from 1995 to 2010 are shown in
increasingly "obesogenic; characterized by environments that both numeric (table 1.3) and graphic form (figure 1.2) with bar
promote greater consumption of less healthy food and dis• and pie charts. A bar chart displays the distribution of a vari•
courage physical activity. Obesity is identified as a leading able whose observations are allocated to categories, while a
cause of coronary heart disease, certain types of cancer, and pie chart shows the entire data set divided into pie-sliced
type-2 diabetes. Many other health risks (including high total shaped portions of the whole. To emphasize the rapid change
cholesterol, liver and gallbladder disease, and even reproduc• over this 15-year period, only the 1995 and 2010 bar and pie
tive and mental health conditions) are also associated with charts are shown in figure 1.2.
obesity. As concerned citizens, we should recognize that the looking at the nationwide obesity trends certainly illus•
aforementioned health problems have far reaching societal trates the rapid changes in obesity levels in the United States.
consequences, including rising health care costs and lower Geographers, however, are specifically interested in analyzing
worker productivity. The CDC estimates that the overall the geographic patterns and trends of obesity in the United
annual medical costs related to obesity could be as high as States, not just nationwide trends. By looking at a map show•
S147 billion. ing the percent of adults who are obese from state to state
In response to this rapidly growing epidemic, many individ• (figure 1.3),we can determine where the highest and lowest
uals and organizations are attempting to take prescriptive levels of obesity exist, and then further explore why the pat•
action. The CDC's Division of Nutrition, Physical Activity, and tern has the distribution it does.
Obesity (DNPAO)provides funding to states, with the expressed Only Colorado has a prevalence of adult obesity less than
goal to: "Create policy [italics added] and environmental 20% (2009 data). Thirty-three states have obesity levels of at
changes ... to address obesity and other chronic diseases least 25% and nine of those states have 30% or more of their
through a variety of nutrition and physical activity strategies." adult population classified as obese. looking at figure 1.3, it
Other policymakers are attempting to address the problem appears that the states with the highest obesity levels are
through legislation that is considered overly intrusive by located mostly in the mid-South (especially in the Mississippi
some-banning children's meals at fast food restaurants, plac• River Valley) and in Appalachia. The very highest obesity rate
ing a surcharge or "fat tax• on certain foods and beverages, or (34.4%) is in Mississippi. Conversely, states with the lowest lev•
requiring restaurant menus to provide nutritional information els of obesity seem to form two distinct clusters-one in the
on the food they serve. Recent White House initiatives attempt western and Rocky Mountain states and another in the north•
to confront the problem of childhood obesity with programs eastern states, from New Jersey and New York north through
sponsored by First Lady Michelle Obama, with the hope that much of New England.
these programs will increase the amount of exercise children How can we now proceed through the scientific research
get and better educate families about healthy food choices. process? More specifically, in what ways can we use statistics
Over the last few decades, the CDC's Behavioral Risk Factor to help answer various where and why questions associated
Surveillance System (BRFSS)collected obesity-related data with this spatial distribution of obesity levels? It certainly
through monthly telephone interviews. The CDC report that seems that there is a distinct regional patterning associated
"obesity rates for all population groups-regardless of age, with obesity, but how can it best be confirmed and how can
sex, race, ethnicity, socioeconomic status, educational level, or we use statistics to help approach the question of why the
geographic region-have increased markedly." obesity level pattern varies regionally in the way that it does?
Since 1995, the BRFSShas gathered fully comparable data The general geographic research procedure (figure 1.1)
from all states, and the results show a rapid and frightening recommends starting with the following sequence of steps:
trend toward obesity. Obesity is defined using the Body Mass develop questions to investigate regarding the obesity prob•
Index (BMI), a measure of an adult's weight in relation to his or lem, collect and prepare data to answer these questions, and
her height, specifically by dividing weight in pounds (lbs) by then process these data through the use of maps, graphic
height in inches (in) squared and multiplying by a conversion summaries, and descriptive statistics. The following is an
factor of 703.
Example: Weight= 150, Height= 5' 5" (65") TABLE 1.3
Calculation: [150 / (65)2) x 703 = 24.96 Percent of Total U.S. Population by Weight Status
Of course, a comparable index is available in the metric sys• Category: 1995 to 201 O
tern, using kilograms and meters (or centimeters). For adults Neither
20 years old and older, BMI is interpreted similarly for all ages Overweight
and for both men and women. CDC has created standard Year or Obese Overweight Obese
weight status categories associated with BMI ranges: 1995 47.9 35.5 15.9
2000 42.9 36.7 20.1
obese: BMI ≥30.0;
3
2005 38.5 36.7 24.4
overweight: BMI 25.0 to 29.9; 2010 35.3 36.2 27.6
neither overweight nor obese: BMI 24.9 Source: Centersfor DiseaseControl(CDC)

(continued)
10 Part I & Baisic Statistical Concepts. in Geography

exa.mple of a str.a'ightforward descri1ptive question and to the extent tha.t there is a stam-ll eve1l associatio 11 betw@@n
descriptive statement that s.ee1ms worthy of investigat�on: low income or education and high obesity rate.
"'States m n wh 1ch a lower pen:entag e ofthe popu�atli on is
IDo southern and Appalach'an stat es ha\fe hilgher oileslty
iii

levels than other states oUJts1de these regions? Visually. it physical


1 ly active wil I h ave higher obesity levels:
appears th at the logical .ans.wet to th is q ues tlon i.s yes. We 1 n-sta.tes in which a higher perc@nta.ge of the population eats
can rephrase this question a:s a descriptive statement a. .,healthy and n u1Jritious' diet wi II have lower obesity· rares."
• Southern and Appa.lach ian stam.s have high er •obesity levels
• States in ·which a higher percentage of the population has
than states outside these regions!'
type-.2 diabetes wi H have higher obesity level.s."
Using similar log1;c� we can create other examples of • States having h"gher obesity levels in 2000 will have larger
deson ptrve s,tatements: increase.sin t.lte percentage of their populat1lon that �s obese
- �states with higher obes,ity levels have a higher percentage from .20-00-·201 O than willl states h aving lower obesity levels
of popuia.tii<J.lh that is b�ack (African Amencan)." Thi.s state­ in 2000. ff thi·s descriptive statement is correct� it would sug­
ment is correct to the degree tihat there is a higher percent:­ gest that the level of • obesity inequality� is increasing and
that the 'bbesity ga � bet.ween ltigh and low obesity rate
age of b! adk population in southern states cornpa red to- the
:states is widening, not narrowii ng.
rest of the country.
• States with lower income {or education) levels have higher Descriptive questions and statements such a.s the.se can
obesity le-vels . lh is descriptiv@ statement wm be confl m-1ed often be deflnm.tively answered without too much difficulty,

!Percent of U .. S .. !Population by· We glht Status ·Category: 1995


60 --------------------

Neither
40 ..,....._ Overwe• hrt or
Obese
30--
D Overwetght

0 ----- Obese
Neither Overweight Obese
Overweight or
Obese

!Percent of U .. S .. !Population by Weight Status ·Catego,ry: 2010

40 ..-------------------
35, +---
Neither
30-1-- Overwetgrht or
25, --e----- Obese·

20 ----
15' -- D Overwetght
10--
5,--
o-- Obese·
f\heilliler Overweight Obese
OvenNeight or
Obese

FIGU1RE 1�2
Bar an:d Pie Charts of Tota1 U.S. Population by w:eight Sta1us. Category: 1995 and 201 0
Source: Ce�m for Disease Cootrol and Preven1'ion (GOC),
Chapter I Introduction: The Context of Statistical Techniques 11

simply because data are already available and are summarized percent of population that is obese in the southern and
at the state level. This may allow us to make •conclusive" Appalachian states is higher than the percentage of popu•
descriptive summary statements, and perhaps develop geo• lation that is obese in other states outside this region.•
graphic models to predict obesity levels based on some Notice that this first inferential hypothesis is worded in a
explanatory factor or variable, such as being physically active similar fashion as the first descriptive statement, but the
or eating a healthy and nutritious diet. inferential hypothesis is fundamentally different because it
To proceed further in the geographic research process and is evaluated based on sample data. In this case, we can use
formulate inferential hypotheses, it is necessary to collect, pre• an inferential statistical test to determine whether we
pare, and analyze sample data. Only then can we use inferential should reject or not reject (accept) the hypothesis and esti•
statistics to their ultimate or maximum potential. Keep in mind mate the probability or likelihood that the correct conclu•
that the following are just a few representative examples from sion was reached.
many inferential hypotheses that one could create using obe• • Select a random sample of individuals from two states-
sity along with some of these other variables. In many cases,it one state having a relatively high percentage of the popula•
is most appropriate to create inferential hypotheses that are tion that is black (say Alabama) and another state having a
applicable at a disaggregated level, using smaller areas or spa• relatively low percentage of the population that is black (say
tial units such as counties, or using individual observations. South Dakota). Based on this sample data, we can create an
Using disaggregated data allows us to take proper samples for inferential hypothesis: "The percentage of Alabama adult
subsequent inferential projection to entire populations. population that is obese is higher than the percentage of
Here are a few examples showing how the use of disaggre• South Dakota adult population that is obese."
gated levels might work with obesity patterns and some of
• Select a random sample of counties from each of the four
the variables taken from the descriptive statement list:
major census regions (Northeast, South, Midwest, and West)
• Take two representative samples of individuals-one sam• and record the county obesity rate for multiple dates (1995,
pie from the southern and Appalachian states and another 2002, and 2009, for example). An inferential hypothesis suit•
from states outside this region. Based on these sample data, able for testing this sample data is: "There are significant dif•
we can formulate an inferential hypothesis, such as: "The ferences in obesity rates between the four census regions."

2009 Obesity Rates

18.6 to 21.8
21.9 to 24.9
2s.o to 28.1
28.2 to 31.2
31.3 to 34.4

FIGURE 1.3
Percent of Adults Classified as Obese by State, 2009
Source: Centersfor DiseaseControland Prevention(CDC)

(continued)
12 Part I .& Baisic Statistical Concepts. in Geography

. Sele(:t a random .sarn pie of ind ivid uaIs frrom, the• nat�o11W11de• tial analysis. th at ,a Seampie of the population from a state hav­ .
;;.
population. For each person sampled, indicate whether the ing an ' healthy and n utrirtious diet has a sig 111 ffi,cantly lower
rr
person is Hphy.skally active· or �phys�ca ly lnactive-1 {of obesity rate than a sample of the pop,u lation from a state hav­
"'
c-0urse, -these terim.s must be, defined �n a. consistent and use­ ing an u Iiihealthy and less n�ritious diet. lh ese i·1ndi 111.g s
fu I manner) and all.so whether the pew5.on is dassiifi ed as woulld send a. ,dear smgnal that p-0 1ides improving diet and
1

obese or not An i· nferenUaI hypothesis based on th is sa rn·ple nutrition leve�s .shou�d help �ower obesity levels in places that
data is: lh:e number of physically· active p,eople that are rurte(ltly have poor dietary and nutritional hab1ts. llh ·other
actually obese (observed · to be obese) is signjficantly lower words, it now becomes ,�asonabl'e for us to incorporate spa.­
than the number of physf,call ly· ac.tive p-eople that are, tiail poU.cies and p� ans in those plac-es n-eeding help to lower
expected to be dbes,e lif no relatlonsh p exilsts bet.ween obe­ the, ir obesity levels. Stated ij,n :yet another way, we can now
sity and the level of physmca� aalivity."' prescribe what u, do at a . ny particular llocatioin, and thereby
help polmcym akets in a 1mea.ningfo I way. Wi.thout tihis analysis,
Of ,our� unless we actually analyze these descriptive pol ic.y-rnakers are simp,ty basing decisions on � ut rea,ctilon,s
"'

.statements and infei'erlti al lhypotlhese.s, it is difficult or im pos­ or neducat,ed guesses.'°' Sadly, many po�icy dlecisions are. made
sjbl@ t,o deterrnine. whether any of them a re ,.orrect or incor­ in exactly· ·thi.s. way-without strong s,cientific rationale. {Sev­
rect Sup,pose,r hoW@cver, we disc-0ver through descri ptwe era.I of the desc riptive statements and inferentiaI hypotheses
a naiysis th at :st.ates where a h�gher percentage of the pop,u 1- just d",scussed are explor,ed and analyzed in la.ter chapters
tion eats a i11heatthy andf nutritious di·e t� generally h.ave lower- using various statisticaI techniques).
obesity rates. Further sup pose we determ� ne through inferen.-

Example: Life IExpe,ctancy a,t the Counttry and World Region Levels

Life expectancy at birth [s one of the mo,st frequently used in tlhe mid to hi.gh-fortJes, Other countriies have average J rfe
variables to m,easure the overal � health of a country. Tihiis is not expectandes in tlle low eig hUes.
su1�prisji ng, as length of I fife is the cuirn u latiive result of .so many Folk"w� 111 g the general geographic reseairch process, and
interrelated econo1rni·c, social, 1polit1ica�, and envmronrnental w.ing the life e'")(pectancy map as a starting point, ·we can g·en­
irml uences.. From cradle to grave,, many favorable circu m,­ erate aJ series o·:f basic descn ptive statements:
stances must operate for the typical or average citizen of a
� More developed counties {MDCs;) and regions have longer
countr y to have a lon·g lffe expectancy. C..onversety, it seems
life expectancies thi3!n les:s developed countries (LDCs.) and
logi.,ca� tha.t many �ndh,iduals. in countries plagued. with fam­
regions."
ine. diseas.e, wa r1 and mi 11 imal
1 l health, servic--es and personnel
wi II have
. relativ.ety short life expe.ctancies. i ncountri.es on the continent of .Africa have shorter life expec­
Nu merous orgarnmzations (ind u ding the World IH ea Ith tancies than countries on other con.tinen.ts.i.!• This observation
Orga. niJ.atmon [WHO] of the United Natio11H and tihe united
1 s. eems to hold particularly ·fur sub-Saharan Africa.
States Cen.tra 11 ntel lig:ence Agen.cyJ collett Hfe expectancy .. Cou n.tn:es with higher incidence rates of tiuberculost1 {per
data and m,onitot ch,anges over tj me. ll n addition., many pri­
1
1 00., 000 p=eopie per year) have shorter life expectancies�· Of
vat ,e organizations such as the Kellogg Foundation and the course,, we could substitute many •oth,e r cause-sp&ific or
Hil I and Melindai Gates Fou ndatio 11 spenl:I bU lions of do�la rs in
1 infectious diseases for t:uberculosu5 (cho�era, HIV/AIDS,
hu rnan�tarian relief to alleviate human :s ufferin.g in countrie.s maJaria, measles, mumps, polio,myelitu s, yellow fever_ etc.}.
sufferin.g fmm low I ife expectancy. Coundess s:tud1ies have
• "Count11ies with a higher percentage of the population usiing
slhow n th,@ overa! I spatial di5ttri bution and pattern of Hfe
improved drinkJi 11g water sources have longer life expectan­
expectancy worldwide� confinni ng e,x-actlly where II ife e)(p&­
cies.� Here aga� n, we could substitute other risk fa-ctors for
tancies are �owe st and then tryu ng to undetstand why they
good drintking. water Umproved sanitation, level of alcohol
am lowest in those places. On the basis of this information _,
con!iumpt�on, prevalence of sn1Loking a tobacco product. etcJ.
one hopes 'to determine what to do, pe:rhaps developing
nc,ountri.es w·itb a higher physic:ian den:Sity (per 1 0,000 peo­
. re effective medical and nutritionall he.a Ith car,e delivery.
1
mo
.syste1m:S and devising better fl nancia� po� icies th.at he� p raff se ple) have iong er lif� expecta tu::ies.� Other m,easur,es of' he.a.Ith
Ufe expectancie.s. wu1rkfurce and infrtJ1structure could be su bstitut@d for physi­
-Once again,, an. exce� lemt place for us to :Start is with a. map cian density (environ mental a.nd public health ·worker den­
slhowing the overa Ill pattern of life expecta 111cy at birth for the sity, hosp�tal bed: densUy, etc.).
world's countries (figure 1 .4 show.s the 2009 pattern). Life * "Cou ntnes. Viii.th Iow-er adult Iit@rac. y rates have shorter lffe
expectancy at birth is defined as the average number of years expectancies�- We could
i .substitut,e other d:@mogwaphic and
a lave newborn infant CLttl ,expect t.o live undeir cu rirent mortal­ sodaI measures ind irecdy related to � ife exp-@ctancy for liter­
ity l@v,els -0Populat�on Reference Burea PRB). The life •exp@c­ acy rate {pen:ent population Iiving in urban areas,, total fertil­
ta n,cy· pattern seenu to show con sjd@rable regional ity rate per woman, net peroentage primary 1 seihod
diff'eren:ces-and spatial varuatio . n. from country to country. enrollment r�te,r gross nationa I iocome 1Per capita me.asuted
Some cou 11tr�1es have v:ery �ow average life expectancies, only in Purchasing Power Parity [PPP] in internationa� do Har� etc.).
Chapter I Introduction: The Context of Statistical Techniques 13

Life Expectancyat Birth(in Years), 2009

--
...

.
No Data
45.o to 54.9
D 55.o to 64.9
65.0 to 74.9 •
75.0 to 84.9

FIGURE 1.4
Life Expectancy at Birth (in Years}, 2009
Source: World Health Organization (WHO), World Health Statistics, 2010

• "Countries with a higher per capita total expenditure on generally less than $300 annually (PPP-international dol•
health (expressed in PPP-international dollars) have longer lars} and average life expectancies are quite low, ranging in
life expectancies." In economics, PPPdetermines how much years from the low 40s to the upper 60s. The countries with
money is needed to purchase the same goods and services the very lowest life expectancies (below 60) are virtually all
in two countries, which is then used to calculate an implicit in sub-Saharan Africa. Somewhat better are several of the
foreign exchange rate. Using that PPPrate, an amount of poorest Asian and Latin American nations (Bhutan, Haiti,
money therefore has the same purchasing power in different Laos, Nepal, Turkmenistan} with life expectancies in the
countries. The actual nature of this relationship, with coun• low• to mid• 60s but still with annual per capita total health
tries identified by world region, is shown in figure 1.5. In
i this expenditures below $300. Somewhat surprisingly, perhaps,
figure, the assignment of countries to world region follows there is quite a range of life expectancy values among these
the system used by the Population Reference Bureau and countries, even though the amount of money spent on
the source of data is World Health Statistics 2010, published health care per person is about the same.
annually by the WHO. 2. Countries in the "Transition Stage" (Stage 2) are experienc•
ing concurrent or simultaneous changes in both life expec•
Figure 1.5 is an example of a scatterplot,a graphic tech•
nique that shows the relationship between two quantitative tancy and per capita total expenditure on health. All of the
variables. On a typical scatterplot, every observation has a countries spending between $300 and $1500 per person
value on both variables and is plotted on a case-by-case basis. annually seem to have life expectancies ranging from the
low 70s to the upper 70s. Several world regions and sub•
The scatterplot shown in figure 1.5 indicates there may be a
regions are well represented in this group, including coun•
rather intriguing and complex relationship between these
two variables. The appearance of three distinct stages as per tries from North Africa, Latin American and the Caribbean,
capita health expenditures increase may suggest a "life expec• Asia, and Eastern Europe and the Russian Federation.
tancy transition" to population geographers. Although a lot of "scatter" appears among the countries in
this stage, a rough general relationship seems to exist.
1. Countries in the "Initial Stage" (Stage 1) are spending little Countries with lower total expenditures on health per capita
money on health. Per capita total expenditure on health is (in the $300 to $700 range) generally have life expectancies

(continued)
14 Part I Basic Statistical Concepts in Geography

''
....
JPN'-o FRA
NZL e
e♦ AUS✓.■ CAN
ecHE
ARG '' □ '-.. eNOR

• •••
80
~
COL POLBRN ~' ' • ■ usA

-.,,
I!!
PER~
ROM
?.
□ ~~TUR
OOMt:1
'f.t:1
e
l1

l1
9
BHR
D
e \

HRV
'
' ,
Stage 3
Final
~ ' ', 4- ''
JAM□ IRN '
-C
·-
co
0
0
70

•□~
egv~ ~'...a--RUS
GTM ''
Stage 2
Transition
□ TKM
'' Country Codes
N
.c
t:
- JO~ BTN ''
' Argentina ARG Japan JPN

-
·-
.0
.,,60 ·
>,
u
~
:o"n
NPL O
LAOO
1
Stage 1
Initial OAfrica
World Region Australia
Bahrain
Bhutan
Brunei
AUS
BHR
BTN
BRN
Laos
Nepal
New Zealand
Nigeria
LAO
HPL
NZI.
NGA

-
0
.,,
C t. Latin America/Caribbean Canada
Columbia
CAN
COL
Norway
Peru
NOR
PER
u
G)
OCOG
) 0 Congo COG Poland POL
□ Asia
l:l. Croatia HRV Romania ROM
)( g-._UGA Dominican Republic DOM Rus.sia RUS
G)
50 ONGA ■ North America Egypt EGY Swrtzerland CHE
.S!
·-
..J a..___oZMB France
Guatemala
FRA
GTM
Turkey
Turkmenistan
TUR
TKM
0 •Europe
Haiti HTI Uganda UGA
Iran IRN United States USA
•Oceania
> Jamaica JAM Zambia ZMB

40 . • . . .
0 1000 2000 3000 4000 5000 6000 7000 8000
Per capita total expenditures on health care, 2008 (PPP int. $)

FIGURE 1.5
Scatterplot Showing Relationship Between life Expectancy at Birth and Total Health Care Expenditures Per Capita, Selected
Countries, 2008 (Note: Regional Assignment of Countries Based on Population Reference Bureau)
Source: World Health Organization (WHO), World Health Statistics, 2010

in the low 70s (e.g., Dominican Republic-life expectancy more developed country (e.g.,Australia-LE 81, $3357; Can•
72, health expenditures per capita $411; Romania-LE 73, ada-LE 81, $3900; or France-LE 81, $3709).
$592). Countries with somewhat higher expenditures on
As a curious geographer, a number of interesting inferential
health per capita (in the S1100 to S1500 range) mostly have
hypotheses could be created to examine the precise nature of
slightly longer life expectancies in the mid• to upper-70s this relationship between life expectancy and per capita total
(e.g.,Argentina-LE 75, $1322; Croatia-LE 76, $1398). health care expenditures. To what extent are there differing cul•
3. Countries in the "Final Stage (Stage 3) are all classified as tural or social values between countries with regard to spending
"more developed" by the United Nations. This category a lot of additional money to increase life expectancy only very
includes all of Western Europe, the United States, Canada, slightly? If samples of elderly seniors were followed in different
Japan, Australia, and New Zealand. In these countries, life MDCs,would the United States health care delivery system
expectancies are in the very high 70s (78 or 79) or very low spend more money per person to support very expensive end•
80s (80 to 83). However, there seems to be little or no rela• of-life care? By contrast, do other MDCs have more triage-based
tionship between life expectancy and total health care decision.,making policies that discourage or not allow certain
expenditures per capita. For countries that spend more than expensive end-of-life medical procedures? Is there a spatial
S1500 per capita, life expectancies differ by only a very small (regional) pattern for countries having lower life expectancies
amount. Beyond this S1500 threshold, it does not seem to than predicted given their total health expenditure levels?
matter how much money is spent on health care. By far, the To proceed further with the general research process for
largest total expenditure on health care per capita is in the the life expectancy problem will require the availability of com•
United States, with $7285, yet life expectancy is only 78 parable sample data from country to country at the global
years. Contrast U.S.performance with Japan (the country level. Unfortunately, there does not seem to be directly compa•
with the world's longest life expectancy at 83 years, but a rable, sample-based life expectancy figures for many of the
total health care expenditure of only $2696) or any other very poor less developed counties. Certainly there are life
Chapter I • Introduction: The Context of Statistical Techniques 15

expectancy estimates for every country, and excellent descrip• these sample data, test the inferential hypothesis: "The life
tive statements can be made about the spatial pattern of these expectancy of those individuals sampled from MDCs is Ion•
estimates, however, the actual "in-the-field" sampling proce• ger than the life expectancy of those individuals sampled
dures used to get these estimates vary significantly from coun• from LDCs.•Analysis of this hypothesis leads to a statistical
try to country. For example, sample sizes and sample conclusion that the life expectancy for the entire population
proportions will almost certainly vary from country to country. of all MDCs is longer than the life expectancy for the entire
In-field data collection procedures will necessarily be quite dif• population of al I LDCs.
ferent from one country to another, especially in poorer LDCs. • Select a random sample of individuals from two countries-
A multiplicity of economic, political, and environmental cir• one sample from an MDC (e.g., Norway) and another sample
cumstances will likely make it literally impossible to collect from an LDC (e.g., Egypt). Based on these sample data, the fol•
comparable life expectancy sample data for al I of the world's lowing inferential hypothesis is reasonable: "The average life
countries. We may also have no idea whatsoever about the expectancy of those individuals sampled from Norway is Ion•
amount of error introduced in samples of life expectancy esti• ger than the average life expectancy of those individuals sam•
mates from some countries. Even if background information pied from Egype If a significantly longer life expectancy is
regarding sample characteristics exists, reports or documents confirmed for Norway than Egypt, using an inferential statisti•
may not provide this information (this limitation seems to be cal test on the samples from these two countries, then it is
the case in the WHO annual World Health Statistics reports). All reasonable to infer that life expectancy is longer for the entire
of these problems and issues make the use of inferential test population of Norway than for the entire population of Egypt.
statistics a risky proposition with uncertain results and ques•
• Select a random sample of countries whose total annual
tionable conclusions. Therefore, no sample data dealing with
expenditure on health per capita is high (say $1,500 or more
life expectancy are analyzed in this text.
per capita) and another random sample of countries whose
However, if appropriate sample data were available, it
annual total expenditure on health per capita is low (say $300
would certainly be possible for us to create any number of
or less).Based on these two samples, a possible inferential
inferential hypotheses:
hypothesis is: "Countries with a high total annual expenditure
• Take two representative samples of individuals-one sam• on health per capita have longer life expectancies than coun•
pie from MDCs and another sample from LDCs. Based on tries with a low total annual expenditure on health per capita."

Example:Timing of the Last Spring Frost in the Southeast United States

The length of the growing season in any agricultural area is States. From this map pattern a set of research hypotheses can
of critical importance in determining the spatial patterns of be proposed. Since the last spring frost data represent a sam•
different agricultural crops. Systematic changes and natural pie of weather stations in this region, these hypotheses are
variations from year to year in growing season length will inferential in nature. This will allow statistical prediction and
have profound effects on the management of agricultural sys• estimation of last spring frost dates for any location within this
terns. Especially crucial is the timing of the last spring frost, for study region, not just the original sample of weather stations.
that date strongly influences the timing of all subsequent At the general regional level, the following inferential
management activities, from planting to harvesting. In mod• hypotheses are worth considering:
erate climates with distinct seasonal changes (such as the • "Weather stations with more northerly latitudes (further
Southeastern United States), a later-than-normal spring frost from the equator) have average last spring frost dates later
can have numerous adverse socioeconomic impacts on both in the spring than weather stations with less northerly lati•
vegetable crops and citrus. tudes (closer to the equator)."
Given the potential impact of the last spring freezes on the
• "Weather stations with higher elevations above sea level
agricultural industry across the Southeast United States, it is
have average last spring frost dates later in the spring than
not surprising that a great deal of research is conducted over a
weather stations at lower elevations."
wide variety of crops to determine the spatial patterns and
temporal trends of the growing season in this area. One recent • "Weather stations further from the coast (the Atlantic Ocean
study is unique because it examines the characteristics of and Gulf of Mexico) have average last spring frost dates later
severe late spring frost events in new, innovative ways. This in the spring than weather stations closer to the coast~
study (along with more recent updates) collected late spring These three inferential hypotheses suggest that latitude,
frost data from a sample of weather stations distributed elevation, and distance to the coast are all potential influences
throughout a large portion of the Southeast U.S.from 1950 to on the timing of the last spring frost date. In chapter 6 we
2010. The author has graciously allowed the use of some of return to this topic and look at the effectiveness of a regional
these sample data, which provides us with another practical probability map showing the likelihood of a late spring frost
example of the geographic research process. occurring on or after April 1.Then, in the final part of the book
The spatial pattern of average last spring frost dates is (chapters 16 and 17), we examine the spatial relationships
shown (figure 1.6) for a selected sample of 76 weather stations between these three "explanatory variables• and last spring
distributed across a large portion of the Southeastern United frost and explore the effectiveness of some predictive models.

(continued)
16 Part I .A. Basic Statistical Concepts in Geography

Last Spring Frost Dates

0 February 16 to March 16
[!] March 17 to April 3
& April 4 to April 13

1
• April 14 to April 28

0 25-0 5-00 N
GULF OF
A1EXICO Kilometers A
FIGURE 1.6
Average Date of Last Spring Frost, Selected Weather Stations in Southeast United States, 1950 to 2010
Source:Parnell, 2013

Example: Patterns of Population Change in the Contiguous United States

Suppose we are interested in analyzing state-level Given these maps of state-level population growth, we
growth trends in the United States over the last few decades might ask why these spatial patterns exist. Why do these
(figure 1.7). The fastest growing states over a recent 30-year growth rate patterns vary in the ways that they do? What fac-
period (1980-2010) appear to be generally located in the tors can we suggest to help explain the nature of these spa•
South and West, with consistency from decade to decade. tial distributions? What spatial processes might have been
Conversely, many of the states which have experienced slow operating and what policies (if any} might alleviate any prob-
growth (less than 15%} or even population loss (as is the case lems caused by excessive population growth or excessive
with a few states during the period from 1980 to 1990) are population loss?
located in the Northeast and Midwest. However, further As with previous geographic examples in this chapter, a set
examination of the maps reveals exceptions to this rule. For of descriptive statements is developed:
example, during the 1980s Wyoming lost population while
New Hampshire witnessed relatively high growth (above • "Southern and western states with warmer climates have
15%). In addition, the growth rates for many states vary quite higher population growth rates than northeastern and mid-
a bit from decade to decade, indicating possible changes in western states having colder climates." Many people feel
the pattern of economic growth and employment opportu• that climatic amenities such as a warmer winter and associ-
nities or changes in people's preferences for different places ated environmental and recreational considerations influ•
at different times. ence the spatial pattern of population change in the United
Chapter I • Introduction: The Context of Statistical Techniques 17

1980 -1990

1990 - 2000

Percentage Change

CJ Population Loss

Cl 0.1% to 1s.0%

2000 - 2010
Cl 1s.1% to 30.0%

30.1% to 66.0°/o

FIGURE 1.7
Percent Population Change by State, 1980 to 2010
Source:UnitedStates Bureauof the Census

(continued)
18 Part I .A. Basic Statistical Concepts in Geography

States. Some states in the South and West-including Flor• • "States attracting larger numbers of immigrants will have
ida, Texas,Arizona, and California-are considered to be in higher population growth rates than states that attract few
the "sunbelt• because of their warm and sunny climates. immigrants~ Many migrants from Asia and Latin America
Conversely, the northeastern and midwestern states that are have recently settled in California, Florida, and Texas(among
either losing population or experiencing slow growth are other places), contributing to higher growth rates in these
sometimes classified as being in the "Snowbelt• or "Frost• states. However, many immigrants have also settled in llli•
belt." The implication is that people are fleeing the cold win• nois and New York-states that have experienced little
ters and snow for year-round warm weather and outdoor recent change in population size.
recreation opportunities. We have a problem, however: how • "States with smaller populations and lower population den•
are we going to operationally define •warmer climate" for a sities have higher growth rates than states with larger popu•
state? After all, a state is an area, not a point like a weather lations and higher population densities." Geographers
station, and some states have wildly-variable climates within studying recent migration trends sometimes suggest that
their borders. low-density residential areas such as rural regions and small
• "States with healthier and stronger economies will have towns are increasingly attractive residential destinations for
higher population growth rates than states whose econo- many Americans. To the extent that this trend toward non•
mies are weaker and lessvibrant." Economic factors metropolitan growth or decentralization of the population is
undoubtedly influence growth in a number of ways. A key true, states with smaller populations and lower population
task, of course, will be to define what is meant by •healthier" densities should have higher growth rates.
or "stronger" economy, so that it can be measured and ana• • "States with ocean-oriented, water-based recreation and eco•
lyzed. In surveys examining why people change residence, nomic activity have higher population growth rates than
respondents frequently cite job opportunities and related states lacking these amenities." To the degree this is the case,
economic reasons. Over recent decades, many more new rapidly growing coastal (peripheral) communities should con•
jobs were created in the southern and western states than in trast with stagnant or declining interior (heartland) locations.
the Northeast and Midwest. The region containing the states
of Pennsylvania, Ohio, Michigan, Indiana, and Illinois is A decidedly different set of descriptive statements might
sometimes called the "Rustbelt" because of its traditional emerge if we look at a county-,level map of population
economic reliance on heavy industry and manufacturing- change, such as the map of percentage change from 2000 to
sectors of the U.S.economy that have been suffering a rela• 2010 (figure 1.8).The spatial pattern of growth now seems to
tive decline for several decades. There is some recent com• be much more precise and complicated than at the state level,
pelling evidence, however, that this spatial pattern is and this allows us to speculate specifically about what spatial
changing again. processes might be operating that influence growth patterns.

Percentage Change
Population Loss
0.0% to 15.0%,
2000 -2010 15.1% to 30.0%,
30.1% to 109.2%,

FIGURE 1.8
Percent Population Change by County, 2000 to 2010
Source:UnitedStatesBureauof the Census,2010
Chapter I • Introduction: The Context of Statistical Techniques 19

For example, visuallycontrast the growth rate patterns of county. What spatial processes related to growth are operat•
Pennsylvania and Colorado. The map indicates that the many ing to cause only small differences in growth rate among
counties in Pennsylvania mostly grew at about the same rate Pennsylvania counties but much larger differences among
from 2000 to 2010, with relatively little difference from county Colorado counties? Perhaps you can create a descriptive state•
to county. In Colorado, however, you can see considerable ment suggesting some possible reason or reasons.
variation in growth rates during the decade from county to

As stated when introducing these four geographic 6. Formulate possible descriptive statements and infer-
examples at the beginning of this section, we are going to ential hypotheses when presented with a spatial pat-
explore many of the descriptive statements and inferen- tern or locational data set.
tial hypotheses in later chapters. Each of these topics and 7. Distinguish between questions, hypotheses, laws,
problem settings (obesity, life expectancy, last spring theories, and models in geography.
frost, and population change) deserves further analysis,
8. Explain the typical organization or relationship of
and they all lend themselves to various types of statistical
observations, variables, and data values in geographic
procedures as we continue.
data sets.
9. Explain the basic difference between descriptive sta-
tistics and inferential statistics.
KEY TERMS
10. Distinguish between a statistical population and a
bar chart, 9 sample.
data, data set, data value, 7 11. Describe or summarize a given set of data graphi-
data matrix (array), 8 cally with a bar chart, pie chart, and scatterplot.
descriptive statement, 5
descriptive statistic, 7
geography, 3 REFERENCES
AND ADDITIONAL READING
hypothesis, 5
inferential hypothesis, 5 Abler, R. F., J. S. Adams, and P. R. Gould. SpatialOrganization:
inferential statistic, 7 The Geographer'sView of the World. Englewood Cliffs, NJ:
law, 7 Prentice-Hall, 1971.
Amedeo, D. and R. G. Golledge. An Introductionto ScientificRea-
model, 7
soningin Geography. New York: John Wiley and Sons, 1975.
observation, 7 American Statistical Association. "What is Statistics?" "What
pie chart, 9 Do Statisticians Do?" Accessed January 9, 2014. http://
sample, 7 www.amstat.org/ careers/
scatterplot, 13 Association of American Geographers. Geography:Today's
statistical population, 7 Careerfor Tomorrow.(informational brochure, discontinued).
statistics, 4 Washington, DC: AAG, no date.
theory, 7 Association of American Geographers. Praaicing Geography:
variable, 7 Careers for EnhancingSocietyand the Environment.Upper Sad-
dle River, NJ: Prentice Hall, 2012.
Burton, I. "The Quantitative Revolution and Theoretical Geog-
raphy." The CanadianGeographer7(1963): 151-62.
MAJOR GOALSAND OBJECTIVES Cobb, G. "Reconsidering Statistics Education: A National Sci-
ence Foundation Conference." Journalof StatisticsEducation
If you have mastered the material in this chapter, you ( 1993). Accessed January 9, 20 I 4 via www.amstat.org/
should now be able to: publications/jse/vlnl/cobb.html
I. Identify the types of questions geographers ask. Geography Education Standards Project. Geographyfor Life:
National Geography Standards 1994. Washington, DC:
2. Understand the general importance and uses of sta- National Geographic Research and Exploration, 1994. (2nd
tistics. ed. now available, 2012). www.ncge.org/geography-for-life
3. List the potential applications of statisticsin geography. Haring, L. L. and J. F. Lounsbury. lntroduaionto ScientificGeo-
graphicResearch.4th ed. Dubuque, IA: Wm. C. Brown, 1992.
4. Understand the role of statistics in the geographic
Hubbard, R. "Assessment and the Process of Learning Statis-
research process. tics." Journalof StatisticsEducation(1997). Accessed January
5. Explain the role of statistics both before and after 9, 2014 via http://wwwamstat.org/publications/jse/v5nl/
hypothesis formulation. hubbard.html
20 Part I .A. Ba.sicStatistical Concepts in Geography

Kuhn_,T. S. The Struaure of ScientificRevolutions.3rd ed. Chi-


cago, IL: University of Chicago Press, I 996.
Montello, D. R. and P. C. Sutton. An Introductionto Scientific
ResearchMethodsin Geogmphyand EnvironmentalStudies. 2nd
ed. Thousand Oaks, CA: Sage Publications, 2012.
Parnell, D. B. A Climatologyof FrostExtremesAcrossthe Southeast
UnitedStates,1950-2009.Unpublished manuscript, Salisbury,
MD: Salisbury University, 2013.
If you are interested in exploring further the geographic
examples mentioned in this chapter, the following are good
places to start. For obesity-related data and information, see the
Centers for Disease Control and Prevention (CDC).
www.cdc.gov.For life expectancy data and information regard-
ing many other health-related variables worldwide, see the
World Health Organization (WHO): especially the annual
World Health Statistics publications. www.who.int/en/ and
www.who.int/ gho/ publications/ world_health_statistics/ en/.
A wide variety of climatological data (including last spring frost
information) is available at the National Climatic Data Center
(NCDC): www.ncdc.noaa.gov. The sample data set we use for
last spring frost analysis in the southeast U.S. is provided by
Darren Parnell of the Department of Geography and Geosci-
ences at Salisbury University, Maryland. For information con-
cerning population growth and change patterns, see the United
States Census Bureau: www.census.gov. Specifically see the
series of Census Briefs dealing with population-related issues
www.census.gov/prod/cen20JO. Another excellent source for
demographic data is the Population Reference Bureau (PRB):
www.prb.org. Especially useful is their DataFinder, from which
you can browse a variety of both U.S. and International topics:
www.prb.org/DataFinder.aspx.
Geographic Data
Characteristics and Preparation

2.1 Selected Dimensions of Geographic Data


2.2 Levels of Measurement
2.3 MeasurementConcepts
2.4 Basic Classification Methods

Before performing statistical processing and analysis operational methods or rules of classification are applied
we must first understand a number of characteristics of to a set of spatial data.
spatial data. One needs to understand how variables are
organized as well as how data are arranged. In this chap-
ter, basic concepts are introduced that provide the back- 2.1 SELECTEDDIMENSIONS
ground information needed to characterize data before OF GEOGRAPHIC
DATA
statistical techniques are applied.
Questions about data and variables arise early in the In the scientific research process, questions about
scientific research process. When identifying an appropri- data arise almost immediately. When trying to identify an
ate geographic research problem and formulating meaning- appropriate geographic research problem or formulate a
ful hypotheses, you must make decisions about the sources hypothesis, questions like these usually emerge: What
of available data, the method of collecting data, and the sources of data are available? Which method(s) of data
variables to include in the analysis. In section 2.1, these collection should be used? What type of data will be col-
dimensions of geographic decision-making are discussed. lected and then analyzed statistically?The various aspects
Several measurement issues are considered before of a geographic problem must be carefully considered to
conducting any statistical analysis. Variables may be orga- ensure that when data are collected and analyzed, you
nized and displayed in various ways, and different levels can answer research questions effectivelyand reach mean-
of measurement are used depending on the geographic ingful conclusions. The dimensions of geographic data
problem. The characteristics of nominal, ordinal, and discussed here include sources of information, methods of
interval-ratio measurement scales are reviewed in section data collection, and selected characteristics of data that
2.2. In section 2.3 we discuss potential measurement distinguish geographic research problems.
errors and address the issues of precision, accuracy, valid- A simple distinction is made between primary and
ity, and reliability. secondary data sources. Primary data are acquired
Basic methods of data classification are reviewed in directly from the original source. If you are collecting pri-
section 2.4, emphasizing the goals and purposes of classi- mary data, then you are probably obtaining this informa-
fication, including its importance in geographic research. tion "in the field." Primary data collection is often quite
The two fundamental classification strategies of subdivi- time-consuming and generally involves making decisions
sion and agglomeration are defined and several specific about sample design to acquire a representative set of data.

21
22 Part I .A. Ba.sicStatistical Concepts in Geography

Secondary (or archival) data are generally collected .spatially explicit because the locations of the observa-
by .some organization or government agency and can be tions or units of data are analyzed directly. An important
used by the geographer. Because the data are already col- .set of .spatial .statistics i.s used to investigate these prob-
lected and are probably organized in an accessible, con- lems. Later in the book, a variety of descriptive .spatial
venient form--.such as a written report, DVD, database .statisticsare discussed (chapter 4), and in chapters 13-15
file, or download from the internet--.secondary .sources is.sues.such a.sautocorrelation are introduced and .several
are generally le.s.sexpensive and time-consuming to use inferential .spatial .statisticsare applied to point and area
than primary .sources of data. In addition, many of the patterns to test for randomness and measure key aspects
problems associated with .sampling and .survey de.sign of .spatial distributions.
may not be experienced with archival .sources.Secondary Other geographic .studies are implicitly spatial. An
.sourcesare often very comprehensive, including a census implicitly .spatial .situation exists when the observations
or total enumeration from a very large population. or units of data represent locations or places, but the
Duplicating these efforts with primary data collection i.s locations them.selvesare not analyzed directly. For exam-
virtually impossible for the re.searcher. ple, a geomorphologi.st might wish to determine if .signif-
Potential difficulties can al.so occur with .secondary icant differences occur in alluvial fan development in two
data .sources,.such a.sthe data being improperly collected, different basins in the basin-and-range region of the
organized, or .summarized. Errors may have occurred in American Southwest. A random .sample of alluvial fans
the editing and collating of data, especially if the infor- from each of the basins could be taken and relevant
mation was obtained from a number of different original aspects of alluvial fan development compared. In a .sub-
.sources. Information from .secondary .sources is not urban neighborhood, a geographer may be investigating
always measured properly, re.suiting in other potential the relation.ship between the a.s.se.s.sed valuation ofhome.s
problems. Finally, an old axiom .states that data immedi- and their ages. In both of these examples, the observa-
ately becomes out-of-date the moment it i.scollected. tions (alluvial fans in a basin or homes in a .suburban
Several basic procedures are used to collect geo- neighborhood) obviously have locations on the earth's
graphic data. If primary data are necessary for the .study, .surface, but the locational pattern itself is not under scrutiny.
.samplingwill almost certainly be part of the data collec- Looking ahead, a two-sample difference test might be
tion process and may require the de.sign of .survey ques- appropriate in the alluvial fan .study(chapter I0) while a
tionnaires. The options for primary data collection correlation analysis may be needed to .studythe .suburban
include direct observation, field measurement (especially housing question (chapter 16).
in physical geography re.search), mail questionnaires, Another important dimension of geographic
personal interviews, and telephone interviews. re.search i.s the contra.st between individual-level and
To .select the appropriate method of data collection, spatially aggregated data .sets.In .somegeographic prob-
you mu.st evaluate the nature of the re.search problem lems each data value represents an individual element or
very carefully. Even if a .suitable method is .selected,prob- unit of the phenomenon under .study.In other problems,
lems in .surveyde.signare common. When using a .survey each "value" entered into the .statisticalanalysis i.sa .sum-
to collect data, each question mu.st be properly worded, mary or .spatial aggregation of individual units of infor-
all possible responses to a question mu.stbe considered in mation for a particular place or area. The be.st way to .see
advance, and the .sequence of questions mu.st be deter- this distinction i.s by discus.sing an example. Suppose a
mined. In fieldwork, logistic problems often occur, or population geographer i.sre.searchingcurrent fertility pat-
.special arrangements are necessary. Preliminary .site terns in Nigeria. One approach would be to collect a .set
reconnaissance may not reveal all the difficulties. A more of individual-level data, perhaps through personal inter-
detailed discussion of data collection methods i.sfound in views of a random .sample of Nigerian women. Another
chapter 7, "Ba.sicElements of Sampling." possible approach would be to obtain birth rate estimates
Other dimensions or characteristics of data help dis- from officials in each of Nigeria's administrative divi-
tinguish geographic re.search problems. Some .studiesare sions (21 .statesand 1 territory) and use these 22 .spatially
considered explicitly spatial because the locations or aggregated values a.s data units to estimate the nation-
placement of the observations or units of data are them- wide fertility pattern .
.selves directly analyzed. For example, a geographer Using .spatially aggregated data raises .special is.sues.
responsible for .selecting potentially profitable locations Geographers mu.st always be extremely cautious when
for a new retail .storemight calculate the "center of grav- trying to transfer re.suits or apply conclusions "down"
ity" or average location of certain "target" households in from larger areas to .smallerareas or from .smallerareas to
the area. In another application, a biogeographer might individuals. If conclusions are derived from the analysis
analyze the .spatial pattern of a .sample of diseased trees of data .spatiallyaggregated for large areas, it may not be
in a national fore.st to determine whether these trees are valid to reach conclusions about .smallerareas or individ-
randomly distributed throughout the area or clustered in uals. In the .studyof Nigerian fertility patterns, for exam-
certain places. In both of these examples, the data are ple, valid .spatially aggregated conclusions can be drawn
Chapter 2 .A. Geographic Data: Characteristics and Preparation 23

about the degree of acceptance or rejection of family-plan- is assigned to one of two or more categories. Suppose an
ning programs in each of these states by using birth rate agricultural geographer asked 80 farmers to identify
estimates from the administrative divisions. However, tak- their primary cash crop and received the following
ing these aggregate conclusions down to the level of the responses: 43 farmers said com, 28 said wheat, and 9
individual family will almost certainly result in deductive said barley. It might be tempting to conclude that these
errors. Even in the Nigerian state with the lowest birth are quantitative data, since 43, 28, and 9 are clearly
rate, a number of families will not be practicing any form numerical values. However, the variableresponses(corn,
of birth control. This invalid transfer of conclusions from wheat, barley) are nonnumeric. The values of 43, 28,
spatially aggregated analysis to smaller areas or to the and 9 are not the raw data, but rather frequencycountsof
individual level is known as the ecological fallacy. observations assigned to the non-numerical categories,
Conversely, taking individual-level data and aggre- making this is an example of qualitative (or categorical)
gating it to larger spatial units is generally not a problem. data. Other examples of qualitative variables are type of
In fact, Nigerian officials probably collected data from land use, gender (male or female), political party affilia-
individuals and selected villages, then aggregated that tion, religious preference, and climate type. Special types
information to obtain state-level estimates. Some of the of statistical tests are discussed in chapter 12 to handle
effects of level of spatial aggregation on descriptive statis- qualitative (categorical) data.
tics are discussed in chapter 3.
Variables in a data set are characterized as either dis-
crete or continuous. A discrete variable has some restric- 2.2 LEVELS
OFMEASUREMENT
tion placed on the values the variable can assume. A
Just as the organization of variables determines how
continuous variable has an infinitely large number of
data can be analyzed statistically, so too does the related
possible values along some interval of a real number line.
In general, discrete data are the result of counting or tab- issue of variable measurement. That is, the level of mea-
surement of units of data is considered when selecting an
ulating the number of items, and potential values are lim-
appropriate statistical technique to solve a geographic
ited to whole integers. Continuous data are the result of
measurement, and values can be expressed as decimals. problem. Several different levels of measurement are
described (table 2. 1), and different statistical procedures
Examples of discrete variables include the number of
households in a county with cell phones, the number of are appropriate to each.
immigrants currently living in a city, the number of sur-
vey respondents in favor of a local bond issue, and the
Nominal Scale
number of active volcanoes in a country. Examples of The simplest scale of measurement for variables is
continuous variables include inches of precipitation at a the assignment of each value or unit of data to one of at
weather station collected over a period of time, total area least two qualitative classes or categories. In nominal
under irrigation in a country, distance traveled by a fam- scale classification of variables, each category is given
ily on its annual vacation, and average wind speed at the some name or title, but no assumptions are made about
summit of a mountain. any relationships between categories-only that they are
The "rounding-off" of data values must not be con- different. Values are "different" if they are assigned to
fused with the distinction between discrete and continu- different categories or "similar" if assigned to the same
ous data. For example, the elevation of a mountain is category. Thus, problems using variables placed on a
almost always expressed to the nearest foot or meter. Rep- nominal scale are considered categorical (qualitative).
resentation of elevation as a whole number may give the An urban planner could assign each parcel of land in
impression that this variable is discrete. However, since a city to one of several nominal land use categories (resi-
elevation can be measured more precisely than the near- dential, commercial-retail, industrial, recreation-open
est foot or meter, it is considered a continuous variable. space, etc.) without inferring that residential land use is
When discussing probability distributions, the dis- "greater than" recreation-open space or "less than"
tinction between discrete and continuous data is impor- industrial. In fact, the only necessary conditions for a
tant. Geographic problems with discrete variables often proper nominal scale classification of variables are that
require the application of different probability distribu- the categories are exhaustive (every value or unit of data
tions than problems with continuous variables. Several can be assigned to a category) and mutually exclusive (it
practical geographic examples of both discrete and con- is not possible to assign a value to more than one cate-
tinuous distributions are presented in chapters 5 and 6. gory because the categories do not overlap). Oftentimes,
Variables in a set of data are either quantitative or cartographers utilize the category of "other" to ensure
qualitative. If a variable is quantitative, the observations the mutually exclusive category.
or responses are expressed numerically-that is, units of Geographers create nominal variables in many
data are assigned numerical values. On the other hand, ways: individuals are classified by religious affiliation
if a variable is qualitative, each observation or response (Baptist, Catholic, Methodist, Presbyterian, etc.) or
24 Part I .A. Basic Statistical Concepts in Geography

TABLE 2.1

Summary: Levels of Measurement


Level of Measurement Brief Description

Nominal Each value or unrt of data assigned to one of at least two categories or qualitative
classes; no assumptions made about relationships between categories-only that
they are "different.·

Ordinal Values themselves are placed in some rank order.

Strongly ordered Each value or unrt of data is given a particular posrtion in a rank order sequence;
that is, each value is assigned rts own particular rank.

Weakly ordered Each value or unit of data assigned to a category, and the categories are then rank
ordered.

Interval Each value or unit of data placed on a measurement scale, and the interval between
any two unrts of data on this scale can be measured; origin or zero starting point is
assigned arbitrarily (that is, origin does not have a "natural" or "real" meaning).

Ratio Each value or unit of data placed on a measurement scale, and the interval between
any two unrts of data on this scale can be measured; origin or zero starting point is
"natural" or nonarbitrary, making it possible to determine the ratio between values.

political party (Democrat, Republican, Independent); value or unit of data is given a particular position in a
cities are classified by primary economic function (man- rank order sequence, the variable is considered strongly-
ufacturing, retail, mining, transportation, tourism, etc.); ordered. The city ranking schemes that occasionally
counties are organized by primary type of home heating appear in newspapers or magazines, such as the " IObest
fuel (fuel oil, utility gas, coal, wood, electricity, etc.); and places to live" or "50 best American cities," are typical
countries are distinguished by predominant language examples of this sort of survey. A popular publication is
family (Indo-European, Afro-Asiatic, Sino-Tibetan, the Places-RatedAlmanac, which provides a ranking of
Ural-Altaic, etc.). U.S. cities based on the aggregate compilation of many
If a variable has only two categories, a special subset variables. Since each city is assigned its own particular
of nominal classification is used. This sort of dichoto- rank, these "preference rankings" are examples of
mous (binary) assignment is used when no greater degree strongly-ordered ordinal variables. Other examples of
of qualification is necessary or possible. For example, strongly-ordered variables include the ranking of coun-
each person could be assigned a "I" if they attend a pri- tries by gross national product per capita or the ranking
vate elementary school and a "O" if they do not. of states in terms of dollars spent per resident on higher
Although this type of information is clearly limited, sta- education. Realize, however, that even strongly-ordered
tistical analysis on the frequency counts of the number of data is limited in the assumptions that can be made. For
values or individuals in the categories can be conducted. example, because a city is ranked second in a list of 100
Many geographic variables have only "yes-no" or "pres- best places to live does not imply that the city is twice as
ence-absence" data available. good as a city ranked fourth.
Variables that are organized nominally limit the By contrast, in a weakly-ordered variable, the values
types of numerical analysis that can be applied. Never- are placed in categories, and the categories themselves
theless, appropriate statistical techniques exist and are are rank ordered. Suppose one is constructing a chorop-
presented in chapter 12. leth map showing the percentage of population change in
each county of the United States from 2005 to 2010. To
OrdinalScale depict the population change cartographically, suppose
The next higher level of measurement involvesplace- six ordinal categories are selected (15% increase or more,
ment of values in rank order to create an ordinal scale 12%,to 14.9% increase, and so on), and each of the more
variable. The relationship between observations takes on than 3,000 counties nationwide is assigned to one of
a form of "greater than" and "less than." With data in these six categories. When frequency counts of counties
rank order, more quantitative distinctions are possible are made for each category, the variable is weakly- rather
than with nominal (qualitative) scale variables. than strongly-ordered. It is "weak" or "incomplete" in
Geographers can easily identify examples of ordinal the sense that two counties assigned to the same category
scale variables. An important distinction needs to be on the map cannot be distinguished, even though in real-
made, however, between a strongly-ordered ordinal vari- ity the counties almost certainly have different popula-
able and a weakly-ordered ordinal variable. When each tion change values.
Chapter 2 .A. Geographic Data: Characteristics and Preparation 25

As with nominally scaled variables, specific statisti- Counties in the state could then be arranged in a
cal tests are designed to deal with both strongly- and strongly-ordered sequence by the percent of households
weakly-ordered ordinal variables. Some of these tech- in each county using natural gas (with the county having
niques will be discussed in chapters 10, 11, and 12. the highest percentage of households using natural gas
assigned rank one, and so on). A choropleth map of the
Interval and Ratio Scales county values could display a weakly-ordered graphic
With variables measured on either an interval or representation of the percentage of households using nat-
ratio scale, you can determine the magnitude of the dif- ural gas. As yet another alternative, the number of house-
ference between values. That is, the length of an interval holds per county using natural gas could be organized
between any two units of data can be measured on the and displayed on a ratio scale.
scale. This means that not only is the relative position of
each value known (a value is above or below another),
but also known is the magnitude of difference of each
2.3 MEASUREMENT
CONCEPTS
observation from all other observations on the measure- As proficiency improves when working with descrip-
ment scale. tive and inferential statistics, it becomes tempting to
Interval and ratio measurement scales are distin- believe that the analysis is truly error free and that the
guished by the way in which the origin or zero starting geographic research problem is solved. Results from sta-
point is determined. With interval scale measurement the tistical analysis often seem very exact and definitive. If
origin or zero starting point is assigned arbitrarily. The the data are analyzed on the computer, the results are
Fahrenheit and Celsius scales used in the measurement of nicely displayed and the same outcome is obtained if the
temperature are two widely known interval scales. The data are resubmitted. It may seem appropriate to accept
placement of the zero degree point on both of these mea- the answers as error-free automatically. However, an
surement scales is arbitrary. With the Fahrenheit scale, error-free result cannot be guaranteed just because a set
the zero degree mark is the lowest temperature attained of data is submitted correctly into some statistical soft-
with a mixture of ice, water, and common salt, whereas ware package. In fact, several interrelated sources of mea-
with the Celsius scale the zero degree point corresponds surement error can operate separately or in combination
to the melting point of ice. to produce problems for the geographer.
In ratio scale measurement, by contrast, a natural or
non-arbitrary zero is used, making it possible to deter- Precision
mine the ratio between values. If Montreal, Canada Precision refers to the level of exactness associated
receives40 inches of annual precipitation and Chihuahua, with measurement. Precision is often associated with the
Mexico receives only 10 inches, the ratio between these calibration of a measuring instrument, such as a rain
two measures is easily calculated (40/10 = 4). Further- gauge. As a frontal system moves through an area, sup-
more, it is correct to conclude that Montreal received four pose that the amount of rainfall is recorded by two stan-
times as much precipitation as Chihuahua. Zero inches of dard rain gauges having different calibration systems. On
rainfall represent a natural or non-arbitrary zero. the coarsely calibrated gauge, the amount of rainfall might
It should be noted that "ratio-type" statements can- be estimated as somewhere between 1.2 and 1.3 inches.
not be made with interval-scale variables. For example, However, the more finely calibrated gauge provides a
since zero is an arbitrary value on the Fahrenheit scale, 60 more precise estimate between 1.26 and 1.27 inches.
degrees Fahrenheit cannot be considered twice as warm In many geographic problems, the issue of spurious
as 30 degrees. Many other variables of interest to geogra- precision must be considered. The computer (or calcula-
phers are measured on a ratio scale, including distance, tor) output will often provide statistical outcomes with six
area, and such demographic and socioeconomic variables or more decimal places, even when the input data are in
as infant mortality rate and median family income. integer form. Reporting seemingly precise statistics based
Observations from the same variable can be on less precise input is a misleading but relatively com-
expressed at different measurement scales, depending on monplace occurrence. Unless confidence in such a level of
how they are measured, organized, and displayed. For measurement precision is warranted, it should be avoided.
example, a state resource planner interested in the type of
energy used in homes across the state could use data col- Accuracy
lected and measured in several different ways. Data The concept of accuracyrefers to the extent of sys-
could be collected at the individual household level and tem-wide bias in the measurement process. It is quite
organized nominally by the primary type of energy used possible for measurement to be very precise, yet inaccu-
(number of homes in the state whose principal energy rate. Return to the rain gauge example for a moment.
source is coal, utility gas, fuel oil, etc.). Alternatively, Suppose another more finely calibrated (more precise)
data might be available as county-level summaries. rain gauge is used, but the gauge was not calibrated prop-
26 Part I .A. Basic Statistical Concepts in Geography

saturated with multifaceted, complex variables, such as


Case 1: Case 2:
"level of physical inactivity," "environmental quality,"
Precise, Precise,
inaccurate
"economic well-being," "quality of education," "nutri-
accurate
tious diet" and "quality of life." To express the true mean-
ing of such concepts is often not possible, so we find it
necessary to create operational definitions that can serve

@ Q as indirect or surrogate measures. The question then


becomes whether the operational definition is valid. For
example, a geographer studying the spatial pattern of
"quality of education" in a metropolitan area might eval-
uate elementary schools on the basis of "average student
Case 3: Case 4: score on the California Achievement Test (CAT)" and
Imprecise, Imprecise, evaluate high schools by "percentage of graduates who
accurate inaccurate subsequently go to college." Clearly, the concept "quality
of education" involves much more than what is reflected
by these operational definitions, and their validity must
be questioned. In this case, it should be asked whether
"average CAT score" is a fully valid, somewhat valid, or
invalid measure of "quality of education" received.
Admittedly, the degree of validity in a geographic
problem may prove difficult (even impossible) to deter-
mine. Consequently this question is often ignored and
FIGURE 2.1 problems with validity are sometimes assumed away as
The Measurement Concepts of Precision and Accuracy: being inconsequential. A good geographic study involv-
The Target Analogy ing complex variables will discuss the degree of validity
of any operational definitions used in the analysis.
erly. The person reading the gauge estimates the amount
of rainfall to be 1.19 inches rather than the actual rainfall Reliability
total of about 1.26 or 1.27 inches, resulting in an inaccu-
rate reading. Unfortunately, discovering systematic bias A final measurement concept of concern in many
in a measurement instrument is often quite difficult. geographic problems is reliability. When data are col-
To understand the relationship between precision lected over time or when changes in spatial patterns are
and accuracy, consider this "target analogy" showing analyzed over time, you should question the consistency
results from successive firings of a gun at a target (figure and stability of the data. For example, if repeated or repli-
2.1). In case I, the five bullet holes are closely clustered cate samples of water are taken from the same set of loca-
(precise) and centered on the middle of the target (accu- tions over time, a consistent and well-defined method of
rate), making this the best of the four alternatives pre- sampling is necessary or the results of subsequent statisti-
sented. Cases 2 and 3 are both flawed-the inaccuracy of cal analysis may be unreliable. Suppose you want to
the bullet holes in case 2 and the imprecision of the holes examine the spatial pattern of poverty across the United
in case 3 result in different types of errors. Case 4 appears States-does it vary more today than it did SO or 100
to have the most severe problems, with an inaccurate sys- years ago? A reliable and consistent measure of poverty is
tematic bias toward the upper left comer of the target as needed to answer this question. It could be that poverty
well as considerable scatter of bullet holes that also make was defined differently SOor I00 years ago.
the results imprecise. Clearly, you must take care to dis- Reliability problems often occur when using interna-
tinguish between the interrelated concepts of precision tional data. Fully comparable and totally consistent meth-
and accuracy when working with any data. ods of collecting data rarely exist from country to country.
A developing nation has fewer resources (e.g., personnel,
Validity money) than one that is more developed, and sources of
In many geographic problems, the spatial distribu- measurement error inevitably affect the data collection
tion or locational pattern being analyzed is the result of process. International comparative statistics are often
complex processes. In such situations it is understandably unreliable. Even within the same country or region, loca-
difficult to express the "true" or "appropriate" meaning tional variations in data collection and processing meth-
of that concept through the measurement of any simple ods can render data unreliable. Data collection procedures
variable or set of variables. Validity addresses measure- can change drastically from one time period to the next,
ment issues related to the nature, meaning, or definition again resulting in unreliable data. Some of these issues
of a concept or variable. The discipline of geography is were discussed in the life expectancy example in chapter 1.
Chapter 2 .A. Geographic Data: Characteristics and Preparation 27

One way to assess the degree of reliability of a mea- are generally placed in different categories. The result is a
surement instrument is to compare at least two applica- grouping of data that seems to minimize the amount of
tions of the data collection method used at different fluctuation or dispersion of values within the same cate-
times. This "test-retest" procedure is one way to evaluate gory and maximizes the dispersion of values between dif-
the reliability of IQ test scores, medical diagnoses, and ferent categories. Whatever the specific method of
SAT scores, for example. Reliability checks need to be classification, however, the resulting categories should be
used more frequently in behavioral geography, particu- both mutually exclusive and exhaustive.
larly when analyzing the results from a survey or ques- Many specific classification methods are available,
tionnaire that contains individual attitudes and opinions, and each method approaches these general goals from a
which may be uncertain and subject to frequent change. slightly different perspective. No matter what specific
method of classification is used, some information is
invariably lost when large amounts of information are
2.4 BASICCIASSIFICATION
METHODS
simplified and generalized. Information is also lost if
Although we have already introduced categories and individual-level values are spatially aggregated and only
classification, we have not examined the methods the aggregated data are available. Similarly, information
describing how to classify data into categories and rea- is lost if values are classified and only the classified data
sons why one method of classification is better than are available. In fact, spatially aggregated data are simply
another. In this section, the purposes and importance of individual-level data classified by location.
classification in geographic research are explained and At the most fundamental level, classification uses
basic classification methods are reviewed. one of two conceptualstrategies.The first of these is subdi-
Geographers regularly face the problem of deciding vision (sometimes called logical subdivision). At the
how to classify or group spatial data. Classification is start of the subdivision process, all units of data in a pop-
used for several important reasons. Classification ulation are grouped together. Then, through a series of
schemes organize, simplify, and generalize large amounts steps or iterations, individual values are allocated to an
of information into effective or meaningful categories, appropriate subdivision using carefully defined criteria.
bringing relative order and simplicity to complexity. As a This strategy "works down" by disaggregating all values
result, communication is enhanced, detailed spatial into logically subdivided classes. Most practical geo-
information is better understood, and complex spatial graphic examples of subdivision are hierarchical, with
patterns are represented more clearly. Maps created with multiple levels of subdivision, depending on the problem
properly classified data result in more effective graphic or situation. A clear and consistent set of rules is always
communication. Classification is also an integral part of needed to assign values to the proper category at each
the scientific research process, helping in the formation stage of the subdivision procedure. The characteristics or
of hypotheses and guiding further investigation. values associated with each category are defined before
In classification, values are organized according to the classification procedure begins.
their degree of similarity. That is, similar values are gen- The subdivision strategy of classification is illustrated
erally placed in the same category and dissimilar values with two examples. The U.S. Geological Survey leads a

TABLE 2.2

A Portion of the National Land Cover Database System


1. Water 2. Developed 3. Barren 4. Forested Upland 5. Shrubland
11 Open Water 21 Low Intensity 31 Bare Rock/Sand/Clay 41 Deciduous 51 Shrubland
Residential Forest
12 Perennial 22 High Intensity 32 Quarries/Strip 42 Evergreen
Ice/Snow Residential Mines/Gravel Pits Forest
23 CommerciaUlndustrial/ 33 Transrtional 43 Mixed Forest
Transportation
7. Herbaceous Upland 8. Herbaceous
6. Non-Natural Woody Natural/Semi-Natural Vegetation Planted/Cultivated 9. Wetlands
61 OrchardsNineyards/ 71 Grasslands/Herbaceous 81 Pasture/Hay 91 Woody Wetlands
01her
82 Row Crops 92 Emergent Herbaceous
Wetlands
83 Small Grains
84 Fallow
85 Urban/Recreational
Grasses

Source: Unrted States Geological Survey (USGS), USGS Land Cover lnstrtute
28 Part I .A. Basic Statistical Concepts in Geography

partner.ship of Federal agencies that has created a Multi- from others at the start of the classification process. The
Resolution Land Characteristics (MRLC) Consortium. agglomeration procedure then "works up" by allocating
One of the many products of this Consortium is the values into classes according to well-defined grouping cri-
National Land Cover Database (NLCD) classification sys- teria. Agglomeration is accomplished when similar val-
tem, which serves as the definitiveLand.sat-based,30-meter ues are combined into the same category and dissimilar
resolution, land cover database for the Nation. The Level I values are placed in different categories. The agglomera-
and Level II classes are shown in table 2.2. This classifica- tion method of grouping is the conceptual opposite of
tion system supports a wide variety of applications that subdivision. This classification strategy is very important
seek to assess ecosystem status and health, under.stand the in the geographic research process and data for many
spatial patterns of biodiversity, predict effects of climate geographic problems are summarized numerically or
change, and develop land management policy. graphically using agglomeration.
The North American Industry Classification System Beyond these general conceptual strategies of logical
(NAICS) can al.so be used to illustrate subdivision. subdivision and agglomeration, a variety of specific opera-
Applied throughout the United States, Canada and Mex- tionalproceduresar rulesare applied in practical classifica-
ico, this hierarchical system classifies establishments (busi- tion. These operational methods are not necessarily
nesses) by what they produce or supply and allows for "pure" logical subdivision or agglomeration, but may
comparability of data among the three trading partners of contain elements of both. The remainder of this section
the North American Free Trade Association (NAFTA). In focuses on the first four of the six simple operational
the United States, data on business activities are summa- methods or rules of classification that can be applied in
rized and reported by NAICS groupings every five years geography (table 2.4). The standard deviation breaks and
in the EconomicCensus.The NAICS classification subdi- Jenks natural breaks methods will be discussed in the
vides all production or service activity according to next chapter.
numerical code, ranging from a two-digit general eco-
nomic sector grouping at the top level of the hierarchy to 1. Equal interval.s based on range. The range is simply
more .specificsix-digit breakdown of activity at the lowest the difference in magnitude between the large.st and
level. A small portion ofNAICS is shown in table 2.3. smallest values in an interval-ratio set of data. To deter-
The second conceptual classification strategy is mine class breaks (the values that separate one class
agglomeration. With this general approach, each obser- from another), the range is divided into the desired
vation in a population or data set is separate and distinct number of equal-width class intervals. The procedure is

TABLE 2.3

A Portion of the North American Industry Classification System (NAICS)


2 Digit 3 Digit 4 Digit 5 Digit

11 Agric.ulture, Forestry, Fishing and Hunting


21 Mining, Quarrying, and Oil and Gas Extraction
22 Utilrties
23 Construction
31-33 Manufacturing 311 Food Mfg.
312 Beverage and 3121 Beverage Mfg. 31211 Soft Drink
Tobacco Product Mfg. and Ice Mfg.
31212 Breweries
31213 Wineries
31214 Distilleries
3122 Tobacco Mfg.
313 Textile Mills
314 Textile Product Mills
315 Apparel Mfg.
42 Wholesale Trade
44-45 Retail Trade
48-49 Transportation and Warehousing
51 Information
52 Finance and Insurance
53 Real Estate and Rental and Leasing
54 Professional, Scientific, and Technical Services
55 Management of Companies and Enterprises
Administrative and Support and Waste
56 Management and Remediation Services
61 Educational Services
62 Health Care and Social Assistance
71 Arts, Entertainment, and Recreation
72 Accommodation and Food Services
81 Other Services (except Public Administration)
92 Public Administration
Source: United States Bureau of the Census
Chapter 2 .A. Geographic Data: Characteristics and Preparation 29

TABLE 2.4

Brief Description of Selected Single-Variable Classification Methods

Classification Method Brief Description

Equal intervals based on range Class breaks determined by dividing range (difference between the lowest- and
highest-valued unrts of data) into desired number of equal-width class intervals.

Equal intervals not based on range Based on convenience or practical considerations; rounded-off dass breaks and
class interval widths arbitrarily selected.

Quantile breaks Equally divide the total number of values into the desired number of classes; two
commonly used divisions are quartiles (4 categories) and quintiles (5 categories).

Natural breaks (Single Linkage Method) Place units of data in rank order, identify "natural breaks" or separations between
adjacent ranked values, and locate class breaks in the largest of these natural
breaks. Iterative process if single-linkage version used, wrth largest natural break
selected as first class break location, next largest natural break selected second,
and so on until desired number of dasses created.

Standard Deviation Breaks * After calculating the mean and standard deviation of all the observations, class
breaks are determined by dividing the data at rounded-off standard deviation
intervals (z-values) such as 1.0 or 0.5. This method is often used when a data set
is normally distributed.

Natural Breaks (Jenks Method) • Class breaks are determined by ·natural" groupings inherent in the data. This is
an rterative process that attempts to minimize the variation of observations wrthin
classes and maximize the variation between classes.

• Discussion of this class~ication method is delayed until Chapter 3, in which the terms mean, standard deviation and variance are
formally defined.

easy to use and results in class intervals of equal width, number of values is divided as equally as possible into
which may be an advantage for some applications. the desired number of classes. Two frequently used
However, because all class breaks are derived from the alternatives are the division of data into quartiles (four
two units of data with the most extreme values, results categories) or quintiles (five categories). The alloca-
are sometimes misleading. To maintain the equal width tion of an equal number of values to each category is
of each class, the class break numbers are not usually often an advantage in choropleth (area pattern) map-
rounded-off. The number of values in each category ping, particularly if an approximately equal area on
may also vary considerably.This may be an advantage the map is desired for each category. However, the
or a disadvantage, depending on the purposes of your possible disadvantages with quantile breaks classifica-
classification and the goals of the analysis. tion should be evaluated before deciding to use this
method. Class breaks are frequently not convenient
2. Equal intervals not based on range. This method of
rounded-off values and class interval widths are
classification also designates class breaks to create almost never equal for different categories. If a large
equal interval classes, but the exact range is not used number of observations are clustered relatively close
to select the class breaks. Instead, a convenient or together-a frequent occurrence with geographic
practical interval-width is selected arbitrarily, based on data-these similarly sized values are likely to be
rounded-off class break values. Units of data are then "split unnaturally" by a class break to keep an equal
assigned to the categories. This method of classifica- number of values in each category.
tion is preferred for constructing a frequency distribu-
tion, histogram, or ogive to represent the data
graphically (as we will see in the next chapter). Many 4. Natural breaks.Yet another perspective is the natural
breaks method of classification. There are many meth-
institutions and government agencies use this method
to map complex spatial patterns. The convenient class ods of natural breaks, but the most elementary is known
break values generally result in maps that are easy to as the single linkage approach. The logic is to identify
understand and interpret. The number of values in natural breaks in the data and separate values into dif-
each category could vary widely, which may be either ferent classes based on these breaks. The process is done
an advantage or disadvantage, depending on your iteratively, with the largest gap or separation between
classification goals. adjacent values on a number line selected as the first
class break location. The next largest gap is selected sec-
3. Quantile breaks. This method approaches classifica- ond, and so on until the desired number of classes has
tion from a somewhat different perspective. The total been created. With this classification process, similar
30 Part I .A. Ba.sicStatistical Concepts in Geography

values are kept together in the .same category, dis.similar use always involves the trade-off between effective gener-
values are .separated into different categories, and gaps alization and communication of .sufficient detail. Five
in the data are incorporated directly in the grouping cla.s.se.sare used in this obesity example; many cartogra-
procedure. This method will highlight extreme values, phers con.sider this number reasonable for choropleth
often placing unusually large positive or negative obser- mapping of geographic patterns.
vations into their own unique categories. Depending on Each .simple method of classification i.s applied to
your re.search problem, highlighting extreme values the percent of adults defined a.s obese for each area. For
may (or may not) be a primary goal of the classification. consistency of discussion and interpretation, exactly five
Another common con.sequence of natural breaks is the cla.s.se.s
and four class breaks are used with each method.
clustering of large numbers of values into one or two The re.suits are .shown on a .seriesof choropleth maps (fig-
categories. Again, you mu.st decide whether this is an ure 2.2, cases 1-4). On each choropleth map the class
advantage or disadvantage. interval.s and class breaks derived from application of
each classification .scheme are .shown in the legend.
To illustrate how each of these methods works, 2009 Application of the various classification methods
.state-leveladult obesity rates are classified. Even a .simple re.suits in dramatically different obesity pattern maps. It
Ii.st of S 1 ranked obesity rates (Washington, DC is i.s important to realize that .such .startling visual differ-
included) is too detailed and cumber.some for direct ences on choropleth maps are expected from one method
.study and interpretation (table 2.5). This i.s particularly of classification to another, even when all the maps are
true if one goal i.s to illustrate the .spatial pattern of obe- created from the .same data .set.These .sharp visual con-
sity in a choropleth map, with each .stateallocated to one trasts do not mean that any of the classification methods
of .severalcla.s.se.s. are inaccurate or biased. Rather, this .situation illustrates
Selecting the ideal number of classesis very important the distinctiveness of the goal.s in classification. It .seems
in choropleth mapping. If data are allocated into too few that very different impressions regarding the "reality" of
categories, key details in the .spatial pattern are likely to the .situation emerge with application of different classifi-
be lo.st. On the other hand, if too many categories are cation methods.
used, the map reader becomes overwhelmed with detail Perhaps the greatest visual contra.stexists between the
and could mi.s.scertain important generalizations of the maps of quantile breaks and natural breaks (figure 2.2,
.spatialpattern. Your decision on the number of cla.s.se.s to cases 3 and 4). The objective in using quantile breaks i.sto

TABLE 2.5
Ranked Obesity Levels by State, 2009 (in percent)

State Percent Obese State Percent Obese


Colorado 18.6 Illinois 26.5
Washington, DC 19.7 Delaware 27.0
Connecticut 20.6 Georgia 27.2
Massachusetts 21.4 Nebraska 27.2
Hawaii 22.3 Pennsylvania 27.4
Vermont 22.8 Iowa 27.9
Oregon 23.0 North Dakota 27.9
Montana 23.2 Kansas 28.1
New Jersey 23.3 Texas 28.7
Utah 23.5 Wisconsin 28.7
New York 24.2 Ohio 28.8
Idaho 24.5 North Carolina 29.3
Minnesota 24.6 South Carolina 29.4
Rhode Island 24.6 Indiana 29.5
Wyoming 24.6 Michigan 29.6
Alaska 24.8 South Dakota 29.6
California 24.8 Missouri 30.0
Virginia 25.0 Arkansas 30.5
New Mexico 25.1 Alabama 31.0
Florida 25.2 West Virginia 31.1
Arizona 25.5 Oklahoma 31.4
New Hampshire 25.7 Kentucky 31.5
Maine 25.8 Tennessee 32.3
Nevada 25.8 Louisiana 33.0
Maryland 26.2 Mississippi 34.4
Washington 26.4
• Obesity is defined as a body mass index (BMI) of 30 or greater. BMI is calculated from a person's weight and height and provides
a reasonable indicator of body fatness and weight categories.
Source: Centersfor DiseaseControland Prevention(CDC)
Chapter 2 .A. Geographic Data: Characteristics and Preparation 31

Case 1: Equal Intervals Based on Range

-. '
2009 Obesity Rates

Cl 18.6 to 21.1
Cl 21.8 to 24.9
,o Cl 25.o to 28.1
Cl 28.2 to 31.2
31.3 to 34.4
.....
Case 2: Equal Intervals Not Based on Range

2009 Obesity Rates

Cl 11.0 to 20.9
Cl 21.0 to 24.9
,o Cl 25.o to 28.9
Cl 29.o to 32.9
..... 33.0 to 36.9

FIGURE 2.2
Percent of Adults Considered Obese, by State, 2009: Several Classification Methods
Source: Centers for Disease Control and Prevention (CDC). 2010
32 Part I .A. Basic Statistical Concepts in Geography

Case 3: Quantile Breaks

-. '
2009 Obesity Rates

~ 18.6 to 24.1

~ 24.2 to 25.4
,o ~ 25.5 to 27.3

Cl 21.4 to 29.5

..... 29.6 to 34.4

Case 4: Natural Breaks (Single Linkage Method)

2009 Obesity Rates

~18.6

~19.7
,o ~ 20.6 to 21.4

~ 22.3 to 33.0
..... 34.4

FIGURE 2.2 (continued)


Percent of Adults Considered Obese, by State, 2009: Several Classification Methods
Source: Centers for Disease Control and Prevention (CDC), 2010
Chapter 2 .A. Geographic Data: Characteristics and Preparation 33

allocate an equal number of values to each category. other than five, a variation of this same problem will still
Applying quantiles, Mississippi (the state with the highest exist. Nevertheless, this is a fully valid and objective
percent of adults considered obese (34.4%) is grouped application of equal intervals not based on range, work-
with many other states. The overall impression from the ing within the constraints of the data.
quantile map is that high obesity rates can be found in The third method (case 3) is quantile breaks. Tied
many parts of the South, Midwest, and Appalachia, and observations create a minor problem here as well. It is
there is no particular visual focus on Mississippi. not possible to allocate exactlyten states to each of the
By contrast, the goal in natural breaks is to separate five categories without splitting multiple states that are
those places that are very different in value from all other tied. Also, the inclusion of Washington, DC, makes such
places. The mapped result is sharply different from quan- a perfect division impossible. A valid operational rule
tiles, as Mississippi's high obesity rate is highlighted would be to minimize the disparities between groups,
graphically. In fact, the dominant impression from the creating a classification that is as close as possible to hav-
natural breaks map is that most of the United States has ing an equal number of observations in each category.
comparable obesity rates, with the exception of a few One such allocation is shown on the map, where the
extreme observations such as Mississippi, Colorado, maximum number of areas in any category is 11 and the
Connecticut, and Massachusetts. minimum number is 10 (figure 2.2, case 3).
In this example problem, minor operational difficul- The final method (case 4) is the natural breaks, single
ties are encountered with each of the simple methods of linkage method. Here, the lack of precision once again
classification. In addition, elements of subjectivity seem generates a problem. The single largest natural break in
to enter the classification process no matter which proce- the data set is 1.4%,obesity units, separating Mississippi
dure we use, and trade-offs or compromises are some- (34.4%) from Louisiana (33.0%). If the classification pro-
times necessary. One limitation is that the obesity data cess were to stop at this juncture, there would be no prob-
are expressed in tenths of a percent. Of course, this is a lem, but the resultant map would only have two
reasonably precise degree of measurement, but it causes categories-one category of size SOcontaining all places
application problems with some methods of classifica- with obesity rates of 33.0% or less, and another category
tion. As a result of this precision level, some ties exist in of size I containing only Mississippi. A two-category
the data set. For example, three states are reported with a choropleth map will reveal almost none of the actual spa-
24.6% adult obesity rate. Obviously these three states do tial pattern and is clearly not the best strategy. Also, a
not all have an exaa rate of 24.6%, so they are not really condition in this problem was to create classification
tied, but the full reality of the situation is not known. The schemes having exactly five categories, so this objective
Centers for Disease Control recognizes that more precise would also not be attained.
estimates of obesity rates (say, to the nearest hundredth) Proceeding iteratively using the single linkage
could be provided, but they would create meaningless, approach, the next largest natural break in the data set is
spunous preets1on. 1.1 units, separating Colorado (18.6%)from Washington,
The first method (figure 2.2, case 1) is equal intervals DC (19.7%). The next largest natural break is 0.9 units,
based on range. The range from lowest obesity in Colo- and this occurs twice in the data set-the separation of
rado to highest obesity in Mississippi is 1S.8%, and five Washington, DC (19.7%,)from Connecticut (20.6%) and
exactlyequal intervalcategories cannot be created at this the separation of Massachusetts (21.4%) from Hawaii
precision level over a range of 15.8 units of data without (22.3%). After this iteration, exactly five distinct catego-
modifying the data in some way. The closest one can get ries are created, as shown in figure 2.2, case 4.
to equal intervals is the classification shown in the map A much better way of showing this entire step-by-
legend, where the first four categories have an interval of step process of natural breaks single linkage is graphically
3.2 units, and the last category has an interval of 3.1 units with a dendrogram (figure 2.3). Many statistical software
(see the legend in figure 2.2, case !). packages are now capable of creating nice-looking den-
The second method (case 2) is equal intervals not drograms. The first subdivision occurs when Mississippi
based on range. Again, without creating spurious preci- separates from all the others at a distance (difference) of
sion it is not possible to create five exactly equal interval 1.4 variable units. This is clearly seen at the top-right of
categories where the lower bound of the first category is figure 2.3. The second subdivision occurs when Colo-
18.6% and the upper bound of the fifth category is rado separates from all others at a distance of 1.1units,
34.4%. The solution shown on the map uses a class inter- and so on. The horizontal line added to the dendrogram
val of 4.0% obesity,with the first category "anchored" on at a distance below 0.9 but above 0.8 shows the grouping
17.0% and the last category on 36.9%. While these class of areas if the process is stopped with exactly five catego-
breaks are convenient, a disconcerting feature is that no ries. Scan along that line and you can identify the follow-
state has an obesity rate as low as 17.0% or as high as ing five categories-Mississippi, Colorado, Washington,
36.9%. If the class interval width is changed from 4.0% DC, a Connecticut-Massachusetts pair, and a final large
or the number of categories is changed to something category containing the other 46 states.
34 Part I .A. Basic Statistical Concepts in Geography

•••
1A
.,
<IJ

~ 1.2

>-
""
.,
(I)

.£l - -
C
0
1.0

0.8
- - - - - - - - - - - - - - - - - - -Five- category
- - -sut,..division
- - - - - - - - -
I
-- - - - - - -
l
-.1:;

I
2l 0.6
.,.,
C
~
I
I I
I
OA
I!: I
C I I
0.2 - -~
~
I l
0.0 ~. ~ r,
. .' .' r 7 r 7

FIGURE 2.3
Dendrogram: State Obesity Rates, 2009

Despite the impressive dendrogram, the resulting The logical conclusion is for you to recognize that any
map is not very satisfying geographically, and the process observed spatial pattern (map) is a function of the specific
creates some other problems as well. First of all, a few classification method applied and that using a different
extreme values in the data set (often called outliers) are method of classification will likely result in a visually dis-
identified, but the overall map contains very little infor- tinctive map.
mation about the nationwide obesity pattern. Further- In addition to the four simple classification methods
more, the physical area of one of these outliers presented in this chapter, a number of other classification
(Washington, DC) is so small it is not really visible on methods will be introduced later in the text. As footnoted
the map and both Connecticut and Massachusetts are in table 2.4, the standard deviation breaks and Jenks nat-
just barely noticeable. ural breaks methods are discussed in chapter 3. These
Another fairly serious difficulty is seen by looking at methods are not discussed here, because the statistics of
the map legend. Three of the five categories are single mean, standard deviation, and variancehave not yet been
states or areas and the fourth category contains only two formally defined. Classification techniques will appear
members (Connecticut and Massachusetts). As a result, again toward the end of the text (chapter 17), when the
the classification system does not appear to be exhaus- multivariate classification method of cluster analysis is
tive-creating an unfavorable situation since every applied to aspects of the obesity and life expectancy
potential observation value cannot be assigned to a cate- examples introduced in the first chapter.
gory. For instance, where would one assign an obesity
level of 21. 7? In point of fact, this is only a theoretical
KEY TERMS
problem, since all areas can actually be assigned to one
of these categories. However, a map reader looking at classification strategies:
these class values on the map legend could easily get subdivision, 27
confused and there could be possible misunderstandings. agglomeration, 28
In other words, although this is a properly applied single discrete and continuous variables, 23
linkage classification, it is an awkward and idiosyncratic dendrogram, 33
solution as well. ecological fallacy,23
What should you conclude about these disparities exhaustive and mutually exclusive categories, 23
among classification methods? Depending on the method explicitly spatial and implicitly spatial data, 22
used, outcomes can be quite different, even though the individual-level and spatially aggregated data, 22
same data set is used and the same number of classes cre- measurement concepts:
ated. The visually distinctive choropleth maps in figures precision, 25
2.2 illustrate the substantial effects of various classifica- accuracy, 25
tion decisions on spatial patterning and the resultant con- validity, 26
clusions that can be drawn about these spatial patterns. reliability, 26
Chapter 2 .A. Geographic Data: Characteristics and Preparation 35

measurement scales: REFERENCES


AND ADDITIONAL READING
nominal, 23
ordinal, 24 Blalock, H. M. Jr. C.Onceptualization
and Measurementin the Social
interval, 25 Sciences.Beverly Hills.,CA: Sage Publications, 1982.
ratio, 25 Brewer, C. A. DesigningBelterMaps:A Guidefor GJS Users.Red-
methods of classification (selected): lands, CA: ESRI Press, 2005.
Dramowicz, E. and K. Dramowicz. "Choropleth Mapping
equal intervals based on range, 28
with Exploratory Data Analysis." DireaionsMagazine,2004.
equal intervals not based on range, 29 Article 718. www.directionsmag.com
quantile breaks, 29 Jenks, G. F. and R. C. Coulson. "Class Intervals for Statistical
natural breaks (single linkage), 29 Maps." lntematifma/.Yearbook of Cartography3 (I 963): 119-34.
operational definition, 26 Monmonier, M. S. and H. J. DeB!ij. How to Lie with Maps. 2nd
outlier, 34 ed. Chicago: University of Chicago Press, 1996.
primary and secondary (archival) data, 21 Robinson, Arthur H .., Joel L. Morrison., Phillip C. Muehrcke,
quantitative and qualitative (categorical) variables, 23 A. Jon Kimerling, Stephen C. Guptill. Elementsof Cartogra-
spurious precision, 25 phy. 6th ed. New York: John Wiley and Sons, Inc .., 1995.
strongly-ordered and weakly-ordered Sokal, R. R. "Numerical Taxonomy." ScientificAmerican215,
No. 6 (1966):106-16.
ordinal variables, 24
Stauffer, C. L. "Making Maps: The Untold Story." Population
Today30 (2002): 3, 6.
MAJOR GOALSAND OBJECTIVES
If you have mastered the material in this chapter, you
should now be able to:
1. Categorize geographic variables and data sets on a
variety of dimensions (e.g., primary or secondary
[archival), explicitly or implicitly spatial, individual-
level or spatially aggregated, discrete or continuous,
categorical [qualitative] or quantitative variables).
2. Identify the measurement scale (nominal, ordinal,
interval, or ratio) of a variable and understand the
characteristics of each measurement scale.
3. Describe the measurement concepts of precision,
accuracy, validity, and reliability. When presented
with data from any particular geographic problem or
situation, evaluate the variable in terms of these poten-
tial sources of measurement error.
4. Explain the general logic associated with each of the
two conceptual classification strategies (subdivision
and agglomeration).
5. Understand the specific procedures, as well as both
the advantages and disadvantages associated with
each of the operational classification methods (equal
intervals based on range, equal intervals not based on
range, quantile breaks, and natural breaks-single
linkage). When given a set of geographic data, select
an appropriate classification method.
6. Describe or summarize a given set of data graphically
with a dendrogram.
PART II

DESCRIPT
PROBLEMSOLVING
IN GEOG HY
Descriptive Statistics and Graphics

3.1 Measures of Central Tendency


3.2 Measures of Dispersion and Variability
3.3 Measures of Shape or Relative Position
3.4 Selected Issues: Spatial Data and Descriptive Statistics

A basic distinction between descriptive and inferen- Choosing the proper descriptive statistic and related
tial statistics is made in chapter 1. Recall that the overall graphics display for a particular geographic problem
goal of descriptive statistics is to provide a concise and depends partly on the level of measurement-whether
easily understood summary of the characteristics of a par- the data are nominal, ordinal, or interval-ratio. In addi-
ticular data set. For most geographic problems, such tion, different calculation procedures are applied if the
quantitative or numerical summary measures are clearly data are grouped (weighted) or ungrouped (unweighted).
superior to working with unsummarized raw data. With You must be cautious when applying descriptive sta-
these easily understood and widely used descriptive statis- tistics to spatial or locational data. The way in which a
tics, geographers can effectivelycommunicate the charac- geographic problem is structured can affect the resulting
teristics of the data. descriptive statistics. Issues related to the structure of sta-
Some of the advantages of summarizing spatial infor- tistical problems and the resultant impact on the magni-
mation were demonstrated in the discussion of basic clas- tude of descriptive statistics are discussed in section 3.4,
sification methods (section 2.4). In this chapter, using including: (1) the effects of external boundary line delin-
basic descriptive statistics and related graphics, we show eation and study area location on the values of descrip-
how numerical or quantitative summary measures and tive measures; (2) the effects of altering internal subarea
complementary graphics can "describe" a data set. boundaries within the same overall study area-a com-
We can summarize a data set in several differentways: ponent of the so-called modifiable areal units problem
• Measures of central tendency-numbers that repre- (MAUP) known as the grouping or zoning problem; and
sent the center or typical value of a frequency distnbu- (3) the impact on descriptive statistics when using differ-
tion, such as mode, median, and mean (section 3.1). ent levels of spatial resolution or spatial acuity-another
component of the MAUP known as the scale or aggrega-
• Measures of dispersion-numbers that depict the
tion problem.
amount of spread or variability in a data set, such as
range, interquartile range, standard deviation, variance,
and coefficient of variation (section 3.2).
3.1 MEASURESOFCENTRAL
TENDENCY
• Measures of shape or relative position-numbers that
further describe the nature or shape of a frequency dis- The central or typical value of a data set is described
tribution, such as skewness, which indicates the numerically in several different ways. Each measure of
amount of symmetry of a distribution, or kurtosis, central tendency has advantages and disadvantages, and
which descnbes the degree of flatness or peakedness in the logic underlying the calculation procedure for each
a distribution (section 3.3). measure is different. To select the most appropriate mea-

39
40 Part II .A. Descriptive Problem Solving in Geography

TABLE 3.1 most geographic situations where data are interval or


ratio scale. However, if the precipitation data are grouped
Annual Precipitation for Washington, DC:
(table 3.2) the modal class is 35-39.99 (with 12 values)
A Ranked 40-Year Record (in inches)
and the crude mode of37.5 is the midpoint of this modal
26.87 35.20 39.86 45.62 class interval.
26.94 35.38 40.21 46.02
28.28 35.96 40.54 47.73 This is an appropriate time to introduce some graph-
29.48 36.02 41.11 47.90 ics that are often associated with grouped data: (1) the
31.56 36.65 41.34 48.02 histogram (relative frequency histogram) and related fre-
32.78 36.83 41.44 50.50
41.46 51.17
quency polygon, and (2) ogive (cumulative frequency
33.07 36.99
33.62 38.15 41.94 51.97 distnbution). Graphic summaries of grouped data are
34.98 39.34 43.30 54.29 constructed using several different formats, but they all
35.09 39.62 43.53 57.54 share some common characteristics. Usually the fre-
Source: National Climatic Oata Center (NCDC) quency of values is shown on the vertical ( Y) axis and the
range of data is presented on the horizontal (X) axis.
sure in a particular geographic situation requires an Information may be shown as absolute frequencies
understanding of each measure and its characteristics. (actual frequency counts) or as relative frequencies (per-
The discussion in this section is limited to three widely centages or probabilities).
used measures of central tendency: mode, median, and In a histogram, the frequency of values is shown as
mean. Annual precipitation data from Washington, DC a series of vertical bars, one for each class of values. For
collected over a 40-year period are used to present the the Washington, DC annual precipitation, both table 3.2
calculation procedures and related graphics (table 3.1). and figure 3.1 are constructed using the equal intervals
not based on range classification technique. With this
Mode procedure, the class breaks occur at convenient,
The mode is simply the value that occurs most fre- rounded-off positions, and widths of class intervals are
quently in a set ofungrouped data. When nominal data uniform. Since the precipitation data for Washington,
are used, the mode would be the category containing DC are continuous and extend from a low of 26.87
the largest number of observations. With ordinal or inches to a high of 57.54 inches, the histogram is con-
interval-ratio data grouped into classes, the category structed using a series of five-inch intervals from 25
with the largest number of observations is defined as the inches to 60 inches.
modal class. The midpoint of the modal class interval is A frequency polygon is similar to a histogram,
the crude mode. A mode can be calculated for data at except the vertical position of each class is shown as a
all levels of measurement. For nominal data, however, point rather than a bar. If the values have been catego-
the mode is the only available descriptive measure of rized into groups or classes, the single point for display-
central tendency. ing the frequency is placed at the midpoint of the class
While the mode is sometimes a useful measure of interval. The points are then connected as straight lines
central tendency, it often does not provide a practical to produce the frequency polygon. In figure 3.1, the fre-
result. For example, the mode would not be an appropri- quency polygon is superimposed on the histogram.
ate measure for large data sets having few or no tied Instead of displaying the absolute frequency count
observations. In the Washington, DC annual precipita- on the vertical axis, a histogram or frequency polygon is
tion data set, no value occurs more than once over the easily converted to relative frequency. For example, by
40-year sample period, making the mode ineffective. In dividing the individual frequency values by the sum of all
fact, a large number of tied values is not likely to occur in frequencies for the data set, the diagrams would display

TABLE 3.2
Classification of Washington, DC Precipitation Data Using Equal Intervals Not Based on Range Method
Cumulative Cumulative
Absolute Absolute Relative Relative
Category Interval Frequency Frequency Frequency Frequency

1 25-29.99 4 4 0.100 0.100


2 30-34.99 5 9 0.125 0.225
3 35-39.99 12 21 0.300 0.525
4 40-44.99 9 30 0.225 0.750
5 45-49.99 5 35 0.125 0.875
6 50-54.99 4 39 0.100 0.975
7 55-59.99 1 40 0.025 1.000
TOTAL 40 1.000
Chapter 3 .A. Descriptive Statistics and Graphics 41

12 Median
C:::JI Histogram
The median is the middle value from a set of ranked
10
observations and is therefore the value with an equal
-- Frequency number of data units both above it and below it. With an
flll 8 polygon
>-
c
odd number of observations, the middle value is unique
= and defines the median. With an even number of obser-
vations, the median is defined as the midpoint of the val-
ues of the two "middle ranks." The Washington, DC
data has 40 values, so the two middle values are rank 20
2 (39.62 inches of precipitation) and rank 21 (39.86
inches), and the median is their midpoint (39.74 inches).
0 The median is therefore not affected by extreme high or
a, a, a, a, a, a,
~ ~ ~ ~ ~ a,
low values.
~
~
"' ~
I 7 I... i... 1
N M "'
M
"' "'
"'
Annualpr&cipltatlon
(In inch&s) Mean
The mean (also called the arithmetic mean or the
FIGURE 3.1
Histogram and Frequency Polygon for Washington, DC 40-Year average) is the most widely used mea.sure of central ten-
Annual Preciprtation Data dency. It is usually the most appropriate measure when
Source:NationalClimaticData Center (NGOC) using interval or ratio data. The arithmetic mean (X) is
the sum of a set of values divided by the number of obser-
vations in the set. In standard statistic notation, the mean
40 1.00
c'
is defined as follows:
0
-e
.80 8.
g_
-<i'
C
(3.1)
.60 !! n n
I
.40 1,. where X mean of variable X
1!
~ X;= value of observation i
.20 ii.,,
E I summation symbol (upper case sigma)
8 n - number of observations
0 0
25 30 35 40 45 50 55 60 It is generally understood that summation is over all n
Annualprecipitation
(In Inches) observations, so the symbols above and below sigma are
usually omitted:
FIGURE 3.2
Cumulative Frequency Polygon (Ogive) for Washington, DC 40-Year
Annual Precipitation Data X = =I:..:.X:...,.I (3.2)
n
the frequency percentages or proportions for each class.
This change would not affect the general shape of the The calculation of mean annual prec1p1tation for the
graphic, only the scale of values along the vertical axis. Washington, DC example data is shown in table 3.3.
Another useful method for displaying data in a rela- In many geographic problems, a sample mean must
tive frequency format is a cumulative frequency diagram, be differentiated from a population mean. Using conven-
or ogive (figure 3.2). This graphic aggregates frequencies tional notation, lower-case n refers to sample size and
from class to class and displays cumulative frequencies at upper-case N to population size. Population characteris-
each position. By starting at the lowest value and cumu- tics are customarily defined using Greek letters, but sam-
lating higher values, this technique is equivalent to pre- ple measures are not. The formula for a population mean
senting the number of values that are "less than or equal (µ=the Greek letter mu) with Nvalues is:
to" each value or class along the horizontal axis. The
result often produces an ogive having a logistic-curve
µ=
rx. ' (3.3)
shape. In addition to displaying cumulative absolute fre- N
quencies, the data is also shown as cumulative relative
frequencies (or cumulative proportions)-see the right Because it is possible to draw many samples of size n
vertical axis label on figure 3.2. from a large population of size N, a different-magnitude
42 Part II .A. Descriptive Problem Solving in Geography

TABLE 3.3 TABLE 3.4

Worktable for Calculating Arithmetic Mean of Worktable for Calculating Weighted Mean of
Washington, DC Precipitation Data Washington, DC Precipitation Data
Observation i Precipitation X, Class Class Class
interval j midpointXj frequency ~ XIJ
1 26.87
2 26.94 25-29.99 27.5 4 110.0
3 28.28 30-34.99 32.5 5 162.5
35-39.99 37.5 12 450.0
38 51.97 40-44.99 42.5 9 382.5
39 54.29 45-49.99 47.5 5 237.5
40 57.54 50-54.99 52.5 4 210.0
55-59.99 57.5 1 57.5
TOTAL 1598.3
TOTAL 40 1610.0
26.87 + 26.94 + ... + 57.54
x n 40 -
Xw
'f.XiFi 1610.0
40.25
n 40
1598.30
39.96
40
To permit this calculation, two related assumptions
are made about the distribution of values within each
sample mean may result from each sample drawn. The category or class interval: (1) without any information
population mean is a fixed value, however. In chapter 8 to the contrary, one assumes the data are evenly distrib-
the distinction between samples and populations is uted within each class interval; therefore, (2) the best
explored further when applying sample information to summary representation of the values in each interval
estimate population characteristics. is the class midpoint. Quite simply, the class midpoint
The mean can also be calculated for grouped data. In is located exactly midway between the extreme values
some practical situations, computing a grouped or that identify the class interval. So, for example, the
weighted mean is the only viable option. Suppose, for class midpoint of the interval from 25 to 29.99 is 27.5,
example, the only information available is in summary which also happens to be the mean of those two values
form, like the histogram and frequency polygon for the (table 3.4).
Washington, DC annual precipitation data shown in fig- A quick comparison of the unweighted and
ure 3.1. Perhaps the source data from which the histo- weighted means for the Washington, DC precipitation
gram has been constructed (like the National Climatic data reveals that the unweighted mean is only slightly
Data Center data in table 3. 1) is not available. In other smaller than the weighted mean (the unweighted mean is
practical situations, it may be most effective first to clas- 39.95, whereas the weighted mean is 40.25). In general,
sify a set of data, then to use the classified (grouped) weighted and unweighted means should differ only
information to calculate the weighted mean. Generally, if slightly, depending on the overall effect of using class
a graphic representation such as a histogram is part of midpoints as representative summary values for each
the analysis, the conventionally used method of classifi- category. For example, if the majority of original values
cation is equal intervals not based on range, which is how in most of the categories happen to fall below the class
the data in table 3.2 and figure 3.1 are classified. midpoints, the weighted mean will be higher than the
Whatever the specific circumstance, a weighted corresponding unweighted mean.
mean is calculated from only the class intervals and class
frequencies presented in table 3.2 and figure 3.1, using Selecting the Proper Measure of
this formula: CentralTendency
Deciding which measure of central tendency to use
depends both on the geographic application and certain
(3.4) key characteristics of the data set. On the surface, the
mean seems to have certain advantages. It is the most
widely applied and generally understood mea.sureof cen-
tral tendency. 1n addition, it is always affected to some
where X w =weighted mean extent by a change or modification of any data set value.
~ =midpoint of class interval j This is not always true of either the mode or the median .
.lj =frequency of class intervalj The mean is also an element in many tests of statistical
k =number of class intervals k inference when estimates of population parameters are
n =total number of values = I. 11 made from sample statistics. When working with inter-
/=I val-ratio data, the mean is often the measure of choice.
Chapter 3 .A. Descriptive Statistics and Graphics 43

Selection of the "best" measure of central tendency


may depend on other characteristics of the distribution of
the data, as the following situations illustrate:
• If a distribution is unimodal (having a single distinct
mode) and symmetric (evenly balanced around a sin-
gle distinct mode or central vertical line), then all three
"centers" are likely to be similarly located and no Mode A ModeB
advantage accrues to any statistic (figure 3.3, case I). Median Mean

• If a frequency distribution is not completely symmetric FIGURE 3.4


about a central line, it is said to be asymmetric or Measures of Central Tendency Placed on Bimodal Frequency
skewed, and the various measures of central tendency Distribution
are positioned at different places along the distribution.
Additional discussion and a formal definition of skew median. With large negative skew (figure 3.3, case 3),
are provided in section 3.3. With slight positive skew the magnitude of displacement of the median and,
(figure 3.3, case 2), the median is pulled slightly to the especially the mean, is even more dramatic.
right toward the positive tail of the distribution. The
• If a frequency distribution is bimodal (having two dis-
location of the mean is affected even more significantly,
tinct modes) or multimodal (with more than two
positioned even further out the positive tail than the
modes), the mean may be located on the distribution at
a place that would not be considered "typical." In fig-
ure 3.4, neither the mean nor median depicts useful
central locations; providing a separate value for mode
Case 1: A and mode B is probably more informative.
Unimodat
andsymmeltlc • If a frequency distribution contains one or more
distribution extreme or atypical values (outliers) the mean is heav-
ily influenced by these values and its effectiveness as a
measure of centrality may be reduced. The existence of
Mode an extreme value or values is actually a form of skew.
Median A simple example using the incomes of seven
households illustrates the sensitivity of the mean to a
Mean
single outlier (table 3.5). The mean is ineffective as a
measure of central tendency in this example due to the
income of a single atypical wealthy household. The mode
is too low an income to represent a useful "central" value
from this frequency distribution. The median provides the
Case 2: most suitable measure in this situation. In fact, with vari-
Sl~ht posltlv&
skewness ables such as income that often contain a large amount of
skewness, the median is generally the descriptive statistic
of choice.
Mode Mean
Median
TABLE 3.5

Sensitivity of the Mean to a Single Extreme Value


(Outlier)
Household Income Descriptive Statistics
Case 3:
$31,000
Large n&gative
skewness $31,000
Sum = $659,000

$32,000
$36,000
Mode = $31,000
Mean Mode
$37,500
Median
$42,500
Median = $36,000

FIGURE 3.3 $449,000 Mean = $659,000 / 7


Measures of Central Tendency Placed on Symmetric and = $94,142.86
Asymmetric (Skewed) Frequency Distributions
44 Part II .A. Descriptive Problem Solving in Geography

3.2 MEASURES
OFDISPERSION
ANDVARIABIU1Y
*------- 4 1 • Interquartilerang&box
• Toplne ·03{lhim quarde) 75%ollho
dtlta are 1$$$than()t equ.11IJOO'li&
vaie.

Simple Measures of Dispersion 2


• M!Ck:lle
11"19
• 02 (me<litln)
50%oftne
data are len thano, oqual IJOthis vn.ie.

The three measures of central tendency-mode, • eonom llne. QI (111'$1qvM•e>25%oflhe


dtlta are1$$$than()t equ.11IJOlhi&Yale.
median, and mean-specify the center of a distribution, 2 - Upper wlllsl<er
but provide no indication of the amount of spread or Extends lo the ma)linil.a'ndalll point wiltwl
1.5 box heights from the l!Opol the box.
variability in a set of values. Dispersion can be calculated
in several different ways. The level of measurement and 1
3 - Lower whisker
Extendslo the minimum data point within
nature of the frequency distribution determine the most 1,$ 1>0.x
height&fJomthe boaomof O'leboX.

appropriate dispersion statistic. 4 - Outlier<*>


The simplest measure of variability is the range, Otmwv11tlontti•
lcw,~rwllisker.
Ist,eyoru,the upp9r "'

already defined in the previous chapter as the difference


between the largest and smallest values in an interval-ratio 3

set of data. Because it is derived solely from the two most


FIGURE 3.5
extreme or atypical values and ignores all other values, the Generalized Diagram of a Boxplot
range is often a misleading measure. The range measured
for the 40 annual precipitation values in the Washington,
DC example is (57.54 - 26.87) = 30.67 inches. mum data point within 1.5 box-heights from the bottom
If data are grouped, the range is defined as the differ- of the IQR box), and outliers (all observations beyond
ence between the upper value in the highest-numbered the upper or lower whisker). Figure 3.5 is a generali2ed
class interval and the lower value in the lowest-numbered boxplot diagram with these statistics labeled.
class interval. With the grouped precipitation data, these The value of comparing multiple boxplots on the
values are (59.99- 25) = 34.99 inches. As with the same diagram becomes clear when we look at 40 years of
ungrouped range, the grouped range can be a deceptive annual precipitation data from Buffalo, NY, St. Louis,
measure of dispersion. Given only the range, the degree of MO, and San Diego, CA. (figure 3.6). The median line
clustering or dispersion of the values between these two values for Buffalo and St. Louis appear almost identical
extremes is unknown. (in fact, the medians are 39.7 inches and 37.78 inches,
To obtain more information, you can examine cer- respectively), while the median line value for San Diego
tain intervals, portions, or percentiles within a frequency is much less (9.29 inches). You can also see from a quick
distribution. If data are divided into equal portions or per- visual examination of the boxplots that the interquartile
centiles (quantiles) then the range of values within any range box for San Diego is nearly as large as that for Buf-
quantile can be calculated and graphed. Although any falo, even though Buffalo's annual precipitation is about
logical subdivision is possible, data are often classified four times that of San Diego. The individual point dis-
into quartiles (fourths), quintiles (fifths), or deciles tribution (or individual value plot) for each of the three
(tenths). The median, which is the 50th percentile, may cities also seems to confirm that San Diego has almost as
also be one of these subdivisions. much dispersion or variation in annual precipitation
When data are divided into four portions or quartiles, from year to year as Buffalo and St. Louis. We will look
the interquartile range (IQR) is defined as the difference at additional descriptive statistics and measures for the
between the 25th percentile value (the lower quartile) and annual precipitation data of these three cities as we con-
the 75th percentile value (the upper quartile). The inter- tinue through this chapter.
quartile range thus encompasses the "middle half" of the Continuing our discussion of dispersion measures-
data. In the Washington, DC precipitation example, the other dispersion statistics are based on the individual
interquartile range is (43.53 - 35.20) = 8.33 inches. deviations of each observation from the mean value. The
A boxplot (or box-and-whiskers plot) is a very useful difference between each value and the mean is calculated
graphic extension of the interquartile range concept that and these deviations are used as the building blocks to
is particularly effective when comparing groups and dis- measure dispersion. An individual deviation (d;) is calcu-
playing outliers. Many different statistics are shown on a lated as follows:
boxplot, and you can gain additional insights by compar-
ing multiple boxplots on the same diagram. For each (3.5)
group, the typical boxplot will display the median line,
the interquartile range box (encompassing the middle
where d; = deviation of value i from the mean
50% of the data), an upper whisker (extending to the
maximum data point within 1.5 box-heights from the top X; = value of observation i
of the IQR box), a lower whisker (extending to the mini- X = mean of variable X
Chapter 3 .A. Descriptive Statistics and Graphics 45

60 Standard Deviation and Variance


-;, The least squares property of the mean carries over
g" 50 into the most common measure of variability or disper-
,,. sion-standard deviation. Given a sample of values, the
~ 40 I ln.te,quartile
standard deviation (s) is defined as:
I 1--11----; range
,._
,,.
0

-
.!!! 30
~ .....
~
}

~ (3.8)

Upper quartile _..,_ __ In the numerator of equation 3.8, the deviation of each
~,._.
Lower quartile _,.,__-,I:_,
Median value from the mean is squared and all of the squared
deviations are summed, thus incorporating the least
o.1------,-------,------,-----
squares property into this measure of dispersion. Note
Buffalo, NY St Louis, MO San o;ego, CA also that the squaring process removes the problem of
negative deviations offsetting positive deviations in a
FIGURE 3.6 more effective way than absolute values. This least
Comparative Boxplots of Annual Precipitation Data for Three Cities squares sum is then divided by (n-1) to reduce sampling
Source:National Climatic Data Center (NGOC)
bias. Finally, to reverse the effect of squaring, the square
root of this quotient is taken.
For a population, the standard deviation formula is
The average deviation or mean deviation is based
written:
on the mean of the set of individual deviations. That is,
the mean deviation (m) for a sample of values is
(3.9)

m=----
:EIX,-XI (3.6)
n where cr = lower case Greek sigma
Note the different denominators of the two standard devia-
where Ix 1 - xiequals the absolute value of the difference tion formulas. The denominator of a sample is n - I, but
between X; and X. the denominator becomes Nwhen dealing with a popula-
tion. If the standard deviation is being calculated for a large
Some units of data are larger than the mean, making the
sample, (n > 30), the difference between n and (n - 1) is
difference (X 1 - X}a positive_number.Conversely,values small' and the differencebetweens and crwill also be small.
below the mean make (X 1 - X} a negative number. How- .
However, with a small sample (n < 30), the true populatlon
ever the sum of the deviations about a mean 1s always
standard deviation (cr)is slightly underestimated if division
zero.' To avoid the problem of negative differences offset- is by n. When dividing by (n - 1), s becomes larger, correct-
ting positive differences, the absolute value of each indi- ing the underestimation problem. The incorporation of this
vidual deviation is calculated. With absolute values, sample size correction provides a better estimate of the true
negative deviations are converted to positive ones. For
population standard deviation for smaller samples.
example: 1-21 = 121 =2. The variance of a data set, defined as the square of
Another important property associated with the
the standard deviation, provides a measure of the average
mean is its "least squares" character. Not only is the sum
squared deviation of a set of values around the mean. The
of deviations (that is, unsquared deviations) of all obser- magnitude of the variance statistic is generally a very
vations about a mean always zero, but the sum of squared large number and is sometimes difficult to interpret. Nev-
deviations of all observations about a mean is the mini- ertheless, it is an extremely important descriptive measure
mum sum possible. That is, the sum of squared devia- that has many applications in inferential statistics. The
tions about a mean is less than the sum of squared formulas for sample variance (s2) and population vari-
deviations about any other number. This important attri- ance (cr2) are as follows:
bute is called the least squares property of the mean,
and can be summarized:
s2 = :E(X,-X)2 (3.10)
- min
X: :E(X,=X -)2 (3.7) n-l

(3.11)
46 Part II .A. Descriptive Problem Solving in Geography

Both standard deviation and vanance are funda- available. These alternative formulas, based on alge-
mental building blocks in statistical analysis. When dis- braic manipulations of the definitional formulas, are
cussing sampling procedures and estimation in chapter summarized in table 3.6. Since variance is the square of
8, these descriptive measures are used as the theoretical the standard deviation, it is calculated by simply
basis for making probabilistic statements. Confidence removing the square root symbol from each formula in
intervals are placed around estimates derived from sam- table 3.4.
ples to predict population parameters. These two statis-
tics are also integral components in inferential
hypothesis testing. For example, variance is used in anal- TABLE 3.6
ysis of variance (ANOVA), a statistical technique that Definitional and Computatlonal Formulas for
determines whether multiple (more than two) samples Standard Deviation of Sample and Population
differ significantly from one another (chapter 11). Definitional Computational
A final issue regarding standard deviation and vari- Formula Formula
ance requires clarification-the procedural question of
whether to use a more convenient computational for-
mula or a more cumbersome and time-consuming defi- Sample (s)
2.(X; - .X)2 r.xf - n.x2
nitional formula. Formulas based on the definition of n-1 n-1
standard deviation are valuable because they show how
variability is a direct function of individual deviations
(X;-X) or (X; - µ). However, if it is necessary to calcu- Population (cr)
2.(X; - µ)2 r.xf
late standard deviation manually without computer N N
assistance, more efficient computational formulas are

TABLE 3.7
Worktable for Calculating Sample Standard Deviation of Washington, DC Precipitation Data

Observation / x, x,-x (X1 - X)Z x,Z


1 26.87 -13.09 171.28 722.00
2 26.94 -13.02 169.46 725.76
3 28.28 -11.68 136.36 799.76

38 51.97 12.01 144.30 2700.88


39 54.29 14.33 205.42 2947.40
40 57.54 17.58 309.14 3310.85
TOTAL 1598.30 2142.36 66,006.44

n=40 n-1 = 39 x LXi


n
_ 159830
40
39.96

Using the definitional fomiula for sample standard deviation:

I(X; - i) 2 2142.36
s 7.41
n-1 39

Using the computational formula for sample standard deviation:

IXl- n.X2 66,006.44 - 40(39.96) 2


s 7.40
n-1 39
Chapter 3 .A. Descriptive Statistics and Graphics 47

The calculation procedures for both the definitional deviation of the data set. Rounded-off standard deviation
and computational formulas of sample standard devia- intervals such as 1.0 or 0.5 are utilized to divide observa-
tion are demonstrated in table 3.7 using the Washington, tions into categories. This method of classification is most
DC precipitation data. The statistical parity of the defini- effectivewhen a data set is normally distributed. Note: we
tional and computational formulas is easily seen. will discuss the normal distribution in chapter 6.
For calculating standard deviation and variance from If we want an even number of categories, the mean
grouped data, the same assumptions are made as when itself is used as a class break value, and the resultant map
calculating a weighted mean. That is, the values are will clearly show which areas have values above and
assumed to be evenly distributed within each class inter- below the mean. On the other hand, if we want an odd
val, making the midpoint the best representation of the number of categories, the "middle category" would be
middle of that class interval. As with the weighted aver- centered on the mean. All four of the choropleth maps
age, the weighted standard deviation (s,.) and variance showing state-level obesity rates (figure 2.2) used exactly
(sw2 ) formulas use these class midpoints (Xj) and class five categories, so we will again use a five-category classi-
frequency counts Uj). Although a definitional formula fication with the standard deviation breaks method, to
for weighted standard deviation exists, table 3.8 uses the allow consistent comparisons of classification methods.
simpler computational formula for calculating this index The result is shown in figure 3.7.
with the Washington, DC precipitation data. The Jenks method of natural breaks determines
The weighted standard deviation (s,. 7.59) is simi- = class breaks by finding "natural" groupings inherent in
lar in magnitude to the unweighted standard deviation the overall distribution of data values. Each observation
=
(s 7.40). A slight difference between the two descriptive is assigned to a class based on its location relative to all
statistics is expected given the contrasting assumptions other observations in the data set. Using an iterative
that underlie the calculation procedures. (step-by-step) algorithm or procedure, the Jenks method
Now that you have been introduced to the basic identifies class breaks that minimize the variation in val-
descriptive measures of dispersion-standard deviation ues within classes and maximize the variation in values
and variance-it is an appropriate time to explain the between classes. The computational specifics of this iter-
standard deviation breaks and Jenks natural breaks meth- ative algorithm are too lengthy and complex to show in
ods of classification that were cited in table 2.4. Once an introductory textbook, but many statistical software
again using the 2009 state-level obesity rate data set, we and GIS packages include the Jenks method among their
can now compare the patterns generated by these two set of univariate (single variable) classification options.
additional classification methods with the other four The result of applying the Jenks method of natural breaks
classification methods shown in figure 2.2. to the 2009 obesity data is shown in figure 3.8.
The standard deviation breaks method determines
class break values after calculating the mean and standard Coefficient of Variation
In geographic research, it is extremely valuable to
TABLE 3.8 compare directly the amount of variability in different
spatial patterns to see which has the greatest spatial varia-
Worktable for Calculating Weighted Standard
Deviation of Washington, DC Precipitation Data
tion. In other problems it may be useful to compare vari-
ability in some phenomenon as it changes over time. For
Class Class Class
interval midpoint frequency example, a climatologist might want to compare the vari-
I At 4 ,1,4 (,1,)'4 ability in annual rainfall at different meteorological sta-
25-29.99 27.5 4 110.0 3025.00 tions to learn which locations have the greatest variation
30-34.99 32.5 5 162.5 528125 from year to year. An economic development planner
35-39.99 37.5 12 450.0 16875.00
40-44.99 42.5 9 382.5 16256.25 might be interested in comparing household income in
45-49.99 47.5 5 237.5 11281.25 several different counties to learn which region has the
50-54.99 52.5 4 210.0 11025.00 greatest internal variation and income inequality. The
55-59.99 57.5 1 57.5 330625
general issue of income inequality itself is now a key polit-
TOTAL 40 1610.0 67050.00
ical and social issue. As a geographer, you might want to
Using the computational formula tor sample standard deviation: look at various aspects of the distribution and variation of
income. Asking these investigative,comparative questions
2 allows you to explore practical problems in a highly pro-
r,(x1)'[j- ((r,X1fJ) fn) ductive manner, often leading to better policy decisions.
Sw --
n-1
Using standard deviation or variance to compare
locations or regions directly is often inappropriate
67050 - ((1610) 2/ 40) because they are both absolute measures. That is, their
= 39
- 7.59
values depend on the size or magnitude of the units from
which they are calculated. A data set with large numbers
48 Part II .A. Descriptive Problem Solving in Geography

Standard Deviation Breaks

.'

2009 Obesity Rates


Mean= 26.65%, Standard Deviation= 3.47%

LJ Less than 21.5% {les.s than -1.Ss)


,o LJ 21.5% to 2•.9% {-1.Ss to -0,Ss}

LJ 25.0% to 28.3% {-0,Ss to O.Ss)

LJ 28.4% to 31.8% {O.Ssto 1.5s)

- Greater than 31.8% {greater than 1.55)

FIGURE 3.7
Percent of Adults Classified as Obese, by State, 2009: Standard Deviation Breaks
Source: Centers for Disease Control and Prevention (CDC), 2010

Jenks Natural Breaks

2009 Obesity Rates

[=i 18.6% to 21.4%

[=i 21.5% to 25.2%

,o [=i 25.3% to 27.4%

Cl 21.s¾ to 3o.s%

30.6% to 34.4%

FIGURE 3.8
Percent of Adults Classified as Obese, by State, 2009: Jenks Natural Breaks
Source: Centers for Disease Control and Prevention (CDC), 2010
Chapter 3 .A. Descriptive Statistics and Graphics 49

(e.g., in the millions) is described with a large average, TABLE 3.9


standard deviation, and variance. Conversely, analysis of
Descriptive Statistics Using 40 Years of Annual
a data set containing small numbers results in small abso-
Precipitation Data for Three Cities
lute descriptive measures. Clearly, direct comparison of
Standard Coefficient
averages, standard deviations, and variances is limited deviation of variation
across data sets with different magnitudes. City Mean (absolute) (relative)
To resolve this problem and compare the spatial vari- Buffalo, NY 40.38 4.90 12.14
ation of two or more geographic patterns, a relative mea- St. Louis, MO 39.26 8.12 20.68
sure of dispersion called the coefficient of variation or San Diego, CA 10.23 3.98 38.91
coefficient of variability (CV) is used. The CV is simply
the standard deviation expressed relative to the magni- ity in precipitation from year to year (with a standard
tude of the mean: deviation of8.12) than Buffalo (whose standard deviation
is only 4.90). Situational factors related to climate charac-
s teristics accounts for the different standard deviation lev-
CV=-= or CF= :. (100} (3.12) els of these two cities. St. Louis is found in the middle of
X X
the continent (subject to changeable continental influ-
The coefficient of variation is such a simple method ences and extremes as well as being far from any large
with such great potential, it is surprising that it is so unde- body of water), whereas Buffalo is close to a consistent
rutilized by geographers and statisticians. The CV is usu- source of moisture with prevailing winds coming off Lake
ally expressed as a proportion or percentage of the mean Erie. By the way, you should notice that these absolute
and may be used with either sample or population data. descriptive statistics confirm the graphic evidence shown
Dividing the standard deviation by the mean removes the in the boxplots of these three cities (figure 3.6).
influence of the magnitude of the data and allows direct Of the three cities, however, semiarid San Diego has
comparison of relative variability in different data sets. by far the greatest degree of relative variability in precipi-
Only ratio-scale data should be used to calculate the tation from year to year. Although the standard deviation
coefficientof variation. Data measured at the interval scale for San Diego is only 3.98 inches (the smallest of the
are not appropriate because the interval metric has an arbi- three cities), that magnitude of variability is relativelylarge
trary zero. As a result, numeric manipulations involving when compared to San Diego's rather low average
multiplication or division are meaningless. Given that the annual precipitation of 10.23 inches. The much higher
coefficient of variation is a ratio of the standard deviation coefficient of variation for San Diego clearly reveals that
to the mean, the coefficientof variation value has no mean- the amount of precipitation occurring there from year to
ing when calculated from interval scale data. year fluctuates much more dramatically than in Buffalo
Using 40 years of annual precipitation data from or St. Louis. Those of you interested in climatology
three weather stations-Buffalo, NY, St. Louis, MO, and would start to explain this phenomenon by citing the
San Diego, CA (table 3.9)-sharp contrasts between basic semiarid climate and the variable air and water currents
descriptive statistics are evident. The absolute descriptive that affect rainfall patterns in southern California, espe-
statistics for Buffalo and St. Louis are fairly similar, but cially this far to the south. Therefore, one expects some
still have some notable differences. The mean annual degree of consistency in rainfall in Buffalo and St. Louis,
average precipitation values are very close (Buffalo with while San Diego experiences greater fluctuations in rain-
40.38 inches and St. Louis with 39.26 inches). However, fall in proportion to the mean.
St. Louis has a notably larger absolute amount of variabil-

Example: Maps of Annual Precipitation Statistics by U.S. Climate Division

The comparative annual precipitation data from Buffalo, St. nately, we can investigate these questions and reach defini•
Louis, and San Diego may have whetted your appetite for tive conclusions because plentiful data are available from the
more descriptive statistics and climatology. What if this brief National Climatic Data Center (NCDC).
three-city example is expanded to encompass all of the con• located in Asheville, North Carolina, NCDCcollects •nor•
terminous United States (excluding the climatic outliers of mals• (means) and standard deviations of precipitation and
Alaska and Hawaii)? Can you fashion a couple of descriptive other key climatic indicators (temperature, heating degree
statements related to the spatial distribution and variation of days, cooling degree days) for thousands of weather stations
precipitation across the country? Where in the U.S.do you across the United States. Since 1931, NCDC has consolidated
think the absolute measures (mean, standard deviation) of this voluminous amount of weather station data into climate
annual precipitation are the highest or the lowest? It might be division summaries. The conterminous U.S.has 344 climate
a bit more difficult to hypothesize about where the relative divisions, plus climate divisions in Alaska and Hawaii. According
measure (coefficient of variation) is highest or lowest. Fortu• to the NCDCwebsite, "climatic divisions are regions within each

(continued)
50 Part II .A. Descriptive Problem Solving in Geography

Mean

Precipitation
Mean
'--__.I 5.08 to 23.61
D 23.62 to 42.15
D 42.16 to 60.68
- 60.69 to 79.22
- 79.23 to 97.77

Precipitation
Standard Deviation
'--__.I 1.77 to 4 .80
I 4.81 to 1.81
~
'----'I 1 .82 to 1o.83
- 10.84 to 13.85
- 13.86 to 16.87

Precipitation
Coefficient of Variation
'----'I 8.1 to 16.1
D 16.2 to 23.4
D 23.5 to 30.1
- 30.8 to 38.1
- 38.2 to 45.4

FIGURE 3.9
Maps of Annual Preciprtation Statistics by U.S. Climate Division, 1971-2000
Source: National Climatic Data Center (NCDC)
Chapter 3 .A. Descriptive Statistics and Graphics SI

state that have been determined to be reasonably climatically The bottom portion of figure 3.9 diagrams the spatial pat•
homogeneous." 1 These divisions were established to encour• tern of the coefficient of variation. This map illustrates the
age research in hydrology, agriculture, biogeography and other amount of relative variation in annual precipitation from year
disciplines that sometimes require data averaged over an area to year for each climate division. Recall that the coefficient of
of a state rather than just at selected points (weather stations}. variation is simply the standard deviation divided by the
NCDC derived the climate division summaries by using mean. A large CV for a climate division would therefore indi•
information from all weather stations that consistently report cate that annual precipitation fluctuates a great deal from year
precipitation within a climate division. The number of report• to year relative to the average annual precipitation, while a
ing stations within a climate division varies from year to year, small CV would indicate very little fluctuation in precipitation
but this variation was ignored by the agency in the computa• from year to year relative to the average annual precipitation.
tion of the mean and standard deviation values. Being a relative descriptive measure, the coefficient of vari•
The top portion of figure 3.9 shows mean annual precipita• ation pattern differs markedly from the absolute descriptive
tion for each of the 344 climate divisions in the conterminous measures (mean and standard deviation} and provides a
United States. Mean annual precipitation ranges from 5.08 greatly expanded array of geographic insights regarding the
inches in the very arid southwestern corner of Arizona to 97.77 spatial patterns of annual precipitation across the United
inches on the northwestern coast of Washington. Relatively States. Of the 344 climate divisions in the conterminous U.S.,
low mean annual precipitation values are found throughout only 11 of them have CV values greater than 30. This region of
the western third of the United States,with climate divisions in highly variable and uncertain annual precipitation from year
several states (Arizona, California, Nevada, Utah, and Wyo• to year is located almost entirely in the state of California, with
ming} averaging less than 10 inches per year in precipitation. adjacent climate divisions in southwestern Arizona and south•
The highest mean annual precipitation values (above 60 ern Nevada. In fact, the entire state of California has highly
variable precipitation from year to year-all 7 of California's
inches per year} are concentrated in the Pacific Northwest,
either along the coast of Washington and Oregon or in the climate divisions have CV values greater than 30. This is clearly
not just a function of low average annual precipitation in arid
Cascade Mountains of those two states. A second concentra•
locations resulting in large precipitation fluctuations from
tion of high mean annual precipitation values is found in cli•
year to year. The climate division along the northern California
mate divisions along the Gulf of Mexico coastline-from
coast has a substantial average annual precipitation of 41.25
Louisiana, through Mississippi and Alabama, to Florida.
inches, but a very high CV value of 31.3.
In general, we should expect climate divisions having the
What climatological processes are operating to cause such
lowest mean annual precipitation to also have the lowest
variability in precipitation from year to year throughout Cali•
standard deviation values from year to year, and this seems to
fornia but almost nowhere else in the United States? Current
be the case, looking at the middle portion of figure 3.9. Many
climatological thinking suggests the answer is to be found by
of the same climate divisions in Arizona, Idaho, Nevada, Utah,
associating Pacific Coast precipitation levels with El Nino and
and Wyoming with low average annual precipitation also
La Nina cycles. The term "El Nino" refers to the dramatic warm•
have low standard deviation values from year to year. Con•
ing of sea-surface temperatures in the eastern and central
versely, the climate divisions having the greatest annual pre•
tropical Pacific Ocean. Historic monitoring of these warming
cipitation also generally have the greatest absolute variation
events suggests they occur in an irregular cycle, generally
in precipitation from year to year (see the areas in the Pacific ranging from 2 to 7 years and lasting from as little as 6 months
Northwest and along the Gulf of Mexico}. to as long as 4 years. During an El Nino event, increased precip•
NCDC has produced excellent multi-color maps of annual itation is expected in California and adjacent portions of the
precipitation by climate division for both mean and standard southwest. This is due to more southerly zonal storm tracks.
deviation. With multi-color mapping (as opposed to shades of The "cool" event counterpoint of El Nino is La Nina. During
gray}, spatial patterns have the potential to be shown more La Nina events, the polar and Pacific jet streams shift north•
precisely, and NCDC personnel are effectively able to use a ward and increased precipitation is diverted into the Pacific
larger number of map categories or classes. In fact, the NCDC Northwest due to more northerly zonal storm tracks.
map of annual precipitation normals by climate division con• Recent studies have shown that heavy precipitation years
tains fifteen categories and the corresponding standard devi• in southern California (especially in Climate Division 6, which
ation map has fourteen categories. In addition, NCDC represents the southern California coast from Santa Barbara
cartographers are able to use composite classification meth• south to San Diego} are strongly associated with moderate•to-
ods, with class intervals that vary by the magnitude of annual strong El Nino events, while years with below-normal precipi•
precipitation, thereby allowing them to highlight extremes to tation are generally associated with moderate-to-strong La
some extent while maintaining equal interval categories Nina occurrences. These relationships seem to be strongest in
throughout the middle portions of the classification scheme. the San Diego area.
By contrast, the maps shown in our text are limited to five Another phenomenon worth noting in the coefficient of
categories (using shades of gray}, and we use equal intervals variation portion of figure 3.9 is the general west·to•east
based on range throughout our classification scheme. It is cer• decline in CV values across the U.S.What descriptive hypothe•
tainly a worthwhile exercise to compare the NCDC maps ses would you propose to account for this decline in precipita•
online with the maps shown in figure 3.9. tion variation as we more eastward?

1
"U.S.Climate Normals 1971-2000, Products; last modified August 20, 2008,
http://www.ncdc.noaa.gov/oa/climate/normals/usnormalsprods.html
52 Part II .A. Descriptive Problem Solving in Geography

3.3 MEASURES
OFSHAPE resultant distribution is said to be negatively skewed. 1n a
distribution with a longer tail to the right, large positive
OR REIATIVE POSITION cubed deviations will dominate the sum, resulting in a
positively skewed distribution.
Skewness and Kurtosis Another measure of skewness, known as Pearson's
In addition to the coefficient of variation, two other coefficient, is based on a comparison of the mean and
relative measures-skewness and kurtosis-further median:
describe the nature or character of a frequency distribu-
tion. As already mentioned, skew measures the degree of Pearson's Skewness= 3(X-median) (3.16)
symmetry in a frequency distnbution by determining the s
extent to which the values are evenly or unevenly distrib-
uted on either side of the mean. Kurtosis measures the With a unimodel and symmetric distribution, the mean
flatness or peakedness of a data set. Like the coefficient and median have similar values, and the skewness coeffi-
of variation, geographers underutilize these indices, yet cient is close to zero (figure 3.3, case 1). When the mean
they can sometimes provide important descriptive is greater than the median (as in case 2), positive skew
insights about a frequency distribution and offer consid- generally results, and when the mean is less than the
erable potential in spatial research. median (as in case 3), the skew measure is generally neg-
Introducing the concept of moments of a distribu- ative. Division by the standard deviation provides a stan-
tion about the mean provides a more complete under- dardization of the Pearson's skewness values, allowing
standing of skewness and kurtosis. The first moment is direct comparisons.
the sum of individual deviations about the mean and Kurtosis measures the fourth standardized moment
must equal zero: of a frequency distribution using the following formula:

First moment= I,d, = I(X,-X)' =0 (3.13)


(3.17)

The second moment of a frequency distnbution is the


numerator in the expression that defines variance: 1fa large proportion of all values are clustered in one part
of the distribution, it will have a "pointed" or "peaked"
2 appearance, a high level of kurtosis, and be considered
Second moment= I(X,-X) (3.14)
leptokurtic (figure 3. I0). 1n a data set having low kurto-
sis (a "flat" or platykurtic distribution), values are dis-
Skewness involves use of the third moment of a fre- persed more evenly over many different portions of the
quency distribution. One commonly used measure of rel- distnbution. Some distributions are considered mesokur-
ative skewness contains the third standardized moment tic, having a "moderate" or "bell-shaped" appearance,
in the numerator: which is neither very peaked nor very flat.
The interpretation of kurtosis is enhanced by com-
paring the peakedness of a distribution to that of a nor-
Skewness= (3.15) mal probability distribution. Although the importance of
the normal curve is discussed in more detail in chapter 6,
it is worth mentioning here because kurtosis formulas
The denominator of this expression contains the cubed assign characteristic values to a normal distribution. 1n
standard deviation, which effectivelystandardizes the third equation 3.17, a leptokurtic distribution has a kurtosis
moment. This allows you to compare the amount of rela- greater than 3.0, a normal distribution is mesokurtic,
tive skewness in different frequency distributions directly.
If a frequency distribution is symmetric, with a
roughly equal number of values on either side of the
mean, the distribution has little or no skew. 1fa value in a
Leptokurtic
distribution is greater than the mean, its cubed deviation is
positive. However, if a value is less than the mean, it will
produce a negative cubed deviation. In a symmetric distri-
bution, these positive and negative cubed deviations coun-
terbalance each other and the sum (the third moment) is Platykurtic

close to zero. 1n a distribution with a longer tail to the left,


large negative cubed deviations cause the sum of all devia- FIGURE 3.10
tions (the third moment) to be negative. In this case, the Different Levels of Kurtosis
Chapter 3 .A. Descriptive Statistics and Graphics 53

with a kurtosis value of 3, and a platykurtic distribution the interquartile range is larger than the lower quartile
has a value less than 3. Some computer software pack- portion. In addition, there seems to be more dispersion of
ages use equation 3. 18, subtracting 3 from the quotient, individual values plotted above the interquartile range
so that a normal distribution has zero kurtosis. With this compared to the dispersion of individual values plotted
alternative, platykurtic distributions are negative, below the interquartile range. These observations indicate
mesokurtic distributions are close to zero, and leptokur- a slightly higher chance of these cities getting an annual
tic distributions are positive. precipitation well above their averages than getting an
annual precipitation well below their averages.You might
speculate a bit about whether these facts concerning year-
r(x,-x)• (3. 18) to-year differences in annual precipitation have any prac-
Kurtosis = --''----'--'-- 3
ns • tical economic implications, such as future irrigation
needs or flood management protection strategies.
The kurtosis values are all very close to zero, and
ExploratoryCotnparisonof Skewness don't seem to reveal anything meaningful about the flat-
and KurtosisValues ness or peakedness of the distributions. We might gain
Can the examination of skew and kurtosis values pro- some geographic insight, however, ifwe extended the data
vide any useful insights about the comparative nature of collection to hundreds of cities across the United States-
spatial patterns? Geographers sometimes want to com- would any non-random kurtosis pattern be revealed?
pare descriptive statistics for a problem variable at differ-
ent locations during a single time period. They may ask
such questions as: "During this period of time, which 3.4 SELECTED
ISSUES:SPATIALDATA
locations had the greatest relative variability? . . . the AND DESCRIPTIVE
STATISTICS
greatest amount of skew? ... the most leptokurtic distri-
bution?" We can then continue the investigation by asking Geographers need to recognize several potential
why these observed spatial patterns of skew or kurtosis complications associated with the analysis of spatial or
occurred. By exploring data sets more thoroughly through location-based data. Aspects of these problems and
the application of descriptive statistics, geographers will issues are also discussed to some extent later in the text,
gain a better understanding of the problem variable. (for example, in chapter 16 we examine how correlations
You must proceed with caution, however, as recent can be dramatically affected by the spatial structure of a
research shows both skew and kurtosis statistics appear data set). However, we concentrate here solely on how
very dependent on sample size. While skew and kurtosis spatial data can affect the value and magnitude of vari-
provide fully valid descriptive measures of a data set, they ous descriptive statistics. Problems addressed in this sec-
may provide very imprecise population parameter or tion include how each of the following can affect the
probability model estimates. In fact, in many cases even summary statistics you calculate from a set of data:
sample sizes as large as several hundred fail to give good (1) alteration of the external boundary of a study area;
estimates of the true probability model skewness and kur- (2) modification of internal (subarea) boundaries; and
tosis. We will discuss techniques for estimating popula- (3) change in the level of spatial resolution by using a dif-
tion parameters from sample statistics in later chapters. ferent scale or level of aggregation.
Given these limitations, we can still make some valid As an introductory text, our discussion of these com-
descriptive statements regarding the 40 years of annual plications is deliberately limited to a few basic comments
precipitation data from Buffalo, St. Louis, and San and rather simplistic examples. If you are interested in
Diego. In table 3.10, intercity comparisons about precipi- following up on these ideas, many intermediate and
tation distributions are made for skew and kurtosis. All advanced references are available-a few of these are
three cities show slightly positive skew. A quick glance at identified at the end of this chapter. Sadly, these issues
their boxplots (figure 3.6) suggests a slight amount of are often not discussed sufficiently outside the discipline.
skew in each case, because the upper quartile portion of However, a basic understanding of the nature of these
problems is essential for conducting geographic research.

TABLE 3.10 Impact of ExternalBoundaryDelineation


Additional Descriptive Statistics Using 40 Years of The location and spatial extent of the overall study
Annual Precipitation Data for Three Cities area boundary (external boundary) is likely to have a sig-
Standard nificant impact on the magnitude of any summary
City Mean deviation Skewness Kurtosis descriptive statistics that you calculate. Let's look at a
Buffalo, NY 40.38 4.90 0.74 0.15 simple example.
St. Louis, MO 39.26 8.12 0.47 -0.25 Suppose you are assisting local planners who are
San Diego, CA 10.23 3.98 0.59 -0.25
interested in analyzing the number and spatial pattern of
54 Part II .A. Descriptive Problem Solving in Geography

persons below the poverty level in the Wicomico County, boundary selected. These affects may be complicated and
Maryland area (Wicomico County is the home county of rather hard to predict. For example, is absolute variability
Salisbury University). There are several alternatives for in poverty levels greatest at the city, metro planning area,
the external study area boundary, including:(!) all of Wic- or county level? The answer to this question is not obvi-
omico County; (2) the Salisbury/Wicomico Metropolitan ous, but the standard deviation and variance values will
Planning Organization (S/WMPO) area boundary; and likely differ depending on the study area boundary
(3) the incorporated area of Salisbury (figure 3.11). selected. It is clear, therefore, that absolute descriptive
Suppose you have detailed locational information statistics should be evaluated primarily in relation to a
regarding the number and distribution of people below particular external study area boundary.
the poverty level (perhaps down to the census tract area),
so any of these three external study area boundaries is ModifiableAreal Unit Problem:
feasible. It is generally known that dense clusters of pov- Modification of InternalSubareaBoundaries
erty are located near the Salisbury city center, in multiple Descriptive statistics can also be influenced signifi-
directions. If you analyze only those census tracts within cantly by using alternative subdivision or regionalization
the incorporated area of Salisbury, you suspect the aver- schemes within the same overall study area. This is an
age number (and percentage) of people below the poverty important component of the Modifiable Areal Unit Prob-
level per census tract will be higher than if you analyze lem (MAUP) often referred to as the grouping or zoning
all census tracts in the S/WMPO area or all census tracts problem. In many geographic problems, the external
in the County. study area may be fixed, but the internal subarea bound-
Other absolute measures whose values are partially a aries may be drawn in many different ways, resulting in
function of the mean, such as standard deviation and different descriptive statistical summaries when the sub-
variance, will also be affected by the external study area area data is analyzed.

D E L A w A R E
Dorchester

Mardela

Sussex
Wicomico

Willards

M A

N
Wicomico
A
Worcester

CJ Jncorporaled
Areas

Somerset
CJ S/WMPO PlanningArea Boundary
0 2,$

Kilometers ' LJ WicomicoCounty,Maryland

FIGURE 3.11
Alternative External Study Area Boundaries: Wicomico County, MD and Salisbury/Wicomico Metropolrtan Planning Organization (S/WMPO)
Area Boundary
Source:Salisbury/Wicomico
MetropolitanPlanningOrganization,IncorporatedAreas and BoundaryMap (2000)
Chapter 3 .A. Descriptive Statistics and Graphics 55

The MAUP grouping or zoning problem is illus- analyzed them in a lab to determine numerous chemical
trated with a very simple example. Suppose the shaded concentrations for each sample. In figure 3.13 we show
area in figure 3.12 represents the location of a high con- the chemical concentrations of phosphorous (P) and
centration of a particular demographic group. In both potassium (K) in parts per million (ppm) at each sample
case 1 and case 2 the external boundary (study area) is point in the farm field. Phosphorus is a macronutrient
the same, the location of the demographic group within essential for crop production as it stimulates early plant
that external study area boundary is the same, and the growth, gives plants a good and vigorous start, and
number of internal subareas in the overall study area is allows the crop to mature more quickly. Potassium is
the same (6 subareas in each case). However, the posi- associated with the movement of water, nutrients, and
tioning of the internal subarea boundaries seems to cre- carbohydrates in plant tissue. If potassium is lacking,
ate the perception that the group is more "integrated" growth is stunted and yields are reduced. Therefore, an
(case I) or more "segregated" (case 2) within the study agronomist may be interested in understanding the rela-
area. When basic descriptive statistics (mean, percent- tionship between potassium and phosphorous through-
age, standard deviation, etc.) are calculated using these out a field to better enact crop management strategies.
alternative subdivision schemes, dramatically different The phosphorous level appears aboveeach of the 72
results are likely to occur. Again, caution is advised sample point symbols, while the potassium level appears
when interpreting descriptive statistics. It is likely that be/aw each sample point symbol. The overall average
summary measures will have fully valid comparative phosphorous level (across all 72 sample points) is 5.6
interpretations only if calculated over the same area and with a variance of 21.77 and the overall average potas-
subarea configuration. sium level (again, across all 72 sample points) is 69 with
A practical example further illustrates the difficulties a variance of 1056. Each sample point symbol indicates
that the grouping or zoning component of the MAUP whether the concentration values are above or below the
can uncover. Consider an actual experimental farm in field-wide averages (see the legend in figure 3.13 for
upstate New York. Agronomists took 72 soil samples and details). A glance at figure 3.13 illustrates that both the
potassium and phosphorous values generally increase
Case 1: Group appears more "integrated" from north to south. Noticeable differences in pattern are
apparent, however. The "above average potassium zone"
includes all sample points in the southeast corner of the
field, while the "above average phosphorous zone" does
/
not. Other slight differences can be noted where the
"above average" zones do not coincide.
Agricultural studies often segment a study area
(field) into different subareas or regions, allowing them
to collect fewer samples by associating the sample aver-
ages with each subarea or region. The grouping or zon-
ing component ofMAUP describes the general potential
"- ./
problem that different regionalization or internal subdivi-
sion schemes may yield different statistical summaries.
Given that the underlying geographic patterns of potas-
sium and phosphorous are generally similar, will differ-
Case 2: Group appears more "segregated" ent regionalization schemes result in significantly
different summary statistics?
For this example, we created three different regional-
/
ization patterns, and constructed scatterplots showing
the relationships between phosphorous and potassium.
For each example there were six groups, and each group
contained 12 sample points.
The regionalization schemes were chosen to extract
I
\ different geographic patterns, to the extent that they exist
at all. In figure 3.14, case I is designed to pick up east/
west variation, case 2 is designed to pick up north/ south
'- ,
variation, and case 3 is a compromise, attempting to cap-
ture both east/west and north/south variability. The fig-
ure shows all three regionalization schemes and the
FIGURE 3.12 mean and variance for both potassium and phosphorous
Perceived Affect of Alternative Internal Subarea Boundaries in each scheme.
56 Part II .A. Descriptive Problem Solving in Geography

While the means in each regionalization scheme variances. The greater problem introduced with the differ-
remain the same (simply because we are computing the ent areal units is that one can obtain diametrically opposed
mean of the means), the variances differ. The variance interpretations of the relationship between potassium and
relates to the actual regionalization or internal subarea phosphorous. Oddly enough, even when using the same set
boundary scheme, thus presenting a real interpretive of underlying sample data, we can identify a possible nega-
challenge. If we look at a series of scatterplots of the val- tive relationship between potassium and phosphorous in
ues for potassium and phosphorous overall and in each of case 1, but in case 2 we might conclude there is a positive
the three cases, the results can be interpreted very differ- relationship. And remember, in each instance, we are using
ently. Figure 3 .15 shows each scatterplot along with a exactly the same spatial pattern of data! Only the regional-
basic interpretation. ization scheme has changed. It would appear that case 3
When the regionalization schemes (internal subarea most closely matches the overall data, probably because this
boundaries) are modified, we notice there are no differences regionalization scheme sought to accommodate the varia-
in averages; however, there are significant differences in tion in both north/ south and east/west directions.

1.01 1.41 1.45 1.51 1.57 1.05

8
35.31
8
35.45
8
35.65
8
38.00
8
37.08
843.9
1.54 1.39 0.99 0.91 1.33 1.73

8
34.14
8
:la.73
8
39.04
8
34.05
8
40.82
8
34.62
2.12 1.58 0.89 1.48 2.46 2.95

8
39.91
8
34.75
8
36.53
8
37.43
8
4-4.99
8
46.91
3.36 2.21 1.84 2.76 3.45 6.57

8 8 8 8 8 [:]
44.73 35.02 37.57 42.1 47.74 80.4
1.71 1.17 2.02 4.52 3.74 8.54

8 8 b 8 b60.1 [:]
35.76 33.31 35.21 42.19 54.16

2.09 2.4-4 4.06 5.16 9.1 42


8 8 8 8 [:] 8
40.13 49.79 56.47 49.22 81.0 89.41
2.7 3.5 6.14 8.57 4.62 8.02

8
37.61
8
43.96
[:]
64.01
[:]
75.37
8
75.15
[:]
94.42
10.08 6.26 9.12 7.34 10.56 9.38
[:] [:] [:] G [:] G
55.46 79.74 108.55 74.68 104.37 91.84
Overall Mean: Phosphorus 5.6
9.6 6.99 8.0 12.18 8.41 4.07 Potassium 69.0
[:] G [:] G [:] 8
58.56 98.8 55.16 85.35 97.92 78.94
Phosphorus above average,
14.16 22.41 12.77 7.55 7.83 3.79 Potassium above average
G
107.43
G
116.49
G
111.96
[:]
111.5
G
120.81
8
88.64
Phosphorus above average,
20.92 11.1 13.15 9.25 3.86 3.81 Potassium below average
[:] G G G 8 b
100.41 87.9 128.65 124.56 116.92 73.94
Phosphorus below average,
13.93 12.51 8.14 1.71 3.01 2.4-4 Potassium above average
[:]
81.98
[:]
146.71
G
132.11
8
121.23
8
103.32
8
128.35
Phosphorus below average,
Potassium below average

FIGURE 3.13
Farm Field Showing the Spatial Patterns of Phosphorus Levels (above) and Potassium Levels (below) for 72 Sample Points
Source: Lembo. A.. Lew. M., Laba. M., Baveyye, P. 2006
Chapter 3 .A. Descriptive Statistics and Graphics S7

So what should a geographer conclude? This real- Population size is a basic variable, readily available
world situation certainly illustrates the potential magni- at virtually any level of aggregation. In this simple exam-
tude of problems associated with the grouping and zon- ple, we present population figures for two dates (1980
ing component ofMAUP, and clearly demonstrates why and 2010) at three different levels of aggregation (state,
it is considered one of the more vexing problems in geog- census division [CD), and census region [CR)). Figure
raphy. This example also demonstrates that understand- 3.16 shows all three scales-SO states aggregated to 9
ing the nature of the underlying data may actually help census divisions, then 9 census divisions subsequently
the geographer determine the most appropriate regional- aggregated to 4 census regions. Many of the basic
ization (internal subarea boundary) scheme to use. This is descriptive statistics (both absolute and relative) intro-
an important point. The use of GIS tools to overlay and duced in this chapter are calculated for both dates and all
aggregate data may lead to embarrassing results if the three scales or levels of aggregation (table 3.11).
geographer does not understand the data being analyzed. As you might expect, the mean and standard devia-
tion values for each date increase in magnitude as the scale
ModifiableAreal Unit Problem: increases in level of aggregation. For example, in 1980 the
average state population was 4,442,080 with a standard
Change in Scale or Levelof Spatial Aggregation
deviation of 4,699,160. When these 50 state populations
When analyzing data, you can often choose from were aggregated to nine census divisions, the average CD
many different levels of spatial aggregation or scale. population was 25,171,710 with a standard deviation of
Many socioeconomic variables, for example, are often 11,839,680.Further aggregating the nine CDs to four cen-
available at the block, census tract, enumeration district, sus regions resulted in an average CR population of
election district, county, planning region, and state levels. 56,636,400 with standard deviation of 14,065,900.Clearly,
When the same data are aggregated at different spatial by aggregating data into larger subareas, the magnitude of
levels or scales, the resultant descriptive statistics will the absolute descriptivemeasures increases.
vary, sometimes in a systematic, predictable fashion and In contrast to the absolute measures, the columns of
sometimes in an uncertain way. table 3.11 showing relative measures are more challeng-

Case 1 Case2 Case 3

• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •

• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • •

Number of regions = 6 Number of regions = 6 Number of regions = 6


Points per region = 12 Points per region = 12 Points per region = 12
Potassium mean = 69 Potassium mean = 69 Potassium mean = 69
Potassium variance = 1083 Potassium variance = 325 Potassium variance = 413
Phosphorus mean= 5.6 Phosphorus mean = 5.6 Phosphorus mean = 5.6
Phosphorus variance = 22.8 Phosphorus variance= 12.7 Phosphorus variance= 10.9

FIGURE 3.14
Three Regionalization Schemes (Different Internal Subarea Boundaries) for the Same Field (Same External Study Area Boundary)
Source:Lembo, A., Lew, M.. Laba, M., Baveyye, P. 2006
58 Part II .A. Descriptive Problem Solving in Geography

30

25
., 0
.,
> 20 0 Overall Field Data

-
...J

Cl.
~

U> 15 0
As K increases, P also
increases. We also see
::, 0
~ 0 0 0 a large spread of the
_g 00
0 data, indicating lots of
a. 10 Oo 0 0 0 0
3 ~ 0 oO 0 0 variation.
.s= 0 0 0
Cl. 0 f}
5 0 0
Oo &o 6 0
0
IJ/}8,o~<9o 0 0
0 . . 150
.
30 50 70 90 110 130
Potassium (K) Level

8
7 0
.,
J 6 0
00
Case 1
-
...J
5 0 0
Cl.
~
As K increases, P
4 decreases. This is the
U>
::,
~ di reel opposite of what
_g 3 the overall field data
a. 2
3
.s=
shows .
Cl. 1
0
30 40 50 60 70 80
Potassium (K) Level

12
Case 2
10
J., 0
As K increases, P also
- 8 0
...J
increases. This is a
Cl. 0 pattern similar to the
~

U> 6 overall field data.


2
0
.s= 4 0 However, the points are
a. almost in a straight line,
3
.s= 2
0
and don't seem to show
Cl. 0
a large variation.
0 • •
' ' ' ' ' 100 '
110 120
30 40 50 60 70 80 90

Potassium (K) Level

12
0 Case3
., 10
.,
>
0
As K increases, P also
increases. This is a
-
...J 8
Cl.
~
similar pattern to the
U>
6
0
overall field data. In
2 addition, the spread of
0
.s= 4 0 the points show greater
a.
U>
0 oO variability, which is
.s= 2
Cl. similar to the overall
0
40
. . .
100
.
110 120
field data.
30 50 60 70 80 90

Potassium (K) Level

FIGURE 3.15
Scatterplots Showing Various Relationships Between Potassium and Phosphorus: Overall Field Summary and Three Different Internal
Subarea Boundary Schemes
Source: Lembo. A.. Lew. M., Laba. M., Baveyye, P. 2006
Chapter 3 .A. Descriptive Statistics and Graphics S9

ing to interpret geographically. For both 1980 and 2010, population averages become much larger, but the stan-
notice that the magnitude of the coefficient of variation dard deviation increases at a much slower pace, making
decreases as the level of spatial aggregation increases. the ratio of standard deviation to mean (that is, the coef-
Why would this occur? At the state level, population fig- ficient of variation) a smaller number.
ures vary greatly, from very large states like California The interpretive value of skew and kurtosis seems
and New York to very small states like Alaska and Ver- very limited in this example. This is not surprising, since
mont. Therefore the standard deviation of these 50 state these measures of shape or relative position generally
populations is large relative to the average of those popula- become useful only when the sample size is quite large.
tions. As we shift to the census division scale, and espe- However, the skew values for both 1980 and 2010 are
cially to the highly aggregated census region scale, the slightly positive, suggesting that the positive (right) tail of

TABLE 3 .11

Descriptive Statistics of Total Population by Date and Level of Spatial Aggregation

Absolute measures (in 1000s) Relative measures

1980 Coefficient
Mean Standard deviation of variation Skewness Kurtosis
State 4442.08 4,699.16 105.79 2.15 5.45
Census Division 25,171.71 11,839.68 47.04 0.15 -1.94
Census Region 56,636.40 14,065.90 24.84 0.89 0.06

2010 Standard deviation Coefficient Kurtosis


Mean Skewness
of variation
State 6,053.83 6,823.98 112.72 2.65 8.82
Census Division 34,305.60 16,093.91 46.91 0.23 -1.45
Census Region 77,186.38 25,867.93 33.51 1.56 2.81
Source: United States Bureau of the Census

MIDWEST
VIA

NO
'
0
., so

P.acific West.tiorth
'"
Central IA East,North
NV
Mountain NE
'
Central
IN
"' 00
IL
U.S. Census Divisions
KS MO New England
KY
Middle Atlantic
lN Mountain
OK Ea.st Pacific
"" AA South
Ce'ntral
South Atlantic
East Sooth Central
GA
West South AL West South Central
WEST Central
MS
East North Central
LA West North Central
TX

.. ,o U.S. Census Regions

·~
FL
la ,0. - SOUTH
' -NORTHEAST
>
"\•,-~
~ LJWEST
Pacific LJSOUTH
LJMIDWEST

FIGURE 3.16
U.S. States, Census Divisions, and Census Regions
Source:United States Bureau of the Census
60 Part II .A. Descriptive Problem Solving in Geography

the distribution extends a bit more from the mean than tages and disadvantages of each, and select the most
the negative (left) tail of the distribution. This relation- appropriate measure given a particular frequency dis-
ship is very weak, however, but might suggest that larger tribution or particular geographic situation.
population states (e.g., California, Texas, and Florida) 2. Define the basic descriptive measures of dispersion
are growing faster than many small population states. (range, quantiles, mean deviation, standard deviation,
The kurtosis values are also generally limited in variance) and explain the characteristics of each.
interpretive value. However, the large positive state-level
3. Recognize whether a geographic problem or situation
kurtosis values for both 1980 and 2010 seem to indicate
requires weighted or unweighted descriptive statistics.
somewhat leptokurtic (peaked) distributions. If you con-
structed histograms of these data sets, you would proba- 4. Understand the concept of relative variability and its
bly see the many states with rather low populations value in comparing different spatial patterns directly
creating a peak in the histogram, with a few large popula- by using the coefficient of variation.
tion states on the positive tail. The result would represent 5. Understand the characteristics, as well as both the
a rather "peaked" distribution. The census division and advantages and disadvantages associated with the
census region kurtosis values do not seem to reveal any classification methods of standard deviation breaks
geographic insights. and Jenks natural breaks.
6. Define the measures of shape or relative position
(skewness and kurtosis) and recognize their potential
KEY TERMS
value in descriptive analysis in geography.
absolute and relative measures, 47, 49 7. Recognize the potential effects of locational data on
average deviation (mean deviation), 45 descriptive statistics. Possible influences include: (a)
bimodal, multimodal, unimodal, 43 impact of external boundary delineation, (b) the Mod-
boxplot (upper and lower whisker), 44 ifiable Areal Unit Problem (MAUP), and (c) the spa-
class midpoint, 42 tial aggregation or scale problem.
coefficient of variation (coefficient of variability), 47
histogram (frequency polygon), 40
individual point distribution (individual value plot), 44 REFERENCES
AND ADDITIONAL READING
interquartile range, 44 Bailey,T. C. and A. C. Gatrell. ln1eractive
SpatialDaraAnalysis.
Jenks natural breaks method of classification, 47 London: Longman, 1995.
Kurtosis (leptokurtic, mesokurtic, platykurtic), 52 Lembo,A., Lew;M., Laba, M., and Baveyye,P. "Use of Spatial
least squares property of the mean, 45 SQL to Assess the Practical Significanceof the Modifiable
measures of central tendency: Areal Unit Problem,'' Computerand Geosciences, Elsevier
mode, 40 Press, Vol.2, No. 2 (2006):270-274.
median, 41 Openshaw,S. TheModifiable Areal UnitProblem.Concepts and
mean (arithmetic mean), 41 Techniques in Modern Geography (CATMOG), No. 38.
modal class and crude mode, 40 Norwich: Geo Books, 1984.
von Hippe!,P.T. "Mean, Median, and Skew: Correctinga Text-
Modifiable Areal Unit Problem (MAUP):
book Rule,'' Journal
ofStatistics
Educaticm,
Vol. 13,No. 2 (2005).
grouping or zoning problem, 54 Wheeler, D. J. "Problems with Skewness and Kurtosis, Part
scale or level of aggregation problem, 53 Two," QualityDigestDaily,Manuscript231, 2011.
ogive, 41
outliers, 43 Looking at other introductory textbooks in quantitative
quantiles, 44 geography and spatial analysis can be a valuable experience.
Other texts may be simpler,more difficult,or roughly compara-
skewness (negative and positive skew), 52
ble.,but whatever their level of difficulty,they will provide du-
standard deviation, 45 ferent insights. Expect some "symbol shock," as each statistics
standard deviation breaks method of classification, 47 text will likely use somewhat different statistical notation. A
symmetric, 43 few introductorytextbooksare listed here.
variance, 45
weighted mean, 42 Burt.,J. E.., G. M. Barber,and D. L. Rigby.ElementaryStatistus
for Geographers. 3rd ed. New York:Guilford Press, 2009.
Clark, W. A. V. and P. L. Hosking. Statistical
Methodsfor Geogra-
MAJOR GOALSAND OBJECTIVES phers.New York:John Wiley and Sons, 1986.
Earickson,R. and J. Harlin. Geographic Measurement andQuan1i-
If you have mastered the material in this chapter, you tativeAnalysis.New York:Macmillan, 1994.
should now be able to: Ebdon, D. Statisticsin Geography: A PracticalApproach.2nd ed.
BlackwellPublishing, 1991.
1. Define the basic descriptive measures of central ten- Griffith,D. A. and C. G. Amrhein. StatisticalAnalysis
for Geogra-
dency (mode, median, and mean), explain the advan- phers.EnglewoodCliffs,NJ: Prentice-Hall,1991.
Chapter 3 .A. Descriptive Statistics and Graphics 61

Rogerson, P.A. StatisticalMethodsfarGe-Ography:


A Student'sGuide.
3rd ed. Thousand Oaks_,CA: Sage Publications_,Inc., 2010.
Taylor, P. J. QuanlitativeMethodsin Ge-Ography:
An Introductionto
SpatialAnalysis.Prospect Heights, IL: Waveland Press, 1983.
In addition to textbooks on quantitative methods written pri-
marily for geographers, dozens of introductory statistics text-
books are currently in print. A few popular introductory
statistics texts are listed here.
Bluman, A. ElementaryStatistics:A Step by Step Approach.6th ed.
New York: McGraw-Hill_,2007.
De Veaux, R. D., P. F. Velleman, and D. E. Bock. Stars:Data and
Models.2nd ed. Boston, MA: Pearson Education, Inc., 2008
Freedman, D., R. Pisani, and R. Purves. Statistics.4th ed New
York: W. W Norton and Co., 2007
Kirk, R. E. Statistus: An Imroduction. 5th ed. Belmont, CA:
Thomson Learning, 2008.
Peck, R., C. Olsen, and J. Devore. Introductionto Statisticsand
Data Analysis. 3rd ed. Belmont, CA: Brooks/Cole, 2008.
Urdan, T. C. Statisticsin Plain English. 3rd ed. New York, NY:
Routledge Academic, 2010.
Descriptive Spatial Statistics

4.1 Spatial Measures of Central Tendency


4.2 Spatial Measures of Dispersion

In the preceding chapter, a variety of basic descriptive Spatial measures of central tendency like central fea-
statistics were examined, including the mean, standard ture, mean center, median center (Euclidean median),
deviation, and coefficient of variation. To summarize and Linear Directional Mean (LDM) are examined in
point patterns, a set of descriptive spatial statistics has section 4.1. Each of these measures has characteristic
been developed that are areal or locational equivalents to properties and a set of practical geographic applications.
these nonspatial measures (table 4.1). Since geographers Section 4.1 also contains some discussion about the con-
are particularly concerned with the analysis of locational cept of" distance." The most important absolute measure
data, these descriptive spatial statistics, appropriately of spatial dispersion is standard distance, the spatial
referred to as geostatistics, are often applied to summa- equivalent to standard deviation. Standard distance is dis-
rize point patterns (and line patterns) and to describe the cussed in section 4.2, in addition to the concept of relative
degree of spatial variability of some phenomena. Geosta- dispersion and the Standard Deviational Ellipse (SDE).
tistics can also provide a useful summary of an areal pat- Only the most fundamental measures of geographic
tern on a choropleth map, if each area on the map can be distnbutions are discussed in this chapter. We have delib-
operationally represented by a point. erately kept things simple and straightforward at this

TABLE 4.1

Selected Descriptive Statistics: Traditional (Nonspatlal) and Spatial•

Measures of center Measures of Measures of


and central tendency absolute dispersion relative dispersion

Traditional (nonspatial) mode range coefficient of variation


measures median interquartile range skewness
mean mean (average dispersion) kurtosis
standard deviation
variance

Spatial measures
Point central feature standard distance relative distance
mean center standard deviational ellipse (SOE)
median center (Euclidean median)
Line linear direction mean (LDM)

'" Contentsof tablelimitedto only thosedescriptivestatisticsdiscussedin chapters3 and 4.

62
Chapter 4 .A. Descriptive Spatial Statistics 63

stage, with the focus exclusively on elementary descrip- The worktable for calculating the central point is
tive spatial statistics. However, a variety of inferential summarized in table 4.2. Notice that the definition of
spatial statistics have become increasingly important "distance" in this example is Euclidean (or straight-line)
with the explosive development of geographic informa- distance. The Euclidean distance (dij) separating points i
tion sciences (GIS) in recent years. These issues will be and j is defined by the Pythagorean theorem as:
discussed later, especially in chapters 13 through 15.
It is important to recognize that even simple descrip-
(4.1)
tive measures of a geographic distribution are usually cal-
culated with a GIS software package. Spreadsheets such
as Excel and statistical software packages such as To refresh your memory, examine figure 4.2, which
Minitab, SPSS, and SAS can also be used for calculation, illustrates the Pythagorean theorem that states: "In a
but the procedures may be more cumbersome. As geog- right-angled triangle, the square of the hypotenuse is
raphers, we now have more options and computational equal to the sum of the squares of the other two sides." If
flexibility than ever before. the hypotenuse represents the Euclidean distance sepa-
rating two points (for example, points"~' and "B" in fig-
ure 4.1), the distance between those two points is easily
4.1 SPATIALMEASURES calculated as part of the calculation for the central point
OF CENTRAL
TENDENCY (table 4.2).
The matrix in table 4.2 shows Euclidean distances
Consider the scatterplot of seven points shown in fig- for all pairs of points and we can see that point "D" is the
ure 4.1. These points might represent any spatial distri-
one whose total distance from all other points is the low-
bution of interest to geographers-the only stipulation is est with 9.97. By definition, this makes point "D" the
that the phenomenon can be displayed graphically as a
central point.
set of points in a two-dimensional Cartesian coordinate
system. The directional orientation of the coordinate Mean Center
axes and location of the origin are often both arbitrary.
The mean was discussed as an important measure of
Central Feature central tendency in the previous chapter. If this concept
of central tendency is now extended to the same set of
The central feature is the feature (point or area)
seven points as shown in figure 4.1, the mean center ( or
whose total distance to all other features in the study area
average location), can be determined. Given a coordinate
is the shortest. If we are evaluating a point pattern, the
system with the X and Y coordinates of each point deter-
central point is the point with the lowest total distance to
all other points. Similarly, if evaluating an area pattern,
the central area is the area whose total distance to all
other areas is the shortest. In either case, this is one sim- TABLE 4.2
ple way of defining the most accessible feature. Only the Worktable for Calculating Central Point
central point calculation procedure is shown here. Distance matrix between points in figure 4.1

Total
A B C D E F G Distance
y A 0 2.59 1.93 1.68 1.55 2.56 2.90 13.21
4 1.96 3.31
• B (1.6, 3.8)

B
C
2.59
1.93
0
1.96 0
3.33
1.58
3.82
2.34
3.86
1.92 1.41
18.87
11.14
e G (4.9. 3.5) D 1.68 3.33 1.58 0 0.91 0.89 1.58 9.97
C (3.5. 3.3)
3 E 1.55 3.82 2.34 0.91 0 1.58 2.47 12.67
Central Point F 2.56 3.86 1.92 0.89 1.58 0 1.14 11.95

• G 2.90 3.31 1.41 1.58 2.47 1.14 0 12.81

2 ~ F (5.2. 2.4)

D (4.4, 2.0)
Point "D" is the lowest total distance from all other points: 9.97


A (2.8. 1.5)
Euclidean distance from point A to point B:

1 • E(4.3. 1.1)

✓(2.8 - 1.6) 2 + (1.5 - 3.8) 2

2.59
FIGURE 4.1 Locational coordinates: A (2.8, 1.5), B (1.6, 3.8)
Graph of Locational Coordinates and Central Point
64 Part II .A. Descriptive Problem Solving in Geography

mined, the mean center is simply calculated by sepa- y


rately averaging those X and Yvalues, as follows: Point/ (X,. YJ

- I,X,
X, = =--'- and
- I,Y,
Y, = =--'- (4.2)
~ ------------------------
Point/
n n (X,, Y.)

C
b (Y,- Y;)

where: X, = mean center of X


Y, = mean center of Y Yj
a
X; = X coordinate of point i
Y; = Y coordinate of point i ~--~--------~---- X
n = number of points in the distribution Xj x,

The mean center coordinates (X, = 3.81 and Y, = 2.51)


are calculated in table 4.3 and the mean center location is Pythagoreantheorem:
ai+bi=ci
shown in figure 4.3.
In a rlgllt-angled triangle, the square of the hypotenuse (c)
Just as the nonspatial mean is strongly affected by an Is equal to the s-umor the squares or the other two sides (a and b} a
outlier or small number of extreme values, the mean center
Euctldean (straight-line) dlstanoo:
is influenced in a similar way. Suppose, for example, that
d, = /(X,-X,'f'+(Y,-Y,Y,
one additional point with coordinates (15, 13) is included
in the previous example (figure 4.4). The mean center
location shifts dramatically from (3.81, 2.51) to (5.21, FIGURE 4.2
Calculation of Euclidean Distance (dq) from Point; to Pointj
3.82), the latter being a coordinate position having larger X
and Y coordinates than any of the other seven points.
Thus, while the mean center represents an average loca- the concomitant westward shift of the center of popula-
tion, it may not represent a "typical" or "central" location. tion (figure 4.5).
The mean center may be considered the center of In the previous seven-point examples, each point is
gravity of a point pattern or spatial distribution. Perhaps given an equal weight in the central point and mean cen-
the most widely known application of the mean center is ter calculations-that is, each point is equally important
the decennial calculation of the geographic "center of statistically. In many geographic applications, however,
population" by the U.S. Bureau of the Census. This is points of a spatial distribution could be assigned differ-
the point where a rigid map of the country would bal- ential weights. These weights are analogous to frequen-
ance if equal weights (each representing the location of cies in the calculation of grouped statistics, such as the
one person) were situated on it. Over the last two centu- weighted mean. The points might represent cities and
ries, the westward movement of the U.S. population has the frequencies the number of people, or the points could
continued without significant interruption, as reflected in be retail store locations and the frequencies could be vol-

TABLE 4.3
y
4 Worktable for Calculating Mean Center

B (1.6, 3.8)
Mean oonter (3.81, 2.51)

• locational coordinates •

C (3.5, 3.3)
G (4.9, 3.5)
Point x, Yi
3
A 2.8 t.5
-------------------------@ I

F (5.2, 2.4)
B 1.6 3.8
C 3.5 3.3
2 • 0 (4.4, 2.0) D
E
4.4
4.3
2.0
t.1

A(2.8. 1.5) • F 52
4.9
2.4
3.5
1 •E (4.3. 1.1)
G

rx,= 26.7
n = 7 Er, = 17.6

- rx·
Xe- - _, 26.7 = 3.81 Ye= LYi _ 17.6 = Z.Sl
6 X n 7 n 7
1 2 3 4 5
Mean center coordinates: (3.81, 2.51) •
FIGURE 4.3
Graph of Mean Center • See figtJre 4.3 for graph of locational coordinates and mean center.
Chapter 4 .A. Descriptive Spatial Statistics 65

ume of sales per store. The weighted mean center is Each point in figure 4.3 is now assigned a weight
defined as follows: (figure 4.6) and the locational coordinates of the
weighted mean center are calculated (table 4.4). These
coordinates are somewhat different from the coordinates
and (4.3)
of the comparable unweighted mean center. The
weighted mean center is heavily affected by the relatively
large frequency (20) associated with point "B". Gravita-
where: X,....
= weighted mean center of X
tion of the center toward a point with an unusually heavy
.f..,= weighted mean center of Y weight will occur even if that point is located peripherally
/; = frequency (weight) of point i within the spatial distribution.
The mean and mean center both have an important
y characteristic of locational significance. Recall from the
13
New point
(15. 13)

y
Newme.ancenter 4 Weightedmeancenter
•B (20) (3.1o.2.88)
4 (5.21. 3.82)

B (1.6. 3.8) G (4.9. 3.5) • •C(8) •G(3)

C (3.5. 3.3) • •
3 3 --------------------------~~

-------------------------@
Oki meancenter •
F (5.2. 2.4)
(3.81, 2.51) 2
2 • D (4.4. 2.0)

A (2.8. 1.5) •
1 •E (4.3. 1.1)
1

1 2 3 4 5 sx
1 2 3 4 5
FIGURE 4.6
FIGURE 4.4 Graph of Point Locations, Weights (in Parentheses) and Weighted
How an Outlier Might Affect Mean Center Location Mean Center

Pennsylvania
Iowa

Ohio
Indiana

Illinois

j,9601 1890 WestVirginia

Missouri • Vrginia

Kentucky

NOl"'lh
Carofna
• Center of population
Tennessee 0 70 140 280
Mansas
KilOtne1etS

FIGURE 4.5
Geographic Center of U.S. Population, 1790 to 2010
Source:UnitedStates Bureauof the Census.2010
66 Part II .A. Descriptive Problem Solving in Geography

TABLE 4.4 MedianCenter (EuclideanMedian)


Worktable for Calculating Weighted Mean Center For many geographic applications, another measure
Locational Weighted of "center" is more useful than the mean center. Often, it
coordinates• Weight coordinates is more practical to determine the central location that
Point X, Y, t, t, x, ft Yi minimizes the sum of unsquared,rather than squared, dis-
A 2.8 1.5 5 14.0 7.5 tances. This location, which minimizes the sum of
8 1.6 3.8 20 32.0 76.0 Euclidean distances from all other points in a spatial dis-
C 3.5 3.3 8 28.0 26.4
4.4 4
tribution to that central location, is called the median
D 2.0 17.6 8.0
E 4.3 1.1 6 25.8 6.6 center (X,, Y,) or Euclidean median. Mathematica11y,
F 5.2 2.4 5 26.0 12.0 this location minimizes the sum:
G 4.9 3.5 3 14.7 10.5

n =7 r,t, = 51 r,r,x,= 158.1 Ef,Y, = 147.0 (4.7)


X- -
\VC -
Lf·X·I
Lfi
I 158.1
51
= 3.10
Unfortunately, determining coordinates of the
Euclidean median is complex methodologically. Com-
L/;Y;
I:/;
= 147.0
51
=2.88 puter-based iterative algorithms (step-by-stepprocedures)
must be used to reach a solution. These algorithms evalu-
Weighted mean center coordinates: (3.10, 2.88) • ate a sequence of possible coordinates and gradually con-
verge on a suitable (often optimal) location for the
'"See figure4.6 for graphof pointlocations,frequencies.and 'lfflightedmean center.
median center.
A weighted median center is a logical extension of the
discussion on mean deviation (section 3.2) that the sum simple (unweighted) median center, and the same types of
of squared deviations of an observations about a mean is algorithmic procedures locate the weighted Euclidean
zero. In addition, the sum of squared deviations of an median. The coordinates of the weighted Euclidean
observations about a mean is the minimum sum possi- median (X .,., Y...) will minimize the expression:
ble. That is, the sum of squared deviations about a mean
is Jess than the sum of squared deviations about any
other number:

2 The weights or frequencies may represent population,


X: minI:(X,-X) (4.4)
sales volume, or any other feature appropriate to the spa-
tial problem.
This important attribute is caned the least squares prop- The location of the weighted Euclidean median is
erty of the mean. important to geographers for several practical reasons.
The mean center is spatiany analogous to the mean For example, a classical problem in economic geography
and has the same least squares property as the mean. is the so-called Weber problem, which seeks to determine
That is: the "best" location for an industry. The optimal location
minimizes the total cost of transporting the raw material
to the factory and the finished product to the market. The
weighted Euclidean median is the location that mini-
mizes these transportation costs.
In a Cartesian coordinate system based on location, Perhaps the most extensively developed general
deviations such as (X 1 - X,} and (Y, - Y,}are, in fact, application for the median center in geography is public
distancesbetween points and can be expressed by the and private facility location. Often an important goal in
Euclidean distance we discussed earlier in the chapter: facility location is minimizing the average distance trav-
eled per person to reach a designated or assigned facility.
(4.6) This efficiency-based objective is equivalent to minimiz-
ing the aggregate or total distance people must travel to
use the service systemwide. The Euclidean median
Thus, the mean center is the location that minimizesthe achieves this goal.
sum of squareddistancesto all points. This characteristic Consider, for example, the problem of locating the
makes the mean center an appropriate center of gravity for site for an urban fire station based on a predicted pattern
a two-dimensional point pattern, just as the mean is the of fires for the region. Using the past or present pattern of
center of gravity along a one-dimensional number line. fires as a reasonable estimate of future fires, the optimal
Chapter 4 .A. Descriptive Spatial Statistics 67

central location for the station could be defined as the If this Euclidean distance metric is now "generalized" to
site that minimizes the total {and hence, average) dis- allow non-Euclidean distance measurement, the result is
tance traveled by the fire equipment to reach fires. That a general distance metric:
location is determined by the Euclidean median.
In another application, suppose location analysts for
an exclusivewomen's apparel chain wish to select an acces- (4.9)
sible site for a new store. Further, suppose that market
analysis indicates that the demographic group most likely When the general distance metric (k) equals 2, the
to shop in the store is women aged 45-65 who are mem- formula is conventional Euclidean distance. When k
bers of households with incomes greater than $120,000. equals I, however, we are measuring distances in which
From the compilation of census tract information in the movement is restricted to a rectangular or grid system.
designated trade area, each tract could be weighted by the The term Manhattan distance describes the restrictive
number of women having these age and income character- movement typical of travel in the New York City borough
istics. The weighted median center will designate the loca- of Manhattan. Measuring the distance between points i
tion that minimizes the total (and average) distance and j in Manhattan space, where k equals one, gives:
traveled by these women to reach the potential store site.
Extending this procedure to the simultaneous loca-
(4. 10)
tion of multiple facilities within a spatial pattern of
demand is known as the "location-allocation" problem
or the "multiple facility location" problem. Suppose, for If you want to travel by car between Columbus, Ohio
example, city health care planners wish to locate a set of and Chicago, Illinois, neither Euclidean distance nor
neighborhood medical centers to provide selected types Manhattan distance will be very accurate or realistic (fig-
of remedial health care. Not only must a set of medical ure 4. 7). No major highway or highways allow anything
centers be located, but the potential clientele must be close to straight-line travel between these two cities, so
allocated to an appropriate facility, creating "catchment Euclidean distance is a rather poor measure of the effort
districts" or zones for each center. Problems such as required (travel time, comfort level while traveling, etc.)
these can be extremely complex and challenging, with to get from the origin to the destination. Following a
both theoretical issues and practical applications receiv- Manhattan distance route such as the one shown in fig-
ing considerable attention from geographers. ure 4. 7 is clearly impractical and probably not possible.
In recent years, facility location strategies and model- However, one of many possible network distance routes
ing techniques have become much more sophisticated. (traveling along interstate highways through Indianapo-
The application of Geographic Information Sciences lis, Indiana) seems reasonable from both travel time and
(GIS) to locational decision-making has become routine travel convenience perspectives. For many spatial interac-
and expected, and is the procedural method of choice in tion models in geography, some type of network distance
market analysis and the optimal siting of businesses. Sev- measure may be the best alternative for minimizing inter-
eral references at the end of this chapter are worth exam- action effort or cost.
ining for further information on this subject.
Up to this point, all of the descriptive measures of cen- LinearDirectional Mean
ter have used Euclidean {straight-line)distance. This has The linear directional mean (or simply directional
been the only definition of "distance." In some geographic mean) identifies the typical or general (mean) direction
situations, however, this may not be the most appropriate for a set of lines. That is, the overall trend of a set of lines
way to proceed. For example, in an urban geography is determined by calculating the mean angle of the lines.
study of commercial activity locations (e.g., grocery This is easily illustrated by showing the linear directional
stores, drug stores, etc.), none of the measures of center mean (LDM) as a graphic summary of a set of line data
considered so far may be very accurate. When travel is (figure 4.8).
confined to a grid or rectilinear street pattern, for example, In some geographic applications you might be inter-
a straight-line or Euclidean measure of distance under- ested only in the orientation of the lines. For example, a
represents the actual travel distances and could result in a seismologist working with urban planners in the greater
misplaced center or inaccurately provide a lower measure San Francisco area will be concerned only with the mean
of dispersion for the facilitiesbeing evaluated. orientation of the San Andreas and other fault lines.
Fortunately, other measures of distance are available. Clearly, the concept of direction does not apply to a fault
Recall that the formula measuring the Euclidean distance line. {Although opposing tectonic plates along a trans-
(dij) of point i (X;, Y;) from pointj (~, 1f) is: verse fault are certainly moving in different directions!)
You could calculate the directional mean of many trans-
portation systems, such as a trail system in a national for-
est, or a road or railroad system in a region. Since
68 Part II .A. Descriptive Problem Solving in Geography

movement can occur in both directions, our only interest Many geographic phenomena change over time, and
would be the average orientation of the system. some applications could focus on temporal changes (pos-
In many other geographic applications the direction sible cycles of change) in mean direction. Does the direc-
of each line is of importance and is factored into the mea- tional mean of winter storms crossing a particular
surement. For example, researchers might find it useful location (a city in the Midwest, for example) change
to determine the mean direction of such natural phenom- from decade to decade in any cyclical fashion? Have the
ena as tornado or hurricane tracks. Environmental plan- seasonal migration paths of elk in the Greater Yellow-
ners might be interested in the seasonal migration paths stone Ecosystem changed due to encroaching develop-
of various deer, elk, or the mean migratory path of Cana- ment? Also, climatologists have long been interested in
dian Geese in the Atlantic flyway. looking for repeating cycles that might be associated with
some natural phenomenon.
The formula for the linear directional mean (LDM) is:

•I,sin0 1

\ = arctan =
LDM 1=~
1 --
(4. I I)

I,cos0 1
i=I

where 0;represents the direction ofa set of linear features


from a single origin.
Data LOMSumma,y
When using this application, GIS software is essen-
FIGURE 4.8 tially calculating the angle of the resultant vector, the
Linear Directional Mean (LOM) Summarized from Data angle that results when each line in the analysis is

Michigan Lake
Lake Erie
Michigan

Chicago, IL

Euclidean distance
---._...-~- (274 miles)

Illinois
Indiana Ohio

'- - --------------------- Columbus, OH


Manhattan distance -+-_/
(405 miles) ....___Network distance
(358 miles)

•Indianapolis, IN
N 0 25 50 100

A Miles

FIGURE 4.7
Measuring 'Distance" Between Columbus, Ohio and Chicago, Illinois
Chapter 4 .A. Descriptive Spatial Statistics 69

treated as a unit vector (length of one unit). The manual 4.2 SPATIALMEASURES
OFDISPERSION
calculation of the LDM is nicely illustrated by geogra-
phers Jay Lee and David Wong in StatisticalAnalysiswith
Standard Distance
ArcView GIS.
The island territory of Puerto Rico, like many Carib- Just as the mean center serves as a locational ana-
bean countries, is vulnerable to hurricanes. Typically, logue to the mean, standard distance is the spatial equiv-
tropical depressions and storms form over Western alent of standard deviation (table 4.1). Standard distance
Africa and are carried westward by the trade winds to measures the amount of absolute dispersion in a point
become hurricanes in the Caribbean region. The linear pattern. After the locational coordinates of the mean cen-
directional mean of all hurricanes and tropical storm ter have been determined, the standard distance statistic
tracks that have crossed Puerto Rico from 1852 to 2007 is incorporates the straight-line or Euclidean distance of
shown in figure 4.9. An interesting research hypothesis each point from the mean center. In its most basic form,
could be framed around the question of whether the standard distance (Sv) is written as follows:
LDM angle changes in any statistically significant way
from one time period to another. Since the data are avail-
able from the National Climatic Data Center, many (4.12)
interesting research questions could be posed and analy- n
ses conducted to pursue this area of inquiry.
If equation 4.12 is modified algebraically, the number of
required computations can be reduced considerably.

.. Linear Directional Mean

Hurricane and Tropical


Storm Tracks

A
0 250 500
Ki!otneters

FIGURE 4.9
Linear Directional Mean: Hurricane and Tropical Storm Tracks Crossing Puerto Rico
Source:ESRI Resources Help. Linear Directional Mean
70 Part II .A. Descriptive Problem Solving in Geography

[Note: in practical situations, it is most likely that all y Standard distance


computations will be done using a GIS package.] 4

(4.13)
3
------------- ___________ Q,.--,,

Using the same point pattern as in the earlier examples Meancenter _4- __ _/
(figure 4.1), standard distance is now calculated (table 2 (3.81, 2.51)
4.5) and shown as the radius of a circle whose center is
eA
the mean center (figure 4.10).
Like standard deviation, standard distance is 1 E
strongly influenced by extreme or peripheral locations.
Because distances about the mean center are squared,
outliers or atypical points have a dominating impact on X
the magnitude of the standard distance. 1 2 3 4 5 6
Weighted standard distance is appropriate for those
geographic applications requiring a weighted mean cen- FIGURE 4.10
ter. The definitional formula for weighted standard dis- Graph of Point Locations, Mean Center, and Standard Distance

tance (S wv) is:

2 A weighted standard distance can be computed


I/;(X;-X., )2+ If,(Y;-Y., ) using the same point pattern as before (table 4.6). A
SwD = If,
(4.14)
moderate disparity exists between the relative magni-
tudes of the unweighted and weighted standard distances
This may be rewritten in simpler form for computation as: (1.54 vs. 1.70). This difference can be primarily explained
by point B. Because this point is distant from the mean
center and exerts a proportionally greater influence on
(4.15) the standard distance measure with its larger weight,
point B causes the weighted standard distance to be
larger than the unweighted standard distance.

TABLE 4.5
TABLE 4.6
Worktable for Calculating Standard Distance
locational Worktable for Calculating Weighted Standard
,22~inmu • Distance
Point x, Y, x,' Y,'
Point ,, x, x,' f,(Xr)
2
Yr Yi' f;(Yr)
l

A 2.8 1.5 7.84 225


A 5 2.8 7.84 3920 1.5 2.25 11.25
B 1.6 3.8 2.56 14.44
B 20 1.6 2.56 5120 3.8 14.44 288.80
C 3.5 3.3 12..25 10.89
C 8 3.5 12.25 98.00 3.3 10.89 87.12
0 4.4 2.0 19.36 4.00
D 4 4.4 19.36 77.44 2.0 4.00 16.00
E 4.3 1.1 18.49 121
E 6 4.3 18.49 110.94 1.1 1.21 7.26
F 52 2.4 27.04 5.76 F 5 5.2 27.04 135..20 2.4 5.76 28.80
G 4.9 3.5 24.01 12.25
G 3 4.9 24.01 72.03 3.5 12.25 36.75

Xe= 3.81 Ye= 2.51


-2
Xe = 14.52 -2
Y, =6.30 From earlier calculation of weighted mean center:

n =7 rx.2 = 111.50 1: r,2 = so.so f?wc = 3.10 l'wc = 2.88 -


Xwc
2
= 9.61 -
Ywc
2
= 8.29
1:t, = 51 1:t,(X,) 2 = 584.01 1:t,(Y,) 2 =475.98
Ll;(X;)2 - 2) (l:l;(Y;:)2 - 2)
(
"El;
- Xwc + =-e-'--"- - Ywc
°Eli
j(u~so _ 14.52) + (SO;BO _ 6.30)

= 1.54
'"See figure 4.10 for graph of locational coordinates, mean center, and standard 1.70
distance.
Chapter 4 .A. Descriptive Spatial Statistics 71

Standard Deviational Ellipse TABLE 4.7


While standard distance is a useful measure of dis- Summary Statistics for Standard Devlational
persion in a point pattern, it does not take the orientation Ellipse: Anti-Shipping Activity off the East
of the pattern into account. Points may be dispersed dif- African Coast
ferently along the X or Y axis, potentially revealing an #of so, so,
elliptical trend in the point locations. To capture this Year incidents
-
x, -Y, (miles) (miles) Rotation
trend, the standard distance is calculated separately for 2009 114 52.53 -1.00 379.7 604.7 45.4
the X and Y coordinates using the following formulas: 2010 230 52.06 -1.58 415.9 956.3 56.3
Source: NationalGeospatial-~telligooce
Agency(NGA)

The standard deviational ellipse methodology is


applied to point patterns representing anti-shipping activ-
SDx represents the average distance points vary from ities off the east coast of Africa in 2009 and 2010 (figure
the mean center on the X axis and SDy represents the 4.12). Anti-shipping activities are hostile actions against
average distance points vary from the mean center on the ships or mariners and are sometimes referred to as piracy.
Y axis. Construction of the standard deviational ellipse Motivated by political and economic unrest in the region,
begins with creating a cross of dispersion (figure 4.11). the number of incidents of piracy has increased in recent
To calculate the cross of dispersion, a line of length SDx years. The locations of these hostile acts are recorded by
is extended in both directions from the mean center par- the National Geospatial Intelligence Agency. There were
allel to the X-axis. Similarly, a line of length SDy is 114 recorded incidents in 2009 and 230 incidents in 2010.
extended in both directions from the mean center paral- Given the large number of incidents, it would be difficult
lel to the Y-axis.An ellipse is constructed which encom- even for the trained geographer to discern meaningful
passes this cross of dispersion. A trigonometric function patterns from such a large number of points.
is used to calculate the angle of rotation for the ellipse. The standard deviational ellipse is calculated sepa-
The ellipse is rotated about the mean center so that the rately for the 2009 and 2010 incidents. This allows us to
distances between the points and both arms of the cross explore changes in the magnitude of dispersion and ori-
of dispersion are minimized. entation of the pattern from one year to the next. The val-
ues summarizing the standard deviational ellipses are
shown in table 4.7 and the ellipses themselves are shown
~ Angleof in figure 4.12. The mean centers and X-axes standard dis-
I ~ \ rotation tances for the two distributions are very similar. However,
the standard distances for the Y.axes differ by over 350
miles, indicating a more dispersed pattern for the 20 I0
Crossof incidents. One possible explanation for this shift is that
A-- dJspersion the pirates have developed the capability to operate at
greater distances from the east coast of Africa. The orien-
tation of the ellipses for both years is remarkably similar.
The elongated ellipses suggest a corridor of piracy extend-
ing from the coast ofTan2ania into the Indian Ocean. In
fact, the standard deviational ellipses follow major ship-
so, : so,
' .... '" .. '" ..... '" .. ' . ping lanes which extend from the large port of Dar es
Salaam, Tanzania to the Middle East and Asia.
Although the standard deviational ellipse is a useful
Meancenter
method for identifying the magnitude of dispersion and
(x',, I',) ---1-/ ,... orientation of a point pattern, two procedural issues must
o: be considered. First is the delineation of the study area
en •
boundary. In many geographic problems, a political or
administrative boundary defines the study area. If this
boundary is itself elongated it may impose an artificial spa-
tial trend on the data. You must consider carefully whether
Anglo of \ ~ j the pattern of the standard deviational ellipse is the result
rotation ~ of some underlying spatial process or an artifact of the
study area boundary. The second issue is spatial outliers.
FIGURE 4.11
Both the mean center and standard distance will be
Standard Deviational Ellipse impacted by points which are located far from the center
72 Part II .A. Descriptive Problem Solving in Geography

••
Sudan
•• • • •
••
• • •• •
Ethiopia
•---~---

, •'
, ,

• 2009
• ...
• ' ' '•• , ••
• •
..
\

• •• ' ••
2010

..
Somalia
• • • • • I
•••
• •••,_• • I

• I

-
I
Uganda • I ••
Kenya
, , ,. • • ••
• •
• •
• •
I
• •


,• • •• • •
•• •

-~
~ • •
·~
..

,.,·.. •
• • I•

• •
• • •
••
• •
. •
·,

, , ,
, ,..
,, ,

N
• ,
iTanzania
• •• ••
• • • , , ,
A
• • ••
• • • , ,, •
• • ,,,, Mean center 2009
•• , ,


' ' - - -- - , Mean center 2010
••
..
J\ .
0 Anti-shipping event

Malawi Mozambi~ue • 0 375 750


Madagascar • Kilometers

FIGURE 4.12
Standard Deviational Ellipse: Anti-shipping Activity off the East African Coast, 2009 and 2010
Source:NationalGeospatial-lntelligence
Agency(NGA)

of the distribution. The presence of spatial outliers can pro- As a measure of absolute dispersion, standard distance
duce artificiallyelongated standard deviational ellipses. remains unchanged. Conversely, the coordinates of the
mean center will change whenever the coordinate sys-
The Relative Dispersion Problem tem is shifted. Because coordinate system location
The coefficient of variation (standard deviation affects the mean center but not standard distance, a rela-
divided by the mean) is the nonspatial measure of rela- tive dispersion metric based on the ratio of these mea-
tive dispersion (table 4.1). Unfortunately, a perfect spa- sures will be meaningless.
tial analogue to the coefficient of variation does not exist Despite these difficulties, some logical estimate of
for measuring relative dispersion. Although it seems logi- relative dispersion is necessary for spatial measurement.
cal to divide the standard distance by the mean center to Consider the three point patterns in regions A, B, and C
produce a relative dispersion index, this procedure will (figure 4.14, case 1). The distribution of points in each
not provide meaningful results. region has the same amount of absolute dispersion and
The problem of measuring relative dispersion can the same standard distance. However, in small region A,
easily be demonstrated. Consider a situation where the the points have a high degree of relative dispersion,
spatial statistics for the same point pattern are calculated whereas they have a low relative dispersion in region C
twice using different positions for the coordinate system because the region is larger. The point patterns in regions
(figure 4.13). In case 1, the X coordinates of each point D and E (figure 4.14, case 2) appear to have the same
are three units lower and the Y coordinates one unit amount of relative dispersion. However, the point pattern
lower than in case 2. Notice, however, that the shift in in region D has a larger standard distance (absolute dis-
the coordinate system does not affect standard distance. persion) than region E because of its larger size.
Chapter 4 .A. Descriptive Spatial Statistics 73

Case 1: Xe= 3.81, Ye= 2.51, So= 1.54 Case 1: Same absolute dispersion; decreasing
y relative dispersion from region A to region C
5

4
• M&ancenter

• •• • •• • ••
B (3.81, 2.51)

3
________ :~Y
•• •• ••
• F
A
2
'•
I D
B

A I
,.
C
1
I E
Case 2: Same relative dispersion; decreasing
absolute dispersion from region D to region E
X
1 2 3 4 5 6 7 8 9

Case 2: Xe= 6.81, Ye= 3.51, So= 1.54



y •
M&ancenter • •
5

B
(6.81, 3.51)

•G • •
4

-------------------------• •F • •

3 • D • E

•A •
2 • E
D

1 FIGURE 4.14
Comparisons of Absolute and Relative Point Pattern Dispersion

X
1 2 3 4 5 6 7 8 9

FIGURE 4.13
Arbitrary Placement of Coordinate Axes and Resultant
Descriptive Spatial Statistics
KEY TERMS
To derive a descriptive measure of relative spatial dis- absolute and relative dispersion in a point pattern, 69
persion, the standard distance of a point pattern must be central feature (point or area), 63
divided by some measure of regional magnitude. This Euclidean (straight-line) distance, 67
divisor cannot be the mean center. One possible divisor is general distance metric, 67
the radius (rA) of a circle with the same area as the region geographic "center of population", 65
being analyzed. A measure of relative dispersion, called geostatistics, 62
relative distance, (Rv), can now be defined: linear directional mean, 67
Manhattan distance, 67
mean center (weighted mean center), 63
(4.17) median center (Euclidean median), 66
network distance, 67
This relative distance measure allows direct comparison relative distance, 73
of the dispersion of different point patterns from different standard deviational ellipse, 71
areas, even if the areas are of varying sizes. standard distance (weighted standard distance), 69
74 Part II .A. Descriptive Problem Solving in Geography

MAJOR GOALSAND OBJECTIVES REFERENCES


AND ADDITIONAL READING
If you have mastered the material in this chapter, you Abler, R. F._,J. S. Adams, and P. R. Gould. Spatial Organizatwn:
should now be able to: The Ge-Ographer's View of the World. Englewood Cliffs, NH:
Prentice-Hall, 1971.
1. Understand the concept of central tendency as applied Birkin, M. G., G. Clarke, M. Clarke, and A. Wilson. Intelligent
in a spatial context and explain the distinctive charac- GIS, Location Decisionsand StrategiesPlanning. New York:
teristics of the central point, mean center, and Euclid- John Wiley and Sons, 1996.
ean median. Davies., R. and D. Rogers. Store Location and Store Assessment
2. Define the spatial measures of dispersion (standard Research.New York: John Wiley and Sons, 1984.
Drezner, Z., ed. Facility Location:A Survey of Applicali-Ons and
distance and relative distance) and recognize their
Methods.Heidelberg: Springer Verlag, 1996.
potential applications in geographic problem solving. Ebdon, D. Statisticsin Geography.2nd ed. Oxford: Basil Black-
3. Recognize when a point pattern in a study region has well, 1985.
a large or small absolute dispersion and a large or ESRI. "Linear Directional Mean (Spatial Statistics)." Accessed
small relative dispersion. January 9, 2014. http://resources.arcgis.com/en/help/
main/ I 0.1/index.html#/ /005p0000001 7000000
4. Identify the potential limitations and locational issues
Lee, J. and D. W S. Wong. StatisticalAnalysis with ArcView GIS.
associated with the application of descriptive spatial New York: Wiley, 200 I.
statistics. Menecke_,E. "Understanding the Role of Geographic Informa-
5. Understand the concept of "distance metric" and dis- tion Technologies in Business: Application and Research
tinguish between Euclidean distance, Manhattan dis- Directions." Journal of GeographicInformati-Onand Deciswn
tance, and other distance metrics under various spatial Analysis. I (1997): 44-68.
conditions. Mitchell, A. The ESRI Guide to GIS Analysis. Vol.2: SpatialMea-
surementsand Statistics.Redlands_,CA: ESRI Press.,2005.
6. Explain the goals of the linear directional mean and National Climatic Data Center. Search www.ncdc.noaa.gov for
standard deviational ellipse and suggest practical hurricane and tropical storm tracks.
applications of each. National Geospatial-Intelligence Agency (NGA). Search NGA
website for standard deviational ellipse data. www l .nga.mil/
Smith, M. J, M. F. Goodchild, and P. A. Longley. Geospatial
Analysis: A ComprehensiveGuide to Principles,Techniquesand
Software Tools. Leicester: Matador, imprint of Troubador
Pub. Ltd., 2009.
Taylor, P. J. QuantitativeMethodsin Geography:An Introductionto
SpatialAnalysis. Boston: Houghton Mifflin, 1977.
PART Ill

THE TRANSITION
TO INFERENT
PROBLEMSOLVING
Basics of Probability and
Discrete Probability Distributions

5.1 Basic Probability Processes, Terms, and Concepts


5.2 The Binomial Distribution
5.3 The Geometric Distribution
5.4 The Poisson Distribution

As geographers, we are vitally interested in exploring two possible outcomes and is particularly useful for
and studying spatial patterns found on the earth's physical studying the likelihood of multiple events or trials. Sec-
and cultural landscapes. We develop descriptions and tion 5.3 introduces the geometric distribution, which is
explanations of existing patterns and try to understand the applied under similar circumstances as the binomial, but
processes that create these distnbutions. In many cases, we is concerned specifically with the timing of the next
make predictions about future occurrences of geographic event. The Poisson distribution, presented in section 5.4,
patterns and suggest spatial policies. In short, the core of is used to analyze patterns that are random over time or
geographic problem solving is the description, explana- space. Examples of both spatial and temporal pattern
tion, and prediction of geographic patterns and processes. analysis with Poisson are shown in this section. The nor-
In earlier chapters, we have focused on ways to clas- mal distribution, which is of central importance to all
sify, describe, and otherwise summarize spatial data. researchers, is the primary topic in the next chapter.
Much of the remainder of the book presents various infer-
ential statistical methods for problem solving in geogra-
phy. The chapters in this part of the book (chapters 5-8) 5.1 BASICPROBABILITY
PROCESSES,
deal with probability and sampling and serve as a transi- TERMS,AND CONCEPTS
tion to the inferential statistical techniques that follow.
Chapter 5 provides an introduction to probability and Real-world processes that produce physical or cul-
the important role it plays in geographic problem solving. tural patterns on the landscape are often complex and are
The first section discusses the nature of geographic pat- not usually totally identifiable. Two general categories
terns and the processes that produce them. Included here are used to describe the nature of these processes-deter-
are geographic examples of deterministic, random, and ministic and probabilistic. Deterministic processes cre-
stochastic processes. This section also introduces basic ate patterns with total certainty, since the outcome can be
terms and concepts of probability, including definitions exactly specified with 100% likelihood. Because of uncer-
and rules for using probability in geographic analysis. tainty and complexity in human behavior and decision
Sections 5.2 through 5.4 cover certain discrete proba- making, virtually no cultural processes are completely
bility distributions that are important in geographic prob- deterministic, but some physical processes established
lem solving. The binomial distribution, discussed in through scientific principles fall into this category. For
section 5.2, concerns probability of events having only example, the length of time insolation (solar radiation)

77
78 Part ill .A. The Transition to Inferential Problem Solving

strikes a point on the earth's surface is determined by Some spatial patterns are almost totally unpredictable.
both latitude and day of the year. Given these two com- For example, the precise location of a tornado touchdown
ponents, the exact length of daylight and darkness can be within a region is partly the result of random meteorologi-
determined at any location. cal processes. Sometimes the size of the study area affects
The second category, probabilistic processes, con- the degree of randomness. The number of tornadoes next
cerns all situations that cannot be determined with com- year within Kansas can be estimated as a stochastic proba-
plete certainty. Given the complex character of most bility; however, if the focus is narrowed to a small area
geographic situations, virtually all location-based prob- within Kansas, the exact location of a particular tornado
lems fall into this category. Therefore, while the number touchdown is random or very close to random.
of hours of sunlight at a particular location and time is The study of probability focuses on the occurrence of
deterministic, the amount of solar energy actually reach- an event, which can usually result in one of several possi-
ing the ground is probabilistic because the amount of ble outcomes. Once all possible outcomes have been con-
cloud cover and particulate matter that absorbs and sidered for an event, probability represents the likelihood
reflects incoming solar energy as it passes through the of a given result or the chance that an outcome actually
atmosphere is changing all the time. takes place. Suppose a simple experiment is conducted
Probabilistic processes can be subdivided into two by rolling a single standard die having six faces-I, 2, 3,
useful categories. With random processes, the same prob- 4, 5, and 6. What is the likelihood of rolling a 6 with a
ability is assigned to all outcomes because each outcome single toss of the die? In this experiment, the event is the
has an equal chance of occurring. Typical examples of roll of the die, and the possible outcomes are the 6 sides
random processes include drawing a card from a deck, of the die. In effect, rolling the die is taking a sample
rolling a die, or tossing a coin. Such situations are consid- observation from the infinitely large population of all
ered examples of maximum uncertainty. In a geographic rolls of that die. The die could be rolled 20 times to col-
context, an ornithologist might develop a GIS model to lect a sample size of 20. If the die is not "loaded," each
predict breeding bird occupancy within a wildlife refuge outcome or face of the die has an equal chance of occur-
by assigning all nesting locations an equal random proba- ring and the likelihood of rolling a 6 is one out of six or
bility for occupancy. With stochastic processes, the likeli- .167. Of course, the same probability also exists that a I,
hood or probability of any particular outcome can be 2, 3, 4, or 5 will be thrown. Collecting sample data of
specified and not all outcomes are equal. For example, an outcomes from such experiments (when the sample is
ornithologist may assign a greater probability that a bird only a portion of the population) and then calculating
will select a nesting location based on how far it is from the probabilities of different outcomes is the basis for sta-
the edge of the refuge or whether the location is shielded tistical inference.
by a dense tree canopy. Probability can be thought of as relative fre-
One way to illustrate the nature of stochastic processes quency-the ratio between the absolute frequency for a
is to look at the Los Angeles Department of Water and particular outcome and the frequency of all outcomes:
Power's (LADWP) forecasting of serviceability in a water
distribution network after an earthquake. The probability P(A) =F(A) I F(E) (5.1)
of a pipe breaking entails numerous complex factors such
as the severity of the earthquake (as measured by peak where P(A) =probability of outcome A occurring
ground velocity), the pipe material and age, and the char-
acteristics of the soil through which the pipe runs. The
F(A) =absolute frequency of outcome A
combination of these factors determines the likelihood that
F(E) =absolute frequency of all outcomes for
event E
an individual pipe may break, allowing LADWP to esti-
mate the number and location of broken pipes throughout Probabilities can also be interpreted as percentages when
the system. Because the dynamics of infrastructure perfor- the denominator of equation 5.1 is converted to 100. In
mance is complex, particularly when predicting future the example of the die roll, the probability of a 6 (.167)
events, identifying an individual pipe's integrity after an can also be described as an outcome that has a 16.7%
earthquake cannot be specified with certainty. Under these likelihood of occurring.
conditions, the agency can only specify the likelihood or Some of the basic terms and concepts of probability
probability of each pipe's integrity in the system. can be illustrated with a simple geographic example.
When dealing with this level of uncertainty, a sto- Suppose a day is classified as wet (W) if measurable pre-
chastic model is often run numerous times, with each cipitation (defined as at least .01 ofan inch) falls during a
trial yielding a slightly different result. An overall esti- 24-hour period, while the day is termed dry (D) if mea-
mate of system serviceability is then made based on all surable precipitation does not occur. This is a valid classi-
the independent model runs. This approach is known as fication, with clearly defined operational definitions of
Monte Carlo simulation and is frequently used when wet and dry. The two categories are mutually exclusive,
there is significant uncertainty in the model parameters. since it is not possible to assign a particular day to more
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 79

sum of the probabilities of an outcome and its comple-


ment is 1.0:
P{X)
1.00 •
P(A)+P(A) = 1.0 (5.5)

0.75 • In the precipitation example, the probability of a wet


S? □ DallyPrecipitation day is .38, and the probability of a dry day is .62. Since
0.62
a:- ~ 0.01 lnch&S
these are complementary outcomes and they are mutu-
?;- □ DallyPrecipitation ally exclusive, their probabilities must total 1.0:
:0 0.5 • < 0.01 Inches
i 0.38
e
a. P(W) = P(D) (5.6)
0.25 •
so P(W)+P(W) = P(W)+ P(D) = 1.0 (5.7)

-+---~---~------ X
Let's return to the example of the die for a moment.
Wet Dry With a single toss of the die, it is known that the probabil-
ity of rolling a 6 is 1/6 or .167. What is the probability of
FIGURE 5.1 tossing the die twice in succession and getting a 6 both
Relative Frequency of Wet and Dry Days from a 100-Day Period
times? These two events, the first roll of the die and the
second roll of the die, are statistically independent events.
than one category because the categories of wet and dry Statistical independence exists when the probability of one
do not overlap. By keeping a record of wet and dry days event occurring is not influenced or affected by whether
over a 100-day period, absolute frequencies of precipita- another event has occurred. The chance of rolling a 6 on
tion can be determined and relative probabilities calcu- the second roll is not at all influenced by what happened
lated from the data (fig. 5.1). In this example, 62 days are on the first roll. Many of the inferential statistics discussed
categorized as dry and 38 as wet. The probability of a later in the book are based on this property of statistical
wet day occurring P(W) is independence. This means that many of the sampling
methods developed in the following chapters must be
number of wet days = 38 = _ designed to incorporate this assumption of independence.
P(W)= 38 (5.2) If events are independent, the simple multiplication
total days 100
rule of probability is applied. For independent events A
and B, this rule states
Similarly, the probability of a dry day occurring P(D) is
P(A and B) = P(A) • P(B) (5.8)
number of dry days = 62 = _
P(D) = 62 (5.3) Therefore, if event A is the first roll of the die and event B
total days I00
is the second roll of the die,
Thus, a 38% chance exists that a day will have measurable P(A and B) = P(A)·P(B) = 1/6•1/6 = 1/36 (5.9)
precipitation and a 62% chance that a day will be dry.
Several rules of probability guide the decision maker. That is,
The maximum probability for any outcome is 1.0, indi-
cating total certainty or perfect likelihood of a particular P(A and B) = P(A) • P(B) = .167 • .167 = .02778 (5. 10)
occurrence. The lowest probability for any outcome is
0.0, suggesting no chance of occurrence. Most outcomes There is one chance in 36 (somewhat less than a 3%
have probabilities falling between these maximum and chance) of tossing a die twice in succession and getting a
minimum values. Thus, for any outcome (A), 6 with both tosses.
The multiplication rule of probability can be
0.0 < P(A) < 1.0 (5.4) extended using similar logic. For example, if a die is
rolled three times, the likelihood of rolling a 6 on the first
roll (event A), another 6 on the second roll (event B), and
Another straightforward definition concerns the
complement of an outcome. If (A) contains all those yet another 6 on the third roll (event C) is
events in which a particular outcome occurs, then the
complement of (A), (written as A), includes all those P(A and B and C) = P(A) • P(B) • P( C)
(5.11)
events where that particular outcome does not occur. The =1/6 • 1/6 • 1/6=1/216
80 Part ill .A. The Transition to Inferential Problem Solving

That is, P(A or B) = P(A) + P(B) (5.13)

P(A and B and C) = P(A) • P(B) • P( C) where P(A or B) = probability of event A or event B
(5.12)
=.167 • .167 • .167=.00463 P(A) = probability of event A
P(B) = probability of event B
The chance of tossing three 6s in succession on a die is
very low (Jess than one-half of one percent.) If event A is the selection of a person from China and
Let's consider how these basic probability rules can event B is the selection of a person from India, then the
be applied to a set of U.S. immigration data for fiscal year probability of selecting an immigrant from one or the
2010 (table 5.1). These data show the number of persons other of these countries is:
obtaining legal permanent residence by country of birth
and state destination. Only the five largest countries of if P(A) = 70,863 / 1,042,625 = .0680 (5.14)
birth and five largest state destinations are shown in the
table, but they account for a significant amount of the and P(B) = 69,162 I 1,042,625 = .0663 (5.15)
total immigration. Of all persons obtaining legal perma-
nent residence in the United States in 2010, 37.5% were then P(A or B) = .0680 + .0663 = .1343 (5.16)
born in one of these five countries and 58.4%,located in
one of these five states.
Using similar logic for more than two events, the
Some categories in the table are mutually exclusive.
addition rule for mutually exclusive events can be gener-
For example, some immigrants initially settled in Califor-
alized. For example, with three mutually exclusive
nia while others first settled in New Jersey. These are
events-A, B, and C-the addition rule becomes
mutually exclusive (non-overlapping) categories because
no single person can initially live in more than one state.
Similarly,no single person can have more than one coun- P(A or B or C) = P(A) + P(B) + P(C) (5.17)
try of birth. Within certain portions of the table, however,
the categories are not mutually exclusive. For example, a If we identify event A as the selection of an immigrant
single immigrant could certainly be from Mexico and set- from China, event B as the selection of an immigrant
tle in Florida and thus be included in both of these (non- from India, and event C as the selection of an immigrant
mutually exclusive) categories. In fact, 3,113 such indi- from the Philippines, then the probability of selecting a
viduals exist. person from any one of these three countries is their sum,
Suppose your task is to calculate the probability of according to the addition rule:
selecting an immigrant from China or India at random
from a list of all 1,042,625 persons obtaining legal per- if P(A) = 70,863 / 1,042,625 = .0680 (5.18)
manent residence. The concept of randomness requires
that each individual on this very lengthy list has an equal and P(B) = 69,162 I 1,042,625 = .0663 (5.19)
chance of being selected, and the selection is not biased
in any way. Since these occurrences do not overlap, the and P(C) = 58,173 / 1,042,625 = .0558 (5.20)
additionrulefor mutually exdusiveeventscan be applied. In
general form, this rule is as follows:
then P(A or B or C) = P(A) + P(B) + P(C)
= .0680 + .0663 + .0558 = .1901 (5.21)
TABLE 5.1

Persons Obtaining Legal Permanent Residence


When randomly selecting a single individual from all
Status by Country of Birth and U.S. State
persons obtaining legal permanent residence status dur-
Destination, Flscal Year 2010
ing fiscal 2010, the probability of choosing a person from
State destination
China, India, or the Philippines is .1901 (19.01%).
Country TOTAL
of birth CA NY FL TX NJ (all states) Suppose your task is to determine the probability of

MelOco ,...
, 2437 3113 32811 2437 13:9,,20
selecting a person from Mexico or an immigrant settling
in California, or both. These are certainly not mutually
China 18680 18859 1620 3280 220, 70,863
,.,, ,rn 69,162
exclusive events, as many Mexicans obtained legal per-
hdia
'"'"' 5116
2l81
8123
58,173
manent residence in California in 2010. In fact, accord-
....
Philijppi-les 24'182 2320 2525 2321
Oorrinican
172 ~249 3900 275 53,870
ing to table 5.1, that number is known to be 50,645. For
Republic
this situation, an addition rule for non-mutually exclusive
TOTAL
(all countries}
208446 147999 107276 anso 58920 1,042,625 events must be applied:
C(l .... trl8$ of~
Note: Flw ltlf'98<$l In on:il)rt>yrr,,,. Fill$ ltlf'9$&t$laWJCIG$.lneAon&
11&:teo h&t90In Ql"(le,
by column. Gtand ICl!alobtairing le,galpermat1et1tstalus: 1,042,625.
Sooiw: U.$, Oep911mentofHom91an<J
Sec,Jmy,Ytart)QQ1(
oflmnjgf'90on &9d&iC$,2010 P(A or B) = P(A) + P(B) - P(A and B) (5.22)
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 81

If event A is a person from Mexico, and event Bis a per- 5.2 THE BINOMIAL DISTRIBUTION
son settling in California, then the following is known:
The binomial is a discrete probability distnbution
if P(A) = 139,120 / 1,042,625 = .1334 (5.23) associated with events having only two possible out-
comes. Binary outcomes are often described in a zero-
and P(B) = 208,446 I 1,042,625 = .1999 (5.24) one yes-no or success-failure format and many geo-
' situations
graphic ' fit a binary framework. Consider these
and P(A and B) = 50,645 I 1,042,625 = .0486 (5.25)
examples: a location either has measurable precipitation
then P(A or B) = P(A) + P(B) - P(A and B) over a 24-hour period or it does not, a person is either a
= .1334 + .1999 - .0486 =.2847 (5.26) male or a female, a river is either above or below flood
stage, a respondent to an opinion poll either favors or
Note that the 50,645 Mexicans settling in California have opposes an issue. In each of these situations, only one of
been counted twice; those immigrants are included in the two outcomes is possible, assuming that an undecided or
numerator of both P (A) and P (B). Therefore, this out- uncertain result is not possible. The general shape of the
come P (A and B) needs to be subtracted to avoid double binomial distribution is shown in figure 5.2.
counting or duplication. Thus, if a single person is ran- The binomial distribution is especially useful in
domly selected from a list of all 1,042,625 persons examining probabilities from multiple events or trials,
obtaining legal permanent residence in the United States each of which has only two possible outcomes. Geogra-
in fiscal year 2010, there is a .2847 probability (28.47% phers are often interested in evaluating sequences of
chance) of selecting someone from Mexico, or someone these events or trials-phenomena that either occur or
settling in California, or both. don't occur at some location through time. In many of
Suppose you already know that a particular individ- these situations, the probability of a single event is
ual has settled in New York (event B), and the probability determined quite easily, perhaps using historical data.
that this individual is from the Dominican Republic is to For example, the probability of a river in Bangladesh
be calculated. In this case, determining the conditional reaching flood stage during a given year may be .40.
probabilityP (A I B) is the appropriate strategy: Thus, on average, flooding occurs four years out of ten.
P(A I B) = P(A and B) / P(B)
With this information, the binomial distribution may be
used to determine the probability of multiple flood
where P(A and B) = 26,249 I 1,042,625 = .0252 events occurring over other time periods (perhaps 15
years out of the next 30). The assumption is that a river
and P(B) = 147,999 I 1,042,625 = .1419
reaching flood stage during a given year is an event
then P(A I B) = .0252 I .1419 = .1774 independent of whether the river reached flood stage in
Many other rules of probability exist for more com- other years.
plex applications. However, because this discussion of When using binomial probabilities, the focus is on
probability is only introductory, you should examine the one of the two possible outcomes, termed the given out-
sources listed at the end of the chapter for more thorough
treatments of the subject.
The probability of outcomes in certain problems fol-
lows consistent or typical patterns. Such patterns, called
probability distributions, relate closely to frequency dis- P(X)
.
tributions, such as the histogram and ogive diagrams
shown earlier. In a frequency distribution, the number of - -
,-
occurrences appears on the vertical axis, and in a proba-
bility distribution, the probability of occurrence is dis- .
- ,-

.... ,-
played on the vertical axis. In both cases, the horizontal
axis shows the actual outcomes, occurrences, or values of .... ....
the variable being studied.
Recall that variables are termed discrete or continu-
ous depending on whether the values occur as distinct
- -
whole numbers (discrete) or decimal values (continuous).
Probability distributions for discrete outcomes are termed
. - -
discrete probability distributions, whereas those for out- - t-
comes that can occur at an infinite number of points are ...r h
X
termed continuous probability distributions. In this
chapter, discussion is focused on three widely used dis- FIGURE 5.2
crete distributions-binomial, geometric, and Poisson. The Binomial Distribution
82 Part ill .A. The Transition to Inferential Problem Solving

come (X). The binomial distribution is shown mathemat- example. Suppose a potential vegetable grower is seeking
ically as follows: a location to start a business. One of the key variables in
site selection is the probability of adequate precipitation
I X n--X at that site, thereby reducing the necessity of expensive
P(X)= n.p q (5.27)
X!(n-X)! irrigation. Suppose the potential grower has learned from
other growers in the immediate region that at least three
inches of precipitation are needed during the growing
where n - number of events or trials season to avoid irrigation. The grower estimates that irri-
p - probability of the given outcome in a gation can be afforded only one year in five to make a suf-
single trial ficient profit. Precipitation data collected for a potential
q - p = 1 - p or the probability of the other
site show that in 21 of the last 25 years rainfall exceeded
outcome in a single trial three inches during the growing season. The historical
record therefore suggests that the probability of a given
X - number of times the given outcome occurs
year requiring supplemental irrigation is 4/25 or 0.16.
within then trials
What is the probability that the grower can meet the
n! - n factorial: requirement of having to irrigate only one year in five or
ifn > 0, n! = [n (n-1) (n-2) ... (2) (1)] not having to irrigate at all?
ifn=O,n!=l Over the five-year period the farmer faces six possi-
for example, 5! = (5) (4) (3) (2) (1) = 120 ble outcomes, from none of the five years requiring irriga-
To summarize, the binomial distribution is applica- tion up to all five years requiring irrigation. However,
ble for those geographic problems in which the following only two of the six outcomes (no years and one year)
conditions are met:
• The objective is to determine the probability of multi- TABLE 5.2
ple (n) independent events or trials, and the number of Critical Values and Binomial Probabilities of
trials is fixed. Suitable Outcomes for Vegetable Owner
• Each event or trial has two possible outcomes, one
termed the given outcome (X), with associated proba-
bility p, and a complementary or otheroutcome having
probability q = p = 1- p .
• The probabilities p and q must remain stable or consis-
tent over the duration of the study period or over suc- Critical Values:
cessive trials through time-that is, the probability of a X = 0, 1, 2, 3, 4, 5
given outcome is the same for each trial. n= 5 years
p= 4/25 = .16
The best way to illustrate the characteristics and prac- q= (1-p) = .84
tical uses of the binomial probability is with a geographic II.= (5) (4) (3) (2) (1) = 120

P(X)
0.50 . S!{.16) 0 (.84) 5
When X = 0: P(O) = -'--'-'--''- =
120(1)(.418) =
.418
O!(S)! 1(120)
0.418
0.398

Satisfiesprobability
. requirement
When X= 1: P(l) =
S!{.16) 1 (.84) 4
l!( 4)! =
120(.16)(.498)
1(24) = .398

Doe.snotsatisfy
&
0..
ptobabillty
requirement

"" .
""
:i5 0.25 When
X= 2· P(2) = S!{.16)2(.84)'
• 21( 3 )!
= 120(.026)(.593)
2(6) = .152
l1l 0.152
e
0..
S!{.16) 3 (.84) 2 120(.004)(.706)
. When X= 3: P(3) = ( )!
31 2
= 6(2) = .029

0.029
, 0.003 S!{.16) 4 (.84) 1 = .003
0.000 When X = 4: P( 4) = ~~~~ = 120(.001)(.84)
0.00 X 4!(1)! 24(1)
0 1 2 3 4 5
Number of years irrigation needed
S!{.16) 5 (.84)o 120(.000)(1)
When X = S: P(S) = S!(O)! = 120(1) = .000
FIGURE 5.3
Probability of Vegetable Owner Needing Irrigation for Oto 5 Years
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 83

TABLE 5.3 P(X)


Binomial Probabilities of Vegetable Owner
.
Needing Irrigation Over a Five-Year Period
Number of years Binomial
out of five probability
Type of Outcome .
0 .418 Suitable &
0.
;,,,
1 .398 Suitable ""
:i5
.
2 .152 Unsu~able l1l
3 .029 Unsu~able
e
0..

4 .003 Unsu~able
5 .000 Unsuitable
TOTAL 1.000
-1----1_...L_L___J_ _ _L_J=:::b= X
Totalprobabilityof suitableoutcomes = .816
Totalprobabilityof unsuitableoutcomes = .184
FIGURE 5.4
The Geometric Distribution

would result in a profitable situation for the grower; each


of the other four outcomes would be too costly. Thus, the ring next year." The general form of a geometric distribu-
probability that the grower will meet the profit require- tion is shown in figure 5.4.
ment is found by summing the probabilities for the two In conclusion, the geometric distribution applies to
suitable outcomes. The critical values and calculations of geographic problems in which the following conditions
the binomial probabilities for all possible outcomes are are met:
shown in table 5.2. The binomial probabilities for the • The objective is to determine the number of consecu-
problem are also listed in table 5.3 (with suitable or tive independent trials needed to observe a given event
unsuitable outcome) and plotted in figure 5.3. (success) for the first time. Alternatively, the objective
The vegetable grower would have a .418 probability
is to determine the number of failures that will occur
of needing no irrigation and a .398 chance of needing one
before the first success.
year of irrigation during the next five years. By adding the
binomial probabilities for no years and one year of irriga- • As with the binomial distribution, each event or trial
has two possible outcomes (success/failure), each with
tion (.418 + .398) we conclude that the grower has an
a stable probability of occurrence.
81.6% chance of meeting the profitability requirement at
this potential site during the five-year interval. This level • The geometric probability is "memoryless." That is, if
of risk may or may not be acceptable to the grower. a success has not yet occurred in a sequence of trials,
the number of past trials in the sequence does not
change the probability distribution of the number of tri-
5.3 THE GEOMETRIC
DISTRIBUTION als that remain before a success occurs. For example, if
The geometric distribution is a special case of the an unbiased coin has been tossed ten consecutive times
binomial distribution. While the binomial distribution and all tosses have been heads, the probability of tails
determines the probability associated with a number of on the next toss remains unchanged at .50. Quite sim-
given outcomes or events ("successes"), the narrower ply, the past doesn't influence the future (a characteris-
focus of the geometric distribution is on the exact num- tic that many gamblers fail to grasp!).
ber of consecutive trials necessary to observe the given The usefulness of the geometric distribution is best
outcome or event for the first time. For example, a geog- shown with an actual geographic application. The
rapher might be interested in the timing of the next cate- Susquehanna River Basin drains 27,510 square miles of
gory three hurricane at a particular location. As another New York, Pennsylvania, and Maryland-where about 4
example, you might want to estimate the probabilities million people live (fig. 5.5). The Basin has a history of
associated with the timing of the next major earthquake flooding. In fact, the main stem of the Susquehanna has
(say, above 7.0 on the Richter scale) along the San experienced 15 major floods since record-keeping began
Andreas Fault in the San Francisco area. Questions like around 1800, with the latest incident occurring quite
these could be asked: "What is the probability that the recently (table 5.4). This averages out to a major flood
next big quake in this area will occur after 2020?" or every 14 years.
"I've just signed a yearly contract for an apartment Some past flooding has been catastrophic. The
located very close to the San Andreas Fault and want to flooding from tropical storm Agnes in 1972 caused
know the likelihood of the next major earthquake occur- record-breaking damage in the Susquehanna River
84 Part ill .A. The Transition to Inferential Problem Solving

Basin. Seventy-two people died and damage topped $2.8 ous flood.s--certainly a reasonable assumption. [You
billion-about $14.3 billion in today's dollars. It was the might ask if this is really a correct assumption: if climate
nation's most costly disaster until Hurricane Katrina change is significant, the frequency of flood events could
struck New Orleans and the Gulf Coast in 2004. Future be changing!] If events are truly independent, the probabil-
flooding of the Susquehanna has the potential to be even ity of a flood not occurring in the first year is (I - p) = .86.
more devastating than past events. Of the 1,400 commu-
nities in the Basin, an estimated I, 160 still have residents
TABLE 5.4
who live in flood-prone areas. Obviously, flood predic-
tion, management, and protection are of utmost concern Major Floods In the Susquehanna River Basin
for these people. Since 1800
The geometric distribution can be applied to deter- 1810 1964
mine probabilities associated with the timing of the next
1865 1972 TropicalStorm~nes
major flood along the Susquehanna. It seems logical to
want flood probability information for each single year in 1889 1975 TropicalStormEloise

the future and also a cumulative probability estimate 11l94 1996 Basirnw:leFlash Flood

(cumulative distribution function-CDF) showing when 1935 2004 TropicalStormtvan


the next major flood will most likely occur. 1936 St. Patric:k's
Day 2006
According to the historical record, the probability a
1946 2011 TropicalStormlee
major flood will occur in the first year of analysis is p =
.14. It is assumed that the timing of each major flood is an 1955 HurricanesConnie& Diane

independent event, not influenced by the timing of previ- Soun:e: SusquehannaFloodFo,ecast andWarningSystem

Susquehanna
D River Basin
Vermont

NewYork

A
Massachusetts

Comectla.ri

Ohio

Pennsylv8J'U

NORTH
NewJersey

OCEAN

WestVlrgHa
40
Delaware
KIiometers

FIGURE 5.5
The Susquehanna River Basin
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 85

The probability of a flood occurring in the second year (if P(X)


none occurred in the first year) is (1 - p) (,p) (.86) (.14) = = 0.16 .
. 1204. The probability of a flood occurring in the third .1400
year (if no flood occurred in years one or two) is (1 - p) 0.14

= =
(1 - p) (,p) (1 - p)2 (,p) (.86) (.86) (.14) .1035. These = .
0.12
.1204

expressions are applications of the multiplication rule of


ct" 0.10 .
X' .1035
probability for independent events (see formula 5.8). Con-
.0890
tinuing this logic, the probability of a flood occurring in ~ .
the year k (if no flood occurred in any prior year-I, 2, =
.0
0.08 .0766

=
3 ... k- 1) is (1-p) k-/ (,p) (.86) k-l (.14). This last expres- "'
0 0.06 .
.0
~
.0658
.0566
Cl.
sion corresponds to (k - 1) consecutive years of no flood, .04<17
.0419
0.04
followed by a flood in year k. .OW>
The application of the geometric probability distribu- 0.02 .
tion to predict the likelihood of possible flooding in the
Susquehanna River Basin over the next 10 years can be 0.00
1 2 3 4 5 6 7 8 9 10
X
summarized both graphically and in table form. The reg- Yearof next ma}orOood
ular decline in flood probability is shown directly in fig-
ure 5.6 and in the middle column of table 5.5. Year 1 has FIGURE 5.6
the greatest likelihood of the next major flood occurrence Geometric Distribution: Timing of the Next Major Flood on the
Susquehanna River
(.14), year 2 the second largest probability of the next
flood, and so on. The rightmost column of table 5.5 lists
the cumulative distribution function (CDF) probabili- over the last couple of centuries the sequencing of floods
ties. If past flooding frequencies continue into the future, has been highly variable. No major floods occurred over
then the next major flood has more than a 50% chance of a 55-year period from 1810 to 1865, and Basin residents
occurring (52.96%, to be precise) sometime in the next were given another lengthy break from flooding between
five years. For residents on a Susquehanna River Basin 1894 and 1935. Of course, major floods can also strike in
floodplain, this must be viewed as an alarmingly short rapid succession. Serious flooding events devastated the
time period. region in consecutive years (1935 and 1936) and over the
The average amount of time until the next major last decade or so there have been three notable incidents
flood is (1 / p) years: that is (1 / .14) 7.14. However, = (2004, 2006, and 2011). This irregular sequencing of
floods is reflected by the rather large standard deviation
of 6.624 years. How should this standard deviation value
TABLE 5.5 be interpreted? Recall that standard deviation measures
Geometric Distribution: Probabilities Associated the "typical" or "standard" difference of an observation
with the Next Major Flood in the Susquehanna from the mean. Since the mean of this geometric distri-
River Basin bution is 7.14 years and the standard deviation is 6.624
Probability the next major
Cumulative probability that years, considerable variability can be expected in the
Number of next major flood will occur
flood will occur this year
trial& (years}
(probability ma&& fonc-lion)
this year (cumulative amount of time that will elapse from one flood incident
distribution function)
to the next. This high level of uncertainty and unpredict-
p (X = k) (1-p}""' p 1-(1-p)• ability must be very frustrating to emergency prepared-
ness personnel and local government officials that are
1 =.14
(.86)0 .14 1-(.86) =.14
1
constantly fighting to get funding for flood protection. In
2 (.86) .14 =.1204
1
1- (.86) 2 =.2604 times of economic difficulties, it is much easier to argue
3 (.86) .14 =.1035
2
1- (.86)' =.3639 for budgetary needs that are more predictable, such as
4 (.86)3 .14 =.0890 1- (.86)' =.4530
police and fire protection or education.
5 (.86) .14 =.0766
4
1- (.86)' =.5296
6 (.86) .14 =.0659
5
1- (.86)" =.5954 5.4 THE POISSON DISTRIBUTION
7 (.86)6 .14 =.0566 1- (.86) =.6521
7
Some probability problems in geography involve the
8 (.86) .14 =.0487
7
1-(.86) =.7008
8
study of events that occur repeatedly and randomly over
9 (.86) .14 =.0419
8
1-(.86) =.7427
9 either time or space. For example, the placement of calls
(.86)9 .14 =.0360 1- (.86)' 0 =.7786
to an emergency response dispatcher might be consid-
10
ered random over a short period of time. At certain spa-
Mean= 1 / p = 1 / .14 = 7.14 tial levels, multiple occurrences of weather-related
Variance= (1- p) I p 2 = (1- .14) / .142 = 43.877 phenomena (e.g., thunderstorms, tornadoes, and hurri-
Standard Deviation = ((1- p) 1 p')°'' = ((1- .14) / .142)'·' = 6.624 canes) may occur with little spatial predictability. Geog-
86 Part ill .A. The Transition to Inferential Problem Solving

raphers also study various cultural entitles whose The general form of the Poisson distribution is a
patterns may be the result of random processes. In function of lambda, the mean frequency. Figure 5.7
instances where events occur repeatedly and at random shows the shape of the distribution with mean frequen-
(i.e., independent of past or future events), the Poisson cies of 3 and 10.
probability distribution can be used to analyze how fre- In summary, the Poisson distribution is appropriate
quently an outcome occurs during a certain time period for geographic problems that meet the following criteria:
or across a particular area. Other geographic applications
of Poisson involve the analysis of existing frequency • The objective is to determine how frequently an out-
come occurs where events occur repeatedly and at ran-
count data to determine if a random distribution exists.
Poisson probabilities are calculated using this gen- dom over time or across space. That is, data are counts
of events. In other instances, the objective is to analyze
eral formula:
a set of existing frequency count data to determine if it
is randomly distributed through time or to analyze a
(5.28) point pattern to determine if it is randomly distributed
across space.
• The time period being analyzed is divided into discrete
where X = frequency of occurrence units (e.g., years) or the study area being analyzed is
A = mean frequency divided into discrete areal subdivisions (e.g., quadrats).
e = base of the natural logarithm • The frequency distribution of the occurrence of multi-
(approximately 2.71828) ple events by discrete time unit or the frequency distri-
X! = X factorial bution of a point pattern by quadrat is estimated under
the assumption that a random process is operating.
Alternatively, an existing frequency distribution is ana-
P(X) lyzed to determine if it has been generated by a process
. operating randomly through time or across space.
As with the binomial and geometric distributions,
the most effective way to demonstrate the usefulness of
. the Poisson distribution is by looking at practical geo-
Lambda= 3 graphic applications.

. Analyzing GeographicData through Ti111e


This first application involves snowfall prediction
. over time-specifically, predicting the average seasonal
frequency of separate and unique "crippling heavy snow-
falls" {which we define as a snowfall event of at least 10
I inches) in the very snowy city of Flagstaff, Arizona.
X The city of Flagstaff is located in north central Ari-
zona. At 6,900 feet, surrounded by ponderosa pine forest
P(X) and south of mountains that reach well over 12,000 feet,
. Flagstaff enjoys a four-season climate. However, at this
elevation, Flagstaff receives a lot of snow, with a seasonal

. - - average of 97 inches.
Both positive and negative outcomes become associ-

- - Lambda= 10
ated with very heavy snowfall incidents. In addition,
important budget and management policy issues must be
. made before the winter season even starts. Flagstaff is a
- - winter sports area and substantial economic activity
- depends on adequate snowfall. A successful season is
.
-
- -- I-
determined by the quality and quantity of snowfall, the
timing of seasonal precipitation, its amount, and the tem-
perature. Conversely, very heavy snowfalls can be eco-
,- - nomically crippling in a number of ways. Extremely high
' X
seasonal totals are possible, as in winter of 1973, when
FIGURE 5.7 nearly 200 inches fell. Local public officials must con-
The Poisson Distribution sider the likelihood of crippling snow events that have the
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 87

potential to paralyze transportation systems and jeopar- Since both e and A are constants, the expression e.;.,
dize safe mobility. in Equation 5.28 is also a fixed value (e•A 2.12·18 = =
Obviously, planners need to budget for major snow .16533). Using this constant e•Avalue,Poisson probabili-
events in advance. For example, how much annual ties are calculated for a series of heavy snowfall frequen-
money should be allocated to road salt purchase, or how =
cies (X 0, 1, 2, 3, 4) and presented in table 5.6.
many snow removal vehicles should be at the ready (in If heavy snowfalls occur randomly with a mean of
inventory). Should planners stay with the current number 1.8 per winter, the likelihood of no such snowfall during
of snow-removal vehicles (which can handle the 5-inch a given winter is .16533. Extending this logic to a period
snowstorm adequately), or should two additional vehi- of 100 winters, about 16 or 17 of those winters should be
cles be purchased, which public service officials estimate free of very heavy snowstorms. The most likely occur-
would be sufficient to handle most 10-inch events? Local rence is one very heavy snowfall per winter, which
planners routinely face such snow-related decisions that should occur at a probability of .29759, or almost 30 win-
require them to balance cost with level of preparedness. ters out of 100 (fig. 5.8). Amazingly, there is more than a
If we examine the snowfall record over a recent 30- 7% chance (probability of .072315) that a future winter in
year period (1971-2000), Flagstaff experienced 54 sepa- Flagstaff will have 4 different crippling heavy snow
rate crippling heavy snowfall events of at least 10 events in which over 10 inches of snow will fall!
inches-an average of 1.8 events per winter. With this
=
mean frequency value of A 1.8, we can calculate all Analyzing Geographic Data over Space
Poisson probabilities. The United States Department of the Interior main-
tains the Federal Fire Occurrence website, which has
over 650,000 wildfire records collected by the six Federal
TABLE 5.6
land management agencies that occurred between 1980
and 2011. Users can query, research, and download this
Poisson Distribution: Expected Number of Days data for statistical and other geographic analyses.
per Winter with Heavy Snowfall (Greater than 10
In 2012 a fire in Gila National Forest, New Mexico
Inches), Flagstaff, Arizona
became the largest wildfire in the State's history, burning
over 200,000 acres. Suppose, for example, foresters in
New Mexico are concerned about the frequency of very
large wildfires (greater than 20,000 acres burned).
Because of the uncertain nature of human influences,
Critical Values:
land use patterns, and weather events, the number of
large wildfires varies greatly from year to year. Due to the
X= frequency of occurrence= 0, 1, 2, 3, ... , 6 financial pressures associated with fighting very large
,\ = mean frequency per winter = 1.8 (historical record shows
54 separate heavy snowfall events over 30 years) fires, foresters may want to know the probability of expe-
e = 2.718 (base of the natural logarithm) riencing more than 3 fires exceeding 20,000 acres in a
e 1 =2.718,.. =.16533

P(X)
0.4 •
(2.110- 1.e)(1.s 0 ) (.16533)(1)
When P(o) = '----'-'----'- = -'--'-'-'- = .16533
O! (1)

.297S9
0.3
(2.110- 1•8 )(1.0 1) (.16533)(1.8)
When P( 1) = -'---~--'- = -'--~~ = .29759 .268a3
t! 1 &
0..
~
(2.110- 1.e)(1.s 2 ) (.16533)(3.24) =
.0
0.2 •
When P(2) = -'-----'--'---'- = -'---'--'----'- = .26883
2!

(2.110- 1.e)(1.s 3 )
2

(.16533)(5.832)
I 0.1 •
.16S33 .16010

When P( 3) = '----'-'----'- = -'---'-'---'- = .16070 .07232


3! 6
.02603
I .00781
0.0 X
0 1 2 3 4 5 6
Expected number of days with heavy snowfall per winter

(2.110- 1.e)(1.s 6 ) (.16533)(34.012) FIGURE 5.8


When P( 6) = '----'-'----'- = -'--~---'- = 00781
6! 720 • Poisson Distribution: Expected Number of Days per Winter with
Heavy Snowfall (Greater than 10 Inches), Flagstaff, Arizona
88 Part ill .A. The Transition to Inferential Problem Solving

given year. The current occurrence of wildfires is inde- ber (frequency counts) of wildfires occurring annually.
pendent of past or future occurrences and can be consid- The mean outcome or average number of wildfires per
ered random. Therefore, Poisson is the valid probability year 0,) is 1.06. Since both e and A.are constants, the
model for describing wildfire frequency across the State expression /A.in equation 5.28 is also a fixed value (ei.. =
of New Mexico. =
2.72 t. 06 2.88). The exponential function on a calcula-
During the 30 years between 1980 and 2010, 19 tor or comjuter is needed for this calculation. Using this
years were completely free of very large wildfires, while 9 constant e value, Poisson probabilities are calculated for
wildfires occurred in 1 of the 30 years. Since a total of 32 a series of wildfire frequencies (X =
0, 1, 2, 3, ... , 9)
large wildfires occurred during the period, the mean and presented in table 5.7.
number of large wildfires per year was 1.06. If large wildfires occur randomly with a mean of 1.06
The wildfire probability can be calculated for any per year, the likelihood of no large wildfire over 20,000
frequency of occurrence. For example, since the data acres occurring in a given year is 0.35. In other words,
cover 30 years, and 19 of these years did not have a very extending this logic to a period of 100 years, about 35 of
large wildfire, the resultant probability of no large wild- those years should be free of very large wildfires. The
fires occurring in a year is 19/30 or 0.63. Thus, given this most likely occurrence is one very large wildfire per year,
set of data, New Mexico has a 63% chance of avoiding a which should occur at a probability of .37 or about 37
very large wildfire (greater than 20,000 acres burned) in years out of 100 (fig. 5.9). A geographer can then com-
any given year. pare the expected probabilities with the observed proba-
If the process producing the pattern is truly random, bilities in table 5.7 to determine whether the occurrence
the probabilities of occurrence will follow the Poisson of large wildfires is a series of random events.
distribution. The mean frequency is the average number
of large wildfires per time period or geographic area. P(X)
Knowing only this mean value is sufficient to allow all 0.40
0.31
Poisson probabilities to be calculated: 0.35
0.35

0.30
(5.28)
8:
a. 0.25
;?;-
·- 0.20 0.19
:a
where X =frequency of occurrence "'
~ 0.15
A =mean frequency CL

e =base of the natural logarithm 0.10


0.07
{approximately 2.71828)
0.05
=
X ! X factorial 0.02
I 0.00 0.00
0.00 X
Like the binomial distribution, Poisson requires dis- 1 2 3 4 5 6 7
Expected number of very large wildfires per year
crete or integer outcomes (X) to represent the number of
frequency of occurrences. Unlike the binomial, however, FIGURE 5.9
the Poisson distribution is not binary. In the wildfire Poisson (Expected) Probabilities for Number of Very Large
example, discrete outcomes represent the different num- Wildfires in New Mexico per Year

TABLE 5.7

Observed and Expected (Poisson) Frequencies for Number of Very Large Wildfires In New Mexico per Year
Observed Observed Expected
Number of very large frequency Total probability of Poisson frequencies
wildfires per year (years) wildfires occurrence probabilities (years)
0 19 0 0.63 0.35 10.5
1 4 4 0.13 0.37 11.1
2 2 4 0.07 0.19 5.7
3 2 6 0.07 0.07 2.1
4 1 4 0.03 0.02 0.6
5 1 5 0.03 0.00 0.0
9 1 9 0.03 0.00 0.0
TOTAL 30 32 1.00 (100%) 1.00 (100%) 30.0

Source: U.S. GeologicalSurvey, Federal WildlandFire OccurrenceWebsie


Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 89

Wildfire pattern without quadrats superimposed In a spatial application of Poisson, the geographic
region under study is divided into spatial areas, usually a
series of regular-sized square cells known as quadrats.The
• •• •• •
• • number of occurrences of an item or phenomenon being
• studied is recorded for each of the quadrats covering the
• ' • study area. Given the mean number of occurrences per
•• •• •
• •
quadrat or cell across the region, the Poisson distribution
•• shows the probability of a quadrat containing a certain
frequency of occurrences.
• • We can now ask the question: is this set of very large
I
• ••• • wildfires across New Mexico randomly distributed? To
calculate the expected Poisson pattern of large wildfire
• • • occurrences that would be found under an assumption of
• • • \ randomness, a set of quadrats is placed over the study
• •

• ••
l .•

'\" . .• "·• •. .'•.. • •• • area (fig. 5.10).
• • •
•• •• C•••
•••
• •
••
• ••
• •• • • , • Because the southern boundary of New Mexico is
••



• . ~-,
.... ~.-.
.. • • •••• •• •
• •••
• somewhat irregular, some of the quadrats near the border
• •• • •••
• would lie outside the study area. Therefore, the quadrats
• • •• • •• • •
• •• •
••• • • ••• were placed in a manner to maximize the number of
,. •
• •

~ •• J •••


·- ,.... • •
•• •
• .• •
•• • • quadrats that would fit in the state and also cover the
greatest amount of points. Of the 326 wildfires, only 10
• ••·4;• •• , •_. I •
• • • •
• • ••• • . are not included in a quadrat.
• • l
• Which cells should be included in the analysis? A
...:..• procedural method must be selected that is as objective,
consistent, and unbiased as possible. One logical way to
proceed would be to include only those cells with more
Wildfire pattern with quadrats superimposed
TABLE 5.8
• •• •

• • Observed Frequency of Very Large Wildfires per
Cell in New Mexico
• •
• Observed
• • Number of very probabilities
.. large wildfires
per cell
Observed
frequencies
Total
points
of wildfires
per cell

0 50 0 0.410

• • • 1 23 23 0.189
2 14 28 0.115
• • • 21
3 7 0.057
• • \


l .• •• • •• • 4 7 28 0.057
• ••

• ••
• . ..• ( ... 5 2 10 0.016





"•
.. .:......-.

•)
• •
I• •

• • ••• •• •
6 3 18 0.025
7 2 14 0.016
.... ..
• • •• • •• • •
• • • "
• 8 2 16 0.016
• • I • .. • • • •
• • • ...... ••
_. I
• •
9 3 27 0.025
• ••• 10 1 10 0.008
• • I
N 11 1 11 0.008
0 140 280
• I
Kilometers
I I
A 13
14
2
1
26
14
0.016
0.008
16 2 32 0.016
FIGURE 5.10 19 2 38 0.016
Spatial Pattern of Very Large Wildfires (over 20,000 Acres) in
New Mexico, 1980 - 2009 TOTAL 122 316 1.00 (100%)
Source:U.S. Geological Survey, Federal Wildland Fire Occurrence Website
90 Part ill .A. The Transition to Inferential Problem Solving

than half of their area in New Mexico. If Poisson proba- and 30 cells to have two fires. These expected frequencies
bilities for the wildfire pattern are determined in this are very different from the SOobserved frequencies with
fashion, 122 quadrats (containing 316 of the 326 large zero fires and the 14 observed frequencies with two fires.
wildfire locations on the map) are analyzed, representing Inferential statistics can be used to determine whether the
almost 97% of the large wildfires that occurred in New expected and observed frequencies are significantly differ-
Mexico over the 30-year period. ent. This procedure is discussed in section 13.1, where
The frequency of wildfires is determined for each cell the focus is inferential statistics for point patterns.
by counting the number of points inside each quadrat. Even without using inferential statistics, table 5. 10
This is easily accomplished using the overlay operations in seems to show that the observed and expected frequen-
a GIS. The observed frequencies for wildfire occurrence
by cell range from a low of Oto a high of 19 (table 5.8). In
TABLE 5.9
those cells experiencing a very large wildfire, the most fre-
quent occurrence or mode is one point per quadrat, found Critical Values and Poisson Probabilities of
in 23 cells. The average cell frequency ()..}-total number Expected Wildfire Frequency per Cell for Three
of wildfires (316) divided by the number of cells (122)-is Outcomes
316/122 or 2.59 (table 5.9). Thus, for this set of quadrats,
an average cell contains 2.59 points (wildfires).
The probability of very large wildfire occurrence
under an assumption of randomness is determined from
the Poisson equation (equation 5.28) using the mean cell Critical Values:
frequency. The calculations for zero, one, and two wild- X = 0, 1, 2, 3, 4, ... , 17, 18
fires per cell are shown in table 5.8, and figure 5.11 N = 122 cells
shows the Poisson probabilities graphically. To compute f = total frequency= 316 wildfires
the expected cell frequencies, each Poisson probability is ,l = mean frequency per cell= f/N= 316/122 = 2.59
multiplied by the total number of cells. For example, e = exponential (approximately 2.72)
since the Poisson probability of a cell having 4 wildfires is
.14, 14% of the 122 cells or 17 cells should have 4 wild-
e> = 2.72 2•09 =13.35
fires (table 5.9). The largest frequency expected for a ran- 2.59° 1
P(O) = --- = -- = 07
dom pattern of wildfires is 2 points per cell, which should 13.35(0!) 13.35 •
occur in 30.69 cells of the study area. This maximum
expected value is consistent with the mean cell frequency 2.59 1 2.59
P(l) = --- = -- = 19
of2.59 points per quadrat. 13.35(1!) 13.35 •
The observed frequencies do not appear to match the
calculated Poisson frequencies. Major discrepancies 2.59 2
occur between the observations. Notably, under a random
P(2) = 13.35(2!) = 6.7081
26.7
= .25
distribution we would expect 9 cells to have zero fires,

P(X)
0.30

0.25

x'
~
0.20
a:
~
ii 0.15

I 0.10

0.05

0.00 .1-_l _ _j_ _ _L_L_JL_J_ _ _t:=h=------------- X


0 1 2 3 4 5 6 7 8 9 10 11 13 14 16 19
Expected number of wildfires per cell

FIGURE 5.11
Poisson (Expected) Probabilities for Number of Very Large Wildfires per Cell
Chapter 5 .A. Basics of Probability and Discrete Probability Distributions 91

TABLE 5.10

Observed and Expected Frequencies (Poisson) for Wildfires per Cell


Number of very large Observed Expected frequencies of Expected
wildfires per cell frequencies wildfires per cell frequencies
0 50 0.0750077 43 9.150945
1 23 0.19428235 23.70245
2 14 0.251611568 30.69661
3 7 0.217238403 26.50309
4 7 0.140670769 17.16183
5 2 0.072872071 8.890393
6 3 0.031458435 3.837929
7 2 0.011640358 1.420124
8 2 0.003768804 0.459794
9 3 0.001084647 0.132327
10 1 0.000280941 0.034275
11 1 6.61531 E-06 0.008071
13 2 2.84499E-07 0.000347
14 1 5.26356E-08 6.42E-05
16 2 1.47137E-08 1.8E--06
19 2 4.39774E-11 5.37E-09

TOTAL 122 0.999923056 121.998251

cies are quite different. Therefore, the distribution of probabilistic processes, 84


large wildfires throughout New Mexico is not random. randomness and random processes, 84
This finding makes intuitive sense, since some areas of relative frequency, 84
the state are devoid of forests and would have no chance rules of probability, 85
of having a forest fire. Also, different management prac- statistical independence, 85
tices might prohibit human interaction with some forests, stochastic processes, 84
while encouraging the use of other forests for activities
such as camping. If human activity contributes to wild-
fire frequency, then excluding people from certain areas MAJOR GOALSAND OBJECTIVES
will have an influence on the locations of large wildfires. If you have mastered the material in this chapter, you
The geographer must make several methodological
should now be able to:
decisions when using quadrats to examine spatial point
patterns. In addition to deciding how to handle quadrats 1. Understand the general nature of deterministic and
partially outside the study area, researchers must con- probabilistic processes in geography.
sider the important issue of quadrat size. How would the 2. Recognize the distinction between stochastic and ran-
Poisson probabilities have differed if more quadrats of dom probabilistic processes.
smaller size had been placed over the pattern of wild- 3. Explain the concept of randomness and recognize the
fires? How does a decision to include quadrats only par-
characteristics of a random experiment.
tially inside the study area affect results?
4. Explain the probability concept of relative frequency
and recognize applications in geography.
KEY TERMS 5. Understand the basic rules of probability.
binomial distribution, 87 6. Understand the characteristics of the binomial, geo-
complement, 85 metric, and Poisson probability distributions and iden-
cumulative distribution function, 91 tify potential applications of each in geography.
deterministic processes, 83
discrete and continuous probability distributions, 87
event and outcome, 84 REFERENCES
AND ADDITIONAL READING
factorial, 88 Bulmer, M.G. Principles of Statistics. Edinburgh: Oliver and
geometric distribution, 89 Boyd, 1967.
Monte Carlo simulation, 84 Evans, M. J. and J. S. Rosenthal. Probabilityand Statistics:The
mutually exclusive, 84 Scienceof Uncertainty.2nd ed. New York: W. H. Freeman and
Poisson distribution, 91 Co., 2009.
92 Part ill .A. The Transition to Inferential Problem Solving

Hereford, R. Climate Variationat F7agstaff,Arizona-1950-200Z


U.S. Geological Survey Open-File Report 2007-1410, 17 pp.,
2007.
National Climatic Data Center. United States Climate Normals,
1971-2000.Contains Flagstaff, AZ annual snowfall data.
Rogerson, P. A. StatisticalMetJwdsfor Geography: A Student'sGuide.
3rd ed. Thousand Oaks, CA: Sage Publications, Inc., 2010.
Susquehanna Flood Forecast and Warning System.
www.susquehannafloodforecasting.org. Search "flood-his-
tory" for list of past major floods.
U. S. Department of Homeland Security. Yearbookof Immigra-
tion Statistics:2010. Washington, D.C., U.S. Department of
Homeland Security, Office of Immigration Statistics, 201 I.
Data source for immigration data in table 5.1.
U. S. Geological Survey. Federal Fire Occurrence website.
http://wildfire.cr.usgs.gov/firehistory/index.html
Unwin, D. JntroduaorySpatialAnalysis.London: Methuen, 1981.
PART V

INFERENTIAL
SPAT STATISTICS
General Issues in Inferential Spatial Statistics

13.1 Types of Spatial Pattern


13.2 The Concept of Autocorrelation
13.3 Who Is My Neighbor?

Geographers often examine spatial patterns on the case 1, both the point and area patterns have a clustered
earth's surface that are produced by physical or cultural appearance. On the point pattern map, the density of points
processes. These patterns represent the spatial distribu- appears to vary significantlyfrom one part of the study area
tion of a variable across a study area. Sometimes geo- to another, with many points concentrated in the northwest
graphic variables are displayed as point patterns with dot
maps. In chapter 4, descriptive spatial statistics (or geosta-
tistics), such as the mean center, standard distance, and
Point patterns Area patterns
standard deviational ellipse were introduced to summa-
rize point patterns. In other instances, explicit spatial pat-
terns representing data summarized for a series of • •••• •
subareas within a larger study region can be displayed • •• • •
effectivelyusing choropleth maps. Case 1:
Whether data are presented as points or areas, geog- • Clustered
raphers often want to describe and explain an existing •
pattern. In this chapter, we focus on several general issues
associated with inferential spatial statistics. A recurring
theme is the analysis of a sample point pattern or sample • •
area pattern to determine if the overall or "global" •

arrangement is random or nonrandom. When testing a
point or area pattern for randomness, the question is • • • • Case 2:
Dispersed
whether the population pattern from the sample was gen- • • •
erated by a spatially random process. A secondary theme • •
has to do with "local" inferential spatial statistics that
identify and map clusters ("hot spots" and "cold spots")
of points or areas within a larger area that have a concen-
tration of very high (or very low) data values.
•• • •
•• Case 3:
• Random
••
13.1 l\'PES OF SPATIALPATTERN •••
Geographers often want to compare an existing spatial
pattern to a particular theoretical pattern. Spatial patterns FIGURE 13.1
may appear clustered, dispersed, or random (fig. 13.1). In Types of Point and Area Patterns

205
206 Part V .A. Inferential Spatial Statistics

portion of the area. Perhaps the points represent sites ofter-


(Y)
tiary economic activity (retail and service functions), which
often cluster around a location with high accessibilityand
high profit potential, such as a highway interchange. On
the clustered area pattern map, the shaded subareas could • • •
represent political precincts where a majority of registered • •
voters are Democrat, with Republican majority precincts
not shaded. Such a clustered pattern would likely occur if
there is a distinctlynonrandom spatial distribution of voters • • • •
• •
by income, race, or ethnicity in the region. • •
Other spatial patterns seem evenly dispersed or regu-
lar. The set of points in case 2 (fig. 13.1) appears uni- •
formly distributed across the study area, perhaps ••
suggesting that a systematic spatial process produced the
locational pattern. One hypothesis in classical central
: nugget
place theory, for example, is that settlements are uni-
'--'-'--------'L....---------► (X)
formly distributed across the landscape to best serve the Dlstanc&betweengeograph.ic
locations
needs of a dispersed rural population. The area pattern in
case 2 exhibits a regular or alternating type of spatial FIGURE 13.2
arrangement. This pattern could represent county popu- An Exampleof a Variogram
lations in the same region where the central place distri-
bution of settlements is hypothesized. Shaded counties
could have above average populations while counties not is, high value features tend to be located near other high
shaded are below average. value features, medium value features near other medium
The spatial patterns in case 3 (fig. 13.1) appear ran- value features, and so on. Negative spatial autocorrelation
dom in nature, with no dominant trend toward either means that geographic objects near one another tend to
clustering or dispersion. A random point or area pattern have sharply contrasting values (a high value located adja-
logically suggests that a spatially random (Poisson) pro- cent to a low value, for example). Most geographic phe-
cess is operating to produce the pattern. In each of these nomena have some measure of positive spatial
examples we use the term appear,based on our subjective autocorrelation. Nearby cities tend to get similar amounts
opinion of the distribution of the data. However, the New of rainfall; expensive homes tend to be located near other
Mexico wildfire distribution example from chapter S expensive homes.
illustrates a process that can be statistically evaluated to One can visualize spatial autocorrelation through use
of a variogram. A variogram is a type of scatterplot that
determine if the wildfire pattern is random or clustered.
In most geographic problems, a point or area pattern displays the differences in values between geographic
will not provide a totally clear indication that the pattern locations against the differences in the distance between
the geographic locations. The Y axis records the average
is clustered, dispersed, or random. Rather, many real-
variance in values among a set of geographic objects, but
world patterns show a combination of these arrange-
ments with tendencies from "purely" random toward for reasons beyond our discussion, the actual measure
used is half the variance (termed semi-variance).The X axis
either clustered or dispersed.
shows the average distance among the objects, which is
termed the range. For simplicity, the distances are aggre-
13.2 THE CONCEPTOFAUTOCORRELATION gated into more easily understood groups like 100 miles,
200 miles, etc. By plotting a variogram you can determine
In 1970, geographer Waldo Tobler stated that the average difference in values among locations that are
"Everything is related to everything else, but near things 200 miles apart, 500 miles apart, and any other logical
are more related than distant things." This quote became distance that you would like. A generalized example of a
known as Tobler's Law, but, more importantly, it illus- variogram is shown in figure 13.2.
trates the phenomenon of spatial autocorrelation. Spa- In this example, the Y axis represents the average dif-
tial autocorrelation simply measures the degree to which ference in valuesbetween geographic locations, while the
a geographic variable is correlated with itself through X axis represents the difference in distancesbetween geo-
space. Like traditional correlation discussed in chapter graphic objects. You can see that geographic locations
16, spatial correlation is measured as either being posi- near one another tend to have smaller differences, while
tive, negative, or non-existent. geographic locations with larger differences are generally
Positive spatial autocorrelation means that geographic further apart. Rather than showing the individual points,
objects that are near one another tend to be similar. That a curved line is used to represent the nature of the rela-
Chapter 13 .A. General Issues in Inferential Spatial Statistics 207

tionship. While beyond the scope of our discussion, in the Southeast United States. Naturally, we would expect
advanced geostatistics the mathematical properties of the the LSF date for two nearby weather stations to be simi-
best-fitting curve are used to estimate values at unknown lar, while stations further separated from one another
locations. The first part of the variogram (left portion of would have less similarity.
the X axis) illustrates the change in average differences in This relationship is illustrated perfectly by the set of
data values as the distances become greater, while the flat observations and curve that "best fits" through those
portion of the variogram (right portion of the X axis) observations (fig. 13.3). Between O and 400 miles it
shows there is no change in the average differences in appears that if the distance separating a pair of weather
data values when the locations are very far apart. In stations is large, then their dates of last spring frost are
other words, the flat portion of the graph suggests no likely to be quite different as well. Furthermore, this rela-
relationship between the average difference in values for tionship between distance and differences in LSF date
pairs of geographic objects and their distance from one appears linear. Notice, however, what happens with pairs
another, while the earlier portion of the graph shows that of weather stations located more than 400 miles apart:
the values become less similar as the distances among the the scatterplot of points disperses around the best fit
locations increase. curve and appears random. This means there is no rela-
The rangerepresents the distance at which the differ- tionship between the LSF dates of weather stations when
ences in values of geographic locations are no longer cor- they are more than 400 miles apart. Therefore, the range
related, while the sill represents the average difference in is said to represent the distance at which the LSF dates
value where there is no relationship between location are no longer spatially autocorrelated-in this example,
and value. Typically, if two objects are in the same loca- about 400 miles.
tion, they should have the same value and thus no differ- The concept of spatial autocorrelation is critical in
ence. However, whenever samples are taken, some analyzing geographic phenomena for a number of rea-
degree of uncertainty always occurs due to measurement sons. First, by just pushing a button, GIS software allows
error, the inherent properties of the measuring device, or users to analyze the relationship between points no mat-
some other reason. The nugget represents the degree of ter how far apart they are. However, in our example we
uncertainty when measuring values for geographic loca- have concluded that there is no relationship in LSF val-
tions that are very close to one another. ues between weather station pairs that are more than 400
For our purposes, we do not address the mathemati- miles apart. Therefore, using a GIS or statistical process
cal derivation of the variogram (you can learn more that assumes a relationship among the data may have
about variograms in some of the more advanced readings limited use if the points are more than 400 miles apart. In
listed at the end of this chapter). Rather than going into short, to make intelligent observations about spatial data,
statistical detail, we illustrate the variogram concept with we must understand the spatial structure of the data.
a short example. In chapter I, an example shows the Last Second, and more importantly, the presence of spa-
Spring Frost (LSF) date for selected weather stations in tial autocorrelation presents a real challenge when per-
forming statistical analysis. Most inferential statistics
assume that observations are independent of one another.
(Y) Unfortunately, we have just demonstrated that LSF dates
20 are actually spatially correlated (and thus dependent)
18 • with one another for weather stations less than 400 miles
• apart. Why is this important? Simply stated, confidence
16 •
•· • • • • • • • • • • range • • • • • • • • • • • ~"-•
---------- intervals calculated from spatially autocorrelated data
•• will likely be different than confidence intervals calcu-
• • lated from data that are not spatially autocorrelated .
Recall from chapter 8 that sample size is an important
8C 10 component in the computation of standard error; the
.,
!'1
greater the sample size, the better our estimates. In the
'E 8
0
case of spatially autocorrelated data, imagine a geogra-
6
pher sampling 30 objects, with five of those objects very
4 close to one another. Essentially, selecting five nearby
2
.
+
: nugget
observations (assuming spatial autocorrelation within that
' distance) is almost like selecting the same observation five
0 ♦ (X) times. While you may believe you have 30 independent
0 200 400 600 800
DistanceIn mil&s
observations, in reality (since five of them are spatially
autocorrelated), it is probably more practical to recognize
FIGURE 13.3 that you essentially have only 25 independent observa-
Variogram of Last Spring Frost (LSF) Date tions. Therefore, the value of n in the confidence interval
208 Part V .A. Inferential Spatial Statistics

equation is too high (you will be erroneously making n Pennsylvania is adjacent to New York because they share
equal to 30 rather than 25), thus giving you a smaller stan- a common boundary. However, neither state is adjacent
dard error than warranted. Also, if the five spatially auto- to California. Another approach is to define "neighbor"
correlated values are similar, then the standard deviation using a distance threshold (or cut-off distance) measure.
for the entire dataset will also most likely be lower, yield- For example, if we establish a distance threshold of 300
ing an even lower standard error. Therefore, you must take miles, New York City and Boston would be considered
extreme care when interpreting results for inferential anal- neighbors, while Boston and Los Angeles would not.
ysis when the data exhibits spatial autocorrelation. "Neighbor" may also be defined as a continuous
A final characteristic of autocorrelation worth men- variable, by measuring the strength of "neighborliness"
tioning here is that it can be measured either globally or between two objects as a function of the distance separat-
locally. For instance, a social geographer may be inter- ing them. This logic is comparable to spatial interaction
ested in determining if the distributions of various racial models. For example, a simple inverse-distance weight
and ethnic groups in a city have different levels of spatial {I/distance) could establish a measure of proximity.
autocorrelation. Perhaps it is hypothesized that one racial Using this model, the inverse distance weight of New
or ethnic group has a higher positive autocorrelation York City and Boston is 1/189 miles, or .005, while the
across the entire area of the city than another. That is, spa- inverse distance weight of Boston and Los Angeles is I/
tial autocorrelation is being used as a comparative global 2588 miles, or .0004. In this case, the "neighborliness" or
measure of segregation or integration. It is "global" in the interaction measure is 12 times stronger for New York
sense that a single spatial autocorrelation value is derived City and Boston than it is for Boston and Los Angeles.
for each group across the entire study area, telling us the Yet another measure of neighborly interaction is inverse-
degree to which the overall spatial pattern of each group distance squared ( I / distance 2 ).
is clustered, random, or dispersed. In chapter 15, we However neighbors are defined, geographers typi-
apply a measure of global autocorrelation to several racial cally represent them in the form of a weight, where wij is
and ethnic groups in Cleveland, Ohio. the weight between geographic objects i and j. The
By contrast, local autocorrelation methods compare weight can be either O or 1 for a binary representation or
each geographic object with its surrounding neighbors, a ratio such as I/ distance for a continuous function.
thereby measuring whether the values in the immediate
neighborhood are clustered, random, or dispersed. Many
global methods for measuring spatial autocorrelation KEY TERMS
have been modified to provide local mea.sures. Also in autocorrelation (spatial autocorrelation), 206
chapter 15, we evaluate the spatial pattern of obesity lev-
clustered, random and dispersed patterns, 205,206
els for Pennsylvania counties and identify particular global and local measures of autocorrelation, 208
counties having significant local spatial autocorrelation. neighbor definitions: adjacency, distance threshold,
For the last half century, geographers and statisti- inverse-distance, inverse-distance squared, 208
cians have continued to study methods to identify and
range, sill, and nugget in a variogram, 207
account for the amount of spatial autocorrelation among variogram, 206
variables. It is not unlike correlation over time. For
instance, during a long recession, a company's sales for
one year may be very similar to the sales in a previous MAJOR GOALSAND OBJECTIVES
year. Economists apply advanced statistical formulas to
account for this correlation over time. Geographers are If you have mastered the material in this chapter, you
now applying similar principles to account for correla- should now be able to:
tion over space. I. Understand the concepts of clustering, dispersion, and
randomness in point and area patterns.
13.3 WHO IS MY NEIGHBOR? 2. Recognize situations in which point and area patterns
appear clustered, dispersed, or random.
In the famous parable of the Good Samaritan, a man 3. Explain why it is important to measure spatial auto-
asked "who is my neighbor?" Geographers studying spa- correlation.
tial autocorrelation have recently been asking the same
4. Explain the basic purpose of a variogram and under-
question. Geographers have different ways of defining
stand the terminology associated with a variogram.
the term "neighbor." One approach is to define neighbor
in a binary way-that is, two geographic objects are 5. Identify some ways in which the term "neighbor" is
either neighbors or they are not. One acceptable criterion defined in spatial statistics.
for this binary relationship can be represented as a form 6. Explain the basic difference between "global" and
of geographic association called adjacency. For example, "local" measures of spatial association.
Chapter 13 .A. General Issues in Inferential Spatial Statistics 209

REFERENCES
AND ADDITIONAL READING
Included below are a number of references that refer to
geostatistics (spatial statistics) in general. Many of these sources
also cover the point pattern and area pattern techniques dis-
cussed in chapters 14 and 15.
Anselin, Luc. "Local Indicators of Spatial Association-
LISA." Ge-Ographical Analysis,Vol. 27 No. 2 (1995): 93-115.
Bailey, T. C. and A. C. Gatrell. InteractiveSpatialData Analysis.
London: Longman, 1995.
Barnes, R. VariogramTut-OriaLGolden Software Inc. http://
www.goldensoftware.com/variogramTutorial.pdf
Cliff, A. and J. Ord. Spatial Processes:Models and Applications.
London: Pion, 1981.
Ebdon, D. Statisticsin Geography.A Praaical Approach.Oxford:
Basil Blackwell, 1985.
Goodchild, M. SpatialAutoco"elatwn, CATMOG No. 47. Nor-
wich, England: Geo Books, 1988.
Griffith, D. A. Spatial Autocorrelatwn:A Primer. Washington,
DC: Association of American Geographers, 1988.
Griffith, D. and C. Amrhein. StatisticalAnalysisfar Geographers.
Englewood Cliffs, NJ: Prentice Hall, 1991.
lsaaks, E. H. and R. M. Srivastava. An lntroduawn to Applied
Geostatistics.Oxford: Oxford University Press, 1989.
Lee, J. and D. W. S. Wong. StatisticalAnalysis with ArcView GIS.
New York: Wiley, 200 I.
Mitchell, A. The ESRI Guideto GIS Analysis. Vol.2: SpatialMea-
surementsand Statistics.Redlands, CA: ESRI Press, 2005.
Odland, J. Spatial Aut-Ocorrelation.Vol. 9, Scientific Geography
Series. Beverly Hills, CA: Sage, 1988.
Ripley, B. Spatial Statistics.New York: Wiley, 1981.
Point PatternAnalysis

14.1 Nearest Neighbor Analysis


14.2 Quadrat Analysis

This chapter presents methods for analyzing point analyze the points. The study would determine whether
patterns that are distributed across an entire study region the trees were randomly distributed throughout the study
(that is, they are global methods of analysis). Such spatial area or clustered in certain portions of the forest. The
patterns are characterized by observations appropriately result of this analysis would provide some guidance con-
displayed as a series of points or dots on a map and are cerning the most appropriate type of treatment (e.g.,
very common in geographic studies. For example, urban widespread aerial spraying versus concentrated treat-
geographers plot the location of settlements as points on ment from the ground).
a map. Economic geographers study the spatial pattern For many geographic studies that begin with analysis
of retail activities by mapping store locations as dots on of a variable having locations on a dot map, a primary
the map. Physical geographers show glacial features like objective is to determine the form of the pattern of
drumlins as a series of points, while crime analysts may points. The nature of the point pattern might reveal infor-
plot the location of burglaries as a series of dots. mation about the underlying processes that produced the
In some problems, the spacing of individual points distnbution. In addition, a series of point patterns of the
within the overall point pattern is important, in which same variable recorded at different times can help deter-
case a statistical procedure known as "nearest neighbor mine temporal changes in the locational process. In
analysis" is applied. For example, an urban geographer short, point pattern analysis offers quantitative tools for
might analyze the existing configuration of fire stations examining a spatial arrangement of point locations on
in a city to determine whether the pattern is random or the landscape as represented by a conventional dot map.
more dispersed than random. Suppose one of the goals
in the provision of fire service is to provide an equitable
or dispersed distribution of fire stations throughout the 14.1 NEARESTNEIGHBOR
ANALYSIS
city so that no one lives or works extremely far from the Nearest neighbor analysis (NNA) is a widely used
nearest fire station. The geographer might suggest the sit- procedure to determine the spatial arrangement of a pat-
ing of new stations or relocation of existing facilities to tern of points within a study area. The distance of each
meet this goal. The proposed new configuration of fire point to its "nearest neighbor" is measured and the
stations could then be analyzed to determine if it were mean nearest neighbor distance for all points is deter-
more dispersed than the existing pattern. mined. The spacing within a point pattern can be ana-
In other instances, the nature of the overallpoint pattern lyzed by comparing this observed mean distance to
is important, and a statistical test known as "qua drat some expected mean distance, such as that for a random
analysis" is often appropriate. For example, a biogeogra- (Poisson) distribution.
pher might select a random sample of trees in a national The nearest neighbor technique was originally devel-
forest to determine which trees are diseased, plot the oped by biologists interested in studying the spacing of
location of each diseased tree as a point on a map, and plant species within a region. They measured the dis-

210
Chapter 14 • Point Pattern Analysis 211

ranee separating each plant from its nearest neighbor of index comes from comparing the index value for an
the same species and determined whether this arrange- observed pattern to the results produced from certain dis-
ment was organized in some non-random manner or was tinct or theoretical point distributions. This objective is
the result of a random process. Geographers have applied analogous to the situation discussed in section 12.1,
the technique in numerous research problems, including where an observed frequency distribution is compared to
the study of settlements in central place theory, economic a perfectly normal frequency distribution.
functions within an urban region, and the distribution of In NNA, mean nearest neighbor distances can be
earthquake epicenters in an active seismic region. In all calculated for three distinct theoretical point arrange-
applications, the objective of nearest neighbor analysis is ments: perfectly random, perfectly dispersed, and per-
to describe the pattern of points within a study area and fectly clustered. In each case, the spacing index is
make inferences about the underlying process. determined for an area containing a certain density of
The nearest neighbor methodology is illustrated points. For example, if points are arranged in a random
using the example of seven points shown in figure 14.1. spatial pattern, the mean nearest neighbor distance
A coordinate system is created and the horizontal (X) (NND R)would be determined as follows:
and vertical (Y) positions of the points recorded (table
14.1). For each of the points, the nearest neighbor (NN) 1
is determined as the point closest in straight-line (Euclid- NND R = --;=== (14.2)
2.jDensity
ean) distance. The distances to each nearest neighbor
(NND) are then calculated from the coordinates or mea-
sured from the map. From the set of nearest neighbor dis- where NND = mean nearest neighbor distance in a
R
tances, the mean nearest neighbor distance (NND) 1s random pattern
determined using the basic formula for the mean: Density= number of points (n)/ Area
Since the study area displayed in figure 14.1 has a dimen-
NND=I;NND (14.1) sion of 10 by 10 units, the area represented is 100 square
n units. The corresponding density of points is therefore 7I
100 or .07. For an area with this point density, a random
where n = number of points. arrangement of points within the study area should pro-
duce mean nearest neighbor distance:
The mean nearest neighbor distance of the seven-
point pattern is 2.67 distance units. This mean nearest
neighbor distance provides an index of spacing for the set I
NND R = = 1.89
r;;;:;
of points. However, the usefulness of this descriptive 2v.07

TABLE 14 .1
y
Coordinates and Nearest Neighbor Information for 10~---~.-.--~.-,--~,-.--~.-.-~
Example*

Point X y NN NND
8
A 1.3 0.9 E 3.94 F
B 3.2 4.4 C 2.00 - • .
C
C
0
3.3
5.6
6.4
3.8
B
E
2.00
1.36 6 -
• .
E 4.8 2.7 D 1.36
F 8.1 7.4 G 4.21 .
G 9.4 3.4 D 3.82
.s
D
SUM 18.69
4
• G

•E

where NN = nearest neighbor 2 - .


NND = nearest neighbor distance
- •A .

'f.NND 18.69 0 X
NND 2.67 0 2 4 6 8 10
n 7
FIGURE 14.1
• See figure 14.1 for graph of point locations. Location of Points for Nearest Neighbor Problem
212 Part V .A. Inferential Spatial Statistics

If the arrangement of points shows maximum disper-


sion or a perfectly uniform pattern, the mean nearest
neighbor distance (NNDo) would be determined as:
(Perfectly
dispersed}
R=2.149
• •

• • -
NND
O
= 1.07453 (14.3)
,/Density
••
Since this pattern represents the most dispersed or sepa-
(More dispersed
than random)
R= 1.5
• •
• -
rated arrangement of points, it also serves as the maxi-
mum value for the nearest neighbor index. The mean
• •
nearest neighbor distance for a dispersed pattern whose
point density matches that in figure 14.1 is:
(Random) R= 1.0
••
• -
NND o =l.°}JJ
3
=4.06
.07

The pattern of spacing most distinct from dispersion


or uniformity is "clustering." When all points lie at the
(More clustered
than random)
R=0.5
•••
•• -
-
same position, the pattern shows maximum clustering of
points, and each nearest neighbor distance is zero. There- (Perfectly
clustered)
R=0.0 ®
fore, in a perfectly clustered pattern, the mean nearest
neighbor distance NNDc is also zero, representing the
lowest possible value for the index: ® =5 points at
sam&location

NNDc=O (14.4) FIGURE 14.2


Continuum of R Values in Nearest Neighbor Analysis
Since perfectly clustered and perfectly dispersed pat-
terns provide the extreme spacing arrangements for a set index, dependent upon the units used to measure dis-
of points, the nearest neighbor index offers a useful tance. Therefore, direct comparison of results from dif-
method to measure the spacing of locations within an ferent problems or different regions is difficult. Although
observed point pattern. However, the mean nearest the minimum value of the nearest neighbor index is
neighbor distance is an absolute (as opposed to relative) always zero (a perfectly clustered pattern), the maxi-
mum value corresponding to a perfectly dispersed pat-
tern is not constant, but a function of the point density.
To overcome this complication, a standardized nearest
NearestNeighbor
Analysis neighbor index (R) is often used. This index is found by
dividing the mean nearest neighbor distance (NND) by
Primary Objective: Determine whether a random (Poisson)
the corresponding value for a random distribution with
process has generated a point pattern
the same point density:
Requirements and Assumptions:
1. Random sample of points from a population NND
2. Sample points are independently selected
R=== (14.5)
NNDR
Hypotheses:
With the standardized index, a perfectly clustered pattern
Ho: NND = NND• (point pattern is random) produces an R value of 0.0, a perfectly random distnbu-
H.: NND * NND• (point pattern is not random) tion 1.0, and a perfectly dispersed arrangement generates
H.: NND > NND• (point pattern is more dispersed the maximum R value of 2.149 (fig. 14.2). Thus, an
than random) actual point pattern can be measured for relative spacing
H•: NND < NND• (point pattern is more clustered than along a continuous scale from perfect clustering to per-
random) fect dispersal. For the set of points from figure 14.1, the
standardized nearest neighbor index is:
Test Statistic:

Zn=
NND - NNDR R =2.67 =1.41
1.89
Chapter 14 • Point Pattern Analysis 213

This spacing pattern is moderately dispersed, since it where crNNi5= standard error of the mean nearest neigh-
lies between a perfectly dispersed distribution and a ran- bor distances
dom one. The standard error for the nearest neighbor test can be
In addition to its use as a descriptive index of point estimated with the following formula:
spacing, the nearest neighbor methodology can also be
used to infer results from a sample of points to the popu- .26136
<,-=-;===== (14.7)
lation from which the sample was drawn. A difference NND Jn(Density)
test can be used to determine if the observed nearest
neighbor index (NNii) differs significantly from the theo-
where: n = number of points
retical norm (NND R), which would occur if the points
were randomly arranged. The expectation is that a Pois- Either a directional (one-tailed) or non-directional
son process operates over space to produce a random pat- (two-tailed) approach can be applied, depending on the
tern of points. The null hypothesis is that no difference form of the alternate hypothesis (HA)· If the problem
exists between observed and random nearest neighbor suggests a clear rationale for the actual point pattern
values. The test statistic (Zn) follows a format similar to being either dispersed or clustered (as opposed to the null
other difference tests discussed in chapters 9 through 12: hypothesis of randomness), a one-tailed approach is war-
ranted. However, without an underlying reason or theo-
retical expectation that the null hypothesis should be
NND-NNDR rejected on one side as opposed to the other, a two-tailed
Z.=------ (14.6)
c:,NND non-directional approach is best.

Example: Nearest Neighbor Analysis-Community Services in Toronto, Canada

The nearest neighbor methodology is now applied to


point patterns representing community service sites within TABLE 14. 2
Toronto, Ontario. Some public services are best located in a Worktable for Nearest Neighbor Analysis: Police
highly dispersed pattern to provide relatively equal service Service Facilities In Toronto, Ontario
distance for all parts of the region. Such a concern for spatial
H0 : NND = NND, (point pattern is random)
equity is especially important for emergency services such as
police and fire. Other community activities may be sited to HA: NND "' NND, (point pattern is not random)
offer the region a high degree of efficiency or cost-effective• Calculate mean nearest neighbor distance:
ness, with less concern for equal spacing of services within the
NND = '[. NND = 72.68 = 3_63
region. Many nonemergency services exhibit more clustered n 20
arrangements in their point patterns, possibly reflecting finan•
where: n =number of points
cial rather than spatial constraints.
Four community services in Toronto are selected for a near• Calculate random nearest neighbor distance:
est neighbor analysis of their spacing. Two of the services, - 1
NND, =
police and fire, provide emergency protection from sites of 2JDensity
their facilities to various locations within the region. The other
where: Density= n/Area
two services, elementary education and voting locations, pro•
vide nonemergency services for persons who travel to these - 1 1 1
locations. The facility sites for the four services are shown in
NND, = ~= ::-;;;=aa
2,/0.0312
= 0.1766 = 2•83
2
figure 14.3.
A nearest neighbor analysis is applied to each pattern to
Calculate standardized nearest neighbor distance:
measure the spacing of service sites. The calculation proce•
dure for analyzing the spacing of police facilities is sum ma• NND = 3.63 = 28
R = NmJ, 2.83 1.
rized in table 14.2. The nearest neighbor distances (in
kilometers) for each of the 20 sites are shown in figure 14.4.
Calculate test statistic:
The mean nearest neighbor distance for police stations in
Toronto is 3.63 kilometers. If the stations had been distributed NmJ- NmJ,
Zn =
randomly across the city, the mean spacing would have been amm
2.83 kilometers (table 14.2). To hold the influence of point where: Density= n/Area
density constant and allow more useful comparisons of near•
0.26136 0.26136
est neighbor results, a standardized nearest neighbor index a-
NND
-
-
,====
J20(0.0312)
-
- 0.7899
= 0.3309
(R) is calculated. This ratio of NND to NND• for police stations
in Toronto (R = 1.28) suggests that the actual spacing of points
3.63 - 2.83 0.80
is more dispersed than random. Although dispersal of sites is ----- 0.3309 -
2 42

(p = 0.0078)
0.3309
expected for an emergency public service, does the R value of

(continued)
214 Part V .A. Inferential Spatial Statistics

1.28 differ significantly from 1.0,the result for a random pat• p =.0078) indicates that you can reject the null hypothesis
tern of points? with more than 99% certainty and conclude that the spacing
Equations 14.6 and 14.7 are used to test for a significant of police stations is more dispersed than random.
difference between an observed nearest neighbor index and Inferential interpretations in this problem should be made
the corresponding index value for a random spacing of points. only with extreme care. It could be argued that the existing
The null hypothesis is that the observed and random nearest pattern of locations is a •natural" sample from the many possi•
neighbor distances are equal. In this problem, a one-tailed ble locations where police stations could have been sited. If
(directional) test is logical, since police facilities are hypothe• this argument is not accepted, the inferential assumption of
sized to have a dispersed locational pattern. Therefore, the an independently selected random sample of points is not
alternate hypothesis is that the observed nearest neighbor met. If this is the case,you might want to focus primarily on a
index is larger than the corresponding random value. The descriptive comparison of nearest neighbor indices (R•values)
resulting test statistic for police locations in Toronto (Zn= 2.42; for the police stations and the other community services.

TABLE 14. 3

Nearest Neighbor Values for Selected Public Facilities In Toronto, Ontario

Public Facility NND Density NNDR R z. p


Fire 2.06 .1278 1.40 1.47 8.21 .0000
Police 3.63 .0312 2.83 1.28 2.42 .0153
Elementary schools 0.59 .7358 0.58 1.02 0.95 .3439
Voting locations 0.27 2.4351 0.32 0.84 -12.23 .0000

Elementary schools Police facilities

. . ........
. .• ::. ........
- .. , .....
...
..... .. ·... .•.. . ...... .
....... . . . . ..... .... . . .. ... .... .
:·.····••\''

.

.
. . . .............

.
• :

. . . . ..·..·......
• ,•.· ••

.

·.. . .

.
.. .

.
'I,

..

.: .. ..'• . ,...~-
t
••••

••••
• • •
• •
• • I

,

••

... ·.. .:_:... ,.. ..... ·... ... .. ....... .. ............


. . . . •. ····:····.. •,,•:·""·
: :'•,- ..-~-....··-::·.
. . ..,·,:·.· ..• ..·----~
..

Voting locations Fire service facilities

..
.. N

A
Inner
0 ,.
harbor KllorniekA

FIGURE 14.3
Location of Selected Public Facilities in Toronto, Ontario
Source:City of Toronto Planning Division
Chapter 14 • Point Pattern Analysis 215

Police Facilities
(Perfectly • • ,_
dispersed)
R=2.149 •
A 4.73 I
J
• •
$,75 M
K
..,, ~B
(More dispersed ••
E
,3,78 -
F 303 P
Q
L
3.17
than random) R= 1.5
• •
• Fire= 1.47

T HKI S N
4-- Police = 1.28
A
.
G ,81/\..2. 13 _.R
us C O N~9•
0
• , (Random) R= 1.0
• • Education = 1.02
Inner • •
ha<bo,
Kloml!ilMt
• 4-- Voting = 0.84

FIGURE 14.4
•••
Nearest Neighbor Distances for Police Service Facilities in
Toronto, Ontario
(More clustered
than random)
R= 0.5
•• -
The corresponding nearest neighbor values and test statis•
tic results for each of the four service patterns are shown in
table 14.3. The public services are ranked in decreasing size of
the R index (from more dispersed to more clustered). The pat•
tern of fire stations in Toronto is the most dispersed (R 1.47) =
(Perfectly
clustered)
R= 0.0
-
=
followed by police stations (R 1.28), thus supporting the ® = 5 points at
samekicaUOn
assumption of an even distribution of emergency services to
meet an equity requirement. The third most dispersed service
=
is elementary education facilities (R 1.02). However, statisti•
FIGURE 14.5
=
cal evidence (p 0.34) indicates that this dispersion is not sig•
Continuum of R Values in Nearest Neighbor Analysis:
nificantly different from a random pattern. Toronto, Ontario
The most clustered spacing pattern among the four ser•
=
vices is voting locations (R 0.84), the only service where the
random nearest neighbor distance exceeds the observed Spacing of facilities for the four selected services in Toronto
average nearest neighbor distance. Unlike emergency ser• confirms the expected arrangements of emergency and non•
vices which are located to provide relatively equal response emergency services (fig. 14.5). As hypothesized, the pattern of
times for all parts of a region, voters must utilize a specific vot• emergency fire services exhibits the most dispersed spacing,
ing location based on their place of residence. This suggests followed closely by the pattern of police facilities. Also as
that areas with higher population densities would require expected, voting locations in Toronto are more clustered than
additional voting locations producing a more clustered spatial random. Although education services visually appear more
pattern. This clustered pattern is reflected in a lower R index dispersed than random, statistical evidence suggests that the
and is evident in the areasjust north of Toronto's inner harbor. spacing does not differ significantly from a random pattern.

These results are clearly influenced by the way in ary influences the outcome of a point pattern analysis.
which the problem is structured. A critical issue when Ideally, a functional study area boundary should be
using the nearest neighbor procedure is the delimitation defined that is consistent with the type of points being
of the study area boundary. In many research problems analyzed. Some researchers suggest that the boundary
(like the analysis of public services in Toronto), a politi- should be defined just beyond the outermost points
cal boundary defines the study area. In other problems, within the study area.
where no formal or logical boundary exists for enclosing Another issue related to study area boundary is the
the point locations, you must designate an operational specification of nearest neighbors at the edge of a region.
boundary or territorial limit for the region under study. In some actual situations, the nearest neighbor for a
The position of the boundary does not directly affect the point located near the boundary may actually lie outside
distance between nearest neighbors or the mean nearest the study area. Perhaps the nearest fire station or school
neighbor index. However, boundary position does affect lies across the border at a shorter distance than the near-
both the area of the study region and the point density, est neighbor lying inside the study area. Therefore,
factors that determine the random nearest neighbor dis- results are influenced by how nearest neighbor distances
tance. Therefore, specification of the study area bound- are handled near the study area boundary. Administra-
216 Part V .A. Inferential Spatial Statistics

rive boundaries do not necessarily offer the best approach of points-that is, the mean number of points per cell.
to delimiting the study area in point pattern problems. This relationship is directly analogous to the influence
To adapt to various situations, geographers have of the mean on the standard deviation of a variable.
expanded the NNA concept to include not just the single Recall from chapter 3 that the coefficient of variation is
nearest neighbor, but n nearest neighbors (where n repre- used for meaningful comparisons of relative variability
sents all nearby neighbors, and the definition of "nearby" between distributions. In quadrat analysis, an index
varies depending on the specific situation). If the com- known as the variance-mean ratio (VMR) standardizes
plete distribution of all point-pair distances in a spatial the degree of variability in cell frequencies relative to the
pattern is considered, then we have a classical k nearest mean cell frequency:
neighbor (often called a Ripley's K function). In evaluat-
ing more than one nearest neighbor, you simply deter- VAR (14.8)
VMR=---
mine the mean distance from each point to the n closest MEAN
points in the distribution. For example, we can determine
the average distance of the closest four police stations to where: VMR = variance-mean ratio
any other police station in Toronto. VAR = variance of the cell frequencies
The k function is a true second order statistic that MEAN = mean cell frequency
evaluates the distribution of points over the full set of dis-
tances in the point set. That is, for each point the k func- The mean and variance are the basic descriptive statistics
tion determines how many points are within distance d. that summarize the central tendency and variability of a
Because distance is a continuous variable, the k function
typically breaks the data up into distance intervals. For
example, we might determine how many police stations • • • • •
are within 1km, 2km, 3km, ... and 10km of each other
• • • • • • •
and compare that to an expected number of police stations •
from a theoretical distribution of a completely random set • • •
Case 1: • •
• • • • • •
of points. The purpose is to determine if a set of points Perfectly
dispersed • • • • •
(services or facilities) is clustered at multiple distances. • • •
• • • •
• • • • • • •
14.2 QUADRATANALYSIS • •
• • •
An alternate methodology for studying the spatial No variance

arrangement of point locations is quadrat analysis.


Rather than focusing on the spacing of points within a
•••• • •
study area, quadrat analysis examines the frequency of • •
•• • ••
points occurring in various parts of the area. A set of • ••
• • • • ••
• •
quadrats (usually square cells) is superimposed on the Case 2:
study area, and the number of points in each cell is deter- Highly
clustered • •
mined. By analyzing the distribution of cell frequencies, • ••• •• •• ••
the point pattern arrangement within the study area can • • • ••
be described. • •• • •• •
Whereas nearest neighbor analysis uses the average
Large variance
spacing of the closest points, quadrat analysis considers
the variability in the number of points per cell (fig. 14.6).
If each of the quadrats contains the same number of • • •
• • •
points (case 1), the pattern would show no variability in • • • • •
frequencies from cell to cell and would be perfectly dis-
• •• • ••
persed. By contrast, if a wide disparity exists in the num-
ber of points per cell for the set of quadrats examined
Case 3:
Random
• •• • ••
• • • •
(case 2), the variability of the cell frequencies would be • • • •• • •
large, and the pattern would display a clustered arrange- • • •• • •
ment. In a third alternative (case 3), the variability of cell • • • • • • •

frequencies is moderate, and the pattern of points would
Moderate variance
reflect a random or near random spatial arrangement.
However, the absolute variability of the cell frequen- FIGURE 14.6
cies cannot be used as an effective descriptive measure of Quadrat Analysis: The Relationship of Cell Frequency Variability
the point pattern because it is influenced by the density and Point Pattern
Chapter 14 • Point Pattern Analysis 217

variable. Since the data from quadrat analysis are usually


Case 1: Case 2:
summarized as frequency counts (number of points in
each quadrat), formulas for the weighted mean and • • •
weighted variance (weighted standard deviation squared)
are usually applied (see tables 3.4 and 3.8).
Interpretation of the variance-mean ratio offers •• •• •
insight into the spatial pattern of points within the study •• •• •
area. In a dispersed distribution of points, for example,
the cell frequencies will be similar, and variance of cell
frequency counts will be very low. In an extreme case • • •
(such as case 1 in fig. 14.6), if each quadrat contains
exactly the same number of points, the variance will be FIGURE 14.7
zero, making the variance-mean ratio zero. Conversely, if Two Point Patterns with the Same Variance-mean Ratio,
a point pattern is highly clustered with most cells contain- but Different Visual Impressions
ing no points and a few cells having many points, the vari-
ance of cell frequencies will be large relative to the mean will occur. For a perfectly random point pattern, the vari-
cell frequency. This type of spatial pattern will produce a ance of the cell frequency is equal to the mean cell fre-
large value for the variance-mean ratio. quency. To understand this re.suit,recall from section 5.4
If the set of points is randomly arranged across the that the Poisson distribution is used to describe the fre-
cells of the study area, an intermediate value of variance quency of values for a randomly generated spatial or tem-
poral pattern. In the Poisson distribution, the mean
frequency equals the variance of the frequencies.Therefore,
• • • when using the variance-mean ratio to de.scnbe spatial
• point patterns, a result close to one (variance equals mean)
• • • suggests that the distribution has a random arrangement.
Case 1:
Clustered • • Since quadrat analysis measures dispersion related to
the frequencies of points in the cell boundaries and not
2
X is large the explicit location of points in relation.ship to one
another, you mu.stbe careful when interpreting the VMR.
• For example, both case 1 and case 2 in figure 14.7 have
the same VMR. There are eight points and four quadrats.
pis low (close to 0)
However, the actual distribution of the points in relation-
ship to one another in the large study area (ignoring the
four internal quadrats) would suggest that case I has a
• • •

Case 2: •
Random

2 • Quadrat
Analysis
X is •
intermediate
Primary Objective: Determine whether a random (Poisson)
• • process has generated a point pattern

p is intermediate Requirements and Assumptions:


(not close too or 1) 1. Random sample of points from a population
2. Sample points are independently selected

• • Hypotheses:
Case 3: Ho: VMR = 1 (point pattern is random)
• •
Dispersed • HA: VMR * 1 (point pattern is not random)
2 HA : V MR > 1 (point pattern is more dispersed than
X issmall •
• • random)
HA : V MR < 1 (point pattern is more clustered than
• random)
p is high {close to 1)
Test Statistic:

FIGURE 14.8 x2 = VMR (m-1)


Quadrat Analysis: The Relationship of Chi-square Value and p
218 Part V .A. Inferential Spatial Statistics

clustered pattern of points while the distribution of points determines whether an observed VMR value differs signif-
in case 2 is dispersed. icantly from a theoretical random value (VMR = 1.0). The
In addition to its use as a descriptive index, the vari- interpretation of the chi-square test statistic and corre-
ance-mean ratio can also be applied inferentially to test a sponding p-value reflects these two possibilities. A large
distribution for randomness. The test statistic is chi- VMR generally produces a larger chi-square value (with a
square, defined as a function of both the VMR and num- small p-value), suggesting greater variability of the cell fre-
ber of cells (m): quencies and a clustered arrangement of points. Con-
versely, a small VMR produces a smaller chi-square value,
x_2= VMR(m - 1) (14.9) indicating lower variability of cell frequencies and a more
dispersed distribution of points. In such instances, the cor-
The null hypothesis for this test is expressed as no differ- responding p-value is larger. Intermediate chi-square val-
ence between the observed distnbution of points and a dis- ues result from variance-mean ratios closer to one,
tribution of points resulting from a random process (that suggesting a lower probability of rejecting the null hypoth-
is, VMR = l). Either a directional (one-tailed) or non- esis and therefore more likelihood that the distnbution of
directional (two-tailed) approach can be used, depending points is actually random. Thus, a continuum of p-values
on the test objective of the alternate hypothesis. from O to 1 indicates transition from clustered to random
Rejection of the null hypothesis can occur if a point to dispersed (fig. 14.8). Either very high or very low p-val-
pattern is more clustered than random ( VMR > I) or more ues suggest rejection of the null hypothesis, while interme-
dispersed than random (VMR < 1). The difference test diate values do not support rejection.

Example: Quad rat Analysis-Distribution of Wildfires in New Mexico

The example of wildfire distribution (section 5.4) is re• not differ significantly from that of a theoretical random pat•
examined here using quad rat analysis. The objective now is to tern (VMR= 1). The null hypothesis assumes no difference
describe the pattern of points within the study area by analyz- between the observed and expected VMRs.Selection of a
ing the distribution of cell frequencies. The resulting fre- directional (one-tailed) or non-directional (two-tailed) test
quency pattern indicates whether the wildfire starting points depends on whether a priori evidence exists to suggest a
have a dispersed, random, or clustered arrangement. The chi- point pattern that is more clustered than random or more
square difference test is then applied to determine whether
the observed pattern differs significantly from a theoretical
random pattern.
In this analysis, wildfire locations are assumed to be inde-
pendent random events. This assumption is justified if the
sample is considered a "natural" sample over one time period.
If a natural sampling process is assumed, this particular pat•
tern of wildfire starting points represents a random sample
from an infinite number of sample patterns that could have
occurred during a given time period.
As discussed in section 5.4, 122 cells covering New Mexico
have at least half of their area in the state. This set of 122
quadrats excludes only 116 of the over 16,000 recorded wild-
fire starting points for the state over the entire 30 year time
period. One of our objectives is to determine if the distribu-
tions of wildfires in New Mexico are different over the last
three decades: 1980s, 1990s,and 2000s.
In addition to an analysis of wildfire distributions for all
.. . . -·~-·~
• • • " .__. • .I
three decades, we also focus attention on a random set of
2,500 points. Using GISprocessing, a random set of 2,500
"'·:•
• .,
~ . .c •..•.•...
I, •• •. •

points is generated, and yields a mean frequency of 20.9 wild- ...


•• •
.
:-. ,. -. .. ...
·-·.... ,· ......
•• t ...
\
• 1 I• • •

fires per cell-2,500 wildfires in 120 cells (fig. 14.9).The vari-


ability in cell frequencies around these means largely • Randompoint N
determines the nature of the point pattern. For this set of
quadrats, the variance of the random points is 18.9 per cell.
0 ,,.

KIOMeltf'l
300
A
The two summary statistics produce a variance-mean ratio of
.904, suggesting a random distribution (table 14.4). FIGURE 14.9
A difference test is applied to determine whether the VMR GIS-generated Random Set of 2,500 Points Superimposed on
of .904 of our computer-generated random set of points does 120 Cells in New Mexico
Chapter 14 • Point Pattern Analysis 219

Wildfires for the decade of the 1980s Wildfires for the decade of the 1990s
• •• • •• •
:i:·t
• •

.....
......


• •• ••
II.
y

• •
• •

- •
•••• •••••
·:: ••• • .' .
,-tr • •

• •• • • •
• • ,,
••-~ • •
• •

• •
.. • • ' •

~ :-.,;,
• •
• • •• • •

...
••
•• • •
1 •• :~ •

.. ..
• •

• • ~-

I
~





• • • •

- •
• • :,.


,. •
• ••

•• •• • • • •
..•.J.• • ~.,•
• •••••
,.
• ""-··..
.... • • ••
' •.

. • • •• ••
• •••
' f•

• ••
•••• -· •
• •
• •

. '"
••
• • •
• ••
• •
• •
•• " .: • •
• •
• • •• ,• . ••• • \ •
•• •
•• •
• . . • •• ., I ~•
••• •

• •
• •

•• .' ,..
• ' . ••••
' ••• • • •
• • ••
••

• •
••
I

•• • • •
• •
• _,
• I
:

• . ' • • • ••
• ••



• •

••
• • •

Wildfires for the decade of the 2000s Wildfires for all decades (1980s, 1990s, and 2000s)

••
.,.
..~ ••



• • • •

• • •
• •
• ••


'.. • •


• • •

·'• •• ' •





"~•

•••
• •


• ...
•••

,

• •
• •• • •

• • ••
• ••
• •
• •
• • • •• • ••
• • •
·.-..
•, •

• •• •
• •
•••

• • •
••
••• • ."•.. ••• •
• •• •
•• •• •• • • ..~ ••
• I

• •• • •• • • •••
• • • •• • •
• • • I ... •• .,• I
• N
•• •• •

• Wildfire location
• 0 150
Kilometers
300
A
FIGURE 14.10
Point Patterns for Wildfires in New Mexico: 1980s, 1990s, and 2000s
Source:U.S. GeologicalSurvey, Federal WildlandFire OccurrenceWebsite

(continued)
220 Part V .A. Inferential Spatial Statistics

dispersed than random. Because we generated a fictitious set print• across New Mexico has relegated forested areas to an
of random points, there is no a priori evidence suggesting the ever-smaller portion of the state. Quite simply, there are fewer
pattern should be clustered (or dispersed), therefore a non• places where a wildfire could start, and these places are
directional, two-tailed difference test is appropriate. Con• increasingly clustered on public lands such as national parks,
versely, an alternate hypothesis that assumes, a priori, a clus• national forests, and Bureau of Land Management lands.
tered or dispersed point pattern would require a directional,
one-tailed difference test. TABLE 14.4
As could be expected, the resulting chi-square test statistic
Worktable for Quadrat Analysis: New Mexico
of 107 produces a rather large p•value of 0.77 (table 14.4).Thus,
Wildfires (2,500 Random Points)
we have very strong confidence that the nul I hypothesis should
be accepted and that our computer generated pattern is, in Ho: VMR = 1 (point pattern is random)
fact, random. Figure 14.10 shows the point pattern distribution H.: VMR <> 1 (point pattern is not random)
for wildfires in New Mexico for the decades of 1980-1989,
1990-1999,and 2000-2009. Calculate mean cell frequency:
A cursory view of the maps seems to show that the distri•
Mean cellfrequency = nlm
bution of wildfire starting points is very highly clustered over
all three decades. This makes intuitive sense, as many areas in where: n = numberof points
New Mexico are desert, and only a few areas are heavily for• m = numberof cells
ested. In cases like this, a VMRwill almost certainly describe a 2,500
MEAN= = 20.9
highly clustered distribution. However, even though the "nat• 120
ural pattern• of forested areas would logically result in a clus•
tered distribution of wildfires, there is still merit in applying Calculate variance of cell frequencies:
the VMRtechnique to compare multiple decades (and detect
Number of
any changes in the overall clustering pattern of wildfires).
points per Number Total
Table 14.S shows the VMRresults for each decade. cell (x) of cells (f) points fx'
You can see that each decade exhibits a highly clustered 11 11 121
distribution, but as we said, that is not surprising given the dis• 12 2 24 288
13 2 26 338
tribution of forests, deserts, and human settlements in the 14 4 56 784
state. However, the decades of 1990-99 and 2000-09 are dra• 15 3 45 675
matically more clustered than the decade of 1980-89. The 16 4 64 1,024
17 9 153 2,601
increasing clustering of wildfire starting points could possibly 18 8 144 2,592
be explained by the fact that expansion of "the human foot• 19 15 285 5,415
20 10 200 4,000
21 10 210 4,410

.. 22
23
24
11
9
13
242
207
312
5,324
4,761
7,488
25 3 75 1,875
26 2 52 1,352
27 3 81 2,187
28 8 224 6,272
32 1 32 1,024
33 2 66 2,178
• • SUM 120 2,509 54,709

·.:1.• •.
. . • !· .;♦ • •
• 'f,f,Xl - [('f,frX,J']
• •• •• I • .. VAR= m
. . : ... . :. ' ' ...
: • ,..... 't
m-1

....
.,.... .a
. . .., . •
,•• •. • where: f; = frequencyof cellswithiwildfires
• X; = numberof wildfiresper cell
• •
.• 2
.·...
.. -S r• 54,709 - ( ·~~8
2
)
.~·
54,709 - 52,459
. VAR - - 18.9
.. 120 - 1 119

Calculate variance-mean ratio:


••• • Wildfire location N

0
!
150
!
i<IIOmetl:lr&
300
! A VAR 18.9
VMR = MEAN = 20.9 = 0.904

Calculate test statistic:


FIGURE 14.11
New Mexico Wildfires from 2000 to 2009 Overlain with a Larger x' = VMR(m - 1) = 0.9(119) - 107 (p = 0.17)
Quadrat Size
Chapter 14 • Point Pattern Analysis 221

TABLE 14. 5 produces 37 larger quad rats covering 7,177 wildfires for the
decade (fig. 14.11).To be consistent, boundary cells whose area
Variance-mean Ratios (VMRs) for WIidfire is mostly (more than half) outside New Mexicoare eliminated.
Distribution In New Mexico The larger cell network produces fewer quadrats with
Distribution Variance-Mean Ratio higher frequencies per quadrat and a larger mean cell fre•
quency of 211 wildfires per quadrat. The variance of the cell
Computer generated random sample .904 frequencies about this mean is also higher at 81,203; since the
1980 -1989 20,381 variance remains much larger than the mean, the pattern
once again displays a highly clustered arrangement. Stan•
1990-1999 32,645
dardizing this variance to the mean produces a variance•mean
2000-2009 30,818 ratio of 15,070, much lower than the value found with the
smaller quadrat size (VMR= 30,818), although still highly sig•
Does the size of quadrats used to cover the study area influ• nificant. You might want to speculate about why the VMRfor
ence results from quadrat analysis? To examine this question, the larger quad rats is considerably smaller than the VMRfor
the 2000-09 wildfire distribution is evaluated using a cell the smaller quad rats.
design with quadrat areas four times as large. This cell structure

Although quadrat analysis offers a useful approach to KEY TERMS


studying the spatial arrangement of points, several proce-
dural issues must be considered. Different cell sizes pro- nearest neighbor analysis, 210
duce varying levels of mean point frequency and variance point pattern analysis, 210
per cell. Moreover, studies have shown that if the point quadrat analysis, 216
pattern is held constant but the quadrat size is decreased standardized nearest neighbor index, 212
(i.e., the number of quadrats is increased), the variance of variance-mean ratio (VMR), 216
the point frequencies usually declines faster than the
mean. However, this is not always the case as we can see MAJOR GOALSAND OBJECTIVES
from our example. Thus, different cell sizes in a single
problem can generally be expected to produce different If you have mastered the material in this chapter, you
values of the variance-mean ratio. should now be able to:
Although more research needs to be conducted on 1. Identify the types of geographic problems or situations
optimal qua drat size, two guidelines can be offered. When for which nearest neighbor analysis is appropriate.
visual examination of a point pattern reveals distinct clus-
2. Identify the types of geographic problems or situations
ters, the size of quadrats should match that of the clusters.
This cell size allows the results from quadrat analysis to for which quadrat analysis is appropriate.
reflect more accurately the visual impression of the pattern. 3. Explain the logic and procedures used in nearest
Other researchers have suggested a general "rule of thumb" neighbor analysis and explain how they differ from
that cell size should be equal to twice the average area per those used in quadrat analysis.
point. In other words, the mean frequency of points should
be close to 2.0 per quadrat. This guideline approximates REFERENCES
AND ADDITIONAL READING
the cell structure used with the 122 cells covering the New
Mexico forest fire example. Taylor (1977) presents a more Boots, B. N. and A. Getis. PointPatternAnalysis.Newbury Park,
detailed discussion of cell size in quadrat problems. CA: Sage, 1988.
Getis, A. ''Temporal Analysis of Land Use Patterns with Near-
Another issue in the use of quadrat analysis also
est Neighbor and Quadrat Methods." Annals, Associationof
requires some attention. Like nearest neighbor analysis,
AmericanGeogmphers.Vol. 54 ( 1964): 391-399.
delimitation of the study area can influence the results of a Taylor, P. J. QuaniilativeMethodsin Geography.Houghton Miff-
study. In the use of quadrat analysis, a way to handle cells lin: Boston, I 977.
around the study area boundary needs to be determined. Thomas., R. An Introduaionto QuadratAnalysis.CATMOG No.
Cells on the boundary will inevitably contain area both 12. Norwich, England: Geo Books_,1977.
inside and outside the study region. In the New Mexico
For those interested in looking further at the distribution of
wildfire problem, cells with more than half of their area
public services in Toronto, Canada, see the City of Toronto
outside the study region were eliminated. Alternative deci- Planning Division website: www.toronto.ca/planning. For
sion rules can be used, and these may change the results those interested in the spatial distribution of wildfire occur-
obtained from the analysis. No matter what decision rule rences, see the Department of the Interior, U.S. Geological Sur-
is applied in a particular problem, the geographer needs to vey, Federal Fire Occurrence Website:
be consistent in handling the boundary issue. http://wildfire.er. usgs.gov/firehistory /data.html
Area Pattern Analysis

15.1 Join Count Analysis


15.2 Moran's Jlndex (Global)
15.3 Moran's Jlndex (Local)

This chapter presents several methods for analyzing terns. A choropleth or area pattern map is considered
area patterns. Some of these methods are global in clustered if adjacent or nearby areas tend to have highly
nature, encompassing an entire study area, while one similar values or scores. Alternatively, if the values of
method (a variation of Moran's /) analyzes a local adjacent or nearby areas tend to be dissimilar, then the
region. Two statistical procedures, the join count statistic spatial pattern is considered more dispersed than random
and Moran's I Index, are discussed in more detail. The (refer back to fig. 13.1).
join count statistic basically examines the structure of all Area pattern analysis is appropriate for studying
subarea connections (joins) within the larger study area. many practical problems in geography. It is particularly
Each particular subarea is either "joined" or "not joined" valuable when examining the way in which area patterns
to a number of adjacent subareas, and the overall pattern of a specific variable change over time. For example, sup-
of this subareajoining structure is analyzed. The Moran's pose a medical geographer is concerned with the diffu-
Index analyzes ordinal or interval/ ratio area patterns to sion of influenza in a metropolitan area. If the number of
determine if they are significantly autocorrelated. cases is reported by census tract for several time periods,
For example, an economic development planner the influenza morbidity rate could be displayed on a
might be interested in evaluating changing spatial pat- series of choropleth maps and the degree of non-random-
terns of poverty in Appalachia. Through area pattern ness in the patterns analyzed. If the incidence of the dis-
analysis of a chronological series of county-level chorop- ease becomes more dispersed over successive time
leth maps showing Appalachian poverty, the spatial distri- periods, researchers would be very concerned because
bution of poverty can be evaluated over time to determine the dispersion indicates that the influenza is spreading
whether it is becoming more dispersed or clustered. In more widely throughout the metro area. It could also
another example, an urban geographer could analyze a mean a significant number of cases were appearing in
map pattern depicting the number of existing home sales other areas not close to existing areas of high incidence.
by census tract to determine the degree to which sales are Conversely, if the morbidity rate patterns were becoming
spatially concentrated in certain portions of the city. If a more clustered over successive time periods, then a closer
nonrandom clustering of high turnover rates is found in examination of those areas with higher incidence of
certain segments of the city, attention could be focused on influenza would be appropriate. Such knowledge could
why the sales exhibit such spatial patterns. lead to the implementation of effective strategies for dis-
The goals and objectives in area patternanalysis are ease control.
similar to those in point pattern analysis. Just as nearest Suppose a political geographer wants to study the
neighbor and quadrat analysis examine the random or spatial pattern of voter registration rates across a number
nonrandom nature of point patterns, descriptive and of precincts within a community. If registration rates
inferential procedures are available to analyze area pat- have been mapped by precinct before each of the last sev-

222
Chapter 15 .A. Area Pattern Analysis 223

era! elections, a useful comparative analysis of area pat-


Join structure Number
terns is possible. If the pattern is becoming more Area of joins
clustered over successive time periods, a different regis- B A 2
A C
tration strategy may be necessary. An analysis of area B 3
pattern trends might also provide insights into demo- D E C 2
D 4
graphic or economic variables that seem to be related to E 4
the voter registration rate. F 4
F
Join count analysis is used when areas on a map are G
H G 3
H 6
represented in binary form, such as yes/ no, presence/
3
absence, or above/below a particular threshold. For J 3
example, each census tract in an influenza morbidity rate J K 3
map could be classified as having either an above average L 1
or below average incidence of influenza. Moran's I is
used when areas on a map are represented in ordinal or Case 1: Case 2: Case 3:
Clustered Dispersed Random
interval/ ratio form, with the actual values or ranks of
values. For example, the census tracts might indicate the
actual influenza morbidity rate, or a ranking of census
tracts from the highest to lowest morbidity rate.

15.1 JOINCOUNTANALYSIS
The join count is the basic organizational statistic
for the analysis of area patterns. A "join" is operationally
defined as two areas sharing a common edge or bound- FIGURE 15.1
Join Structure and Examples of Area Patterns Appearing Clustered,
ary. The procedures involved in calculating a join count Dispersed, and Random
statistic are relatively straightforward. The fundamental
building block is the number of joins in the pattern and
the nature of the join structure in the study area. Figure joins. In case 2, a dispersed (checkerboard-like) pattern
IS. I illustrates the simplest situation: a binary classifica- has more black-white joins and fewerjoins of similar-cat-
tion of data and a two-category choropleth map. Many egory areas. With random patterning (case 3), the result
geographic variables are binary in nature, or can be is an intermediate number of both similar and dissimilar
downgraded effectively to binary without sacrificing joins. These situations are all reflected in table 15.1.
important locational information. This figure shows the To determine if a join count distribution has been
join structure for the subareas in the study region. For generated by a random process, the number of black-
example, area A is joined to two other areas, B and D. white joins that would be expected from a theoretical
The number of joins associated with each area is listed random arrangement of binary observations must be cal-
adjacent to the join structure map. culated. That is, the expectednwnber of black-white joins
To comply with accepted practice, each area can be in a purely random pattern is compared with the one
identified as either "black" or "white." It is then possible actually observed. If the observed join count is signifi-
to refer to the observednumber of "black-white," "black- cantly different from the expected join count, the
black," and "white-white" joins in the pattern. In case 1 observed pattern can be described as nonrandom.
of figure 15.1, the area pattern appears clustered, with How do you determine the expected number of
relatively few black-white (dissimilar) joins and a rather black-white joins? The locational context in which the
large number of black-black and white-white (similar) overall study area pattern is being evaluated suggests two

TABLE 15 .1
Join Counts Associated with Area Patterns

Dissimilar
areas joined Similar areas joined
Case Total joins • Black-white joins Black-black joins White-white joins Total
1 (clustered) 19 5 7 7 14
2 (dispersed) 19 15 0 4 4
3 (random) 19 12 4 3 7
• See figure 15.1 for map of join structure and area patterns.
224 Part V .A. Inferential Spatial Statistics

possible approaches for generating expected join counts. hypothesis is selected, and the conclusion will be that an
Some area pattern analyses may be conducted under the area pattern is random or nonrandom.
hypothesis of free sampling. Free sampling should be The test statistic (Zb) for non-free sampling is:
used if the researcher can determine the probability of an
area being either black or white based on some theoreti- Z•= Osw-Esw
cal idea or by referring to a larger study area. Once the (IS.I)
asw
probability of an area being black or white has been
determined, the expected number of black-white joins in
a random pattern having those probabilities can also be where: Osw =observed number of black-whitejoins
determined. For example, in the influenza morbidity rate Esw =expected number of black-white joins
problem, suppose each census tract throughout the met- cs8 w =standard error of the expected number of
ropolitan area is classified as having either above average black-white joins
incidence of the disease (black) or below average inci-
dence (white). If the pattern of disease is being analyzed In non-free sampling, the expected number ofblack-
in only one portion of the metropolitan area, logic would white joins is calculated by incorporating the observed
suggest that the probability of above- or below-average number of black areas and white areas directly into the
incidence of influenza should be typical or representative equation:
of that found throughout the metro area, and the free
sampling hypothesis would be appropriate. Esw= 2JBW (15.2)
The non-free sampling hypothesis is used when no N(N-I)
reference to a larger study area or general theory is
appropriate. In this case, only the study region itself is
considered in the analysis, and the expected number of where: Esw = expected number of black-white joins
black-white joins is estimated from a random patterning J = total number of joins
of only those subareas in the study region. For instance, a B = number of black areas
political geographer studying the spatial pattern of com- W = number of white areas
munity voter registration rates by precinct may not have
any reason to believe this pattern is typical or representa-
N = total number of areas (black plus white)
=B+ W
tive of any larger area (such as the state in which the
community is located). Therefore, if it is hypothesized The expected number of black-white joins in the
that the area pattern has resulted from the unique charac- study region is determined by asking how many black-
teristics found within the study area, then a non-free white joins would occur from a theoretical random pat-
sampling hypothesis should be applied. tern containing the same number of black and white
In many geographic problems, you will find it diffi- areas as the observed area pattern. That is, the question
cult to conclude with any degree of confidence that a becomes how many black-white joins would be gener-
subarea pattern is typical of a larger area. If this is the ated from a random arrangement of the observed number
case, use of the expected number of black-white joins of black and white areas in the area pattern.
from a non-free sampling hypothesis frees you from hav- Certain generalizations can be made concerning the
ing to make any restrictive assumption. For the sake of expected number of black-white joins. Given a particular
brevity, we will only show the slightly simpler calculation =
total number of areas (N B + W), the value of the
procedure for area pattern analysis in a non-free sam- denominator in equation 15.2 is constant. If most of the
pling context. areas in the observed pattern are either black or white,
In a non-free sampling context the observed number then their product (BW) in the numerator will be a
of black-white joins is compared with the number of such smaller number, producing a smaller expected number of
joins expected in a purely random arrangement. This will black-white joins. Conversely, if about the same number
help you determine if the area pattern has been generated of black and white areas is found in the observed pattern,
by a random process. The null hypothesis is that the then their product will be a larger number, producing a
observed number of black-white joins is equal to the larger expected number of black-white joins. These
expected number of black-white joins in a purely random results seem logical, for a pattern with a large number of
area pattern. If reason exists to hypothesize that the black areas and very few white areas (or vice versa)
observed area pattern is non-random in a certain direc- would not be expected to have a large number of black-
tion, then a one-tailed alternate hypothesis is appropri- white joins. The expected number of black-white joins is
ate. For instance, an area pattern could be hypothesized maximized when an equal number of black and white
either as more clustered than random or more dispersed areas occur in the overall pattern.
than random. Without an a priorireason to hypothesize a Not all randomly generated patterns will contain
clustered or dispersed pattern, a two-tailed alternate exactly the same number of black-white joins, so a mea-
Chapter 15 .A. Area Pattern Analysis 225

sure of the amount of variability expected as a result of


sampling must be included. The standard error of
expected black-white joins is: JoinCountAnalysis(BinaryCategories)
Primary Objective: Oetermine whether a random process
- ✓EBW+
c,BW-
IL(L-l)BW has generated a binary (two-category)
N (N-1 ) area pattern

Requirementsand Assumptions:
4[1(1-1)- IL(L-l)]B(B-l)W(W-1) 2 1. Each area is assigned to one of two categories
+ N(N-l)(N-2)(N-3) - Esw 2. Each pair of areas must be defined as either adjacent
{"joined") or nonadjacent (not "joined") in a consistent
(15.3) manner

where: crBW = standard error of the expected number of Hypotheses:


black-white joins Ho : Osw = E8 w (area pattern is random)
l = total number of joins HA : Osw * E8 w (area pattern is not random)
IL = total number of links= 21 HA : Osw > E8 w (area pattern is more dispersed
than random)
B = number of black areas
HA : Osw < E8 w (area pattern is more clustered than
W = number of white areas random)
N = total number of areas (black plus white)
=B+ W Test Statistic:

Note: IL = 2J If areas A and B are joined, then A is


"linked" to B and B is "linked" to A, making the sum of aow
all links twice the number of joins.
The area pattern may now be analyzed with the non-
free sampling hypothesis (equation I5 .1 with supplemen- expected number of such joins, then Zb will be negative,
tary equations 15.2 and 15.3). If the observed number of and the pattern will appear more clustered than random.
black-white joins is larger than the expected number of The greater the magnitude of Zb (either positive or nega-
black-white joins, then Zb will be a positive value, indicat- tive), the greater the likelihood that the area pattern being
ing a pattern more dispersed than random. A relatively analyzed is not random. Just like other Z.score difference
large number of black-white (dissimilar) joins will occur tests, this Zb test statistic for area pattern analysis indicates
in a dispersed pattern (fig. IS. I). Conversely, if the the number of standard deviations separating the expected
observed number of black-white joins is smaller than the random black-white join count from the observed value.

Example: Join Count Analysis-Obesity Distribution Patterns across the United States: 1995, 2002, 2009

Returning to the discussion of obesity, we have shown pre• Across the conterminous United States there are 212 links
viously that the obesity rate in the United States has increased and 106 joins (recall from equation 15.3 that I:.L= 2J); table
each year. However, while the nation suffers a high obesity 15.3 shows the state linkage pattern needed for the area pat·
rate, the geographic dis1ribution of highly obese states may tern analysis. This general state linkage pattern serves all three
not be evenly dis1ributed. That is, only certain portions of the dates of analysis (1995, 2002, and 2009) although the detailed
country may exhibit high levels of obesity. A geographer calculation procedure which follows is shown only for the
might be interested in determining if the areas of high obesity 1995 pattern.
in the United States are clustered, random, or dispersed in
order to tailor a specific approach for combating the epi•
demic. You also might want to know if the overall spatial pat• TABLE 15.2
tern of obesity is becoming more dispersed, random, or
clustered over time. Figure 15.2 shows binary maps of obesity Number of States with Obesity Levels Above the
in the conterminous United States (excluding Alaska and National Average
Hawaii, since they are obviously not "joined' to any other States with obesity above Number of black-
state) for three different years: 1995, 2002, and 2009. Black Year the national average white joins
areas represent the s1atesthat are above the national mean 1995 26 21
obesity level and white areas represent states below this level. 2002 27 32
As shown in table 15.2 the number of states with obesity lev•
2009 25 25
els above the national mean is relatively similar for each year.

(continued)
226 Part V .A. Inferential Spatial Statistics

1995

2002

1995, 2002, and 2009


- □
Below national
average
Above national
2009 average

FIGURE15.2
State Obesity Levels in the United States: 1995, 2002, and 2009
Source:Centers for DiseaseControland Prevention(CDC)
Chapter 15 .A. Area Pattern Analysis 227

In this example, a "black-white• or dissimilar join occurs occurs when either two states with obesity rates above {or
when a state with an obesity rate above the national mean below) the national mean share a common boundary. On the
shares a common boundary with a state having an obesity 1995 map, 21 of the 106 joins are dissimilar, while the other
rate below the national mean. Conversely, a similar join 85 joins are similar. Our research task is to determine if the
observed number of dissimilar joins is significantly different
TABLE 15.3 from the number of dissimilar joins expected in a random
obesity level pattern.
State Linkage Pattern The worktable for the 1995 join count analysis is presented
Number in table 15.4. The expected number of dissimilar joins in a
State Name of links !Ll L-1 L !L-1) purely random pattern is calculated by incorporating the
Alabama 4 3 12 observed number of states with obesity levels above and
Arizona 5 4 20 below the national mean directly into the equation. The result
Arkansas 6 5 30 indicates that nearly 54 (53.75) dissimilar joins are expected in
California 3 2 6 a random pattern.
Colorado 7 6 42 To determine whether this contrast between the number
Connecticut 3 2 6 of observed dissimilar joins (21) and the number of expected
Delaware 3 2 6 dissimilar joins (53.75) is statistically significant, the standard
Florida 2 1 2 error of expected dissimilar joins (a 8 w) must be incorporated
Georgia 5 4 20 into the test statistic. The standard error of expected dissimilar
Idaho 6 5 30 joins is 4.92 {table 15.4).The resultant test statistic value (Zb) is
Illinois 5 4 20 -6.66, with an associated p•value of .000. These statistics indi•
Indiana 4 3 12 cate a highly significant tendency toward clustering of states
Iowa 6 5 30 having obesity rates above the national mean.
Kansas 4 3 12 Using a similar statistical test procedure, analyses of the
Kentucky 7 6 42 2002 and 2009 obesity patterns {fig. 15.2) have also been gen•
Louisiana 3 2 6 erated. The results of these analyses are summarized in table
Maine 1 0 0 15.5, and an interesting trend emerges across the United
Maryland 4 3 12 States. In all three years, the distribution of states with obesity
Massachusetts 5 4 20 levels above the national mean is very highly clustered. How•
Michigan 2 1 2 ever, 1995 appeared to have the most clustering over the
Minnesota 4 3 12 15-year period.
Mississippi 4 3 12 The knowledgeable geographer should be able to offer
Missouri 8 7 56 some tentative hypotheses about why the obesity rate in the
Montana 4 3 12 United States was initially most clustered in 1995, became a
Nebraska 6 5 30 bit less clustered in 2002, and once again become more clus•
Nevada 5 4 20 tered in 2009. First, the overall patterns are all highly clus•
New Hampshire 3 2 6 tered {examine fig. 15.2 and table 15.5). Generally speaking,
New Jersey 3 2 6 for all three dates, similar observable regional patterns are
New Mexico 5 4 20 evident. Most states in the New England, Rocky Mountain,
New York 5 4 20 and Pacific census divisions have below national mean obe•
North Carolina 4 3 12
sity rates most of the time. Conversely, most states in the
North Dakota 3 2 6
South and Midwest census regions have above national
Ohio 5 4 20
mean obesity rates most of the time. These dominant
Oklahoma 6 5 30
regional patterns would explain the continuing high levels of
Oregon 4 3 12
obesity clustering. Perhaps we are over-emphasizing the
Pennsylvania 6 5 30
importance of these changes. After all, the obesity patterns
Rhode Island 2 1 2
are all extremely clustered.
South Carolina 2 1 2
When there are exceptions to this overall pattern, measur•
South Dakota 6 5 30
able differences in the level of clustering seem to emerge. The
Tennessee 8 7 56
most clustered pattern is that of 1995 <Zp=-6.66) with only 21
Texas 4 3 12
black-white or dissimilar joins. All of the New England, Rocky
Utah 6 5 30
Mountain, and Pacific states have below national average
Vermont 3 2 6
Virginia 4 obesity rates. Georgia is one of only a few notable outliers on
5 20
Washington 1 the 1995 pattern. If Georgia's obesity rate was just a little
2 2
West Virginia 5 4 20 higher, it would have been above the national mean, and the
Wisconsin 4 12 national pattern would have been even more clustered. Per•
3
Wyoming 6 5 30 haps some additional insights might emerge if we took a
closer look at obesity pattern changes within Georgia. Maybe
TOTAL 212 856
the state obesity level was below the national mean back in

(continued)
228 Part V .A. Inferential Spatial Statistics

1995 because of the rapid growth in the Atlanta area {where Rocky Mountain and Pacific census divisions are now less
obesity levels are generally lower than in the rest of Georgia). obese than the national average. What {if anything) distin•
As the rapid growth in Atlanta has slowed a bit in recent years, guishes the people of the West from much of the rest of the
perhaps this demographic change allowed Georgia to slip country? Will these regional differences become stronger as
back above the national obesity level. time goes by?
The least clustered of the three patterns {although still very
highly clustered) is 2002 I.Zp=-4.33). On the 2002 map, there
are more notable outliers {Florida, Maine, Virginia, Minnesota, TABLE 15.5
California, and Oregon) which produces more (32) dissimilar
Results from the Join Count Analysis-Trends in
joins {table 15.2).
Are these "cluster shifts" significant in any practical policy Obesity Level Clustering: 1995, 2002, and 2009
context? That is, do these country-wide join count analyses
direct us to any particular new obesity reduction plan? The
Year z p-value

answer is probably no, at least not directly. But the newly 1995 -6.66 0.000
emerging outlier states that don't fit the general regional pat• 2002 -4.33 0.000
tern are worth exploring further. In addition to the Georgia sit•
2009 -5.89 0.000
uation just discussed, we might hypothesize why the entire

TABLE 15. 4
Worktable for Join Count Analysis: Obesity in the Conterminous United States, 1995

Ho: Oaw = Eaw {area pattern is random)

HA: Oaw -I Eaw {area pattern is not random)

number of "black" states = 26


number of "while" states = 22

2JBW 2(106)(26)(22) 121,264


53.75
N(N -1) 48(47) 2,256

tL(L- l)BW 4(/(/ - 1) - L L(L - 1)] B(B - 1) W(W - 1)


+ ------------------ - EJw
N(N-1) N(N - l)(N - 2)(N - 3)

(856)(26)(22) 40((106)(105) - 856] 26(25)(22)(21)


53.75 + 53 752
48(47) + 48(47)(46)(45) - •

494,780 40(11,130- 856] (300)(300)


53.75 + 4,669,920 - 2,889.06
2,256 +

✓53.75 + 211.03 + 2,642.68 - 2,889.06

✓24.4

4.92

21- 53.75 -32.75


4.92 4.92
-6.66 (p = 0.000)
Chapter 15 .A. Area Pattern Analysis 229

15.2 MORAN'S/ INDEX(GLOBAL) tionship between neighboring locations is shown for a


simplified study region comprising four areas as shown in
The Moran's Index (/), developed by Australian stat- figure 15.3. Each area is identified by a letter and the mag-
istician Patrick Moran in 1948, is a popular technique for nitude of the area's attribute value appears in parentheses.
quantifying the level of spatial autocorrelation in a set of The general form of Moran's Index for areas is shown
geographic areas, with a value assigned to each area. In in equation 15.4, and the mathematical equivalent, as pre-
contrast to nearest neighbor analysis which examines sented in Ebdon (1988), appears in equation 15.5:
patterns in the locationof points, or the join count analy-
sis that only evaluates a binary areal representation of the
number )( sum of cross-products )
data, Moran's Index takes into account the geographic ( of areas for all contiguous pairs (i,j)
locations (points or areas) in addition to the actual attri- [:-'-----,--'--'--.,......,--------,-~ (15.4)
bute values to determine if areas are clustered, randomly number)( variance of the )
located, or dispersed across the overall study area. The ( of joins area attribute values
index is positive when nearby locations have similar attri-
bute values (clustered), negative when nearby geographic
locations have dissimilar attribute values (dispersed) and (15.5)
approximately zero when attribute values are randomly
dispersed throughout the study area.
Geographers have applied the Moran's Index in where: n = the number of areas
numerous research problems, such as exploring the spa- x = an area attribute value
tial distribution of income, rates of disease, and the racial x = the mean of all area attribute
and ethnic composition of urban neighborhoods. In all values
applications, the objective of Moran's technique is to
identify significant spatial patterns within a study area.
and xi
X; = the values of contiguous pairs
This technique is appropriate for both point and polygon
I(x, -x)(x 1 -x) = the sum of all contiguous pairs
(area) features, and attribute values must be measured on J = the number of joins
an ordinal or interval/ ratio scale. I (x - x) 2
= the variance of the attribute
Moran's Index proceeds by identifying each unique values.
pair of neighboring locations in the study area. The rela-

B
(80) D
(10)
5

A
(70) C
(20)

Mean= (70 + 80 + 20 + 10) / 4 = 45

FIGURE 15.3
Areas and Attribute Values for Moran's /
230 Part V .A. Inferential Spatial Statistics

The deviation values for each pair of neighboring weights (represented as either "joins" or a distance-based
locations or contiguous pairs (i and ;) are multiplied weight). This rescales Moran's Index so that it ranges
together and summed I,(x; - x)(x; - x) to create a from -1.00 to 1.00. A Moran's Index value less than zero
weighted cross-product. When a neighboring pair of indicates a dispersed pattern of area attnbute values, an
areas both have larger values than the study area mean, Index near zero indicates a random pattern, and an
their cross-product will be positive (this is the case, for Index greater than zero indicates a clustered pattern.
example, with pair AB in fig. 15.3). The cross-product of Mathematically, the Moran's Index is similar to a correla-
a neighboring pair will also be positive if both have tion coefficient, as we shall see in the next chapter.
smaller values than the study area mean (which is the While Moran's technique provides an index for the
case with pair CD). Positive cross-products therefore level of spatial autocorrelation for a set of areas, the ques-
indicate that areas with similar attribute values (either tion becomes how different from zero (randomness) must
high or low) are clustered. In addition, the larger the the Index be in order to indicate significant dispersion or
deviations from the study area mean, the larger the abso- clustering? To answer this question, the Index is usually
lute magnitude of the cross-product. converted to a Z-score using equation 15.6. A p-value is
Conversely, when one area of a neighboring pair is then calculated which allows you to determine the signif-
larger than the study area mean, while the other area is icance of the difference between the observed Moran's
smaller, the cross-product will be negative (as with pair Index and the corresponding expected Index value for a
BD). A negative cross-product therefore indicates disper- purely random distribution of values across the same
sion. Once again, the larger the deviations from the study number of areal units.
area mean, the larger the absolute magnitude of the The null hypothesis is that the area values are
cross-product. arranged in a completely random spatial pattern, and
It is important to note that the Moran's Index of that spatial autocorrelation is not present within the
autocorrelation explicitly takes into account the locations study area. The Z score is computed as:
of areas relative to one another. In the procedure as dis-
cussed here, the key factor is whether two areas are z __I=-=E==-
1
(15.6)
joined (making them a neighboring pair) or not. We are - .Jva,,
using simple adjacencyweightingor contiguity to determine
the degree of autocorrelation-only those areas sharing a -1
where: E,=--
common boundary are included in the calculations. For (n-1)
instance, areas A and C share a boundary and are
included in the calculations, while areas A and D do not
share a boundary and are not included.
Other spatial weighting schemes are available to cal-
culate Moran's Index, but these are not shown here. One
Moran'sIndex
alternative method is inversedistance.With this option, all Primary Objective: Identify significant spatial patterns within
pairs of geographic areas are included in the analysis, but a study area
nearby areas carry more weight in the calculation. If this Requirements and Assumptions:
method were used in our analysis, the cross-product for 1. Minimum of (30) geographic features
areas B and C would heavily influence the Moran's Index 2. Attribute values measured on an ordinal or
value (by virtue of their closeness or proximity). By con- interval/ratio scale
trast, since areas A and D are relatively far apart, their
cross-product value would be given a lighter weight. Hypotheses:

To summarize, the numerator of equations 15.3 and Ho : Attribute values are randomly distributed across
15.4 represents the sum of all contiguous or adjacent area features in the study area
cross-products multiplied by the number of areas in the H, : Attribute values are not randomly distributed across
study region. A positive sum indicates that areas with features in the study area
similar values (either high or low) tend to be located Test Statistic:
nearby each other, which suggests spatial clustering
among the subareas. A negative sum indicates that areas nI(x 1- i)(x 1 - i)
I = 2
with high attribute values tend to be located near features JI(x-i)
with low attribute values, which indicates a dispersed Interpretation:
pattern. Sometimes the positive cross-products are bal-
Assuming a significant p-value:
anced by the negative cross-products and their overall
I < 0 (observed pattern is dispersed)
sum will be close to zero, which indicates a spatially ran-
I = 0 (observed pattern is random)
dom pattern. The weighted cross-product numerator is
I > 0 (observed pattern is clustered)
divided by the variance and the sum of the spatial
Chapter 15 .A. Area Pattern Analysis 231

Similar to the free and non-free sampling techniques Variance under the randomizationassumption is com-
mentioned with join count analysis and binary data, the puted as:
variance for Moran's I may be calculated under the
assumption of normality or randomization. Normality is
Var1 =
similar to free sampling in that the observed values of a n[l(n +3-3n) + 31 nI,L
2 2

k[l(n
- n)+6l
2

]-
2

-
2

-2nI,L 2

variable are taken randomly from a normally distributed


1 2 (n - l)(n - 2)(n - 3)
population, whereas randomizationis similar to non-free
sampling in that the values are taken directly from the where: k represents the kurtosis for the variable x
data set. The variance under the normality assumption is
computed as: If the p-value is not significant, then you should not
reject the null hypothesis, as we cannot confidently say
that the observed pattern is different from complete spa-
tial randomness. This conclusion is equivalent to a state-
ment that the study area does not exhibit significant
spatial autocorrelation. However, if the p-value is signifi-
where: n = the number of objects cant and the Z-score is positive, we can confidently reject
l = the number of joins, and the null hypothesis and say that similar area attribute val-
I,L = the sum of the number ofjoins for each ues (either high or low) are clustered in the study area. If
individual area the p-value is significant and the Z-score is negative, we

TABLE 15.6
Worktable for Moran's /: Five-area Example Pattern

Ho: no spatial autocorrelation in the data

H.: spatial autocorrelation exists

Joins Area values


(x-x) 2
Area L L' X (x-x)
A 2 4 70 25 625
B 3 9 80 35 1,225
C 3 9 20 -25 625
D 2 4 10 -35 1,225

Joins(J) = 5 1;L' = 26 X= 180 = 45 1:(x - .i)' = 3,700



Join (x1 - x) (x 1 - x)(x 1 - x)
number
x, Xj

1 70 25 80 35 875
2 70 25 20 -25 -625
3 20 -25 80 35 -875
4 80 35 10 -35 -1,225
5 10 -35 20 -25 875
SUM -975

-1 1
E, = ~~ = - -- = -0 33
(n - 1) 4- 3 •

n1;(x 1 -.i)(>j -.i) 4(-975)


I = 0 21
/1:(x-x)' - 5(3,700) = - -

Var= n2/+3/ 2 -n1:L 2


= (4 2 x5)+(3x5 2
)-(4x26) = 51
-= 0.136
I /
2
(n 2 - 1) 52 (4 2 - 1) 375

-0.21 - (-0.33)
z - v'0.136
- 0.326

p = 0.14
232 Part V .A. Inferential Spatial Statistics

can confidently reject the null hypothesis and say that the are only shown using the normalization assumption. In
area attribute values are more dispersed across the study this example, a Moran's I of-.21 and a p value of .74 indi-
area than what we would expect by random chance. cates that we do not have enough evidence to reject the null
The Moran's I computation procedure is presented in hypothesis and should therefore conclude that the study
table 15.6 using the simplified study area having four sub- area does not exlubit significant spatial autocorrelation.
areas and five joins, as shown in figure 15.3. Calculations

Example: Moran's/ Index (Global)-


Distribution of Racial and Ethnic Groups in Cleveland, Ohio Census Block Groups

The Moran's Index is applied at the census block group tant public services for all racial and ethnic groups, historical
level to explore the spatial distribution of racial and ethnic patterns of segregation still exist, particularly in older, indus•
groups within Cleveland, Ohio (fig. 15.4).The objective is to trial American cities. Among other concerns, urban geogra•
determine the extent to which various racial and ethnic popu• phers have demonstrated a strong link between patterns of
lations are dispersed, randomly located or clustered. Such a segregation and higher rates of crime and poverty.
concern is important because while progress has been made Three racial groups and one ethnic group in Cleveland,
toward providing equal access to housing and other impor• Ohio are selected to assessthe magnitude of spatial autocorre•

Asian Black

White Hispanic

D 0.0% to 25.0% D 25.1 % to 50.0% D 50.1%to75.0% 75.1% to 100.0%


0 7 14

Ki!otneters
N

A
FIGURE 15.4
Distribution of Various Racial and Ethnic Groups by Census Block Group in Cleveland, Ohio
Source:United States Bureauof the Census
Chapter 15 .A. Area Pattern Analysis 233

TABLE 15.7 ing since all of the Z values are positive). The distribution of
Asians is the least clustered. With an index value of 0.3S, we
Moran's / Values and Test Statistics for Selected can ask whether that value differs significantly from the
Racial and Ethnic Groups In Cleveland, Ohio expected value of-0.002 for a purely random distribution
Census Block Groups across the same number of census block groups. The resulting
Moran's = =
test statistics for percent Asian population (Z 18.39, p .000)
Group Index z p-value indicate that you can reject the null hypothesis with near
Asian 0.35 18.39 0.000 100% certainty and conclude that the spatial pattern of Asians
in Cleveland is non-random. Indeed, with large positive test
Black 0.58 64.6 0.000 statistics and p•values of .000, we can conclude that all groups
While 0.64 65.4 0.000 are significantly more clustered than random. In summary, we
0.81 can conclude that a high level of residential segregation (and
Hispanic 65.7 0.000
spatial autocorrelation) currently exists in Cleveland.
Strongly stated inferential conclusions in this type of prob-
lem should be made with extreme care. The underlying null
lation using Moran's Index. As reported in the 2010 Census, hypothesis of complete spatial randomness is very restrictive.
Cleveland is racially diverse with 53.3% of the population Arguably, almost any spatial distribution that would be of
reporting their race as black, 33.4% white (non-Hispanic), 1.8% interest to a geographer will show some measurable level of
Asian, and 10.0% reporting their ethnicity as Hispanic/Latino. spatial autocorrelation. A more reasonable approach to the
A global Moran's analysis is applied to each group to deter• interpretation of these results is to consider them descriptive
mine the degree to which areas with similar (either high or summaries of the patterns within the study area. Rather than a
low) racial/ethnic composition cluster together across the city. strict inferential interpretation, it might be better to focus on a
Simple adjacency or contiguity is the method used to deter• relative comparison of the magnitude of the Moran's indices
mine the magnitude of spatial autocorrelation. The null across the four racial/ethnic groups. Such a comparison of
hypothesis in each case is that the values representing the Moran's values across groups is valid in this instance because
percentage of each racial/ethnic group are arranged in a com• all analyses are conducted for the same study area, using simi•
pletely random spatial pattern. The Moran's indices and test lar variables (all related to racial/ethnic composition), and
statistic results for each of the four racial/ethnic groups are making the same assumptions regarding adjacency and conti•
shown in table 15.7. guity when creating hypotheses.
In this table, the groups are ranked in increasing magni•
tude on the Moran's index (that is, increasing levels of cluster•

15.3 MORAN'S / INDEX (LoCAL) from adjacent or nearby areas). Therefore, rather than a
single measure of spatial autocorrelation, the LISA
In addition to looking at global measures of spatial methods determine individual measures for each geo-
autocorrelation, geographers may also be interested in graphic entity as follows:
assessing whether spatial autocorrelation exists at a local
level. As a global indicator of spatial autocorrelation,
Moran's I may miss certain subtleties within a geo-
graphic dataset. That is, while the overall spatial arrange- (15.7)
ment might indicate that no global spatial
autocorrelation exists, there may in fact be pockets (typi- S/= I,
j=I,/# n-1
cally called hotspots) of spatial autocorrelation not
picked up by the global measure. To resolve this issue,
geographers have adapted global methods of spatial auto- where: x; = the value for a particular geographic entity
correlation like Moran's I to include a local component. xi = the value for the neighboring geographic
This concept, described by geographer Luc Anselin entity
(1995) and others, is termed Local Indicators of Spatial =
X the average of all attributes
Association (LISA). LISA statistics quantify the similar- wij = the spatial weights.
ity of each geographic observation with an identified
group of geographic neighbors, based on one of the geo- The overall computation of the local Moran index is
graphic weighting schemes discussed earlier. It is often beyond the scope of this book, but a short example will
used to identify local clusters (geographic locations help illustrate its usefulness.
where adjacent or nearby areas have similar values) or
spatial outliers (geographic locations that are different
234 Part V .A. Inferential Spatial Statistics

Example: Moran's/ Index (Local)-Distribution of Obesity in Pennsylvania Counties

Consider the pattern of obesity levels in Pennsylvania that surprising to see the metropolitan Philadelphia counties
counties shown in figure 15.5. You can definitely see a pattern highlighted as an urban region with a cluster of low obesity
of high obesity in the southwest corner of the Common• level counties. A pair of adjacent counties east of Pittsburgh
wealth, while the southeastern corner has lower obesity val• (Cambria County and Blair County) with smaller urban areas
ues. However, due to the overall configuration, the Moran's I (Johnstown and Altoona, respectively) is also identified as
statistic for the Commonwealth is a single measure, computed very similar to neighboring counties in a region of high obe•
as .69 with a p•value of .25, indicating no spatial autocorrela• sity levels. Finally, Fayette County in southwest Pennsylvania
tion at the global (statewide) level. However, computation of has a large negative Z•score indicating that it is an outlier in
the local Moran's I value provides a p•value for each county. obesity level when compared to its neighboring counties. This
Figure 15.6 shows the counties with a Z•score above 2.0 or pattern of local spatial autocorrelation values should raise
less than -2.0 (p = 0.00). Those counties with high positive other basic geographic questions about why the pattern has
Z•scores indicate places whose obesity level is very similar to the specific distribution and variation characteristics shown
neighboring counties, suggesting local clustering. It is not all on the two maps.

KllOtnetolS

0 Less than 25.0% 0 25.0% to 26.9% 0 27.0% to 28.9% 0 29.0% to 31.9% Greater than 32.0%

FIGURE 15.5
Distribution of Obesity Levels in Pennsylvania Counties, 201 O
Source:Centers for DiseaseControland Prevention(COC)
Chapter 15 .A. Area Pattern Analysis 235

A
0 so 100

KilOtnel.ers

FIGURE 15.6
PennsylvaniaCounties withSignificantLocal SpatialAutocorrelation

KEY TERMS REFERENCES


AND ADDITIONAL READING
area pattern analysis, 222 Anselin, Luc. "Local Indicators of Spatial Association-
free and non-free sampling, 231 LISA." Geographical Analysis,Vol. 27, No. 2 (1995): 93--115.
join count analysis, 223 Bailey, T. C. and A. C. Gatrell. Interaaive SpatialData Analysis.
Moran's I index (global and local), 229, 233 London: Longman, 1995.
Ebdon, D. Statisticsin Geography.A Praaical Approach. Oxford:
Basil Blackwell, 1985.
MAJOR GOALSAND OBJECTIVES Griffith, D. and C. Amrhein. StatisticalAnalysisfor Geographers.
Englewood Cliffs, NJ: Prentice Hall, 1991.
If you have mastered the material in this chapter, you Lee, J. and D. W. S. Wong. StatisticalAnalysis with Arr ViewGIS.
should now be able to: New York: Wiley, 2001.
Mitchen, A. The ESRI Guideto GIS Analysis. Vol 2: SpatialMea-
1. Understand the purposes of area pattern analysis. surementsand Statistics.Redlands, CA: ESRI Press, 2005.
2. Define and explain join structure and the joint count Ripley, B. SpatialStatistics.New York: Wiley, 1981.
statistic in area pattern analysis. For state-level obesity data and county-level obesity data (for
3. Distinguish between the contexts or problem settings Pennsylvania or any other state), see the Centers for Disease
in which the hypotheses of free sampling and non-free Control and Prevention www.cdc.gov. For spatial distributions
sampling are best used. of racial and ethnic groups by census block group or by census
tract for aeveland, Ohio or any other metropolitan area, see
4. Explain the basic objectives of Moran's index of auto- the U.S. Bureau of the Census www.census.gov.
correlation.
5. Understand the different goals of global and local
autocorrelation.
PART VI

STATISTICAL
RELATIONSHIPS
BE EENV. ABLES
Correlation

16.1 The Nature of Correlation


16.2 Association of Interval-Ratio Variables
16.3 Association of Ordinal Variables

One of the more important concerns in geographic chapter 10 we again used a scatterplot to look at the rela-
analysis is the exploration and analysis of relationships tionship between state obesity level in 2000 and change
between variables that have spatial patterns and variabil- in obesity level from 2000 to 2010. Although the rela-
ity. Many geographic studies get started by noting the tionship did not seem very strong (the points were rather
similarities and differences between two mapped vari- "scattered" in the scatterplot), many of the states in the
ables and then measuring the statistical degree of rela- South and Appalachia that had high obesity levels in
tionship between the sets of data from which the map 2000 continued to increase their obesity levels from 2000
patterns were drawn. However, just making a visual to 2010. In section 16.1, the concepts of direction of
comparison of maps to estimate their association is sub- relationship and strength of association are both further
jective because only a general impression of similarity or clarified using simple scatterplot illustrations.
contrast between variables is gained. Two people can The most widely used index of correlation, Pearson's
view the same maps or examine two sets of data and correlation coefficient, is applied only to interval-ratio
interpret their association very differently. data and discussed in section 16.2. To illustrate the calcu-
Correlation analysis provides a precise, quantita- lation process, we return to the example first presented in
tive set of statistical methods to measure both the direc- chapter 1 hypothesizing a relationship between the date
tion and strength of association between a pair of spatial of last spring frost and latitude for a random sample of
variables. In section 16.1, the nature of correlation anal- weather stations in the southeastern United States. A
ysis is discussed. The scatterplot is highlighted as a second geographic example of Pearson's index measures
graphic tool used to examine both the direction and the direction and strength of relationship between immi-
strength of association. We have already used scatter- grant growth rates and cost of living index (COLI) values
plots to gain insights about the nature of relationship in the largest metro areas of the United States.
between variables in different geographic problems. For Other geographic studies involve the use of ordinal
example, in chapter 1 we looked at the association or rank-order data. Section 16.3 presents Spearman's
between life expectancy and total health care expendi- correlation index-the most frequently used coefficient
tures per capita for many of the world's countries. The for measuring the direction and strength of association
scatterplot nicely illustrated that the United States between two ordinal variables. In a "real-data" example
spends much more per capita on health care, but without illustrating the Spearman's calculation procedure, we
any associated lengthening of life expectancy. Just by explore the relationships among various factors used to
viewing the scatterplot, we were also able to gain rank "America's top states for business" and look for
insights about a possible "life expectancy transition" nonrandom patterns among those factors.
that seems to occur as a country goes through the eco-
nomic development and modernization process. In

239
240 Part VI .A. Statistical Relationships between Variables

16.1 THE NATUREOFCoRREIATION demographics in the population, there may be no over-


riding rationale for expecting a correlation between
We have repeatedly seen that geographic investigation house size and age of unit.
often begins with a graphic display of the data, such as a The strength of association between two variables is
scatterplot. We have also learned that both direction and roughly estimated by examining the amount of point
strength of association between variables can be viewed spread in a scatterplot (fig. 16.2). If the point pattern is
and roughly estimated by looking at a scatterplot. To tightly packed, as in a "pencil-shaped" pattern, the rela-
clarify further, if the general trend of points is from lower tionship is said to be strong (case 2). However, if the
left to upper right (fig. 16.1, case I), the direction of asso- points are more widely spread or dispersed, as in a "foot-
ciation is positive. In a positive or direct relationship, a ball-shaped" pattern, the association is moderate to weak
larger value in one variable generally corresponds to a (case 3). The strongest relationship, or perfect correla-
larger value in the second variable. Alternatively, a tion, occurs when the points are in a straight line or lin-
smaller value in the first variable usually coincides with a ear pattern (case 1) and the weakest association occurs
smaller value in the second variable. Such a correspon- when the points are distributed with very little or no pat-
dence will result in a positive correlation. tern, as in a circular arrangement (case 4).
In geography, a positive correlation is found with Any two variables can be correlated, and the
many variables related to population size. For example, strength and direction of relationship calculated. How-
the association between population size and number of
retail functions in a sample of settlements is very likely to
exhibit a positive or direct relationship. As demonstrated
in central place theory, settlements with more people at
higher levels of the urban hierarchy generally contain
more retail establishments. •• • •
The direction of relationship is negative if the gen- • •
eral trend of points in the scatterplot is from upper left to • •• ••• •
• • •
lower right (fig. 16.1, case 2). When two variables have a Case 1: • • • •
negative or inverse correlation, a larger value in the first Positive • •
variable is generally associated with a smaller value in • •
•••
the second variable and a smaller value in the first vari-
able usually corresponds with a larger value in the sec-
ond variable.
A negative correlation between two variables can be
illustrated by applying the general principle of "distance
decay." In such relationships, some phenomenon or idea

declines with increasing distance from a source or origin.
• • •• •• •
For example, using pollution data from a sample of mon- • • •
itoring sites, the level of air pollution and distance down- Case 2: .••: •.
• • ••
wind from the pollution source would probably exhibit a
negative or inverse relationship. The further downwind
Negative
: ..
• •
from a pollution point, the more likely the level of expo- ••••
sure to the pollutant will be lower.
Some variables may not exhibit either a positive or
negative relationship. If the pattern of points in the scat-
terplot is random (fig. 16.1, case 3), little or no associa-
tion exists between the two variables. In such examples,
the values of one variable are not associated in any sys-
tematic way with the values of the other. This is some- • •••
•• •
Case 3: • • ••••
times referred to as a neutral relationship. Neutral • ••• ••••
In geographic research, variables may show no rela- •• ••• •• •••• ••
• • • ••••
tionship or pattern on a scatterplot. Suppose you are
studying housing characteristics in a city and want to
determine if an association exists between house size
(square feet of living area) and the age of the home. It is
possible that no significant relationship exists between
these two variables. Depending on continuing complex FIGURE 16.1
changes in the housing market and constantly changing Generalized Scatterplots Showing Directional Relationships
Chapter 16 .A. Correlation 241

ever, extreme caution must be used when evaluating or Correlation treats the X and Y variables symmetri-
interpreting correlations. The existence of a relationship cally. That is, the magnitude of correlation of Xwith Y is
or association between variables does not necessarily the same as the magnitude of correlation of Y with X.
imply that one variable is the "cause" and the other However, the companion technique with correlation,
variable is the "effect" or "result" of that cause. One of regression, takes the statistical analysis further, making
the most famous statistical phrases you will hear is predictions or estimates about the magnitude of the
"correlation does not imply causation." Oftentimes, response variable given the value of the explanatory or
researchers erroneously assume that if two variables predictor variable. This extra step necessitates distin-
have a high correlation, then they must have a func- guishing between the roles of the X (causal) and Y
tional relationship that makes them dependent upon (effect) variables.
one another. As a humorous "cause-effect" example, a correla-
However, in many examples of correlation (and in tion analysis may indicate there is a strong negativecorre-
all cases when applying the companion technique of lation between the number of ice cream cones eaten in a
regression, which follows in the next chapter) the two community and the number of flu cases. That is, when a
variables play different roles. The variable of interest that large quantity of ice cream cones is eaten, the area seems
you hope to explain or predict should be placed on the Y to have fewer cases of flu. If one assumes that correla-
axis. This Y variable is sometimes called the response (or tion does imply causation, we might conclude that the
dependent) variable. The explanatory, predictor (or best way to fight the flu is to eat more ice cream cones!
independent) variable should be placed on the X axis. Obviously, another confounding factor is the weather. In
The X variable is used to account for, "explain," or "pre- the summer, when the seasonal effects of flu are mini-
dict" variation in the Yvariable. mal, more people eat ice cream because it is a refreshing
When we constructed the scatterplot showing the treat on a hot day. In the winter, when more people
relationship between life expectancy and total health care catch the flu, it is often too cold to enjoy ice cream, so
expenditures per capita for selected countries (fig. 1.5), less ice cream is consumed. Therefore, it is erroneous to
we made certain that life expectancy values were plotted conclude that ice cream consumption causes a reduction
on the Y axis, as this was the response or dependent vari- of flu cases.
able whose variation we were trying to account for or While our ice cream and flu example is obvious,
explain. Total health care expenditures per capita for other correlations may not be, such as a correlation of
each country were plotted on the X axis, as this was the toxic release inventory locations and minority housing,
explanatory, predictor, or independent variable being or the application of a single fertilizer on an agricultural
used to explain that life expectancy value. plot and the volume of crop production. We must be
careful to assess properly the relationship between two
variables before attributing any influence of one variable
on another.
Statisticians have defined various indices, called
Case 1: Case 2:
Perfect association Strong association "correlation coefficients," to measure the direction and
strength of relationships. Most of these coefficients are
• ••• constructed to have a maximum value of + 1.0, which
• ••••••• indicates perfect positive or direct correlation between

•• • • •••• variables. A minimum value of -1.0 represents a perfect
• • •
• • negative or inverse correlation, and a value ofO.Odenotes
• •
•• the total Jack of correlation between variables. More pre-
cisely than a scatterplot, a correlation coefficient can
indicate both the direction of the relationship (positive or
negative) and the strength of association.
Case 3: Case 4:
Weak association No association As in other areas of inferential statistical analysis, the
level of measurement (nominal, ordinal, or interval-ratio)
••
• ••••


••••
• • .. •• • •
largely determines which index of correlation is applied

... ..
•• •• • • • •

• •
•• • • •• ••
••• • •• •• •
to a problem. The most commonly used correlation coef-
ficients are Pearson's product-moment for interval or
• • ••• • • • • •• • •• •
• ratio data and Spearrnan's rank-order for ordinal or
• ••
•• ••• • ranked data. As we discussed in chapter 12, indices that
• •• •
measure association of nominal (categorical) data, such
as measuring the strength of association or "cross-tabula-
FIGURE 16.2 tion" in a contingency table, are also important.
Generalized Scatterplots Showing Strength of Association
242 Part VI .A. Statistical Relationships between Variables

16.2 AsSOCIATIONOF The point patterns in figure 16.3 represent different


examples of covariation. In case 1, virtually all points are
INTERVAL-RATIOVARIABLES in quadrants I and III. In quadrant I, the deviations of
The most powerful and widely used index to mea- X (X - X) and Y ( Y - Y) are both positive since each X
and Yvalue is greater than its respective mean. In quad-
sure the association or correlation between two variables
is Pearson's product-moment correlation coefficient. In rant III, where most other points in case 1 are located,
the X and Y deviations are both negative since all X and
fact, when the term "correlation" appears in the litera-
ture, it is often assumed that the writer is referring to Yvalues are less than their respective means. Because the
X and Y values for each point in quadrants I and III
Pearson's correlation index. To use this statistical mea-
sure, data must be of interval or ratio scale. It is also covary in the same direction from their means, the prod-
uct of the two deviations will produce a positiveresult for
assumed that the variables have a linear relationship. In
addition, if the index is used in an inferential rather than each point. When these individual products are summed
descriptive manner, both variables should be samples according to equation 16.1, the resultant value will be
derived from normally distributed populations. large and positive. Thus, case 1 represents an example of
Pearson's correlation coefficient relates closely to the a large positive covariation. As seen earlier, this scatter-
statistical concept of covariation: the degree to which plot also corresponds to a positive or direct correlation
two variables "covary" (vary together or jointly). If the between the two variables.
values of the two variables covary in a similar manner,
the data contain a large covariation, and the two vari-
ables will show strong correlation. Alternatively, if the
paired values of the variables show little consistency in
how they covary, the correlation will be weak.
x
The concept of covariation and its relationship to II I

correlation can be better understood by comparing a set ••
••
• •••
of scatterplots, each having four quadrants produced
Case 1: ____
. .... •• •
...;•:i.:•.:..•
----
..'\..••..
from the mean values of the X (horizontal) and Y (verti- y
High positive
cal) variables (fig. 16.3). In each diagram, the mean val-
ues and total variation in the two variables are held
covariation .•••~-:
constant. This allows a direct comparison of the relative
III IV
covariation, which differs on each plot.
To understand covariation, we should first look at the
~eviations of X (X - X) and Y (Y - f) from their respec- X
nve means. These deviations are the basic building blocks
II I
of standard deviation and standardized scores. The X and
Y deviations of each data value (matched pair) are multi- ••
• ••
••• •••
plied together and summed for the set of values to produce • • •••
Case 2: •• y

High negative • '·:,
(16.1) covariation ..
• •1
•••-::
•••
III IV
where CVXY = covariation between X and Y
(X - X) = deviation of X from its mean (X)
(Y- Y) = deviation of Yfrom its mean (Y) X
Mathematically, covariation is analogous to the impor- II • I
• •• •• \•• ••
.......:•4,-,•'----'.•...:•:.__
.......... y
tant concept of total variation, the sum of the squared
deviations from the mean, an integral component of • • • • •
Case 3: ---r-=-·
analysis of variance and regression: Low • • • •
covariation •• • • •• • • • •
•• • • • •
2 •
TV x = L(X-X)(X-X) = L(X-X) (16.2)
III IV
2
TVr = L(Y-Y)(Y-f) = L(Y-f) (16.3)

FIGURE 16.3
where TVx =total variation in X Generalized Scatterplots Showing the Relationship of Covariation
TV y =total variation in Y to Correlation
Chapter 16 .A. Correlation 243

All of these formulas produce an equivalent result. Of


course we usually calculate correlations using a spread-
Pearson
Correlation
Analysis sheet or computer software program; however, these
alternatives are shown to give you further insights into
Primary Objective: Determine if an association exists
between two variables
the calculation procedures associated with correlation.
Based on the conceptual definition, Pearson's corre-
Requirements and Assumptions: lation is expressed as the ratio of the covariance in X
1. Random sample of paired variables and Y to the product of the standard deviations of the
2. Variables have a linear association two variables:
3. Variables are measured at interval or ratio scale
4. Variables are bivariate normally distributed
r:.:. [I.(X - X)(Y - Y) ]/ N (16.4)
Hypotheses:
SxSy
Ho: p = 0
H, : p * O (two-tailed)
where , = Pearson's correlation coefficient
H, : p > O (one-tailed) or
N = number of paired data values
H, : p < o (one-tailed)
Sx, Sy = standard deviation of X and Y,respectively
Test Statistic:
If the data are standardized or converted to Z-score
form (section 6.1), an alternative formula exists to calcu-
late the Pearson's correlation coefficient:

The siruation is different for the plot in figure 16.3, (16.5)


case 2, where most points lie in quadrants II and IV. In
quadrant IV, deviations in X are positive (X values are
greater than X). However, deviations in Y are negative (X-X)
where: Z = X variable transformed to Z-score = ~-~
because values of Y are less than f. A reversed siruation Sx
occurs in quadrant II, where deviations in X are negative (Y-Y)
(X values are Jess than X) and deviations in Y are posi- Zy = Yvariable transformed to Z-score = ~-~
Sy
tive. In case 2, the product of the deviations will be nega-
N = number of paired data values
tive because the values of the X and Y variables in both
quadrants II and IV covary in opposite directions from Using equation 16.5, the correlation between two vari-
their means. When the products from the X and Y devia- ables is equal to the sum of the product of Z-scores for
tions are summed, the resultant covariation will be large each data value divided by the number of paired values.
and negative. The example of covariation in case 2 repre- This formula is valid because Z-scores take into account
sents two variables that have a large inverse correlation. deviations from the mean and the standard deviation.
In the third diagram (fig. 16.3, case 3) points are scat- If one wishes to use the original values of the X and
tered in a random pattern, with nearly equal dispersal of Y variables directly, a computational formula is used to
points in each quadrant. Some of the X and Y deviations derive Pearson's correlation coefficient:
will be positive and others negative. When the deviations
are multiplied together for each unit of data, both positive
and negative products will occur. When these products are I,XY -((I,X)(I,Y)I N)
summed for all points, the values generally cancel each
other out and produce a covariation close to zero. This low
covariation corresponds to a very small correlation, sug-
gesting little or no relationship between the two variables.
Pearson's correlation coefficient (r) can be expressed Although this formula appears more complex than the
mathematically in several different ways: previous equations, it uses only the original data, and no
prior calculation of means, standard deviations, or devia-
I. with deviations from the mean and standard devia-
tions from the mean is needed.
tions (equation 16.4)
In addition to its use as a descriptive index of the
2. with X and Yvalues transformed to Z-scores (equation strength and direction of association, correlation can also
16.5) be used to infer results from a sample to a population.
3. with the original values of the X and Y variables The sample correlation coefficient (r) is the best estima-
(equation 16.6) tor of the population correlation coefficient (p). In this
244 Part VI .A. Statistical Relationships between Variables

application, the null hypothesis states that no correlation The most common test statistic is
exists in the populations of the two variables (H 0 : p = 0).
r
Since the populations of the variables are assumed to t=- (16.7)
have a linear association (see boxed insert), the null s,
hypothesis of no correlation is equivalent to stating that where:
X and Y are independent.
s, = standard error of the correlation estimate
As in most other inferential tests, the alternate
hypothesis (H,J can be stated as directional (one-tailed) =p
y-;;=-z (16.8)
or non-directional (two-tailed). If you use a one-tailed
approach, you should have a logical basis for expecting
The test statistic for Pearson's r can therefore be rewritten
the correlation to be either positive or negative. When
as
no rationale exists for the direction of correlation
between two variables, a two-tailed alternate hypothesis
(16.9)
should be used.

Example: Pearson'sCorrelation Coefficient-


Strength of Relationship between Date of Last Spring Frost and Latitude in Southeastern United States

Pearson'sproduct moment correlation coefficient is now less northerly latitudes (closer to the equator}." In general, the
applied to one of the research hypotheses presented in chap• pattern shown in figure 1.6 seems to suggest that latitude is
ter 1. After looking at the spatial pattern of average spring frost associated with average last spring frost date, but it is impossi•
dates for a sample of seventy-six weather stations distributed ble to tell how strong that association might be just by looking
across the southeastern United States (fig. 1.6), we suggested at the map. Pearson'scorrelation provides us with the direction
the following hypothesis: "Weather stations with more north• of association (positive or negative}, the strength of associa-
erly latitudes (further from the equator} have average last tion (from +1.0 to -1.0), and the associated p-value indicating
spring frost dates later in the spring than weather stations with the probability that this association is statistically significant.

120 -
• •
• • • • • •
~
110 -
•• • •
J!l
••••
""' • ••••••.I • •
,... ..
100 -
C

="'
•• •
:,
::l,

1!l
90-

• • ,
•••• • • •• •
•• •
• •
,I:
O>
C 80- •
·5.
,,, • ••• • ••
-,"',,
~

• •
--.,
70 -
0 • ••
60- •
"".,'
• •
O>
"' 50 -
• •
~

~
40 -

.
- '
29.0• '
30.0° '
31.0° '
32.0° '
33.o• '
34.0° '
3s.o• '
36.0° '
31.0• '
3s.o•
Northern latitude (in decimal degrees}

FIGURE 16.4
Scatterplot Showing Relationship Between Average Date of Last Spring Frost and Latitude, Selected Weather Stations in the
Southeastern U.S.
Chapter 16 .A. Correlation 245

If the data meet the requirements and assumptions for Pear• The calculation procedure for Pearson's correlation is sum•
son's correlation, we are now in a position to test this hypothe• marized in table 16.1. (Not all 76 weather stations are listed in
sis statistically. Both variables, latitude and average date of last the table, but a complete listing is available on the CD that
spring frost, are measured on an interval-ratio scale and we accompanies this book}. We are showing the computational
have a random sample of weather stations from this region. We formula that uses the original data directly, so you can best
tested samples of both variables (using the Kolmogorov• see how the correlation coefficient is derived. Of course, any
Smirnov normality test} and concluded statistically that they of the mathematical expressions for Pearson's could be used,
had been taken from normally distributed populations. Finally, and the resulting correlation coefficient would be the same.
we looked at a scatterplot of latitude and average date of last The resulting ,value of0.849 indicates that a strong posi•
spring frost for these seventy-six weather stations, and con• tive association exists between latitude and average date of
eluded that the form or nature of the relationship appears to be last spring frost and that the association it is statistically signif•
linear (fig. 16.4). It seems that we have met all the requirements icant (p = 0.000). However, we often want to know more about
and assumptions for running a Pearson'scorrelation. the form or nature of th is association. For example, how useful
Notice that the scatterplot displays latitude on the X axis (as (precise} is latitude in predicting the average date of last
the explanatory or independent variable} and average date of spring frost? Given the latitude of a weather station in the
last spring frost on the Yaxis (as the response or dependent southeastern United States (but not among the 76 in this ran•
variable}. The date of average last spring frost is expressed dom sample}, can we predict the date of average last spring
using a Julian date calendar (for non-leap years} rather than a frost at that weather station, and estimate the precision of
regular calendar. In a Julian date calendar, January 1 is day 1, that prediction? These questions will be answered in the next
February 1 is day 32, March 1 is day 60, and so on. This conver• chapter, which deals with simple linear regression.
sion to Julian date is done to allow the computation of sum•
mary statistics such as mean and standard deviation.

TABLE 16 .1
Worktable for Correlation: Latitude and Average Date of Last Spring Frost •

Ho: no association exists between latitude and average date of last spring frost
HA: a positive association exists between latttude and average date of last spring frost

x, Y1
Average date of
Weather station Latitude last spring frost xf Yf X 1Y1
1 Troy, Al 31.so· 74.41 1,011.24 5,536.85 2,366.24
2 Union Springs, Al 32.02° 82.27 1,025.28 6,768.35 2,634.29
3 Talledega, Al 33.42° 97.83 1,116.90 9,570.71 3,269.48

76 Farmville, VA 37.33° 110.62 1,393.53 12,236.78 4,129.44

N =76 L X; = 2,578.31 L Yt = 6,797.59 rxl = s1,136 LY/ = 626,720 L X1Y1= 232,506

rx,l', - ((LX,)(LY,)) 232,506 _ ((2,578.31)(6,797.59))


1,896.87
= - = v'266.495 - =0.000)
r
Jrx1- a:x1i Jrr,
n
2
-
11
a:r, i
2
J81, 736 - (2,578.31) 2
76
J18,730.15
0.849 (p

rv'n - 2 0.849"74
t - v'l - r 2
- v'l - 0.8492
- 13.82 (p = 0.000)
• latitude expressed in hundredthsof a degree (for Troy, Al, 31• 48' is the same as 31.SO•)
• Average date of last spring frost expressed in Julian calendar days {non4eap year)
(for Troy, Al, 74.41 ""74, which is the same as Ma,:c:h15)
246 Part VI .A. Statistical Relationships between Variables

Example: Pearson's Correlation Coefficient-


Measuring the Strength of Relationship between Percent Growth of Foreign-born Population (2000-2010)
and the Cost of Living Index (COLI) in America's Largest Metropolitan Areas

Recall that in chapter 1 1 we used both ANOVA and the (COLI) placed on the X axis as a possible explanatory (indepen•
Kruskal•Wallis test to explore descriptively the differences dent) variable and the 2000-2010 percent growth of foreign•
from one census region to another in the rate of immigration born population on the Yaxis as the response (dependent)
growth in large metropolitan areas from 2000 to 2010. We dis• variable. Notice the 'funnel-like' or "cone-shaped' distribu•
covered that rates of immigrant growth in metro areas vary tion shown in the scatterplot. The variation of the points
considerably by census region. Confirming the results from clearly changes dramatically given the magnitude of the COLI.
several studies by The Brookings Institution, large cities in the This violates the requirement that there is a bivariate normal
South have experienced much higher percentages of foreign• distribution (see the Pearson correlation analysis boxed inset).
born population growth than large cities in the other three There appears to be much more variability of percent growth
census regions. These descriptive findings were shown statis• in foreign.,born between metro areas having a COLI near 100
tically (through ANOVA and Kruskal-Wallis), in mapped form and much less variability when metro area COLI values are
(fig. 11.3), and with census region boxplots (fig. 11.4) considerably larger than 100. There are now two valid reasons
A logical question for us to ask is why immigrants have for applying only a descriptive correlation analysis (with no
been particularly attracted to metro areas of the South over inferential analysis): the data are not random samples and the
the last decade. Naturally, the answer is complex and multi• distribution of variables is not bivariate normal.
faceted. One possible explanation is that a large number of When a Pearson'scorrelation analysis is applied to this data
immigrants are now attracted to places where the cost of liv• set, the result is a correlation coefficient of-.356, indicating a
ing is fairly low, coupled with the perception by potential relatively weak negative relationship between the COLI in
immigrants that realistic employment opportunities exist in 2000 and percent growth in foreign-born population from
these cities. 2000 to 2010 for the 100 largest metropolitan areas of the
For the 100 largest metropolitan areas in the United States, United States.
data are available for both the percent growth in foreign-born From a geographic perspective, additional exploratory
population and cost of living index. These data are graphed analysis seems warranted. The scatterplot depicts a fairly weak
on a scatterplot (fig. 16.5), with the 2010 Cost of Living Index negative relationship between the two variables. If we look

-
0
~

0
N
140 -

• Scranton,PA

g
120 - •
8
•• •
0
~ •
-"'
C
.2 100 - • • •••
•••
::,
a.
0
a. 80 -
C
~

.8
•••••• • 8a1Umore,

MD

• •
'.
•L J
'

. ....
C
Ol 60 -
·e
•• •
- ~- Washington,
OC

. :,.~ .. •
~
0
40 - •• • •
.., ..
.s::
! • Bridgeport,CN

-~
~
Ol
20 - • • Honolulu, HI
C

• • • • • New Yotk,NV

.,
Cl. •• • • •
SanJose,CA
\
San Francisco,
CA
0 -

• I I I I I I
100 125 150 175 200 225
Cost of living index (2010)

FIGURE 16.5
Scatterplot Showing Relationship Between Percent Growth of Foreign-born Population (2000 to 2010) and Cost of Living Index (2010),
America's 100 Largest Metropolitian Areas
Source:BrookingsInstitution,Immigrantsin 2010 MetropolitanAmerica:A Decade of Change
Chapter 16 .A. Correlation 247

closer, however, a more specific descriptive statement can be 2000-2010 foreign-born growth rate of nearly 72%. Clearly,
directed toward the minority of the 20 or so metro areas with metro areas with high cost of living indices do not show high
COLIvalues above 110: •Those large metro areasthat cur• rates of immigrant growth.
rently have a higher-than-national-average COLI(say 110 and As we look to the future, we might ask whether this immi•
above) have generally experienced a low percent growth in grant dispersal to less-expensive metro areas (such as Scran•
foreign-born population (lessthan 55%) over the last decade.• ton, PA)will continue. Will we see other "new frontiers• for
The scatterplot nicely confirms this descriptive statement. immigrant settlement in the future? In what ways might possi•
In fact, of the 20 metro areas with COLIvalues above 110, only ble changes in immigration law (such as altering drastically
Baltimore, MD has a foreign-born growth rate above 55%. the conditions for gaining citizenship) affect the future spatial
Despite having a relatively high COLIof 120, Baltimore has a pattern of foreign-born residency choice?

Spearman's correlation coefficient is sometimes used


16.3 AsSOCIATIONOF ORDINALVARIABLES as a descriptive index of association between two ordinal
variables. However, when the paired variables represent a
In geographic problems with data in ranked form, sample drawn at random from a population of bivariate
Speannan's rank correlation coefficient ( r,) is the mea- data values, an r, value can be tested for statistically sig-
sure of choice to determine the strength of association nificant difference from O. The sample correlation coeffi-
between two variables. It is appropriate when variables cient (r,) is the best estimator of the population
are measured on an ordinal (ranked) scale or interval- correlation coefficient (p, ). In these applications, the null
ratio data have been converted to ranks. Spearman's cor- hypothesis states that no relationship exists between the
relation coefficient is also sometimes appropriate when =
two variables in the population (H 0 : Ps 0). Confirma-
an assumption of Pearson's correlation is clearly not met. tion of the null hypothesis is equivalent to affirming inde-
For example, Spearman's is likely the better choice if pendence between the X and Yvariables.
samples are drawn from highly skewed or severely non- Either a one- or two-tailed test is used, depending on
normal populations. The statistical power ("power effi- the form of the alternate hypothesis. When the Z distri-
ciency") of Spearman's correlation is nearly equal to that bution is used, the test statistic is determined by the
of Pearson's r. Spearman correlation and the sample size:
Similar to Pearson's correlation, r, ranges from a
maximum of + 1.0 for perfect positive or direct correla- Zr,=r,✓n 1 (16.11)
tion to a minimum of -1.0 for perfect negative or inverse
correlation. When no association exists between vari-
ables, ,, equals 0.0. Spearman's correlation coefficient
measures the degree of association between two sets of
ranks using the following equation: Spearman
RankCorrelation
Analysis
Primary Objective: Determineif an associationexists
(16. 10) betweentwo variables

Requirements and Assumptions:


1. Random sample of paired variables
where d = difference in ranks of variables X and Yfor 2. Variables have a monotonicallyincreasingor
each paired data value decreasing association
3. Variablesare measured at ordinal scale or
"f.d2 = sum of the squared differences in ranks downgradedfrom interval/ratioto ordinal
N = number of paired data values
Hypotheses:
Similar to other statistical tests using ordinal data,
Ho:p,=O
the presence of tied rankings influences the Spearman's
coefficient. However, the effect of ties on the resultant HA : P, "' 0 (two-taied)
correlation index will be significant only when the pro- HA: p, > 0 (one-tailed)or
portion of tied rankings to total number of values sam- HA: p, < 0 (one-tailed)
pled is very large. In these instances, a correction factor Test Statistic:
for ties needs to be applied. As a general rule, the correc-
tion factor is not necessary if the number of tied rankings
is less than 25%,of the total number of pairs.
248 Part VI .A. Statistical Relationships between Variables

Example:Spearman'sRank Correlation Coefficient-


Exploring Relationshipsamong Various FactorsUsed to Rank uAmerica'sTop Statesfor Businessn

Each year CNBChas conducted a study to determine the top 10 states for overall business competitiveness in 2011
"America'sTop States for Business.•An annual study from 2011 are: Virginia, Texas,North Carolina, Georgia, Colorado, Massa•
ranks all SOstates using 43 specific measures allocated into 10 chusetts, Minnesota, Utah, Iowa, and Nebraska.
broad categories of competitiveness. These measures are If the state rankings for each pair of categories are corre•
developed with input from business groups such as the lated using Spearman's correlation coefficient, the resulting
National Association of Manufacturers and the Council on matrix of correlations allows direct comparison of the relative
Competitiveness. The idea is to compare states by their own direction and strength of any two factors of business competi•
standard-the selling points they use to attract business. Each tiveness (table 16.2).The Spearman's correlations have a wide
of the 10 categories is weighted by how frequently they are range, from strong negative (-.517 for Quality of Life with Cost
cited in state economic development marketing materials. of Living) to very strong positive (+.874 for Technology and
The 10 categories and their weightings are: cost of doing Innovation with Accessto Capital). To illustrate the Spearman's
business (350); workforce (350); quality of life (350); infrastruc• calculation procedure, the Infrastructure and Transportation
ture and transportation (325); economy (300); education factor is correlated with the Technology and Innovation factor,
(225); technology and innovation (225); business friendliness resulting in r,= .670 (table 16.3).
(200); access to capital (100); and cost of living (50). From a spatial perspective, we may want to ask which state•
As part of the analysis, each of the 50 states is ranked on level factors of business competitiveness vary the most by cen•
each of the 10 general weighted categories of competitive• sus region. To answer this question statistically, the appropriate
ness. CNBCthen totaled up the points each state received to test is Kruskal-Wallis (the non-parametric analysis of variance).
create an overall ranking. The overall ranking of each state was Kruskal-Wallisis the test of choice because the various factors
the headline feature CNBCemphasized on their television of business competitiveness are calibrated by ordinal rank, and
programs and internet publications. These overall rankings more than two groups are being compared for differences
are summarized in figure 16.6. A quick glance at the map (four census regions). Table 16.4 summarizes the Kruskal-Wallis
doesn't seem to reveal a strong spatial pattern, although results. We are limited to a descriptive presentation of these
many of the states in the South and Midwest census regions results: no inferential (probabilistic) analysis is valid because we
seem to have above average rankings. For your information, do not have a sample of states taken from each region (rather,

State Rankings
c:::::J
Ranlts HO: E.)(eellentovetan ranki~

c:::::J
Ranks 11•20: Good
C:JRanlts21-30: Ave<age
~ Ranles31◄0: Belowaverage
~ Ranis 41-60: PO()(over.illrMkin!J
••

FIGURE 16.6
CNBC's 2011 Overall Ranking of States for Business Competitiveness
Source:CNBC, America'sTop States for Business
Chapter 16 .A. Correlation 249

we have the entire population of all SOstates). Therefore, the With regard to the 10 different factors used in the busi•
computer-calculated p-value of 0.091 has no valid inferential ness competitiveness rankings, some factors show large dif•
interpretation. However, a glance at the overall census region ferences from one census region to another, while other
rankings in table 16.4 clearly shows that the Midwest Census factors show only small differences across the regions of the
Region emerges as distinctly better than the other three country. The top portion of table 16.5 indicates that educa•
regions with regard to business competitiveness. As a geogra• tion, cost of living, and quality of life vary greatly from one
pher, you should now ask why the Midwestern states are show• region of the U.S.to another. The single most dramatic
ing better business competitiveness rankings. regional difference is in education. The Northeastern states

TABLE 16. 2

Matrix of Correlation Coefficients: Factors Used to Rank Top States for Business
Cost
of doing Quality Infrastructure Technology Business Access
business Workforce of life Economy & transportation & innovation Education friendline55 to ca.pita!

Workforce 0.235

Quality of life -0.346 -0.222

Economy 0.077 -0.187 0.365


Infrastructure
0.188 0.191 -0.414 -0.379
& transportation
Technology
-0.245 -0.109 -0.068 -0.304 0.670
& innovation

Education -0.141 -0.418 0.426 0.272 -0.029 0.352


Business
0.274 0.464 0.098 0.191 0.107 0.040 0.096
friendliness

Access to ca.pita! -0.299 -0.145 0.090 -0.197 0.592 0.874 0.405 0.060

Cost of living 0.756 0.410 -0.517 0.075 0.336 -0.211 -0.313 0.211 -0260

TABLE 16. 3
Spearman's Correlation Example: State Ranks on Infrastructure-Transportation and Technology-
Innovation, 2011
Ranked Ranked Difference
State* infrastructure-transportation technology-innovation (d) d2
Alabama 24.0 33.0 -9 81
Alaska 47.0 43.0 4 16
Arizona 11.0 18.0 -7 49
Arkansas 40.5 44.0 -3.5 12.25
California 7.0 1.0 6 36
Colorado 26.5 14.0 12.5 156.25

Vermont 49.5 40.5 9 81


Virginia 11.0 11.0 0 0
Washington 18.0 5.0 13 169
West Virginia 44.0 47.0 -3 9
Wisconsin 22.0 21.0 1 1
Wyoming 38.0 50.0 -12 144

SUM 6,867.5

• Data listed is for only 12 states

6(6,867.5) 41,205
l- 50 3 - so 1- ---- 1- 0.3298 .670
124,950
250 Part VI .A. Statistical Relationships between Variables

TABLE 16.4 TABLE 16.5

Kruskal-Wallls: Analysis of Differences Between Kruskal-Wallls: Largest and Smallest Differences


States by Census Region for Overall Rankings of by Census Region, Factors Related to Business
Business Competitiveness Competitiveness
Number of Factors of business competitiveness with
states in Median Average largest differences by census region
Census resion census r!Sion rank rank z
Median rank
Northwest 9 30 29.3 0.87
Census region Education Cost of ltving Quality of IHe
South 16 30.5 26.3 0.26
Northeast 4.0 42.0 11.0
Midwest 12 14 16.6 -2.43
Sooth 32.5 13.5 39.5
West 13 31 30.1 1.33
Midwest 18.5 16.5 22.5
Overall 50 25.5
West 39.0 33.0 16.0
H = 6.46 (p =0.091) H= 26.42 H= 25.71 H= 24.38
=
(p 0.000) =
(p 0.000) =
(p 0.000)
are all evaluated as very strong in education (the median
Factors of business competitiveness with
ranking of the Northeast Census Region is 4.0, and the top smallest differences by census region
five states nationwide are Massachusetts, New Jersey, New
Median rank
York, Pennsylvania, and Vermont). By contrast, states in the
South and West generally rank low in education (with median Technology Access to
Census region Economy
& innovation capltal
ranks of 32.S and 39.0, respectively).
The cost-of-living and quality-of-life factors also show sub- Northeast 19.0 21.0 17.0
stantial differences in rankings from one census region to Sooth 29.0 31.5 30.0
another. However, the spatial patterns are quite different. With
regard to cost of living, the South does best (a least-expensive Midwest 22.5 23.0 27.5
median rank of 13.5) while the Northeast does worst (a most• West 31.0 26.0 25.0
expensive median rank of 42.0). Just the opposite occurs with
quality of life, as states in the South are poorly ranked (median =
H 2.15 H= 2.21 H= 3.32
=
(p 0.543) =
(p 0.531) =
(p 0.345)
rank of 39.S) and states of the Northeast are highly ranked
(median rank of 11.OJ.
The bottom portion of table 16.S shows the three most• petitiveness, it does not seem to matter much which part of
evenly distributed factors of business competitiveness (with the U.S.is being examined; the opportunities for business suc-
relatively small differences in ranking between census cess are about the same. Once again, as geographers we
regions) are technology and innovation, economy, and access should now ask why these factors are rather evenly distributed
to capital. With respect to these dimensions of business com• around the country.

KEY TERMS 2. Distinguish the various directional relationships


between variables (positive, negative, and neutral) and
correlation analysis, 239 among different strengths of association between vari-
covariation, 242 ables (perfect, strong, weak, and none).
direction of correlation: positive, negative, random or 3. Explain the concept of covariation, and understand
neutral, 240 how scattergrarns can depict the relationship between
predictor, explanatory (or independent) variable, 241
covariation and correlation.
Pearson's product-moment correlation coefficient, 242
response (or dependent) variable, 241 4. Create a geographic research problem or situation for
Spearman's rank correlation coefficient, 247 which Pearson's correlation coefficient would be
strength of association, 240 appropriate.
S. Create a geographic research problem or situation for
MAJOR GOALSAND OBJECTIVES which Spearman's rank correlation coefficient would
be appropriate.
If you have mastered the material in this chapter, you
should now be able to:
1. Explain the nature and purposes of correlation analysis.
Chapter 16 .A. Correlation 2S1

REFERENCES
AND ADDITIONAL READING
Abler, R., J. Adams_,and P. Gould. SpatialAnalysis.Englewood
Cliffs, NJ: Prentice-Hall, 1971.
Burt_,J. E. and G. M. Barber. Elemen1aryStatisticsfor Geogra-
phers.2nd ed. New York: Guilford Press, 1996.
Ebdon, D. Statisticsin Geography: A PraaicalApproach.Oxford:
Basil Blackwell, 1985.
Falk, R. and A. D. Well. "Many Faces of the Correlation Coef-
ficient," Journalof StatisticsEducation, 1997. Accessed Janu-
ary 10, 2014. http://www.amstat.org/publications/jse/
v5n3/falk.html
Rogers, J. L. and W. A. Nicewander. "Thirteen Ways to Look
at the Correlation Coefficient." The AmericanStatistician42
( I 988): 59-66.
If you are interested in further exploring the geographic
examples mentioned in this chapter, the following are good
places to start. The most complete source for last spring frost
data is the National Climatic Data Center (NCDC) at
www.ncdc.noaa.gov. The immigration growth data for the larg-
est metropolitan areas comes from the Brookings Institution
www.brookings.edu. If you want to learn more about the Cost
of Living Index (COLI), see various publications provided by
the Bureau of Labor Statistics www.bls.gov. Detailed discus-
sions of issues and problems related to the COLI and related
indices are found at The Council for Community and Eco-
nomic Research website www.coli.org.
Simple LinearRegression

17 .1 Form of Relationship in Simple Linear Regression


17 .2 Strength of Relationship in Simple Linear Regression
17 .3 Residual or Error Analysis in Simple Linear Regression
17.4 Inferential Use of Regression
17 .5 Example: Simple Linear Regression-Lake Effect Snow in Northeastern Ohio

In the previous chapter Pearson's correlation coeffi- be examined and helps uncover additional variables that
cient was discussed as the index for computing the may influence the geographic pattern. The problem dif-
degree of association between variables measured on an fers from simple correlation analysis because a functional
interval/ ratio scale. In correlation analysis, you should relationship is expected between the variables, and the
not automatically assume there is a functional or causal nature of that relationship needs to be explored more
relationship between the two variables. A correlation can fully. Simply stated, regression should not be used unless
be computed for any two variables, as Jong as the corre- a clear rationale or model links one variable to another,
lation index is consistent with the level of measurement as is the case with population density and rainfall.
for the data. Regression is applied successfully in all areas of geog-
Geographers often work in research areas where raphy. For example, a medical geographer could examine
associations between variables need to be explored in the relationship between the number of physicians
more detail. The assumption or hypothesis may be that located in the counties of a state and the income level of
one variable influences or affects another or that a func- persons residing in these areas. Regression analysis could
tional relationship ties one variable to another. For these be used to predict or estimate the number of county phy-
geographical problems, regression analysis is a useful sicians based on a county's income profile. An environ-
statistical procedure that supplements correlation. mental geographer may wish to examine the form of
A classic problem studied by geographers is the spa- relationship between the acidity level in various locations
tial relationship between level of precipitation in an agri- in a chain of lakes and distance from a point pollution
cultural region and the population density the area can source. Regression analysis could predict acidity level in a
support. It is hypothesized that the amount of moisture lake based on distance from the pollution source. A polit-
available at locations within a region influences or affects ical geographer might want to compare the strength of
the density pattern of farm population in the region. votes for a political party and the educational, financial,
Data could be collected for the two variables at various or racial composition of voters in the wards of a city.
sites in the region, and regression used to answer ques- Based on the socioeconomic composition of a city ward,
tions about how population density relates to precipita- regression could then be applied to predict political party
tion level. The nature or form of this relationship can be voting strength. A cultural geographer surveying current
explored and the strength of the relationship determined. attitudes at various locations could apply regression to
Assuming that the relationship is not exact or perfect, predict the level of support for a controversial issue like
regression allows sources of error in the relationship to "women's right to choose" as related to such socioeco-

252
Chapter 17 .A. Simple Linear Regression 253

nomic characteristics as occupation, income, religious lines could be drawn to summarize the points in a scatter-
belief, or education at those locations. plot, the "best-fitting" line is always defined as the
Simple linear regression (also known as bivariate unique line that minimizes the sum of squared vertical
regression) examines the influence of one variable on distances between each data point and the line. The best-
another and is discussed in sections 17.1 through 17.4. fitting line is therefore often referred to as the least-
Just as recommended in correlation, the variable supply- squares regression line (fig. 17.1). As the name implies,
ing the influence or affect is called the explanatory, pre- the line minimizes the sum of squared vertical distances
dictor, or independent variable and is placed on the X between each data point and the line.
axis. The variable receiving the influence or affect is No other line can be generated where the sum of the
termed the response or dependent variable and is placed squared distances between the points and the line (mea-
on the Y axis. In regression terminology, the response sured vertically) is smaller than that calculated for the
variable is hypothesized to be affected or influenced by least-squares line. This line defines the best estimate of the
the explanatory or predictor variable. The regression sec- relationship between the explanatory variable (X) and the
tions include a detailed examination of the form or nature response variable ( Y). It also serves as a predictive model
of relationship between latitude and average date of last by generating estimates of the response variable using
spring frost, so we are continuing the discussion of this both the values of the explanatory variable and knowl-
association from the correlation chapter. edge of the relationship which connects the two variables.
The most common application of regression is the In a simple linear regression with explanatory variable
identification of linear relationships between variables. In (X) and response variable (Y), the least-squares regression
linear form, changes in values of the variables are constant line is denoted using the general definition of a line:
across the range of the data. In these instances, the pattern
of points from a scatterplot approximates a straight line. Y=a+bX (17. 1)
Real-world relationships, however, sometimes show vari-
ables that are related in curvilinear ways where variable In addition to the two variables, the equation contains
changes are not constant across all values. In this chapter two constants or parameters (a is the Y-intercept and bis
regression is applied only to a linear relationship. the slope), which are calculated from the actual set of
data. These values uniquely define the equation and
establish the position of the best-fitting line on the scatter-
17.1 FORMOF REIATIONSHIP
IN SIMPLE plot. The equation of the least-squares line for the average
date of last spring frost example is:
LINEAR
REGRESSION
Y=-152.01 + 7.1171X
Like correlation, simple linear regression measures
how one variable relates to another. The basic question where: X =latitude (expressed in hundredths of a degree)
posed by regression is: "What is the fonn or nature of the Y = average date of last spring frost (expressed
relationship between the variables under study?" As with by Julian date)
correlation, the association between two variables is eas- a= -152.01 (the Y-intercept)
ily visualized by constructing a scatterplot, but now the
scatterplot is supplemented with a "best-fitting" regres- b = 7.1171 (the slope)
sion line (and equation) that best replicates or models
the relationship.
We continue now with the example in which latitude y
is associated with the average date of last spring frost in
©
the southeastern United States. Recall that data for these
two variables are plotted for 76 weather stations in the
study area to produce a scatterplot (fig. 16.4). We always
recommend that you look at a scatterplot before proceed-
ing with any regression analysis. If the scatterplot does
not appear reasonably linear, a simple linear regression 0
model may not be valid. Pearson's correlation analysis 0
results in a strong positive association of 0.849 (table Objective:
16.1). Generally, the greater the latitude north of the Minimize 'F.d;2
equator for a weather station, the later the average date of
the last spring frost for that station.
By calculating a "best-fitting" line and plotting it
'--------------x
over a scatterplot of points, we can descnbe the pattern of FIGURE 17.1
points more objectively. Although an infinite number of The Objective of Least-squares Regression
254 Part VI .A. Statistical Relationships between Variables

The best-fitting regression line is added to the 76-point exists if the explanatory variable is temporal. Extrapolat-
scatterplot in figure 17.2. The constant a represents the ing predictions into the unknown future is a risky business
expected value of Y when the value of X is zero-that is, (as any investor in the stock market will readily admit)!
the Y-interceptindicates where the regression line crosses The other constant in the regression equation, b, rep-
the Y axis. In this example, the value of a is -152.01, and is resents the slope (tangent) of the best-fit line. This value,
not shown on figure 17.2. At first glance, this result doesn't also called the regression coefficient, shows the absolute
seem to make any sense. How can the Julian date be a neg- change of the line in the Y (vertical) direction associated
ative number? The answer is-it can't! So how did we get with an increase of 1 in the X (horizontal) direction. The
this seemingly impossible number? It should be noted that slope reveals how the response variable will change given
a is the predicted best estimate of the Yvalue (average date a unit increase in the explanatory or predictor variable.
of last spring frost) when the Xvalue (latitude) is zero.Obvi- Both the sign and magnitude of b offer useful infor-
ously, we are not interested in knowing the last spring frost mation about the simple linear relationship. The sign
date on the equator! We are interested in predicting the last ( + or -) of the slope determines the direction of relation-
spring frost only in the southeastern United States, roughly ship between the two variables (fig. 17.3). If the slope is
between the latitudes of 29 and 37 degrees north. This positive (b > 0), the line trends upward from low values of
range of latitude from 29 to 3 7 degrees north is formally X to high values of X (case 1). On the other hand, if the
termed the domain of X. Zero degrees latitude (the equa- slope is negative (b < 0), the line trends downward (case
tor) is clearly far outside the domain of our study area, so 2). This interpretation of direction is equivalent to that for
the regression model is not valid for that value of X. the correlation coefficient. When the relationship between
Whether the X variable represents a study area (like the explanatory and response variables is direct, b will be
range of latitude) or is a non-spatial explanatory variable positive and the line trends up from left to right. However,
(such as per capita total expenditures on health care or per- with an inverse relationship, the value of b is negative, and
cent change in obesity level) it is alwaystrue that a regres- the line trends down. When no relationship exists, the
sion model is not fully valid outside the domain of X If value of bis zero, and the line parallels the X axis (case 3).
you attempt to extrapolate beyond this domain, you are In absolute terms, the magnitude of the parameter b
venturing into dangerous territory because you are assum- indicates the flatness or steepness of the regression line
ing that the linear relationship between X and Y will con- when moving from lower to higher values of X. When b is
tinue beyond even the most extreme observations of the large (regardless of sign), the change in Y is large relative
predictor or explanatory variable, X. A similar problem to a unit increase in X. In this situation, the slope of the

120
• •
-.,,-""' 110 • • • ••
C
100
• •• •
"'
..,
::,
• •
~
90 • • ••
i;;
,g • • • •
0,
C 80 •
a.
~

"' • • ••
i;;
• •
--"
"'
0
70
••
.,,
"' 60
Y = -152.01 + 7.1171X
"I!!
0,

50 •
"
~ • •
40

2s.o• 30.0° 31.0° 32.0° 33.0° 34.0° 35.0° 36.0° 37.o• 38.0°
Northern latitude (in decimal degrees)

FIGURE 17.2
Best-fitting Regression line Placed on Scatterplot: Latitude and Average Date of Last Spring Frost
Chapter 17 .A. Simple Linear Regression 255

regression line tends to be steep. When b is small, the


X opposite interpretation occurs. The change in the
response variable is small when compared to a unit
increase in the explanatory variable, and the line has a
flatter slope.
Case 1: In the last spring frost example, the calculated slope
Positive or b value is 7. I 171. Since the regression coefficient is
or direct positive, the least-squares line slopes upward from left to
relationship Y =a+ bX
b>O right on the scatterplot, indicating the last spring frost
date will be later (larger or higher Julian date) if the lati-
tude of the weather station is further north (as shown in
~----------Y fig. 17.2). We can now be very specific in our predictions.
X For each unit increase in X(one degree further north lati-
tude), the average date of last spring frost will increase by
7.1171 Julian days. This means our simple linear regres-
Y = a - bX sion model predicts that a weather station located one
b<O
Case 2: degree further north latitude than another weather sta-
Negative tion will have a last spring frost date approximately one
or inverse
relationship week later in the spring.
The a and b parameters for the best-fitting regression
line are calculated:
'-----------Y
X b= nI;XY-(IX)(IY)
(17 .2)
nix -(x)2
2

IY-bIX (17 .3)


a=
Case 3: n
No
relationship

Y=a where: r.x = sum of the values for variable X


b=O r.Y = sum of the values for variable Y
r.x 2 = sum of the squared values for variable X
'-----------Y
EXY = sum of the product of corresponding X
FIGURE 17.3 and Yvalues
Interpretation of Slope in Simple Linear Regression n = number of observations

TABLE 17.1

Simple Linear Regression Example for Latitude and Average Date of Last Spring Frost

X: Y:
Average date of
Weather station Latitude last spring frost xf X1Y1
1 Troy, AL 31.80° 74.41 1,011.24 2,366.24
2 Union Springs, AL 32.02° 82.27 1,025.28 2,634.29
3 Talledega, AL 33.42° 97.83 1,116.90 3,269.48

76 Farmville, VA 37.33° 110.62 1,393.53 4,129.44

I 2,578.31 I 6,797.59 I 87,736 I 232,506

n L XY - Q; X)(l: Y) 76(232,506) - (2,578.31)(6,797.59)


b =
n1;X 2 -(1;X),
= 76(87,736)- (2,578.31) 2
= 7.1171
1;Y-b1;X 6,797.59 - 7.1171(2,578.31)
a= - = -152.01
n 76
256 Part VI .A. Statistical Relationships between Variables

If you ever have to do calculations manually (hopefully


not), the slope or regression coefficient is calculated first,
then used to compute the Y-intercept or a value. Table
Explained variation
17.1 shows the intermediate steps used to calculate a and absorbed by sponge
b for the average date of last spring frost example. (r Y.'l
The two parameters uniquely define the least-
squares regression line that best summarizes the relation-
ship between the explanatory and response variables. Bucket filled
This line can be placed on the scatterplot by calculating with total
variationin
and plotting two points that lie on the line and then con- v (ry'J
necting them with a straight line. The two points can be
determined by selecting any two values of X, substituting
them into the regression equation, and calculating the
corresponding values of Y. Fortunately, most statistical
r
Unexplained variation
not absorbed by sponge
( r y,')
software packages will plot the regression line for you.
FIGURE 17.4
Bucket and Sponge Analogy in Simple Linear Regression
17.2 STRENGTH
OFREIATIONSHIP
IN Source:Abler R., Adams. J.. and Gould, P. 1971
SIMPLELINEAR
REGRESSION
For any realistic application, the scatterplot of points Abler, Adams, and Gould (1971) created a useful
will not be depicted perfectly by the least-squares line of analogy involving a bucket and sponge to illustrate the
regression. All points will not lie exactly on the line, thus simple linear regression process (fig. 17.4). A bucket full
indicating that the predictor variable cannot fully explain of water represents the total variation in the response
the response variable. This "error" in regression analysis variable, Y. A sponge denotes the explanatory or predic-
can be traced to several sources. Since geographers study tor variable (X) which will be used to explain variation in
patterns on the earth's surface that are frequently pro- Y. When the sponge is dipped into the bucket and
duced by complex processes, it is unreasonable to expect removed, some of the water (symbolizing variation in Y)
a single predictor (independent) variable in a regression is absorbed. This represents the amount of variation in Y
model to account fully for the variation in the response that can be explained by X. Although some water is
variable. Even when multiple influences are considered, absorbed by the sponge, some of it remains in the bucket.
some portion of most real-world patterns is attributable This residual amount is the portion of variation in Y that
to additional known variables not included in the regres- cannot be explained by X.
sion model, other unknown variables, or unpredictable, The sponge analogy can be developed further. The
random occurrences. The inability to measure or opera- central issue in measuring the strength of relationship is
tionalize variables in an accurate or valid fashion may be to determine the relative ability of the sponge to remove
another source of error. Despite these problems, we must water from the bucket. If the sponge is "super absor-
determine the strength of relationship between variables bent," it will remove a large proportion of the water. A
if we want to evaluate the explanatory ability of the "less absorbent" sponge removes less water. Calculating
regression model. the ratio of volume of water removed by the sponge to
The issue of strength in regression analysis can be the total volume of water originally in the bucket pro-
viewed in both conceptual and practical terms. In simple vides a strength index for the sponge.
linear regression, a single explanatory variable (X) is The bucket and sponge analogy translates directly
used to explain or account for variation in the response into regression terminology. The total variation in Y, the
variable (Y). The ability of this explanatory variable to response variable, is represented by the original volume of
account for the variation in Y provides a measure of water in the bucket. This provides a measure of the varia-
strength or level of explanation. To understand this pro- tion available for explanation by the predictor variable:
cess from another perspective, the strength of relationship
in regression is determined by the amount of deviation (17 .4)
between the points on the scatterplot and the position of
the best-fit line. In general, the closer the set of points to
where: r.y 2 = total variation in Y
the regression line, the stronger the linear relationship
between the variables. Determining the strength of rela- Note that Y ("large y") symbolizes values of the response
tionship between two variables in regression is the same variable, whereas y ("small y") denotes the deviation of
as measuring the relative ability of the explanatory vari- each value of the response variable from its mean:
able to account for variation in the response variable. (Y -Y). Because the calculation of total variation
Chapter 17 .A. Simple Linear Regression 2S7

involves summing the squared deviations from the mean, be accounted for by the explanatory variable. It is analo-
the term is often referred to as the total sum of squares. gous to the water not removed by the sponge. Because
An alternative formula for calculating the total sum of unexplained variation relates directly to analysis of residu-
squares 1s: als or error in regression, this concept will be discussed
more fully in the next section.
The explained variation is the ratio of the square of
(17 .S)
the covariation between X and Y to the variation in X:

where: TSS =total sum of squares (17. 7)


As illustrated in the bucket and sponge analogy, the
total variation in the response variable can be broken into
two parts, the explained variation and the unexplained
where: I, xy =covariation of X and Y
variation: I, x 2 =total variation of X
As discussed in section 16.2, the covariation between X
(17 .6)
and Y indicates how the two variables vary together
(covary) and is used to help interpret the correlation coef-
where: I, y; =explained variation ficient. If two variables tend to covary consistently in the
same direction, covariation (and the correlation coeffi-
I, y: =unexplained variation cient) is high and positive (see fig. 16.4, case 1). If they
The explained variation, also called the "explained sum of vary systematically, but in opposite directions, the covari-
squares," is the amount of variation that can be accounted ation and resulting correlation is high and negative (see
for by the explanatory or predictor variable X. In the anal- fig. 16.4, case 2).
ogy, it is the amount of water absorbed by the sponge. The Although the direction of association represented by
unexplained or "residual" variation is the portion of total the sign of the covariation term is important in correla-
variation in the response (dependent) variable that cannot tion analysis, it can be ignored when using covariation to

TABLE 17.2

Worktable for Calculatlng Strength of Relationship for Latitude and Last Spring Frost Example

Total variation:
L Y2 - <rY)Z
N

626 720 - (46,207,229)


' 76

18,730.15

Explained variation:

(2,578.31)(6, 797.59)
Ixy 232,506- 76 1,896.87

(:EX) 2 (2,578.31) 2
N = 87,736 - 266.495
76

(1,896.87) 2
13,501.625
266.495

Coefficient of determination:

2 LYi 13,501.625
r = I;y2 0.721 (p = 0.000)
18,730.15
258 Part VI .A. Statistical Relationships between Variables

determine explained variation. Remember that regres- correlation coefficient shows the direction and level of
sion is concerned with the ability of the explanatory vari- association between any two variables and does not
able to account for variation in the response variable. By imply a functional or causal relationship. The coefficient
squaring the covariation in equation 17.7, the influence of determination, on the other hand, is used as a regres-
of a negative covariation is eliminated. sion index to measure the degree of fit of the points to the
When the squared covariation (numerator of equa- regression line or the ability of the explanatory variable
tion 17.7) is large relative to the total variation in X to account for variation in the response variable. As a
(denominator in equation 17.7) the explained variation is result, the use of , 2 requires a logical rationale for the
also large. In such cases, the points of the scatterplot will existing relationship and the specification of explanatory
tend to lie close to the regression line and the explana- and response variables.
tory variable X will account for more of the variation in
the response variable Y.This is equivalent to selecting an
absorbent sponge. 17.3 RESIDUAL
ORERRORANALYSIS
IN
On the other hand, if the explanatory and response SIMPLELINEAR
REGRESSION
variables do not covary systematically in a scatterplot,
the resulting covariation measure (and correlation coeffi- Residual analysis provides additional information
cient) will be low (See fig. 16.4, case 3). In this situation, about the variation in the response variable that cannot
the ratio of the squared covariation to the total variation be explained by the explanatory variable. Because geo-
in X will also be low, producing a smaller amount of graphic relationships seldom allow perfect explanations
explained variation. When the level of covariation of the of the response variable, points will only rarely lie on the
X and Yvariables is weak, the points tend to scatter more regression line in the scatterplot. The amount of devia-
widely about the regression line and the general strength tion of each point from the regression line is termed the
of relationship is weak. absolute residual. It represents the vertical difference
The amount of explained variation is an absolute between the actual and predicted values of Y, expressed
measure calculated in units of the response variable and in units of the variable itself:
is comparable to the volume of water removed by the
sponge in the analogy. A more useful way to express the RES=Y-Y (17.9)
strength of a regression relationship is with a relative
index that is not tied to the units of measurement. where: RES = residual
Termed the coefficient of determination or r 2, the rela-
Y = actual value of the response variable
tive strength index for regression is simply the ratio of the
explained variation to the total variation in Y, and mea-
Y = predicted regression line value of Y
sures the proportionof variation in the response variable The predicted values of Y are generated from the
that is accounted for by the explanatory variable: regression line. Each Y value represents a best estimate
of the response variable produced from the a and b
LY,2 parameters that define the regression line. By substituting
r = 2
2
(17 .8) any value of X (for example, X 1) into the regression equa-
LY tion, the corresponding predicted value ( Y,) is calculated:
The index is often multiplied by 100 for ease of inter-
pretation. In this form, the strength index ranges from 0 (17.10)
to I00 and can be interpreted as the percentageof varia-
tion in the response variable accounted for by the explan- For example, Farmville, VA (weather station #76) is
atory variable (or in the bucket-sponge analogy, the located at 37° 20' north latitude (37.33), therefore:
percentage of water removed by the sponge).
The worktable concerning the strength of relation- Y16 =-152.01 + 1.1111(37.33) =113.67
ship for the average date of last spring frost example is
shown in table 17.2. The coefficient of determination This predicted value is quite close to the actual aver-
(r 2 = .721) indicates that latitude accounts for 72.lOfc,of age date of last spring frost calculated directly from the
the variation in average date of last spring frost across data (110.62). The regression model predicts Farmville
this portion of the southeast. The remaining 27.9% of last will have its last spring frost on about April 24 (Julian
spring frost variation is not explained by the latitude vari- date 114), whereas the actual date of last spring frost
able in this regression model. (averaging about 111) is April 21. The model slightly
Although the coefficient of determination (, 2 ) is overestimates the actual last spring frost date by about 3
closely related to the correlation coefficient (r), the two days and Farmville has an absolute residual of -3.055 (as
indices have different purposes and interpretations. The taken from the computer output).
Chapter 17 .A. Simple Linear Regression 2S9

By using the parameters of the line and the values of 84.85, the simple linear regression model overestimates
the explanatory and response variables, absolute residu- the average date of last spring frost by nearly IO days.
als can also be calculated without having to compute the The date of average last spring frost (derived from the
predicted Y values directly: data) is about Julian day 75 (March 16), while the model
predicts that the average date of last spring frost should
RES =Y - (a + bX) (17.11) be Julian day 85 (March 26) at that latitude.
Located about 45 miles east of Birmingham, AL, the
Why is so much attention given to calculation of the weather station at Talladega has a fairly large positive
absolute residuals from regression? The reason is simple: residual (I 1.99). With an actual average last spring frost
we can gain two general insights when examining resid- date of 97.83, the point representing Talladega is well
ual values associated with a matched pair of data values. above the predicted regression line value of 85.84. That
First, the size or magnitude of each residual provides the is, while the model predicts the date of last spring frost
absolute amount of error associated with that data value. should be about March 27 (Julian day 86), the average
For some values the residual or error is small and the date derived from the data is April 8 (Julian date 98).
explanatory variable accurately predicts the value of the These residual analyses lead to some obvious ques-
response variable. In other cases the residual is large, indi- tions. Why is the last spring frost in Hawkinsville occur-
cating poor prediction of the response variable. The ring about IO days earlier than predicted, and why is the
smaller the absolute magnitude of a residual, the smaller last spring frost in Talladega occurring about 12 days later
the vertical distance on the scatterplot separating the than predicted? Are residuals of this magnitude typical of
point from the regression line and the less error associated the 76 weather stations distributed around the region?
with the data value. Latitude is certainly not a perfect predictor of when
Second, the direction of residuals from the line is the last spring frost will occur at a weather station. We
important. For some data values the actual magnitude of already know this to be the case since latitude accounts
Y exceeds the predicted magnitude and the residual is pos- for only 72.1% of the variation in average last spring frost
itive (RES > 0). In these instances, the point lies above the dates among the 76 sample weather stations. The simple
regression line on the graph and the model underestimates regression model using only latitude to predict last spring
the actual value of the response variable. For other units of frost date contains a set of residuals (bits of unexplained
data the predicted magnitude of Y exceeds the correspond- variation or errors in the model) that leaves 27.9%,of the
ing actual magnitude, the residuals are negative (RES < 0), total variation in the response variable unexplained.
and the points lie below the line. In these cases, the regres- To explore further, we should take the next logical
sion model has overestimated the actual value of Y. geographic step and map the residuals. This allows us to
The insights that can be gained from analysis of abso- examine the spatial pattern of unexplained variation in
lute residuals are illustrated by examining two weather last spring frost dates. If the spatial pattern of residuals is
stations from the average date of last spring frost exam- random (if there is no discernible non-random pattern to
ple: Hawkinsville, GA (weather station #20) and Talla- the residuals) then it may be difficult to suggest another
dega, AL (weather station #3). Using equation 17.11 and explanatory variable beyond latitude to add to the
the data values listed in tables 16.1 and 17.1, residuals for model. Hopefully, however, there is a non-random pat-
these weather stations are calculated as follows: tern to the residuals that suggests another explanatory
variable which could be added to our model. If so, then
Hawkinsville, GA: we can perhaps build a multiple regression model (see
RES20=Y20 -(a+bX20) chapter 18) with more than one predictor variable. The
generalized pattern of absolute (and standardized) resid-
= 75.22-[-152.0 I+ (7.1171)(33.28)) uals is shown in figure 17.5.
= 75.22-84.85 =-9.63 Of the 76 weather stations included in this problem,
only 22 have a residual whose absolute value is greater
Talladega, AL: than 8.41 days, and these are highlighted in figure 17.5.
We mention the specific value of 8.41 because that is the
RESJ = Y3 -(a+bXJ) standarderror of the estimate (an index of relative error)
= 97.83-[-152.0 I+ (7.1171)(33.42)] which defines the typical distance separating a point from
= 97.83-85.84 = 11.99 the best-fittingregression line on the scatterplot. The stan-
dard error of the estimate is in fact the standard deviation
Hawkinsville, GA (located about 40 miles southeast of of the residual values. We will see shortly how this value
Macon, GA and 120 miles southeast of Atlanta, GA) is a is calculated. The simple linear regression model using
weather station having an absolute residual that is fairly latitude to predict the average date of last spring frost has
large and negative (-9 .63). Positioned below the regres- a standard error of 8.41 days. With some weather stations
sion line on the scatterplot with a predicted Y value of the regression model predicts the last spring frost date
260 Part VI .A. Statistical Relationships between Variables

quite a bit earlier than it actually occurs (like Talladega, where: SE = standard error of the estimate
AL). With other stations (such as Hawkinsville, GA) the n- 2 = degrees of freedom
model predicts the last spring frost date much later than it
actually occurs. With many other stations (such as Farm- Calculation of total and relative error for the last
ville, VA) the predicted date of last spring frost is much spring frost example illustrates the different uses of these
closer to the date that the frost occurs. If we express the regression indices. The unexplained variation or residual
absolute residual of a weather station in terms of the num- sum of squares can be calculated from the residuals or
ber of standard errors it is from the value predicted by the from the actual and predicted Yvalues:
regression model, then it is the standardizedresidual.
The last three columns in table 17.3 show the pre- 2
LY.= I,(RES) 2 =I, ( Y -Y ")2 =5233
dicted date of last spring frost, the absolute residual, and
the standardized residual for some of the weather sta-
It can also be calculated from the total and explained
tions (with the sign of the residual indicating whether the
variation:
point is above or below the best-fitting regression line).
The standard error of the estimate can be calculated
in several ways, using the total error, the residuals, or the I,y: =I,y 2
- I,y; =18730-13497 =5233
actual and predicted values of Y:
These calculations show that of the 18730 units of total

SE= ✓I,y;
2 variation in the response or dependent variable, 5233
=JI.RES= I.(Y-Y) (17.12) units of variation (average date of last spring frost) are left
n-2 n-2 n-2 unaccounted for by the single explanatory (independent)

Magnitude of Standardized
{Absolute) Residual

0 less than -1.00 (-8.41)

& -1.00 to 1.00 (-8.41 to 8.41)


• greater than 1.00 (8.41)
---
GULF OF 0 250 500 N

MEXICO Kilometers A
FIGURE 17.5
Pattern of Residuals: Regression Model Using Latitude to Predict Average Date of Last Spring Frost
Chapter 17 .A. Simple Linear Regression 261

variable of latitude. To see how much of this total error is any non-random pattern in the very large positive or very
associated with a typical data weather station in the large negative residuals?
southeast, the standard error of the estimate that we were Many of the largest negative residuals are located in
discussing earlier is now calculated: weather stations relatively close to the Atlantic coast.
That is, many coastal communities seem to have last
spring frost dates that occur more than a week earlier in
the spring than predicted by the regression model. Con-
versely, almost all of the largest positive residuals are
weather stations located well inland, often in mountain-
The standard error of the estimate produces stan- ous areas where the upper piedmont merges with the
dardized residual values by converting absolute residu- foothills of the Blue Ridge Mountains (places like west-
als into relative residuals. This simple procedure 1s ern North Carolina and northern Georgia).
analogous to generating Z-scores for a distribution: Clearly, the spatial patterns of the largest residuals
are not random. As we consider making improvements
SRES= RES (17.13) to the regression model (hopefully increasing the magni-
SE tude of the coefficient of determination (, 2) above .721),
distance from the coast and elevation both seem to be
excellent candidates for additional explanatory or predic-
where: SRES = standardized residual
tor variables. In fact, we will statistically examine these
In this form, standardized residuals relate the magnitude two variables (both individually and collectively) in the
of each residual to the size of the typical residual, repre- next chapter.
sented by the standard error. They can be interpreted as
the typical amount of error associated with a value, mea-
sured in standard error units. 17.4 INFERENTIAL
USE OFREGRESSION
Look again at the map of residuals (fig.17.5). The
map displays the spatial pattern of error in the regression Sometimes the results from simple linear regression
model when the single explanatory variable of latitude is can be tested for sigriificance and results from a sample
used to predict the date of last spring frost across the inferred to the population from which the sample was
region. All weather stations with a standardized residual drawn. This can be done in several ways. For example,
more than one standard error from the predicted value inferencescan be made concerning the two parameters that
(either positive or negative) are highlighted. Do you see define the regression line, the slope and the Y-intercept.

TABLE 17.3
Regression Model Parameters for Selected Weather Stations: Latitude and Last Spring Frost Example

X: Y: y
Average Predicted date of
date of last last spring frost Absolute Standardized
Weather station Latitude spring frost {from regression model) residual {RES) residual {SRES)
(may differ slightty from computer output due to rounding)

1 Troy, AL 31.80° 74.41 74.31 0.10 0.01


2 Union Springs, AL 32.02° 82.27 75.88 6.39 0.76
3 Talledega, AL 33.42° 97.83 85.84 11.99 1.43

76 Farmville, VA 37.33° 110.62 113.66 -3.04 -0.37

a = -152.01
b = 7.1171

Analysis of variance (ANOVA) (provided by computer output):


Total variation in the response or dependent variable = 18,730
Explained variation (regression model sum of squares) = 13,497
Unexplained variation (residual sum of squares) = 5,233
R-squared (proportion (percent) of total variation in the response or
dependent variable accounted for by the explanatory or predictor variable) = 13,497/18,730 = 0.721 or (72.1%)
262 Part VI .A. Statistical Relationships between Variables

Inferential testing can also be applied to the coefficient of spring frost example problem, we have already confirmed
determination, measuring the statistical significance of the from the scatterplot in the correlation chapter (fig. 16.4)
strength of the regression model. However, no matter that the relationship between variables is linear.
which inferential procedure is chosen, a stringent set of Another important requirement is the equal vari-
assumptions apply. ance assumption. Probably the simplest way to check
Regression has more assumptions than the other whether this assumption is met is to plot either the abso-
inferential procedures discussed in the text. All of the lute or standardized residual values against the predicted
assumptions that apply to Pearson's correlation also apply values of Y. There should be a similar amount of scatter
to regression. One basic requirement is that both vari- about the line whatever the magnitude of the predicted
ables must be measured on an interval or ratio scale value. That is, this scatterplot should not have any nota-
(neither ordinal nor categorical variables can be used). ble features. There should be no significant direction,
Both latitude and date of last spring frost meet this condi- bend, or shape to the residual plot. Also, there should be
tion. Also, random samples are required, using one of no extreme or extraordinary points (no outliers or outlier
two modeling schemes. In a fixed-X model, the investiga- groups) on the residual plot. The scatterplot of residuals
tor preselects certain values of the explanatory or inde- against predicted date of last spring frost (figure 17.6)
pendent variable (X), perhaps as part of a controlled appears very nondescript and featureless, suggesting the
experiment. Sample values for the dependent variable ( Y) equal variance assumption has been met. Meeting these
are then derived using a component of randomness. In requirements also makes it virtually certain that vari-
the random-X model, sample values for both the explana- ables X and Y are themselves normally distributed. For-
tory and response (dependent) variables are chosen at tunately, most statistical packages provide the option to
random. For the last spring frost problem, the weather plot the residuals against the predicted Y-values.
stations are chosen randomly from throughout the south- Both the Y-intercept and the slope of the sample
eastern U.S. study region. regression model can be tested for statistical significance.
Since the simple linear regression model places a best- Inferential testing of the regression parameters considers
fitting line through a scatterplot, the variables are assumed the Y-intercept (a) and slope (/3), which define the popu-
to have a linear relationship. If the association between lation regression line relating the independent and
variables is not reasonably straight, then the simple linear dependent variables:
model should not be used. In some cases one or both of the
Y= a+/JX (17.14)
variables can be transformed to create a reasonably straight
scatterplot, but discussion of transformations is beyond the The sample Y-intercept (a) and sample slope (/3) are the
scope of this book. With regard to the latitude and last best estimators of their respective population parameters.


2
• • •
• • • • •
"' 1
• • • •• •
::,
-0
·.;
• • • ,,. ••• • •• •• • • •
~
-0
Q)
0 • • • • •
.t:!
• • ,.: •• •
"E
"'
-0
C:
• • • • • •• •
s
(/) • • ••
• •
-1
•• • •
••
-2
• •
• •
60 75 90 105 120
Predicted value of Y (average date of last spring frost)

FIGURE 17.6
Scatterplot of Residuals Against Predicted Values of Y (Average Date of Last Spring Frost)
Chapter 17 .A. Simple Linear Regression 263

Inferential testing of the sample regression line param- variation and the unexplained variation (equation 17.6).
eters tells us the likelihood that the Y-intercept and slope It is also true that the coefficient of determination is the
are statistically significant. Even if the Y-interceptlies well ratio of explained variation to total variation in Y (equa-
outside the domain of X, testing its significance is still tion 17.8). Thus, returning to the bucket and sponge anal-
worthwhile. In the latitude and last spring frost example, ogy, testing the significance of r 2 to determine if it is
the Y-interceptis -152.01, the associated t-value is -8.68, significantly greater than zero is equivalent to testing
and thep-value is 0.000. The slope is 7.1171, with associ- whether the sponge (explained variation) removes a sig-
ated t-value of13.81andp-value of0.000. These figures are nificant amount of water (total variation) from the bucket.
found in the "analysis of variance" portion of the computer These components of variation in Y may be
output. We conclude that the regression model contains expressed in terms of the "sum of squares," illustrating
parameters that are statistically significant. That is, if we the direct equivalence between regression and analysis of
collected data from all weather stations in the southeastern vanance:
U.S., we would find a significant nonrandom slope and Y-
intercept between latitude and date of last spring frost.
Inferential significance tests can also be applied to the
population coefficient of determination (p 2), using the I,y 2 = TSS (total sum of squares)
sample coefficient of determination (, 2 ) as the best esti-
mator. Analysis of variance ( or F statistic) is used to evalu-
I,y; = RSS (regression or explained sum
ate the significance of r 2 . In this context, the null
of squares)
hypothesis is that the population coefficient of determina-
tion is not significantly greater than zero (H 0 : p 2 not > 0),
and the alternate hypothesis is the converse (HA: p 2 > 0). I,y: = ESS (error, residual, or unexplained sum
Recall that the total variation in the dependent vari- of squares)
able is equal to the sum of two components, the explained
Therefore: TSS = RSS + ESS and

2
2_ LY,_ RSS
r - I,y 2 - TSS
SimpleLinearRegression
Analysis
Primary Objective: Determine if an independent (predictor)
variable (X) accounts for a significant
portion of the total variation in a
dependent (response) variable (Y) TABLE 17. 4

Requirements and Assumptions: Summary Results of Simple Linear Regression:


1. Variables are measured on interval or ratio scale Latitude and Average Date of Last Spring Frost
2. Fixed-X model: Values of independent variable (X) Dependent(response)variableis average date of last springfrost.
chosen by the investigator, and values of dependent
Independent(predictor)variableis latitude.
variable ( Y)randomly selected for each X
The regressionequationis:
Random-X model: Values of both X and Y randomly
selected average date of last spring frost = -152 + 7.12 (latttude)
3. Variables have a linear association
4. For every value of X1, the distribution of residuals Variable Coefficient SE t-ratio p,,value
(coefficient}
(Y - Y) should be normal, and the mean of the
residuals should equal zero Intercept -152.01 17.50 -8.68 0.000
5. For every value of X, the variance of residual error is
Latttude 7.1171 0.5152 13.81 0.000
equal (homoscedastic)
6. The value of each residual is independent of all other
residual values (no autocorrelation) SE= 8.41 r 2 = 0.721 (72.1%) r = 0.849 (84.9%)
(coefficientof det~mination) {simplecorrelationcoefficient)

Hypotheses:
Analysis of Variance:
2
Ho : p = 0
2 Sl.111of Mean
HA: p "' 0 Source OF square F•ratio p,,value
squares

Test Statistic: Regressionmodel 1 13.502 13,502 191.23 0.000

Residualerror 74 5.228 71
1 - ,2
TOTAL 75 18.730
264 Part VI .A. Statistical Relationships between Variables

The Fstatistic from ANOVA is expressed in terms of It can be concluded that the explanatory or indepen-
the coefficient of determination: dent variable (latitude) accounts for a significant (non-
zero) amount of the total variation in the response or
F= ,2(n-2) (17.15) dependent variable (average date of last spring frost). In
1-,2 conclusion, the summary results of this simple linear
regression of latitude as a predictor of last spring frost
This F statistic is the square of the t statistic used in the pre-
vious chapter to test the significance of r, the simple corre- data are presented in table 17.4.
lation coefficient (equation 16.9). In the last spring frost
example the significance testing results in the following:
2
F= r (n-/) = .721(76-2) =l 9 1.2J 17.5 EXAMPLE:SIMPLELINEAR
1-r 1-.721 REGRESSION-LAKEEFFECTSNOW
p = 0.000 IN NORTHEASTERN
OHIO

Example: Simple Linear Regression-Lake Effect Snow in Northeastern Ohio

One of the mechanisms influencing precipitation on land water tend to have higher amounts of precipitation than do
areas is the presence of nearby large bodies of water. These areas farther away.
water bodies provide an important source of moisture that An interesting example of this land-water relationship
produces sizable amounts of precipitation when brought over occurs in regions adjacent to the southern and eastern shores
a land area by prevailing winds. However, as distance from the of the Great Lakes in New York, Pennsylvania, Ohio, and Indi-
body of water increases,the water's influence decreases, and ana. The presence of the Great Lakes and a strong northwest•
precipitation levels generally decline. Thus, areas adjacent to erly wind flow during the winter combine with other

- -------- --
--- ---
;
;
, , -----
Capping inversion ;
;

~
________, Cloud
,
,
------- formation

Warmer air
, ,
, -
, ,
,
- - _______ -
...
, ,
* * * * *
* * *
* * * * *
Colder * * * * * * * *
*
Arctic air Heat and moisture * * * * *
* *
► ►

ti iii
Warm lake Cool land
Cool land

FIGURE 17.7
Formation of Lake Effect Snow
Chapter 17 .A. Simple Linear Regression 265

influences to create an important regional climatic phenome• matological processes which seem to be operating in this lake
non called "lake effect snow" (fig. 17.7). Using correlation• effect snow region.
regression analysis, the spatial pattern of snowfall in extreme During the early winter, the water temperature of Lake Erie
northeastern Ohio along the shore of Lake Erie is examined. is warmer than adjacent land areas.As cold air currents from
Snowfall data from this region seems to indicate that dis• the northwest pass over this relatively warmer water, the air is
tance from Lake Erie strongly affects the levels of average heated and can hold more moisture. As the moisture-laden air
annual snowfall. To test this relationship, a systematic sample continues moving onto cooler land surfaces south and east of
of 38 locations is taken from an isoline map showing average the Lake, it is cooled, the moisture often condenses, and pre•
annual snowfall amounts for a set of counties in extreme cipitation occurs. Immediately adjacent to the Lake, the air
northeastern Ohio. For each of the 38 sample points, the mag• temperature is often warm enough for the precipitation to fall
nitude of two variables is recorded: (1) average snowfall, inter• as rain. Somewhat further from the Lake (about 7-10 miles)
polated from the map of isohyets, and (2) straight-Hne distance the air is frequently cold enough to create snowfall.
from Lake Erie. When these data are graphed, the following Strictly speaking, the straightforward application of Pear•
scatterplot emerges (fig. 17.8). Note that distance from Lake son'scorrelation and simple linear regression may not be the
Erie is the independent (predictor) variable (placed on the X optimal methodology for this complete curvilinear set of data.
axis) and average annual snowfall is the dependent (response) Some alternate approaches could be used to continue the
variable (placed on the Yaxis). In general, these two variables investigation. One alternative is to apply a nonlinear correla•
seem to indicate a negative correlation-as distance from the tion model matching the curvilinear pattern on the scatter•
Lake increases,average annual snowfall seems to decrease. plot. While fully valid, this methodology is beyond the
However, the overall pattern of scatterplot points is clearly introductory level of this text. Another possibility would be to
not linear. You can see that average annual snowfall amounts convert the nonlinear data into linear form by transforming
for the sample points closest to the Lake are slightly lower one or both of the coordinate axesand then apply Pearson's
than snowfall levels a bit farther from the Lake. A more com• correlation to this transformed data. Data transformation has
plicated curvilinear relationship seems to exist. Perhaps a sim• been used effectively in many geographic studies, but also
pie linear regression model should not be applied as we might has a level of statistical sophistication beyond the introduc•
be violating an important statistical test assumption. Before tory level. Also, a transformation does not seem fully appropri•
deciding what to do, we need to take a closer look at the cli• ate in this problem setting because the nonlinear portion of

110 -

•• • •
100 - • • •
-
gi
u
• •

.s=
C: 90 - • •
C:
= • • •
=
,f1
! 80 -
• •••
C:

.,
<I)
• • •
::,

.,
C:
C:
70 - ••
.,
"
0,
~
• • • ••
• • • •• • •
~
q;
60 -

• •
50 -

'> I I I I I
0 10 20 30 40
Distance from Lake Erie (in miles)

FIGURE 17.8
Original Study Area: Scatterplot Showing Relationship Between Average Annual Snowfall and Distance from Lake Erie

(continued)
266 Part VI .A. Statistical Relationships between Variables

the scatterplot is located only in that narrow band of distances sion model but not appropriate for direct model interpretation
from zero to about six miles from Lake Erie. since zero miles from Lake Erie is beyond the domain of the
Yet another alternative is to eliminate that portion of the study area). The regression model slope is -1.719, predicting
study area less than about six miles from the Lakeshore. If we that for each additional mile of distance from Lake Erie,aver•
select this alternative, the remaining 33 sample points in the age annual snowfall will decrease by 1.719 inches.
revised study region generate a scatterplot that shows a linear Both the Y-intercept and the slope are significantly differ•
relationship between distance from the Lake and amount of ent from zero at all levels of confidence. The !•ratio value for
snowfall. This strategy allows us to not only meet the linearity the Y-intercept is 43.26 with an associated p•value of 0.000
assumption of Pearson'scorrelation and simple linear regres• and the !·ratio value for the slope is -15.13 with an associated
sion, but also can be justified on a climatological basis,when the p-value of 0.000.
domain isrestrictedto a range from sixto forty milesfrom the Lake. The standard error of the regression model is S.80.This indi•
We are now ready to apply correlation-regression analysis cates that the typical difference between actual average annual
to evaluate the relationship of distance from Lake Erie and snowfall and the predicted average annual snowfall (that is, the
average annual snowfall. The location of 33 sample points in typical residual) is S.80 inches of snowfall. This standard error
the revised study area is shown in figure 17.9, along with the can be interpreted as how precisely the simple linear regres•
latitude and longitude values (expressed both in degrees-sec• sion model predicts average annual snowfall in this region.
onds and in degrees-hundredths). Superimposed on this sam•
pie point grid is the isohyet pattern of average annual
TAB LE 17 .5
snowfall. Table 17.Sprovides all pertinent data for each of the
33 sample points needed to run the correlation and reg res• Revised Study Area: Northeastem Ohio Lake
sion analyses. Finally, figure 17.10 shows the 33 sample points Effect Snow Data
in the revised study region (notice their linear alignment in X y
the domain range of six to forty miles), along with the best-fit•
Average
ting regression line and its formula. Systematic Distance from Annual
The summary results of the correlation-regression analysis Sample of Lake Erie Snowfall
are provided in table 17.6. The regression equation (Y =a+ bX) Observations (in miles) (in inches) Latitude Longttude

is: [snowfall= 114.6 - 1.719 (distance from Lake Erie)).The 1 7.7 104 41.75 -81.05
Y-,intercept is 114.6 miles (correct for the simple linear regres• 2 9.9 102 41.75 -80.95
3 11.2 99 41.75 -80.85
4 12.9 92 41.75 -80.75
5 9.9 95 41.60 -81.45
6 8.3 102 41.60 -81.35
7 10.9 105 41.60 -81.25
8 13.6 101 41.60 -81.15
9 15.8 96 41.60 -81.05

Ohio
., ..
Revi&ed
study

-4UO"
10
11
17.1
18.8
90
82
41.60
41.60
-80.95
-80.85
i41 "$4' N)
12 20.6 78 41.60 -80.75
13 11.4 86 41.45 -81.45
80 14 14.9 85 41.45 -81.35
- 41,7~
{41'4S'N) 15 17.5 82 41.45 -81.25
16 20.1 80 41.45 -81.15
70 17 41.45 -81.05
-41.60" 22.8 78
{41'3&' N)
18 25.0 71 41.45 -80.95
19 26.3 69 41.45 -80.85
-41.45•
(41'2rN) 20 28.5 67 41.45 -80.75
21 18.8 68 41.30 -81.45
-4UO" 22 21.0 67 41.30 -81.35
{41'16'N)
23 25.4 66 41.30 -81.25
24 28.0 64 41.30 -81.15

--
- 41.15"
(41'09'N) 25 30.7 62 41.30 -81.05
26 32.4 60 41.30 -80.95
60 ----- --------- - 41,0lr 27 34.6 59 41.30 -80.85
{41 '00' N)
28 36.4 58 41.30 -80.75
• Sample point locations 0 20 40 N
(Systematic sample 29 26.7 60 41.15 -81.45
of 33 points) Kllome1ers A 30 28.5 59 41.15 -81.35
31 32.0 58 41.15 -81.25
32 35.5 57 41.15 -81.15
FIGURE 17.9
Revised Northeastern Ohio Study Area: Average Annual Snowfall, 33 37.7 57 41.15 -81.05
in Inches Source: Kent, R. B. (editor), 1992
Chapter 17 .A. Simple Linear Regression 267

TAB LE 17 .6 The coefficient of determination (r 2) is .881 and the corre•


lation coefficient (r) is -.9386. The strongly negative correla•
Summary Results of Simple Linear Regression: tion coefficient and coefficient of determination make it clear
Distance from Lake Erie and Average Annual that a very high percentage (88.1%) of the variation in aver•
Snowfall age annual snowfall across this study area can be accounted
Dependent(response)variableis averageannualsnowfall. for by distance from Lake Erie.
Independent(predictor)variableis distancefromLake Erie.
In the ANOVAportion of table 17.6 we can see the actual
amounts of variance explained and not explained by the
The regressionequationis: snowfall= 114.6 -1.719
(distancefromLake Erie) regression model. From a total variation of 8742.2, 88.1%
(7699.1)is accounted for by the model, while 1043.1 is not
Variable Coefficient
SE t-f"atio p-val ue
explained and is therefore considered residual error. This leads
(coefficient) to an extremely large F-ratio value of 228.82, with an associ•
Intercept 114.582 43.26 0.000 ated p•value of 0.000. We can virtually be 100% confident that
2.648
the simple linear regression model is statistically significant.
Slope -1.7192 0.1137 -15.13 0.000 Suppose you own an undeveloped piece of property in
this northeastern Ohio region, and are planning to build a
SE= 5.80 ,
2
=0.881 (88.1%) r =-0.9385 retirement home at this location. Before proceeding with con•
(coefficientof deteJmination) {sirnpleCOJrelation
coefficient)
struction, you want to know how much snowfall to expect
Analysis of Variance: each winter. With no official weather station nearby, the best
estimate available is from this regression model. If your prop•
Source OF Sum of Mean F--ratio p-value erty is 20 miles (as the crow flies) from Lake Erie,what is the
squares square
predicted average annual snowfall?
Regressionmodel 1 7,699.1 7,699.1 228.82 0.000 =
The best-fitting regression line equation is: Y 114.582 -

Residualerror 31 1,043.1
=
1.7192X.Since your property is 20 miles from the Lake,X 20.
33.6 =
Substituting into the equation gives: Y 114.582 - 1.7192(20)
TOTAL 32 8,742.2 = 80.20. You can expect about 80 inches of snowfall each win•
ter on your property.

110

•• • •
100
• •

90 • Y = 114.6-1.719X

• •
iiii 80

iii
:,

..
C:
C:
., 70
• •
[., •
~ 60 • • • •
50

10 15 20 25 30 35 40
Distance from Lake Erie (in miles)

FIGURE 17.10
Best-filling Regression Line Placed on Scatterplot: Distance from Lake Erie and Average Annual Snowfall

(continued)
268 Part VI .A. Statistical Relationships between Variables

One important issue in this geographic research problem throughout the study region, but this strategy is not practical
is the lack of detailed site-specific snowfall data. This multi• for us. Because point-specific data are not available, the preci•
county region has very few official weather stations that are sion of the snowfall figures (and hence the quality of the
monitoring snowfall. All that is available is a generalized iso• regression model) is only as good as the procedure used to
line map taken from an atlas. It would certainly be preferable place the isolines on the map in the atlas, which we are not
to measure snowfall amounts directly at monitoring stations able to assess.

KEY TERMS REFERENCES


AND ADDITIONAL READING
absolute and relative residual, 258 Abler, R. F..,J. S. Adams, and P. R. Gould. Spatial Organizatwn:
coefficient of determination, 258 The Geographer'sView of rhe World. Englewood Cliffs, NJ:
domain of X, 254 Prentice Hall, 1971.
equal variance assumption, 262 Burt., J. E., G. M. Barber and D. L. Rigby. ElementaryStatisrus
for Geographers. 3rd ed. New York: Guilford Press.,2009.
explained and unexplained variation in Y,256
Draper, N. R. and H. Smith. AppliedRegressionAnalysis.3rd ed.
least-squares regression line, 253
New York: Wiley, 1998.
linear relationship assumption, 262 Ebdon, D. Statisrus in Geography.A PracticalApproach.Oxford:
normally-distributed variables assumption, 262 Basil Blackwell, 1985.
predictor, explanatory (or independent) variable, 253 Ferguson. R. linear Regresswnin Geography.CATMOG No. 15.
residual mapping, 259 Norwich, England: Geo Abstracts, 1976.
response (or dependent) variable, 253 Montgomery, D. C., E. A. Peck and G. G. Vining. Inrroducti-On
simple linear regression (bivariate regression), 253 to LinearRegressionAnalysis.5th ed. New York: Wiley, 2012.
slope (regression coefficient), 253 The two examples from this chapter deal with last spring
standard error of the estimate, 259 frost data and annual snowfall data as related to lake-effect
standardized residual, 260 snow. Data for both of these climatic variables are found at the
total sum of squares, 257 National Climatic Data Center (NCDC), www.ncdc.gov. Both
total variation in Y,256 of these climatological topics are widely discussed in physical
Y-intercept, 253 geography and climatology textbooks. The source for lake
effect snow in northeastern Ohio is: Kent, R. B., ed. Regwn in
Transirwn:An Economic and Social Arias of NortheasternOhi-0.
MAJOR GOALSAND OBJECTIVES Akron: The University of Akron Press, 1992.
If you have mastered the material in this chapter, you
should now be able to:
1. Explain the nature and purposes of regression analysis.
2. Recognize geographic problems for which regression
analysis is a more useful statistical procedure than cor-
relation.
3. Understand how the form or nature of relationship
between variables is measured with a least-square
regression line.
4. Distinguish between positive (direct), negative (inverse),
and no relationship between variables and identify the
corresponding slope of regression line.
5. Explain how strength of relationship between variables
is measured, including an understanding of total varia-
tion, explained and unexplained variation, covariation,
and the coefficient of determination.
6. Explain the importance of residual analysis in regres-
sion for geographic research (including an explanation
of the value of residual maps).
7. List (and briefly explain) the assumptions that must be
met when using regression for inferential problem
solving.
Examples of
MultivariateProblem-SolvingIn Geography

18.1 The Basics of Multiple Regression


Example: Predicting the Date of Last Spring Frost in the Southeastern United States
Example: Predicting Obesity Levels for Counties of the U.S.
Example: Predicting Life Expectancy for Countries of the World
18.2 The Basics of Cluster Analysis
Example: Grouping African Countries Based on Factors Influencing Life Expectancy
Example: Grouping States using Basic Health Indicators
Example: Grouping Alabama Counties using Basic Health Indicators

18.1 THE BASICSOF MULTIPLE


REGRESSION effectivelyto solve geographic problems and make spatial
policy decisions?
Recall at the start of the text we identified geography Please note that we have given less attention to the
as an integrative science that attempts to explain and pre- statistical procedures and technical details of multiple
dict the spatial distribution and variation of human activ- regression itself. We are not suggesting that these proce-
ity and physical features on the earth's surface. The dures and details of multiple regression are unimportant.
research process in geography entails asking a number of Rather, we are more concerned that you get a basic feel
where, why, and what-to-doquestions about spatial pat- for how multiple regression can help you make spatial
terns, their underlying spatial process, and possible spa- predictions and solve problems. Many intermediate and
tial policy and planning strategies that might result. advanced textbooks provide additional statistical details
Multiple regression is certainly one of the most popular on multiple regression: some of these are referenced at
statistical tools that geographers use to answer these the end of this chapter.
research questions, because most realistic problems in Fortunately, the basic logic and principles of multiple
geography are complex and involve multiple causes. regression are very similar to those found in simple linear
Our primary intentions here are to introduce the regression. The underlying objective is the same: find a
basic objectives and principles of multiple regression and least-squares solution whose coefficients minimize the
demonstrate how geographers develop and apply multi- sum of the squared residuals. That is, find a regression
ple regression models. Think of this as an initial exposure equation that minimizes the amount of unexplained vari-
to multiple regression from a geographer's perspective. ation. The difference is that we have more coefficients in
We highlight the geographic problem-solving capabilities the multiple regression model. More specifically, the
of multiple regression and emphasize how a well- model contains one or more additional predictor (inde-
designed multiple regression model can help better pendent) variables, which can account for more variation
explain (and predict) why a particular spatial pattern of the dependent variable. Think back to the bucket and
exists. Quite simply, how can you use multiple regression sponge analogy from the last chapter. Multiple regression

269
270 Part VI .A. Statistical Relationships between Variables

is equivalent to using additional sponges to absorb a In those multivariate problems having two predictor vari-
greater volume of water from the bucket. ables, data values are located in three-dimensional space
As in simple linear regression, the selection of inde- and a best-fitting plane can be derived:
pendent variables for multiple regression is an important
step in applying this statistical method properly. Since (18.2)
use of regression suggests a functional relationship
between variables, each independent variable needs to be For problems having three or more independent vari-
evaluated carefully to ensure that it shows a logical rela- ables, the same procedure is applied algebraically in
tionship to the dependent variable. multi-dimensional space:
Multivariate regression is similar in structure to sim-
ple linear regression analysis. The statistical technique (18.3)
measures the direction and strength of the functional
relationships linking the independent variables to the where: Y = dependent (response) variable
dependent variable and the remaining error is analyzed. X 1 ... Xn = independent (predictor) variables
Multiple regression results also provide information on a = Y-interceptor constant
the absolute and relative ability of each predictor variable
to explain the dependent (response) variable, while hold- b1 ... bn = regression coefficients
ing the effect of other predictor variables constant. In both the simple linear and multivariate equations,
The assumptions and requirements for multiple the constant or a value shows the value of Y when the
regression are very similar to those which apply in simple values of all X variables are zero. The major difference
linear regression. These conditions are discussed in the between the equations lies in the regression coefficients.
previous chapter (section 17.4). The assumptions In the two-variable problem, one regression coefficient or
include: linearity between predictor variables, normality b-value is produced to show the influence of a single
of variables, equal variance of residuals for different val- independent (predictor) variable on the dependent
ues of the predictor variables, independence of errors in (response) variable. In the multivariate case, however, a
the underlying regression model, and random samples regression coefficient (b;) is calculated for each indepen-
drawn from identifiable populations. It is important to dent variable, (X;). Each coefficient indicates the abso-
check that these various assumptions and conditions are lute influence of an independent variable on the
met, at least to a reasonable extent. However, to save dependent variable. Like the simple regression model,
space in presenting the following multiple regression each regression coefficient shows the change in Y associ-
examples, many details and various checks of model ated with a unit change in the given X variable.
validity have been completed, but are not shown. However, because multiple regression uses more
Multicollinearity is a problem that often arises than one predictor variable to explain a response vari-
when using multiple regression. This term applies to situ- able, determining the relative importance of each vari-
ations when different predictor or explanatory variables able is a major concern. At first, it may appear that the
are not independent, but correlated with one another. regression coefficients could be used directly to measure
When multicollinearity is present, the regression coeffi- the importance of independent variables. However,
cients may be both imprecise and inaccurate, and it because the magnitude of the measurement units influ-
becomes difficult to assign change in the response ences these parameters, they cannot be used directly.
(dependent) variable to any one of the predictor vari- This situation is analogous to the problems encoun-
ables. It is always a good idea to calculate all correlation tered when using standard deviation to compare relative
coefficients between variables (that is, calculate the corre- levels of variation between two or more variables. As dis-
lation matrix) before proceeding with the multiple regres- cussed in chapter 3, when variables are measured on dif-
sion analysis. As a general guideline, whenever the ferent scales, the coefficient of variation provides a better
correlation between two predictor variables is above 0.8, measure for comparing the relative variation between
you should recognize that collinearity could be an issue. variables. In multiple regression, a valid comparison of
If a particular correlation coefficient is above 0.95, collin- the relativeability of each independent variable to explain
earity becomes a serious problem, and one of the two the variation in the dependent variable is accomplished
highly correlated predictors should probably be removed by computing standardized regression coefficients.
from the analysis. These coefficients serve as relative indices of strength,
Reviewing quickly, in simple linear regression a best- allowing a direct comparison of the influence of each
fitting line is generated to represent the prevailing trend independent variable in accounting for the variation in
of points on a two-dimensional scatterplot: the dependent variable. Among the set of independent
variables, the one with the largest standardized regres-
Y=a+bX (18.1) sion coefficient produces the strongest relationship to the
dependent variable.
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 271

In our earlier discussion of simple linear regression,


we discussed two different indices that provide absolute
MultipleRegression
Analysis and relative measures of strength. Explained variation
measures the total level of variation in the dependent vari-
Primary Objective: Determine if multiple (2 or more}
able that is accounted for by the single independent vari-
independent (predictor} variables (X,.
X2, ... , X.) account for a significant
able. This explained variation is then converted to a relative
portion of the total variation in a index called the "coefficient of determination" or r 2 .
dependent (response} variable (Y) Similarly, in multivariate regression, a relative index
called the coefficient of multiple determination (R 2 ) is
Requirements and Assumptions: calculated:
1. Variables are all measured on interval or ratio scale
2. All independent (predictor} variables have a linear 2
association with the dependent (response} variable 2_ I,y, (18.4)
R - 2
and with one another I,y
3. Samples of all variables taken from normally-
distributed populations
4. For all independent (predictor} variables, the variance where: R2 =coefficient of multiple determination
of residuals is equal for all values of the dependent
I, y; = explained variation
(response} variable (homoscedasticity)
5. Residual errors are independent of one another in the I, y 2 =total variation in Y
underlying regression model (no autocorrelation}
This index measures the ratio of variation explained by
Hypotheses: the set of independent variables relative to the total varia-
Ho: P2 = 0 tion in the dependent variable. When the R 2 index is
H•: P
2
"' 0 multiplied by 100, the result is interpreted as the percent-
age of total variation explained.

Example: Predicting the Date of Last Spring Frost in the Southeastern United States

In the previous chapter we developed a simple regression • Each additional degree of latitude is associated with about a
model using latitude as the only predictor (independent) vari• 5.77 increase in days (later in the spring} for the last spring
able of the average date of last spring frost in the southeast• frost (keeping elevation and distance from coast constant).
ern United States. The model is quite effective. As summarized • Each additional foot of elevation is associated with about a
in table 17.4, the regression equation is: (predicted average 0.00415 increase in days (later in the spring} for the last
date of last spring frost)= -152 + 7.12 (latitude). spring frost (keeping latitude and distance from coast con•
The coefficient of determination (r 2} is .721, indicating that stant. By extension, a 1000-foot elevation increase is associ•
latitude accounts for 72.1% of the total variation in ( Y), the ated with about a 4.15 increase in days (later in the spring}
average date of last spring frost. keeping the other predictor variables constant.
At that time, based on the characteristics of the non-ran•
• Each additional mile of distance from the coast is associated
dom map pattern of residuals, we suggested that both dis•
with about a 0.0461 increase in days (later in the spring} for
tance from the coast and elevation seem to be excellent the last spring frost (keeping latitude and elevation con•
candidates for additional explanatory or predictor variables. stant}. By extension, a 100-mile increase in distance from
What follows is a summary table showing a multiple reg res• coast is associated with about a 4.61 increase in days (later in
sion model predicting the average date oflast spring frost as a the spring} keeping the other predictor variables constant.
function of all three predictor (independent} variables: lati•
tude, distance from the coast, and elevation (table 18.1}. The coefficient of multiple determination (R2) is 0.829,
The regression equation is: indicating that 82.9% of the total variation in average date of
last spring frost is explained or accounted for by the three pre•
(predicted average date of last spring frost) = dictor (independent} variables. The adjusted R2 is 0.822, just
-115 + 5.77 (latitude} + 0.00415 (elevation} slightly less than the unadjusted coefficient. This adjusted for•
+ 0.0461 (distance from coast} mulation takes into account the number of independent vari•
ables being used, and can decrease if additional variables are
This multiple regression model meets the least-squares cri• not statistically significant. The adjusted R2will increase (over
terion in the same way as the simple regression model: the unadjusted R2 ) only when the new variable improves the
sum of the squared residuals is minimized. However, interpre• model more than would be expected by chance. Therefore,
tation of the regression coefficients is somewhat different the adjusted R2 is the best unbiased estimator of the contribu•
than in simple regression. tion of a set of independent (predictor) variables to the expla•

(continued)
272 Part VI .A. Statistical Relationships between Variables

nation of the dependent (response) variable. For this multiple the intercept is 14.95 while the intercept value is - 114.52, the
regression model seeking to explain variation in the average standard error of the latitude slope is 0.4546 while the latitude
date of last spring frost the results are substantially better coefficient is 5.7666, and so on. This suggests that the coeffi•
than the 72.1% variation explained by latitude alone in the cient estimates are quite precise, and this is reflected by the
simple regression model. relatively large t•ratio values and the statistically significant
The residuals (components of the unexplained variation) p-values (all below 0.05).
have a standard deviation of 6.67 days (SE= 6.67). This index The F-ratio in the ANOVA portion of table 18.1 is a mea•
(also called standard error of the residuals) provides us with an surement of the likelihood that all of the regression model
indication of how precise the prediction of average date of coefficients are zero, That is, the F•ratio is measuring the statis•
last spring frost can be with this regression model. Stated a bit tical significance of the overall regression model itself. If the
differently, the typical prediction of last spring frost date is off regression model parameters are zero, then the null hypothe•
target by almost a week (6.67 days). The standard error in the sis would not be rejected, and the conclusion would be that
simple linear regression was 8.41 (table 17.4). Using the multi• there is no relationship between average date of last spring
pie regression model, we have reduced the standard error in frost and the three predictors, Also, if the null hypothesis was
prediction by nearly two days. not rejected, the magnitude of the F•ratio would be about
The standard errors for each of the regression model coeffi• one. Since Fis very large (116,48), we can easily reject the null
cients are relatively small compared to the absolute values of hypothesis (p•value = 0.000). Clearly, a statistically significant
the coefficients themselves. For example, the standard error of relationship exists between last spring frost date and all three
of the predictor variables.
Do we now have the best multiple regression model? It is
impossible to fully answer this question, as there is no single
TABLE 18.1
way to define "best." One model might be more suitable for
Predicting Average Date of Last Spring Frost: some purposes, while another alternative seems better or
Three-,,redlctor Multiple Regression Model more useful for other tasks. Keep in mind that all realistic mod•

Dependent (response) variable is average date of last spring frost.


Independent (predictor) variables are latitude, elevation, and distance
fran coast TABLE 18.2
The regressioo equation is: Predicting Average Date of Last Spring Frost:
avg. date ol last spring frost = -115 + 5.77 (latttude)
Two-predictor Multiple Regression Model
+ 0.0415 (elevation)
+ 0.0461 (dist from coast)
Dependent (response) variable is average date of last spring frost.

SE Independent (predtCtor) variables are latitude and elevation


Variable Coefficient (coefficient) t•ratio p-value The regression equation is:
Intercept -114.52 14.95 -7.66 0.000 avg. date of last spring frost = -122 + 6.08 (latitude)
+ 0.0083 (elevation)
Latitude 5.7666 0.4546 12.68 0.000
SE
Elevation 0.004145 0.001910 2.17 0.033 Variable Coefficient (coefficient) t-ratio p-value

Distance from Intercept -121.66 15.67 -7.76 0.000


0.04608 0.01450 3.18 0.002
coast
Latitude 6.0832 0.4704 12.93 0.000

SE= 6.67 R 2 = 0.829 (82.9%) R 2 (adjusted)= 0.822 (82.2%) Elevation 0.008304 0.001475 5.63 0.000

Analysis of Variance: SE= 7.07 R 2 = 0.805 (80.5%} R 2 (adjusted)= 0.800 (80.0%}

Sum of Mean
Source OF square F-ratio p-value
Analysis of Variance:
squares
Regression model 3 15,530.0 5,167.7 116.48 0.000 Sum of Mean
Source OF squares square F-ratio p-value
Residual error 72 3,199.9 44.4
Regression model 2 15,081.4 7,540.7 150.87 0.000
TOTAL 75 18,729.9 Residual error 73 3,648.6 50.0

TOTAL 75 18,729.9
Source of regression model Sequential SS
Latitude 13,496.6 Source of regression model Sequential SS
Elevation 1,584.7 Latrtude 13,496.6

Distance to coast 448.7 Elevation 1,584.7

TOTAL 15,530.1 TOTAL 15,081.4


Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 273

els are imperfect (wrong) to some extent, because every each independent variable in accounting for variation in the
regression model has some unexplained variation. dependent variable. Since the contribution of elevation
We would like to build a parsimonious model that includes (1,584.7) is much larger than the contribution of distance to
as few predictor variables as possible, while also trying not to coast (448.7), the results strongly suggest that we should
leave out any independent variables that are potentially include only latitude and elevation in the revised model
important. This is the delicate balance or tradeoff that is (table 18.2).
placed at the very center of how to construct a good regres• The two independent variables account for 80.0% of the
sion model. total variation in average date of last spring frost. This com•
The key question in this example seems to be whether to pares very favorably with the other regression models we
include both elevation and distance to coast in the model or have examined:
whether to drop one of these independent variables. From
• Latitude only: R2 = 0.721 (72.1% of total variation explained
the correlation matrix that we ran prior to the regression anal•
by model)
ysis, we found that all three independent variables correlate
significantly (p = 0.000) with average date of last spring frost: • Latitude and elevation: adjusted R2 = 0.800 (80.0% of total
variation explained by model)
(latitude 0.849; elevation 0.599; and distance to coast 0.6430).
• Latitude, elevation, and distance from coast: adjusted R2 =
A separate correlation analysis shows that elevation and 0.822 (82.2% of total variation explained by model)
distance to coast are highly correlated (0.737) with each other,
creating a potential multicollinearity problem. Since these If we eliminate distance from the coast from the model,
two predictors are redundant, then one of them should be very little of the explained variation is lost (a reduction from
removed to improve the model's parsimony and efficiency. 0.822 very slightly down to 0.800), the collinearity problem
Returning to the three-predictor regression model sum• between elevation and distance from the coast is eliminated,
mary (table 18.1), the bottom portion of the table provides and the model becomes more parsimonious and efficient. In
the sequential sum of squares showing the contribution of short, this model looks like the best alternative.

Example:Predicting Obesity Levelsfor Countiesof the U.S.

Back in chapter 11, we examined the spatial pattern of variables accounts for 65.0% of the total variation in obesity.
county obesity levels using analysis of variance. Selecting a The model is moderately successful in predicting obesity
stratified random sample of two hundred counties, propor• rates, but there seems to be some potential for reducing the
tionally represented across the four major census regions of number of independent variables without sacrificing too
the United States (excluding Alaska and Hawaii), we discov• much explained variation. Two important related questions
ered highly significant differences in obesity rates between are: which variables should be removed and in what order? If
census regions (F = 22.54 and p = 0.000). we look at the bottom segment of table 18.4, we get some
Using this same sample of two hundred counties, we now excellent guidance on how to proceed. The •source of regres•
present a multiple regression model that can account for the sion model-sequential sum of squares• list tells how much
spatial variation in these county obesity rates. Then, because each of the seven variables contributes to the total amount of
the sample contains a randomization component, we will be explained variance in the model. The comparative importance
able to test the regression coefficients inferentially and use of the variables is quite clear: (1) by a large margin, the per•
the model to predict the obesity level in any county in the centage of county adult population that is physically inactive
conterminous U.S.with a known degree of precision. is the largest •explainer• of variance in obesity, with a sum of
From severaldifferent sources,we identified a large number squares of 1,341.03 (out of a total of 1,658.62);(2) a strong sec•
of potential independent (predictor) variables that have been ond is percentage of county population aged 20 years or older
associated with obesity in past research. From that set of possi• having diabetes (205.32 out of 1,658.62; (3) a clear third place
ble predictors, we selected seven variables that seem to provide goes to percentage of county population of Hispanic origin
a representative mix for input into an initial exploratory regres• (95.03 out of 1,658.62);(4) the other four predictor variables all
sion model. For the sample of 200 counties, we calculated the contribute very small amounts to the total explained variation
correlation matrix of these seven predictors (table 18.3). in the model.
Notice that all seven predictor variables are significantly lfwe were to show the entire multiple regression process,
correlated with obesity level (see the first column of table we would remove or eliminate only one predictor at a time
18.3). However, many of these predictors are significantly cor• from the model. This is because each predictor affects how all
related themselves, suggesting that multicollinearity wil I the others contribute to the model. This can be accomplished
likely have a substantial effect on the regression model. easily by having the computer software conduct what is
The summary results of the multiple regression model are known as a stepwise regression, which (in this case)
shown in table 18.4.The combined set of seven independent removes one independent variable at a time from the model.

(continued)
274 Part VI .A. Statistical Relationships between Variables

For brevity, we don't present these six successive steps ofvari• other predictor. This contradicts most research that looks at
able removal. However, after running stepwise regression, the factors related to obesity, which often cites higher than
"best• model that resulted contains the three predictors just national average obesity rates for Hispanics. In addition, why
mentioned {table 18.5). is percentage Hispanic staying in the regression model, when
This looks like a very good model. Even though four of the its correlation to obesity is so low?
seven predictors have been eliminated, 63.4% of the total vari• The multiple regression output {not shown in table 18.5)
ation in county obesity rate is still explained. In addition, very provides a list entitled •unusual observations~ The computer
little explained variation is lost while the complexity of the software results identify seven out of the nearly 200 counties
model is sharply reduced, allowing us to focus attention on that have "large leverage: "Leverage" is a measure of an
those few factors that seem most related to obesity. observation's ability to alter the regression model with just a
As a closing comment, perhaps you noticed something slight change in the magnitude of the dependent {response)
unusual in this analysis: the percentage of county population variable-in this case, percentage of adults classified as
that is Hispanic negatively correlates with obesity, although obese. In other words, observations {counties) with large
the absolute magnitude of the correlation is smaller than any leverage can have an inordinate amount of influence on the

TABLE 18.3

Matrix of Correlation Coefficients: Factors Used in County Obesity Regression Model

¾Obese08 % Phln08 %Diab08 ¾NC1MI ¾L11MI PovRate08 %Black08

¾Phln08 0.732
0.000
0.744 0.765
¾Diab08
0.000 0.000
0.526 0.582 0.696
¾NC1MI
0.000 0.000 0.000
0.485 0.598 0.616 0.813
¾Ll1MI
0.000 0.000 0.000 0.000
0.512 0.561 0.590 0.655 0.663
%PovRate08
0.000 0.000 0.000 0.000 0.000

0.397 0.333 0.458 0.379 0.099 0.416


¾Black08
0.000 0.000 0.000 0.000 0.171 0.000
-0.367 --0.259 -0.280 -0.153 --0.067 0.199 -0.112
¾Hisp08
0.000 0.000 0.000 0.033 0.356 0.006 0.120

Cell contents: Pearson's correlation


p-value

¾Hisp08 Percentage of county resident population that is of Hispanic origin, 2008.

%Phln08 Percentage of county adult population that reported no leisure time physical activity, 2008. [physical inactivity]

¾Diab08 Percentage of county population aged 20 years or older having diabetes, 2008.

¾NC1MI Percentage of housing units in a county that have no car and are more than one mile from a supermarket or
large grocery store. [no car, 1 mile plus]
¾Ll1MI Percentage of county total population that is low income and lives more than one mile from a supermarket or
large grocery store. [low income, 1 mile plus]
PovRate08 Percentage of county residents wtth household income below the poverty threshold, 2008.

¾Black08 Percentage of county resident population that is non-Hispanic, black, or African American origin, 2008.

¾Obese08 Percentage of age-adjusted persons aged 20 years or older classified as obese, where obese is defined as
a Body Mass Index {BMI}" 30, 2008. {see written narrative in chapter 1 for more details)
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 275

TABLE 18. 4 overall regression model result, if the value of their dependent
variable is even slightly altered. Six of the seven counties with
Predicting County Obesity Levels: Seven-predictor very high leverage are also counties with a very high percent•
Multiple Regression Model age of the total population being Hispanic: Zavala, TX (93.9%);
Dependent(response)variableis percentof adultcountypopulation Dimmit, TX (86.2%}; Santa Cruz, AZ.(82.8%}; El Paso,TX
classifiedas obese. (82.2%); Valencia, NM (58.3); and Taos, NM (55.8%). In short,
Independent(predictor)variablesare %Phln08, %Diab08, %NC1MI, these few "Hispanic-majority" counties have a great deal of
%U1MI, PovRate08,%Black08,and %Hisp08. influence on results of the overall regression model.
The regressionequationis:
%0bese08 = 16.2 + 0.215 (%Phln08) + 0.573 (%Diab08)
- 0.123 (%NC1Ml)-0.0061 (%Ll1MI)
+ 0.120 (PovRate08) + 0.0113 (%Black08)
- 0.0599 (%Hisp08)

SE TABLE 18.5
Variable Coefficient (coefficient} t-ratio p-value
Intercept 16.191 1.005 16.10 0.000 Predicting County Obesity Levels: Three-predictor
Multiple Regression Model
%Phln08 0.21521 0.04679 4.60 0.000
Dependent(response)variableis percentageof adultcountypopulation
%Diab08 0.5729 0.1295 4.42 0.000 classifiedas obese.
%NC1MI -0.1235 0.1173 -1.05 0.294 Independent(predtCtor)
variablesare %Phln08, %Diab08, and %Hisp08

%Ll1MI -0.00612 -0.22 The regressionequationis:


0.02808 0.828
%0bese08 = 15.6 + 0.248 (%Phln08)
PovRate08 0.11982 0.04024 + 0.673 (Diab08)
2.98 0.003
- 0.0386 (%Hisp08)
%B1ack08 0.01132 0.01473 0.77 0.443
SE
%Hisp08 -0.05994 0.01314 -4.56 0.000 Variable Coefficient (coefficient) t-ratio p-value
Intercept 15.6350 0.8843 17.68 0.000
2 2
SE=2.14 R = 0.663 (66.3%} R (adjuste<l)= 0.650 (65.0%}
%Phln08 0.24777 0.04495 5.51 0.000

Analysis of Variance: %Diab08 0.6727 0.1110 6.06 0.000

S1.111
of Mean %Hisp08 -0.03860 0.01142 -3.38 0.001
Source OF squares square F-f'atio p-value
Regressionmodel 7 1,658.62 236.95 51.90 0.000 SE= 2.185 R 2 = 0.640 (64.0%) R 2 (adjuste<l)= 0.634 (63.4%)

Residualerror 185 844.62 4.57


Analysis of Variance:
TOTAL 192 2,503.24 Mean
Sum of
Sot.D'ce OF squares square F-ratio p-value

Source of regression model Sequential SS Regression model 3 1,600.94 533.65 111.78 0.000

%Phln08 1,341.03 Residualerror 189 902.29 4.77

%Diab08 205.32 TOTAL 192 2,503.24


%NC1MI 0.67

%Ll1MI 1.69 Source of regression model Sequential SS

PovRate08 7.95 %Phln08 1,341.03

%Black08 6.93 %Diab08 205.32

%Hisp08 95.03 %Hisp08 54.59

TOTAL 1,658.62 TOTAL 2,503.24


276 Part VI .A. Statistical Relationships between Variables

Example: Prediding Life Expectancy for Countries of the World

One of the four illustrations of the geographic research sion results in table 18.6 indicate that access to clean water
process introduced in chapter 1 dealt with life expectancy at (GoodWater%} is the largest contributor (SS= 9,052.3} to the
the country and world region levels. After looking at the overall model. Additionally, the coefficient for GoodWater%
global pattern of life expectancy (fig. 1.4), a series of descrip• indicates that a nearly 4% increase nationwide in access to
tive statements and inferential hypotheses were suggested for clean water accounts for an additional nationwide year of life
possible future analysis. In addition, a possible "life expec• expectancy. Similarly, a 1% reduction in HIV% cases results in
tancy transition• was discussed. Of course, many different fac• an additional year of life expectancy. With this information in
tors are directly or indirectly related to life expectancy. In mind, policy makers can evaluate different strategies for
WorldHealth Statistics,2010,the World Health Organization improving life expectancy, such as whether to invest in HIV
lists well over 100 "Global Health Indicators• organized into education or preventive care.
nine major categories.
From this complexity, we select different combinations of
possible independent (predictor} variables as candidates for a
series of multiple regression models. Our exploratory analysis TABLE 18.6
results in a variety of different models. A full discussion of
these models is well beyond the scope of this text, but we Predicting Country Life Expectancy: Three-
briefly present some model results that influence the explor• predictor Multiple Regression Model
atory research process. Dependent (response) variable is life expectancy in years.
One of the five-predictor models does quite well in
Independent (predtCtor) variables are GoodWatero/o, GoodSanito/o,
accounting for variability in life expectancy, with a coefficient andHrvo/o.
of determination (R2 ) of 0.833 (833%}. In this model, three The regression equation is:
independent variables focus on health risk and causes of mor• LdeExp = 37.0 + 0.275 (GoodWater%)
bidity (percentage of population using improved drinking + 0.127 (GoodSantt%)
- 0.887 {HIV%)
water sources, percentage of population using improved sani•
tation, and percentage prevalence of HIV among adults aged SE
15-49). The other two independent variables represent mea• Variable Coefficient (coefficient) t-ratio p-value
sures of health expenditure (total expenditure on health as a Intercept 36.995 2.771 13.35 0.000
percentage of gross domestic product and per capita total
GoodW ater"/4 0.27486 0.04636 5.93 0.000
expenditure on health in international dollars}. Of the 119
countries with complete data (many less-developed countries GoodSanit% 0.12654 0.02631 4.81 0.000
did not have HIV estimates}, the analysis identifies only 9 coun•
HIV% -0.8868 0.1079 -8.22 0.000
tries as •unusual observations• having either a very high abso•
lute residual or very high leverage. Eight of these countries are SE= 5.007 R 2 = 0.804 (80.4%} R 2 (adjusted}= 0.799 (79.9%}
in Africa, and the ninth country is the United States. The eight
African countries all have high health-risk numbers (combina•
Analysis of Variance:
tions of bad water, poor sanitation, or HIV}.The United States
has very great leverage in this model because of the extremely Sum of Mean
Source OF squares square F-ratio p-value
high per capita health-care expenditures (look again at fig. 15).
With this model result, we decide to try another model Regression model 3 11.823.5 3,941.2 157.22 0.000
containing only the three risk and morbidity factors. Our Residual error 115 2,882.8 25.1
thinking is that countries in Africa seem to be rather distinc•
tive when compared to the rest of the world. This judgment is TOTAL 118 14,706.4
confirmed, as a three-predictor model focusing just on risk
and morbidity factors accounts for 79.9% of the total variation Source of regression model Sequential SS
in life expectancy globally (table 18.6). As health standards
like clean water, proper sanitation, and control of HIV improve GoodWatero/o 9,052.3
in African countries, global life expectancies will likely con• GoodSanit% 1,078.0
verge dramatically, and other predictive regression models for
HIV% 1,693.2
life expectancy will no doubt emerge.
At the beginning of this chapter we reference the "what to TOTAL 11,823.5
do" questions that might inform policy decisions. The reg res•
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 277

18.2 THE BASICSOF CLUSTERANALYSIS lations. Following the explanation, we will look at some
geographic examples. We start with a set of locations (for
The term cluster analysis refers to a set of multivari- example, states of the United States or counties in Ala-
ate methods that group or classify a set of observations so bama), for which we have multiple variables of data (e.g.,
that similar observations are placed in the same group measures of health such a.slevel of obesity, level of diabe-
and dissimilar observations are placed in different tes, and level of physical inactivity). Our basic goal is to
groups. Synonymous with numericaltaxonomyor typologi- group (cluster) together those locations that are most
cal analysis,clustering is a common technique of statisti- similar (that is, close to one another in multi-dimensional
cal data analysis in many different fields. space: with regard to level of obesity, level of diabetes,
Cluster analysis is not one specific algorithm or pro- and level of physical inactivity) into the same category,
cedure, but rather a general goal to be accomplished while placing locations that are most different (dissimi-
(assign similar observations to the same group while lar) from one another into different categories. Further-
assigning dissimilar observations to different groups). more, the process of grouping or clustering locations
Procedures vary depending on what constitutes a cluster considersall variablessimultaneously(making this a "multi-
and how be.st to identify those clusters. The possibilities variate" technique).
are extensive: well over 100 different "clustering models" Let us suppose that we want to group the states of
or algorithms are in active use. Some of the most basic the United States into five categories or classes based on
considerations in cluster analysis include: three health indicators (obesity, diabetes, and physical
inactivity). With three variables, we are working in three-
• Approach-the underlying cluster model approach can
dimensional space, with each dimension representing a
be either agglomerative or divisive. This parallels the variable. Using this technique, states will be combined
two conceptual strategies of classification (agglomera- with other states (or groups of states) having similar val-
tion and subdivision) discussed in chapter 2. ues of the three health indicators so that ultimately, we
• Level of measurement-different cluster analysis pro- can generate a choropleth map that shows each state allo-
cedures can be used on interval/ ratio, ordinal, and cated to one of these five categories of "health level."
nominal (categorical) data. This will allow us to see the overall spatial pattern of the
• Measure of distance-options include Euclidean, most healthy and lea.st healthy states. Let us further sup-
squared Euclidean, Manhattan, Pearson's, and squared pose we want to use an agglomerative approach to clus-
Pearson's. Euclidean and Manhattan distances have tering, starting with each state in its own separate group,
already been defined. Pearson's methods calculate the and iteratively (step-by-step)group states that are "close.st
square root of the sum of square di.stances divided by together" (most similar) in this three-dimensional space.
variances. This method i.s used when observations We can illustrate this iterative procedure graphically by
need standardizing. using a dendrogram and specifying a "cut-off" line to
allocate states to their proper category (cluster).
• Linkage method - linkage alternatives include single,
The cluster analysis procedure al.socalculates some
average, median, centroid, complete, Ward's, and
useful summary descriptive statistics. For each cluster,
McQuitty's. The linkage method selected determines
we can determine its mean value on each variable and
how di.stance between clusters is defined. Basically, a
calculate a cluster centroid that indicates the center of
linkage rule is needed when calculating inter-cluster
the cluster in three-dimensional space. This allows us to
distances for multiple observations in a cluster. graph the centroids in three-dimensional space and see
In geographic applications, the "observations" being the distance separating cluster centroids. In effect, the
allocated to clusters are often locations (countries, states, closer the centroids in the three-dimensional space, the
provinces, etc.) for which there are data on multiple vari- more similar the clusters (groups) are based on the vari-
ables. Very often our task is to reduce the complexity of ables used for the clustering.
the data through creation of multi-variable clusters of Cluster analysis is always descriptive in nature; no
observations. Hence, cluster analysis comprises a set of inferential hypotheses are being tested. In other words,
data-reduction techniques. Just as factor analysis we are not trying to infer population parameters from
(another important multivariate statistical procedure) sample statistics. In addition, we are not attempting
combines a large number of individual variables into a here to provide a comprehensive coverage of clustering
smaller number of "variable-groups," cluster analysis procedures. Our main objective is to illustrate how clus-
reduces a large number of locations to a smaller number ter analysis is useful in the investigative exploration of
of "location-groups" or "location-types." spatial data sets. Those interested in studying these pro-
Since cluster analysis is a complex multivariate tech- cedures in more detail should examine some of the
nique, it i.suseful to summarize the basic goal and proce- chapter references.
dures of this statistical method in written narrative form,
without using any specific formulas or statistical formu-
278 Part VI .A. Statistical Relationships between Variables

Example: Grouping African Countries Based on Factors Influencing Life Expectancy

As we developed multiple regression models related to the tion, and HIV prevalence among adults were all important life
global distribution and variation of life expectancy, the distinc• expectancy predictors. We further discovered that nearly all of
tiveness of African countries quickly emerged. We discovered the •unusual observations• (that is, large residuals and obser•
that access to safe drinking water, accessto improved sanita• vations with lots of leverage) were countries in Africa.

Very good rates of access to

□ improved water and sanitation;


Average prevalence of HIV


Average rates of access to
improved water and sanitation;
High prevalence of HIV

Rather low rates of access to


improved water and sanitation;
Low prevalence of HIV

□ Missing data (not available)

0
0 1.250 2,500

Kilometers

FIGURE 18.1
Cluster Analysis Map of Africa: Based on Three Predictors of Life Expectancy-Access to Good Water and Sanrtation and
Prevalence of HIV
Source: World Health Organization (WHO), World Health Statistics. 2010
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 279

Therefore, a logical next step in the exploratory geographic 10% on average}. A number of countries in south central and
research process is to look at the spatial distribution and varia• central Africa are members of this group: (Zimbabwe (Rhode•
tion of African countries themselves with regard to these sia),Zambia, Malawi, Burundi, Rwanda, and Uganda}. In addi•
three independent variables. Since we want to group or clas• tion, several countries along the West African coast are in this
sify observations over multiple variables simultaneously, a cluster: Senegal, Cameroon, Gabon, Angola, and Namibia.
cluster analysis procedure is appropriate. These countries generally have average levels of accessto safe
We selected the three-cluster solution, since it seems to drinking water and improved sanitation.
have a regional pattern that is suitable for interpretation The third (and largest} group of countries (darkest shade of
(fig. 18.1). Altogether, 46 African countries have complete data gray} is located in a wide swath across Central Africa, from the
across all three variables. Most of the countries missing data Atlantic Ocean to Indian Ocean coast. Generally, HIV preva•
lack information about HIV prevalence. lence rates are low (average of 3% of the resident population},
One group of countries (shown in the lightest shade of but access rates to safe drinking water and good sanitation
gray} has residents with the highest rates of accessto safe are poor (even by African standards}.
drinking water and improved sanitation. In this group, the clus• Why do these key factors related to life expectancy vary in
ter centroid for safe drinking water is 91.8% of the population the way they do across the African continent? More specifi-
and the cluster centroid for accessto improved sanitation is cally, why do we see the non-random clustering pattern that is
77.1% (see table 18.7). HIV prevalence is about average for the so evident in figure 18.1?Obviously, the spatial processes
continent at 5.3%. Most of the countries in this group are on underlying these spatial patterns are very complex. Multiple
the periphery of the African continent. Key countries include: interrelated factors are influencing each of these three life
South Africa and Botswana in Southern Africa; and Egypt, Tuni• expectancy predictors. You might want to speculate further
sia, Algeria, and Morocco in Northern (Mediterranean} Africa. on additional predictor variables (maybe an income• or edu•
A second group of countries (medium shade of gray} is cation-related measure, for example} that could be included in
most notable for having a high HIV prevalence rate (above a "follow-up" exploratory or investigative geographic analysis.

TABLE 18. 7
Cluster Analysis of African Countries: Based on Three Predictors of Life Expectancy-Goodwater%,
GoodSanit%, and HIV%
Cluster Centroids:
Cluster 1 Cluster 2 Cluster 3 Grand centroid
Variable (9 countries} (13 countries) (24 countries} (46 countries)
GoodWater"/o 91.7778 73.2308 59.3333 69.6087
GoodSanit% 77.1111 46.3077 20.1250 38.6739
HIV% 5.3444 10.1000 3.0292 5.4804

Distance between cluster centroids


Cluster 1 and cluster 2 36.2692
Cluster 1 and cluster 3 65.6157
Cluster 2 and cluster 3 30.4741

Example:Grouping States using BasicHealth Indicators

Without a doubt, the most frequently examined topic in counties, with definitions based on combined levels of obe•
this book has been the spatial pattern of obesity and associ• sity, diabetes, and physical inactivity (see fig. 18.2). Counties
ated distributions of other health-related variables such as highlighted on these maps are either in the bottom 10% (or
diabetes and physical inactivity across the United States. top 10%} of all counties in the United States as measured on
Another instructive way to explore obesity and its related all three of these variables.
variables is to examine the degree to which states cluster spa• Using the same three variables (levels of obesity, diabetes,
tially when considering all three health-related variables and physical inactivity}, we can also examine the "geography
simultaneously. At the county level, for example, we have of health" pattern at the state level by applying a multivariate
constructed two crosstab maps and identified distinctive cluster analysis. Should we expect the cluster analysis to pro•
regional patterns of •very healthy" and "very unhealthy" duce a strong regional patterning of "healthy" and

(continued)
280 Part VI .A. Statistical Relationships between Variables

Healthiest Counties (counties with the lowest rates


of diabetes, obesity, and physical inactivity)

□ Top 10%
healthiest counties

.,.

Unhealthiest Counties (counties with the highest rates


of diabetes, obesity, and physical inactivity)

□ Bottom 10%
unhealthiest counties

FIGURE 18.2
County Crosstab Map: Obesity, Diabetes, and Physical Inactivity
Source: Centers for Disease Control and Prevention (COC)
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 281

•unhealthy" states? To what extent will the state-level pattern the square of Euclidean distance. Therefore, distances that are
of "health" derived from cluster analysis replicate the county• large under Euclidean distance will be considerably larger
level pattern of'health' derived from combining the top 10% under the squared-Euclidean measure. Many sources recom•
or bottom 10% of counties on these same three variables? mend that when using an average linkage method, you
How close or distant are the cluster centroids of the selected should also use a squared-distance measure.
model, and what additional information is provided by com• Certain states combine relatively early in the clustering
paring centroid coordinates? process while others, of course, do not. For example, in this
The multivariate cluster analysis process itself is effectively three-dimensional space (obesity, diabetes, and physical inac•
illustrated with a dendrogram (fig. 18.3) and a summary table tivity), the states of Iowa and Nebraska are close together;
(table 18.8). In this analysis, an •average linkage• method and they are more similar in values of health indices and group or
a "squared-Euclidean" distance measure are used. "Average cluster at a distance of about 0.3 units in three-dimensional
linkage• defines the distance between two clusters as the space (see the bottom center of fig. 18.3). Other states do not
mean distance between an observation in one cluster and an join a cluster until later in the process. For instance, Georgia
observation in the other cluster. "Squared-Euclidean" is simply does not join a group until a distance of about 1.3 units in

3.25

.,u 2. 18

C:
; ------
Five cluster cut•off
- - - -------- - - - - - - ------- I

0
,... I
• I

I l I I
f~ I

l
I~ f';l,
l
r;:i,
I J

r. l
l I

~ 1n. .
n l .L
I
~

. . .

FIGURE 18.3
Dendrogram of State "Health": Based on Cluster Analysis of Levels of Obesity, Diabetes, and Physical Inactivity

TABLE 18. 8

Cluster Analysis of U.S. States: Based on Three Predlctors-Obesity''lo, Diabetes%, and


Physlcallylnactive%, 2008
Cluster Centroids:

Variable Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Grand centroid


(less healthy) (average heatth) (very heatthy) (more healthy (less healthy
than most) than most)
Obesity% 1.411 --0.141 -2.321 -0.953 0.710 0.000
Diabetes% 1.539 --0.326 -1.679 -0.827 0.776 0.000
Physicallylnactive¾ 1.493 0.060 -1.605 -1.170 0.302 0.000

Distance between cluster centroids Distance between cluster centroids

Cluster 1 and cluster 2 2.818 Cluster 2 and cluster 4 1.556


Cluster 1 and cluster 3 5.821 Cluster 2 and cluster 5 1.413
Cluster 1 and cluster 4 4.275 Cluster 3 and cluster 4 1.670
Cluster 1 and cluster 5 1.579 Cluster 3 and cluster 5 4.342
Cluster 2 and cluster 3 3.059 Cluster 4 and cluster 5 2.738

(continued)
282 Part VI .A. Statistical Relationships between Variables

three-dimensional space (see Georgia near the center of fig. All of the "more healthy than most" states are in the north•
18.3). Recall that distance in the three-dimensional space ern portion of the country, ranging from the Pacific Northwest
reflects the degree of similarity between states defined by the (Washington, Oregon, Idaho), through Minnesota, then jump•
three health indicator. Shorter distances imply more similarity. ing to northern New England (Massachusetts, Vermont, New
After the entire clustering process is completed, the ques• Hampshire, Maine) (cluster 4 in table 18.8). Most of the 'less
tion becomes: "Where to cut the dendrogram ?"We explored healthy than most• states are along the south Atlantic coast
cutting the dendrogram at various places (this is a subjective (Georgia, South Carolina, North Carolina) or in the traditionally
decision) and settled on a five-cluster cut-off (fig. 18.3). This industrial Midwest (Ohio, Michigan, Indiana) (cluster 5). The
choice was selected because the map of this five-cluster 'large middle group" of average health states is widely dis•
option shows a strong regional patterning and also highlights persed nationwide (cluster 2).
the Colorado outlier (fig. 18.4). Notice how the dendrogram We encourage you to explore these 'health" patterns fur•
cut-off line divides the states into clusters that can then be ther. What other variables do you hypothesize might be spa•
directly placed on the map. tially associated with these map patterns? Are some of the
The "least healthy" region is a swath of states ranging from descriptive statements and inferential hypotheses mentioned
Texas and Oklahoma in the southwest, through the states of in the obesity example of chapter 1 still relevant? Once again
the lower Mississippi River valley, then northeast into the we see that conducting statistical analyses leads to further
Appalachian states of Tennessee, Kentucky, and West Virginia research questions-and further statistical analysis!
(cluster 1 in table 18.8). We know this is the least healthy
region because the cluster centroid numbers associated with
cluster 1 are all large positive numbers (indicating high per•
centages of obesity, diabetes, and physical inactivity). At the
other extreme, proudly isolated as a statistical outlier, is the
"very healthy" state of Colorado (cluster 3 in the table).

.'

,o

.....

□ Very healthy □ More healthy than □ Large middle □ Less healthy than Least healthy
(1 state: Colorado) most (13 states) group (21 states) most (7 states) (9 states)

FIGURE 18.4
Cluster Analysis Map of the Unrted States: Based on Three Measures of "Health"-Levels of Obesity, Diabetes, and Physical Inactivity
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 283

Example:Grouping Alabama Countiesusing BasicHealth Indicators

We have seen that many of the very unhealthy counties in ical inactivity. Furthermore, Alabama is one of the nine least
the United States are in Alabama (fig. 18.2). In fact, Alabama healthy states as determined by cluster analysis at the state
has 17 of the 103 counties meeting the criteria of being in level (fig. 18.4). To what extent will we find a strong regional
the highest 10% (that is, the least healthy 10%) of all U.S. patterning of "healthy" and •unhealthy" counties within Ala•
counties with regard to levels of obesity, diabetes, and phys• bama itself?

Tennessee to the North

Huntsville

Mississippi /--....I.-.,---'-,--\ Georgia to


to the West '-r----.. the East

Florida to the South


Mo6ile N

□ Very healthy counties A
□ Large middle group of counties


0 40 8-0
Least healthy counties
Kilometers

FIGURE 18.5
Cluster Analysis Map of Alabama Counties: Based on Three Measures of "Health"-levels of Obesity, Diabetes, and Physical Inactivity

(continued)
284 Part VI .A. Statistical Relationships between Variables

Using multivariate cluster analysis, we selected the three• effect labels to variables that happen to be spatially correlated
cluster solution as best, primarily because the clearest and or locations that happen to cluster together.
most interpretable spatial pattern is associated with this Nevertheless, as a result of these statistical analyses we are
option (fig. 18.SJ. now more knowledgeable about factors associated with poor
First, notice that all of the 'least healthy' counties are health. Clearly, with regard to health and obesity (or almost
located in a belt running west to east across the central part of any other variable that has an interesting spatial pattern) it is
the state. Knowledgeable historians and geographers might worthwhile to apply the scientific research process and use of
recognize that these counties are located along the old •cot• appropriate statistical techniques toward solving problems
ton belt," with very fertile soils that were most conducive to related to that variable.
the development of plantation-style cotton growing in the
antebellum South. When this region was initially settled, large
plantations were the rule and the overwhelming majority of
the population were black slaves.
Even today, these counties have a predominantly black pop•
ulation and are also still quite rural and poor. Of course, the cor•
relations between black. rural, and poor are relatively high in
many parts of the South, and it has been widely documented x,
that these segments of our society are relatively unhealthy. As ·-
always, keep in mind that •correlation is not causation.• C1 (55)
--- -·-.. 40

At the other end of the spectrum, several "very healthy"


counties form a cluster, even though they are widely dis•
--- ----- --..
persed from a locational perspective. These counties have a
C3 (8) C2 (4)
more suburban character than most other counties in Ala•
bama. Baldwin County is adjacent to Mobile and also has con•
siderable new development along the Gulf Coast. Shelby
County is relatively close to Birmingham and Huntsville is
located in Madison County. In east-central Alabama, Lee
County's very healthy classification is probably attributable in
x,
0
large part to the City of Auburn and Auburn University. A large
percentage of Lee County population is young, well-educated,
and more likely to not be obese, not have diabetes, and be
physically active.
Figure 18.6 shows all of the cluster centroid values as well
as a three-dimensional plot of their respective locations. If "
you carefully examine figures 18.Sand 18.6, you can see that Cluster Centroid Values:
the cluster centroid of the four "very healthy" counties (clus•
(55) (4) (8) (67)
ter 2) has the lowest levels on all three variables (obesity%,
Cluster 1 Cluster 2 Cluster 3 Grand
diabetes%, and physical inactivity%). As expected, the 'least (middle group} {very healthy} (least healthy) Centroid
healthy" set of 8 counties in figure 18.Sare together as
Obesity% 33.0 28.7 41.1 33.7
cluster 3 in figure 18.6.
In these cluster analysis examples, we relate "poor health" Diabetes% 11.5 9.9 15.2 11.9
to high levels of obesity, diabetes, and physical inactivity. We Physical
32.3 25.6 34.0 32.1
also suggest that "good health" is somehow associated with inactivity %

suburban living and proximity to a larger city. In addition, we


suggest even more linkages between "good health; younger
FIGURE 18.6
age and higher level of education. All of these relationships Centroid Plots of Alabama County Clusters: Based on
are weakly supported but still strongly suggestive. As geogra• Three Measures of "Health"-Levels of Obesity, Diabetes,
phers, we must always be careful not to attach simple cause• and Physical Inactivity
Chapter 18 • Examples ofMultivariate Problem-Solving In Geography 285

KEY TERMS REFERENCES


AND ADDITIONAL READING
cluster analysis, 277 Draper, N. R. and H. Smith. AppliedRegressionAnalysis.3rd ed.
coefficient of multiple determination, 271 New York: Wiley, 1998.
leverage, 274 Everitt, B. S. S. Landau, et.al. ClusterAnalysis. 5th ed. New
multicollinearity, 270 York: Wiley, 20 I I.
Griffith, D. A. and C. G. Amrhein. StatisticaJAna/ysisforGeogra-
multiple regression, 269
phers. Englewood Cliffs, NJ: Prentice Hall, 1997.
standardized regression coefficients, 270
Johnston., R. J. MultivariateStatisticalAnalysis in Geography:A
stepwise regression, 273 Primeron the GeneralLinearModel London: Longman, 1987.
Montgomery, D. C., E. A. Peck and G. G. Vining. Introduction
to LinearRegression Analysis.5th ed. New York: Wiley, 2012.
MAJOR GOALSAND OBJECTIVES Rogerson, P. A. StatisticalMethodsfor Geography:
A Student'sGuide.
3rd ed. Thousand Oaks_,CA: Sage Publications, Inc., 2010
If you have mastered the material in this chapter, you
should now be able to: If going further with the geographic examples mentioned in
this chapter, sources of data mentioned in earlier chapters
I. Explain the basic purposes of multiple regression. remain useful, including: the Centers for Disease Control and
2. Define multicollinearity and explain why it can be a Prevention for obesity and related health factors such as diabe-
problem in regression modeling. tes level and level of physical inactivity www.cdc.gov; the
3. Explain why it is important to determine the relative World Health Organization and their World Health Statistics
publications www.who.int/en and www.who.int/gho/publica-
importance of each predictor (independent) variable. tions/world_health_statistics/en; the world bank for life expec-
4. Explain what is measured by the coefficient of multi- tancy and many related world development indicators
ple determination. www.worldbank.org. To learn more about cluster analysis, rele-
5. Descnbe the characteristics ofa "good" multiple regres- vant terminology, and different linkage methods, a good strat-
sion model. egy is to examine various statistical software packages that
include clustering techniques.
6. Explain the basic purposes of cluster analysis.
7. Define the term "cluster centroid."
PART VII

EPILOGUE
Problem Solving and Policy Determination
in Practical Geographic Situations

19.1 Geographic Problem Solving and Policy Situations


19.2 Answers to Geographic Problems and Policy Situations

19.1 GEOGRAPHICPROBLEMSOLVING "aski~g the proper questions and selecting the appropriate
statistical approaches."
AND POLICYSITUATIONS Second, when presented with the results or com-
One of the primary goals of this textbook is to provide puter-produced output of any statistical analysis, you
you with an introduction to the broad spectrum of statisti- should be able to interpret those results and outputs prop-
cal techniques that geographers use to approach and solve erly. The best way to learn this second general skill is to
practice, practice, and practice. Unfortunately, no magical
spatial problems. It is neither possible nor desirable to dis-
cuss all quantitative procedures available to geographers in shortcut can provide you with this expertise. If you can
explain the results contained in the output from a statisti-
a single ~troductory text. For example, we do not present
cal software package in plain, simple English (and in the
mtermed1ate and advanced multivariate techniques such
context of the specific variables used in your research
as principal components factor analysis, canonical correla-
tion, and discriminate analysis. Nor do we provide any problem), then you have successfullyreached your goal.
The remainder of this chapter presents a series of geo-
specifics regarding more specialized procedures in geo-
graphic information sciences (GIS), such as Geary's C, the graphic situations or scenarios and their solutions. With
each scenario, you will be given enough information to
general G-statistic, geographically weighted regression,
make an informed decision regarding which statistical
and nearest neighbor hierarchical clustering. If you are
method or graphic technique to use. In some cases, you
~terested in learning more about these additional tech-
may be able to interpret the situation in more than one
ruques, we encourage you to consult some of the more-
acceptable way, thereby allowing alternative correct
advanced references listed at the end of various chapters.
answers. If you select an answer that is different from that
As suggested in the Preface, there are two important
general skills you should master to have a basic compe- listed in section 19.2, you should be able to explain and
defend the logic of your alternative solution.
tency in using statistical techniques to help solve practical
geographic problems and conduct geographic research. Perhaps the most difficult decisions involve scenarios
First, whenever you encounter a geographic problem that that require use of an inferential statistic. Multiple steps
may be required to identify the single best statistical test.
grabs or captures your interest, you should be able to
Recall from chapter 9 (table 9.7) that the selection of an
develop specific descriptive statements and inferential
appropriate inferential test may require consideration of
hypotheses related to that problem. Then you should be
such dimensions as the following:
able to select the appropriate statistical technique(s) and
graphic procedure(s) that allow you to best construct 1. type of question (questions about differences, ques-
these descriptive statements and test the inferential tions about similarities, questions about explicitly spa-
hypotheses. We might summarize this first general skill as tial relationships);

289
290 Part VII .A. Epilogue

2. level of measurement [nominal (categorical), ordinal, technique presented in chapters 9-18. It can be used as a
interval/ratio]; decision-making guide for many of the geographic situa-
3. parameter (mean, proportion, etc.); tions that follow.Remember that with many of these sce-
narios, enough information is provided to direct you to a
4. number of samples (one, two, three or more); and
single best answer, but other problems in the list may
5. relationship of samples (independent, dependent). have more than one reasonable solution. Do not view the
list of solutions as definitive; if you are innovative, you
To help you choose the most appropriate inferential may identify valid alternatives that we have not listed
technique, the body of table 9.7 is replicated (table 19.1), that logically apply to the situation.
with a filled-in framework that includes every inferential

TABLE 19.1

Organizational Structure for Selection of Appropriate Inferential Test

Questions about differences

Level of Dependent sample Three or more (k)


One sample Two samples
measurement (matched-pairs) samples

Nominal scale Chi-square goodness-


(categorical variables) of-fit ( x 2 )

Kruskal-Wallis
Chi-square goodness- Wilcoxon matched- Wicoxon (Mann-
analysis of variance by
of-fit ( x 2 ) pairs signed-ranks Whitney) rank sum
ranks
Ordinal scale
Kolmogorov-Smirnov
goodness-of-frt

One-sample difference Two-sample difference Analysis of variance


Matched-pairs (I-test)
of means of means (ANOVA)
Interval/ratio scale
One-sample difference Two-sample difference
of proportions test of proportions

Questions about similarities

Questions about form or


Questions about form or nature of relationship
Questions about strength nature of relationship between one dependent
and direction of relationship between one dependent and variable and two or more
between two variables one independent variable independent variables

Nominal scale
Contingency analysis
(categorical variables)

Contingency analysis
Ordinal scale Speannan rank correlation
coefficient

Pearson correlation
Interval/ratio scale Simple linear regression Multiple regression
coefficient

Questions about explicitly spatial relationships

Nearest neighbor (point data)


Nominal scale Quadrat analysis (point data)
(categorical variables)
Join count (area data)

Ordinal scale Moran's I (not covered in text)

Interval/ratio scale Moran's I (area data)


Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 291

1. Wanda Wolf, the head of scientific research at 10. A geographer wants to see if monthly rents for stu-
Redrock National Park, wants to determine the atti- dent housing vary according to the distance of the
tudes of park visitors regarding various Park activi- rental property from campus. She suspects the
ties and services. She decides to give a survey closer an apartment unit to campus, the higher the
questionnaire to every 10th car entering the Park. rent. How can she test whether this distance decay
What type of sample is Wanda planning to take? effect is present?
2. A cartographer is studying the map interpretation 11. Officials in Kamloops, British Columbia, know
abilities of students at a university. A random sam- from a recent survey that 3.2%,of their residents are
ple of 12 seniors and 16 freshmen (28 students in all) "senior elderly," that is, age 80 and older. What will
answer a set of questions related to the map, and the they be calculating if they try to estimate the likeli-
28 scores are placed in a single, overall rank order. hood that the true population proportion of senior
How would the cartographer test the hypothesis elderly is between 3.0% and 3.4%?
that seniors have better interpretive abilities? 12. A work-study student assigned to the administrative
3. A Manitoba farmer is concerned about the threat of office at a university is trying to find out what activ-
hailstorms and wants to know whether he should ities a sample of students want in the soon-to-be-
purchase insurance. He has a 35-year record of hail- con.structed student union. The only stipulation is
storm frequency and knows that anywhere from 0 that any student has an equal chance of being
to 4 storms have hit his farm in the past, with a included in the sample. What type of sample should
mean frequency of 1.4 hailstorms per year. How can the work-study student take?
he calculate the likelihood of a hailstorm hitting his 13. The Barranca Board of Education wants some
farm next year? information about local residents' attitudes toward
4. An economic geographer wants to examine the spending more money on public education. Plans
strength of relationship between population size and are to take a random sample of 285 voters, learn
the number of retail establishments for a set of Aus- their political affiliation (Democrat, Republican,
tralian cities. What technique should be applied? Independent) and their attitude about increasing
5. An oceanographer wants to analyze the pattern of public education expenditures (strongly agree,
manganese nodules in a portion of the North Atlan- agree, neither agree nor disagree, disagree,
tic to determine if the pattern is more clustered than strongly disagree). What technique should be used
random. What technique should she use? to determine if there is a relation.ship between
6. A small village in northern Ontario has a 112-year political party and attitude toward greater public
record showing the number of days in May with mea- education funding?
surable precipitation (greater than 0. 10 centimeters). 14. A social services planner in Chicago, Illinois, is con-
Some years are totally dry, with no days of measur- cerned about people's access to a set of service cen-
able precipitation, but during one very wet year, rain- ters scattered around the metropolitan area and
fall occurred 16 days of the month. The mean does not want anyone living too far from any of the
number of days with measurable rainfall is 3.2. What centers. How could she test whether the pattern of
method should be used to calculate the likelihood of these centers is more dispersed than random?
S or more days of rainfall for this upcoming May? 1S. From a recent nationwide study it is known that the
7. A soil scientist is interested in observing the spatial typical American watches 25 hours of television per
structure of pH values at different locations and their week (with a population standard deviation of 5.6
similarities with nearby locations. What kind of hours). Suppose SOresidents of New Orleans, Loui-
graph might he make to determine the distance at siana, are randomly selected and their viewing
which the pH values are no longer similar? hours monitored. What technique can be used to
8. As part of a recreation planning survey at a park, determine whether the viewing habits of New
questions are asked of all persons visiting the park Orleans residents differ from the national mean?
on a particular day. What type of sample is built 16. A geographer is looking at two maps of Nashville,
into this sample design? Tennessee. One map has a point indicating the loca-
9. A regional planner obtains random samples of tion of every pawnshop, while the other map has a
home costs from four Michigan counties. Suppose point representing every grocery store. What
the home costs appear normal in each county and descriptive spatial measure should the geographer
that prices seem to vary about the same amount use to compare their levels of spatial dispersion?
from county to county. What test should the planner 17. A demographer has recorded the number of chil-
use to learn if the differences in home cost by dren per family in a sample of 22 households in the
county are significant? Indian state of Kerala. She suspects from a quickly
292 Part VII .A. Epilogue

sketched histogram that this sample has been taken the year. What descriptive statistic should be used to
from an extremely skewed population. What infer- measure and quantify these visual observations?
ential test can she use to test this suspicion? 26. Steve "Smokey" Jones is a ranger in Plywood
18. Geographic consultant Pierre Portage is studying National Forest, and he has access to wildfire
visitor activity patterns in the Quebec park system. records over a 65-year period. Some years have no
Managers want to know if park attendance figures fires, but in one very dry year I6 fires burned fifty
differ from June and July, using the same set of acres or more. He has discovered the mean annual
parks in each case. What technique should Pierre number of such extensive fires is 1.8. How can
advise them to apply? Smokey calculate the odds of three or more wild-
19. A geographer has two isoline maps of Pakistan: fires in Plywood next year?
one showing mean annual precipitation and the 27. You would like to make a county-level choropleth
other showing population density. Suppose the map showing only the counties that have all three of
same random sample of points is selected from these undesirable characteristics: very high infant
each map. Assume that both variables are from mortality rate (worst 10% of all counties), very high
normally distributed populations. What technique suicide rate (worst 10% of all counties), and very
will determine the degree to which the two map high high-school dropout rate (worst 10%, of all
patterns are associated? counties). What kind of map would you construct?
20. A physical geographer is preparing to take soil sam- 28. A geography student has been studying the relation-
ples from an area that is 20% marshland. She sus- ship of distance from Lake Ontario and the amount
pects that environmental degradation is most of "lake effect snow." Some friends of hers live in
severe in the marshland portion of the study area, this study area and have a cabin 10 miles from the
so she wants to ensure that area is proportionally lake. What technique should she apply to estimate
oversampled. What type of spatial point sampling how much snowfall they received last winter?
should she apply? 29. London, England has a Jong, normally distributed
21. Environmentalists are investigating the spatial extent record of annual precipitation. Someone wants to
of thermal plume (higher water temperature) around know the probability that more than 800 millimeters
a water discharge pipe at Calvert Cliffs nuclear of precipitation will fall next year. As a step in this
power plant near Chesapeake Bay. Suppose person- process, what statistic needs to be calculated?
nel want to estimate surface water temperatures at 30. A fluvial geomorphologist is studying surface runoff
various distances from the discharge point. What levels in two different environments: (1) a natural
statistical procedure would be most appropriate? setting with no modification by human activity; and
22. A choropleth map of states in Mexico uses the fol- (2) an urban setting, highly modified through devel-
lowing legend categories: 11.37 to 16.26; 16.27 to opment. To test for significant differences in runoff,
21.36; 21.37 to 26.36; 26.37 to 30.14. What method spatial random samples are taken from both settings.
of classification is being used? Suppose the runoff data are placed into a single over-
23. An economic geographer is studying the spatial pat- all rank order from high to low within the two sam-
terns of income in Melbourne, Australia, and it is pled areas. What test should be used to see whether
known that the mean family income is $38,346. A the urban runoff exceeds that in the natural setting?
sample of 15 Melbourne households reveals an aver- 31. A sample of residents is surveyed to determine the
age of $40,487. How should the geographer test to number of miles traveled per week not related to
see if the sample of 15 families is representative of work. A year later, following a large increase in gas-
the entire population of Melbourne? oline prices, the same individuals are sampled again
24. Suppose you have collected data on infant mortality to record possible changes in discretionary travel
rate and number of hospitals per 100,000 people behavior. How could you test whether the change in
from 50 Jess-developedcountries. You want to look gasoline price is related to decrease automobile use?
at a graphic relationship between these two set of 32. From a representative sample of Canadian voters,
numbers. What graphic will you construct? how would a political geographer determine if the
25. A climatologist is looking at modified climographs percentage of voters favoring placement of a "strict
from 10 weather stations and is focusing on the plots quota" on annual immigration volume differs signif-
of mean monthly precipitation for each of the sta- icantly by gender?
tions. Some graphs have a very high spike (showing 33. A cultural geographer wants to examine the
that almost all the annual precipitation falls in just a strength of possible relationship between ethnic/
couple of months of the year), whereas other graphs racial group identification and type of restaurant
indicate precipitation falls rather evenly throughout most frequented. Suppose 300 people are surveyed,
Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 293

and assigned to I of 5 ethnic/ racial groups and 1 of dent age, income, employment status, and participa-
10 restaurant types. How would you test to deter- tion in support programs. What test can be used to
mine if restaurant preferences differ significantly by determine the relative influence of these various fac-
ethnic/racial group? tors on crime?
34. A researcher wants to depict graphically the mean 41. Geoscientist Gary Gabbro is studying erosion rates
average snowfall amounts that are 10% likely to be of several soil types in a county. He eventually wants
exceeded across western and central Canada (Brit- to use ANOVA to evaluate differences in erosion
ish Columbia, Alberta, Saskatchewan, and Mani- rates by soil type. He has taken systematic spatial
toba). The graphic is based on data from 85 weather samples from each, but is not sure if any of the
stations in this region. What will the researcher be underlying populations from which these samples
depicting graphically? have been taken is normally distributed (an ANOVA
35. A regional planner is studying income inequalities assumption). What test should he run before con-
across a set of counties in northern Appalachia. ducting an ANOVA?
Median household income by county is known for 42. Chinese agricultural officials are conducting a study
both 2000 and 2010. How can the planner descrip- of crop damage resulting from insect infestations in
tively measure the relative dispersion of income the lower Liao River Valley.It is hypothesized that
among counties in 2000, and compare that measure one factor associated with the extent of insect dam-
with the 2010 result? age is the amount of fertilizer used by different
36. Suppose you have the complete historical record of farmers. What test would be used to learn the
major earthquakes (say 6.0 or higher on the Richter strength of relationship between fertilizer use levels
scale) for that portion of the San Andreas Fault on and the amount of insect damage?
the upper San Francisco Peninsula. Let's suppose 43. An urban geographer wants to analyze a map pat-
the confirmed record indicates the date of each of 22 tern depicting the number of existing home sales by
such incidents over the last 3,000 years. What tech- census block group in Detroit, Michigan. He sus-
nique could you use to predict the date of the next pects some blockbusting is occurring and wants to
major earthquake? determine the extent of home sales clustering ("hot
37. An economic geographer is consulting for Flor- spots") in certain portions of the metro area. How
Mart (the giant store chain), helping them better should he proceed?
determine the demographic profile of their custom- 44. A random sample exit poll of Jefferson County vot-
ers. She has an alphabetical master list of all credit ers resulted in the following opinions regarding a
card holders and wants every 20th cardholder to commercial-retail blue law that would prohibit busi-
complete a detail questionnaire. The five-percent nesses from being open on Sunday. In Drumlintown,
sample results, with careful follow-up of non- 48%,were in favor of such a law, while in the rest of
respondents, will be very useful to FM manage- the County, 54% were in favor. How would you test
ment. What type of sampling is she suggesting? the hypothesis that support for the blue law is higher
38. Political geographer Gerry Mander has a metropoli- outside Drumlintown than in the city?
tan area map showing the 78 election districts. For 45. A geography student has a table listing all of the
each district, Gerry knows whether the majority of states, followed by a zero if the state has not passed
registered voters is Republican or Democrat. What legislation making English the official language, or
technique should he use to learn if political party followed by a one if it has. If a map is made using
affiliation is more clustered than random? this information, what technique can she use to
39. Mean annual precipitation totals vary considerably determine if the spatial pattern making English the
across the State of California. Suppose a climatolo- official language is random?
gist wants to examine the spatial structure of these 46. A medical geographer wants to predict the number of
precipitation totals in the state and the amount of physicians living in a city that has a population of
similarity for nearby locations. What kind of graph 10,000to 12,000,based on the median family income
could you draw to show the distance at which mean of the city. Suppose this relationship has been tested
annual precipitation totals are now longer similar to for other cities of this size, but not for this particular
one another? place. What statistical technique is appropriate?
40. The U.S. Department of Housing and Urban Devel- 47. A seismologist in New Zealand has been monitor-
opment is attempting to reduce crime rates in vari- ing earthquakes in the country for years. Many
ous public housing projects. Data are collected from years no "noticeable" quakes (above 3 on the Rich-
a sample of projects for a number of variables ter scale) occurred, but one year there were six such
thought to be associated with crime, including: resi- quakes. How will the seismologist calculate the
294 Part VII .A. Epilogue

probability of more than one earthquake of this Atlanta, Minneapolis, and San Diego. The total cost
magnitude next year? of the same list of food items (same national brands
48. Suppose an urban planner in San Francisco, Califor- and sizes) is recorded for each store in every city sur-
nia, wants to determine if residents in rental units veyed. What test should be applied to learn if the
differ from residents in owner-occupied units with market basket costs between cities are different?
regard to the percentage favoring rent control legis- 57. A region contains three major rock types: limestone,
lation. If random samples of residents are taken calcareous marl, and sandstone. Suppose similar-
from both types of residences, what statistical test sized areas are sampled in each rock type, and the
should be used? number of springs in each is tabulated. What test
49. Transportation planners in Minneapolis, Minne- will determine if the density of springs varies by
sota, are concerned about the effect of weather on rock type?
the ridership levels of public transit. The hypothesis 58. A Japanese city has five major shopping centers. A
is that days with "bad" weather conditions (rain, random sample of 25 shoppers at each center is asked
snow) tend to attract more riders than "good" days how much money they spent shopping in the center
(dry, sunny). Given daily ridership data and know- that day. What statistical test could be used to see if
ing the overall population variability in these rates, the typical shopper at the different centers spends dif-
how could the hypothesized difference in ridership fering amounts of money? Assume the pattern of
levels be statistically validated? expenditures at each center is very positivelyskewed.
50. A researcher working with a SO-yearclimatological 59. Elmwood, Oklahoma, has two restaurants. Suppose
record of Atlanta, Georgia, knows the number of a random sample of patrons at each restaurant is
days each year with temperatures above 90°F. When asked the distance they have traveled to eat (the dis-
she constructs a histogram showing the distribution tribution of distances is severely non-normal). What
of these 50 values, it doesn't look at all symmetric. A test is appropriate to discover if patrons of the two
few years had many more "very hot days" than the restaurants travel different distances?
others. What descriptive statistic can she apply to 60. A geographer plans to make a choropleth map of
measure this Jack of symmetry? Texas counties, showing median family income.
51. Suppose you have applied the natural breaks (single However, he wants to use a method of classification
linkage) method of classification to a set of prov- that will highlight the unusual counties (outliers),
ince-level lung cancer rate data. You would like to thereby allowing map viewers to quickly spot
examine the step-by-step classification process "extreme" counties. What method of classification
graphically. What graphic will you examine? should he use?
52. Experts handling snow-making equipment for Tam- 61. A climatologist is conducting a study of "lake effect
arac Lodge and Ski Resort suspect a relationship snow" along the shores of Lake Ontario. She has
between overnight wind direction and the need to collected annual snowfall data from numerous mon-
make snow during the late winter/ early spring sea- itoring sites at varying distances from the Lake.
son. Using sample data on these variables from the What statistical technique should be used to mea-
last JOyears, how could they test this relationship? sure the strength of relationship between annual
53. In a list of 50 test scores, the mean score is 74.1 out of snowfall and distance from the lake?
100. What statistic should be calculated to determine 62. The housing department at Everbrown University
how much the typical score varies from this mean? has just completed a survey of 86 students, asking
54. From a random sample of trees in a national forest, the distance of their off-campus housing unit from
a biogeographer wants to identify the pattern of dis- campus and the monthly rental cost of their unit.
eased trees to see if they are randomly distributed of Given this information, what technique should
clustered. This will provide guidance concerning the housing personnel use to predict the monthly rent
most appropriate type of treatment (widespread aer- for a student living 1.5 miles from campus?
ial spraying or concentrated treatment of specific 63. Let us suppose that there have been 12 confirmed
areas from ground level). What test should be used? tornado touch-downs in Ellis County, Kansas since
55. Jill Doakes, currently with the U.S. Bureau of the 1900. What statistical technique would you recom-
Census, wants to estimate the 2030 location of the mend to predict the date of the next tornado touch-
geographic center of population. Given county-level down in the County?
population estimates for 2030, what descriptive sta- 64. An earthquake in eastern Turkey resulted in the fol-
tistic should she use? lowing number of deaths and injuries in 11 villages
56. The cost of a "market basket" of groceries is obtained located at increasing distances from the epicenter:
from random samples of grocery stores in Boston, 520,410,320,310,SO, 170,210,140, 100,50,and20.
Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 295

That is, 520 deaths and injuries occurred in the vil- vanatlons, and the score for each part1c1pant is
lage closest to the epicenter, 410 occurred in the next recorded. How can the cartographer test the hypoth-
closest village, and so on. If no further information is esis that higher average scores are received when the
available, how can you measure the strength of rela- five-color version of the map is used?
tionship between proximity to the epicenter and num- 72. An environmental planner has been collecting sulfur
ber of casualties and injuries? dioxide level readings from a sample of monitoring
65. A random sample of residents whose homes are on sites at different distances from a coal-burning util-
the floodplain of the Mississippi River is surveyed ity power plant in the Ohio River Valley. What sta-
regarding their attitudes toward various flood manage- tistical technique can she use to predict the sulfur
ment practices. After a flood has occurred, the same dioxide level at an unmonitored site 12 miles down-
set of homeowners is resurveyed about these issues. wind from the power plant?
What test should be used to determine if attitudes 73. Texas is known for taking capital punishment seri-
toward flood management policies have changed? ously; more executions occur here than in any other
66. A 2012-2013 survey of winter guests at Tamarac state. However, a geographer is curious about
Lodge estimates that 58% of the respondents did not whether attitudes toward capital punishment vary
ski during their visit. A consultant hypothesizes that from one part of the state to another. As one seg-
an even larger percentage will not ski this upcoming ment of his study, random survey samples of resi-
winter. When the data becomes available, how will dents are taken from El Paso (west Texas) and
she test this hypothesis? Beaumont (east Texas). In each city, he has esti-
6 7. In previous years, the overall mean travel time for a mated the percentage of respondents favoring the
boat trip down the Colorado River through Grand death penalty. What technique should be used to see
Canyon National Park was 144 hours. Because of if these attitudes differ statistically?
River congestion and overuse, park personnel 74. A hurricane has recently moved up the Atlantic coast
hypothesize the trip will take significantly more time of the United States, causing damage in major coastal
this year. After the data becomes available, how can cities. Suppose the amount of damage (in millions of
you determine if boat trips now take longer? dollars) is as follows: (the cities are listed in location
68. A glaciologist studying temperature variation in from south to north; the southernmost city is listed
Greenland has proposed that temperatures are sig- first and the northernmost city is listed last) 786, 451,
nificantly warmer today than 300 years ago. Using 507,410,202, 51, 171, 35, 27, 58. How can these data
100 test sites around the subcontinent, he has mea- be evaluated to learn if location northward or south-
sured the average temperature over the last year and ward is significantlyrelated to the amount of damage?
estimated the average temperature at those sites 300 75. The agricultural extension agent in an Oregon
years ago from the chemical composition of ice county is concerned about the adverse effects of
bores. How should he test for warming? hailstorms on farms in her area. Hailstorm activity
69. A political geographer is interested in knowing if the has been recorded over the last 55 years. Some
strength of support for a Mississippi Republican years, no hailstorms occur anywhere in the county.
candidate for the Senate in the last election was spa- However, in one extreme year, nine separate hail-
tially autocorrelated. The candidate received 65% of storms occurred. How can she calculate the proba-
the vote statewide. Each Mississippi county is iden- bility of the county experiencing exactly two
tified as either above or below the statewide percent- hailstorms next year?
age and this pattern is plotted on a map. What 76. The Vermont Tourist Bureau wants to target its tele-
statistical technique should be used? vision advertising for the upcoming winter season.
70. A geographer believes that medical service facilities The Bureau director feels that visitors come equally
in a major urban area are clustered near high acces- from Boston, New York, and metropolitan regions
sibility locations (such as primary commercial in southern Canada. Many resort owners, however,
routes and highway interchanges). If each sampled believe that Boston is the clear leader for tourists.
medical service facility is plotted as a point on an Using data samples from the previous season of the
area map, what technique could be applied to ana- seven largest resorts in the state, how could this
lyze this pattern? question be resolved?
71. A cartographer is studying the relative effectiveness 77. The chair of a local environmental action group
of different map variations. The same map is pro- hypothesizes that a relationship exists between the
duced in two distinct ways: five-color and five shades volume of recycled material left along the curb for
of gray. The same series of questions about the map pickup and various homeowner characteristics,
are asked of 15 people using each of the two map including: age, income, education, and a measure of
296 Part VII .A. Epilogue

homeowner community involvement. What method 85. The community of Sliding Stone, Pennsylvania, has
should be used to predict the likelihood of a particular five clearly identifiable neighborhoods. A student at
homeowner participating in the recyclingprogram? the local university thinks Jot sizes differ by neigh-
78. Suppose you have a set of data listing the total num- borhood, so he selects a random sample of IO lots
ber of murders from automatic high-capacity guns from each neighborhood. It is assumed that lot sizes
per thousand people by state since the year 2000. vary within each neighborhood by differing
You would like to display these data graphically, amounts. What statistical method should he use to
showing the location of each state on this distribu- verify his hypothesis?
tion and also showing the interquartile range. What 86. A newspaper reporter for a small regional paper
graphic will you construct? wants to determine if the local Labour Party candi-
79. Suppose a demographer for the Ontario provincial date for Parliament has a similar level of support
government is looking at both the number of among voters in two adjacent towns. The propor-
migrants moving into economic development areas tion of a random sample of voters in each town
and the number of migrants leaving the same area. favoring the candidate is recorded. How can the
How can she test whether the number of in- and reported discover if the support level for the candi-
out-migrants is statistically different? date differs between the two towns?
80. A tornado-spotter wants to analyze the initial 87. Suppose there are two choropleth maps of Texas
"touch-down" point pattern for tornadoes that ini- counties, one showing median family income in
tially touched down in Arkansas. The overall pat- one of five ordinal categories, and another showing
tern appears somewhat clustered, but there is some the percentage of labor force in profession occupa-
disagreement. What inferential technique can be tion classified into four ordinal categories. What
used to determine whether the pattern is more clus- technique should be used to learn if the two choro-
tered than random? pleth map patterns differ?
81. Personnel at an Ontario Provincial Park want to 88. A private high school is situated in the center of a
locate a new campground on a riverfront site that five-neighborhood region. Each neighborhood has
will not be flooded too frequently. Budget managers about the same high school age population. A ran-
advise a location that will be flooded no more fre- dom selection of 100 students at the school
quently than once a decade. On one possible site, it resulted in the following counts (22, 24, 13, 24,
has flooded 8 times in the last century. What proce- and 17). That is, 22 students from neighborhood 1,
dure should be used to determine the probability 24 from neighborhood 2, and so on). What tech-
that this site will not be flooded too frequently in the nique can be used to discover if student enroll-
next 30 years? ments vary by neighborhood?
82. An urban geographer is part of a team planning to 89. An urban geographer is examining citizens' atti-
collect data from city residents. One action that has tudes toward growth in a rapidly expanding subur-
been suggested is to survey every home in selected ban area. The research design is set up to survey a
city blocks (to save time and money). What sam- random sample of 230 people at two different times.
pling strategy would this be? First, attitudes toward growth will be measured
83. Hydrologists want to analyze the volume of mate- before construction of a major new regional shop-
rial carried by two rivers (expressed in grams of ping mall. Later, the same people will be resurveyed
solid material per liter of water). One river drains a after the mall has been open a year. How can the
predominantly agricultural area, while the other geographer test whether respondents' attitudes
drains a predominantly forested area. To test for dif- toward growth have changed?
ferences in volume of material carried, 12 random 90. Researchers continue to explore the various factors
water samples are taken from both rivers, one sam- that might be related to the homicide rate in a major
ple a week through the summer months. What test city. Data are available by census tract for many
should be used? variables, including age, income, education, and
84. An economic development planner is interested in unemployment rate. What test could be used to esti-
evaluating changing patterns of poverty in Appala- mate the homicide area in a census tract not
chia. She has a series of county-level choropleth included in the original study?
maps showing each county as either above regional 91. Suppose you are studying historical changes in
median family income (shaded) or below regional housing characteristics in your hometown. You
median family income (unshaded). How can she test want to see if house size (square feet of living area)
whether poverty is becoming more spatially concen- and the age of the home are associated. What
trated over time? method will you use to see if a relationship exists?
Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 297

92. In another study concerned about crop damage 99. The SAT scores of50 Middlebury students, 41 Salis-
caused by insect infestations in the Liao River Val- bury students, and 83 Boysenbury students are ran-
ley of China, insecticide use levels seem to be criti- domly selected for analysis, and an overall ranking
cal. Officials want to predict the amount of crop of scores is created. What statistical test should be
damage expected on a farm not included in the orig- used to learn if SAT scores between these three
inal study, solely based on the level of insecticide schools are different?
use. What technique should be used? 100. After completing a local pollution abatement pro-
93. A recreation geographer is testing the effectiveness gram, Florence Floss wants to know if mean pollu-
of an attendance prediction model (number of visi- tion levels from a sample of 10 sites are significantly
tors from a set of cities to a set of regional parks). It lower than national Environmental Protection
is hypothesized that attendance will be directly Agency standards. The national mean EPA stan-
related to city population size and inversely related dard and variance around that national standard are
to distance from the city to the park. What statisti- known from existing reports. What statistical proce-
cal technique could be used to test the validity of dure should Ms. Floss follow?
this model? 101. In Banff National Park, Alberta, some observers
94. Suppose you have data from the Centers for Disease think the number of black bear sightings has
Control (CDC) listing HIV rates for all counties in increased dramatically. Park headquarters has data
the United States. You want to construct a chorop- listing the number of sightings at each campground
leth map based on these data, and your goal is have that existed 20 years ago. How could a resource
about the same number of counties allocated to each manager test this historical data against a newly col-
of the five categories you plan to have on the map. lected set of sightings from the same campgrounds
What classification method are you going to use? to determine if any increase has occurred?
95. The World Health Organization (WHO) has col- 102. You suspect that teenage pregnancy rates vary con-
lected data on tuberculosis infection rates from a siderably from state to state. After collecting the
sample of neighborhoods in Lagos, Nigeria. If these state-level teenage pregnancy data, you want to clas-
neighborhoods are classified as either "more pol- sify the information and display it on a choropleth
luted" or "less polluted" based on levels of indus- (area pattern) map. However, you want to classify
trial activity, how can WHO researchers test the data so that variation within each category is
whether tuberculosis infection rates differ by pollu- minimized while variation between categories is
tion level of the neighborhood? maximized (as much as possible). What simple clas-
96. A cartography student has proposed a research sification do you propose?
paper project. She wants to contrast the map inter- 103. Wolves are now being reintroduced into areas where
pretation skills of male versus female students. Sup- they formerly thrived, but have recently been elimi-
pose map interpretation test scores from 18 males nated. Of course, this strategy is controversial, usu-
and 20 females are placed in an overall rank order, ally opposed by area farmers and ranchers. Suppose
from highest score to lowest score. What technique personnel from the Department of the Interior plan
should she use to learn if male and female map to release 24 wolves from a single departure point in
interpretation skills differ? Yellowstone National Park. Each wolf is tagged and
97. On the wall of the fire department headquarters in released individual at one hour intervals. You have
Chicago, Illinois, a large map shows the location of the data showing the movement pattern of each wolf.
fire stations and recent large fires. Someone is using How can you best summarize these data graphically
a ruler to measure the straight-line distance from a to show the general direction of movement?
fire station to a fire. A visiting geographer says, 104. City planners want to determine the attitudes of res-
"No, that is not the true distance!" What measure of idents concerning possible construction of a new
distance should be used? public swimming pool. They want their proposed 5-
98. Health officials are concerned about the recent dif- percent sample to be representative of the entire
fusion of influenza in Moscow, Russia. If the num- community, and they decide to use number of chil-
ber of cases is known for each local administrative dren per household as an indicator variable. If the
unit for several consecutive weeks, the morbidity mean number of children per household is known
rate for influenza could be displayed on a series of community-wide, how should planners test to see if
choropleth maps (one map for each week), and the the 5-percent sample mean is representative of the
degree of clustering, randomness, or dispersion community-wide mean?
could be analyzed for each map pattern. What tech- 105. Coral Reef is conducting a study of hazard percep-
nique would be used? tion in a community along the Susquehanna River.A
298 Part VII .A. Epilogue

questionnaire Coral has written lists three possible (one map pattern per year) showing the distribution
management strategies: strengthen the levee,re-chan- of hailstorms across a five-county region in north-
nel a portion of the water from upstream, and relo- east Colorado. How can these data be analyzed to
cate selected structures. The study protocol contains compare directly the amount of relative dispersion
three distinct groups: people living on the 25-year on each of the maps?
flood zone, people living within the I00-year flood 113. Complete historical climate records for Fairbanks,
zone, and people living on higher ground. How can Alaska, indicated that 18.6% of the days have clear
Ms. Reef test whether the preferred management weather, with no appreciable cloudiness. Suppose a
strategy is related to the respondent's location? sample of SOdays in the last year is selected at ran-
106. A planner in Tampa, Florida has assisted in the dom, and 26% of these days had clear weather. What
design of a newly planned community project, test would you use to determine if the sample is typi-
which has on-site health care facilities. He hypothe- cal or representative of the overall climate record?
sizes that residents in the planned community are 114. In 2010, the Ontario Tourism Bureau compiled a
older that the metropolitan-wide mean. If the age "Top IO" list from a Province-wide sample survey,
statistics for all Tampa residents is known, what sta- asking residents to rank cities in the Province by
tistical technique should be applied to learn if "livability." The Bureau wants to again ask a sample
planned community residents are older? of Ontario residents to provide a current list. What
107. The United Nations Population Fund wants to eval- statistical test should be used to measure the degree
uate the success of a family planning program in to which the two Top IOlists are related?
selected small villages in Somalia. Previous studies 11S. Experts studying inflation rates for people through-
have shown that a 38% rate of effective contracep- out Europe hypothesize that people living in coun-
tive use after two years is a suitable international tries that currently belong to the European Union
standard for success. How should one test whether (EU) are enjoying lower inflation rates than people
the Somali program has been effective? living in non-EU countries. Large random samples
108. In Pennsylvania counties, unemployment rate data of individuals in EU and non-EU countries are
are available for 201 I and 2012. How can Terrance taken, and the mean inflation rate is calculated for
Tundra (a planner that works with Commonwealth each group. What technique is used to test for sig-
labor statistics) measure the relative variability of nificant difference?
2013 county unemployment rates, to compare 116. The Chinook City Council feels that the turnover
descriptivelywith the previous two years? rate of land parcels is not equivalent across all types
109. A climatologist has measured sulfur dioxide levels in of land use. The Council decides to tabulate type of
the lower atmosphere repeatedly at seven different land use (residential, commercial-retail, industrial)
weather stations along a line of latitude across cen- against the timing of the last sale of that land parcel
tral Canada for an entire year. How can he test for (within the last year, I to S years ago, 6 to IO years
differences in sulfur dioxide levels between stations? ago, and more than IO years ago). If they randomly
110. A geographer with the World Bank is studying select 200 land parcels from the City records, how
global economic development trends and has infor- should they determine if a relationship exists
mation regarding the percentage of labor force in between land use and date of last sale?
"telecommunications" jobs for a sample of coun- 117. A planner wants to examine the pattern of housing
tries. He hypothesizes that countries in "upper," quality in a lower-income suburb of Birmingham,
"middle," and "lower" income categories have dif- England. Suppose a random sample of housing
ferent percentages of their labor force in this job sec- units is taken from each of Birmingham's local elec-
tor. A potential problem is that some sample tion districts, and each district is then categorized as
distributions appear to be from highly skewed popu- having a higher or lower number of substandard
lations. What statistical test should be used? units than the metro-wide mean, and a binary map
111. A geographer is studying total fertility rates in Afri- is drawn. The planner suspects that the substandard
can countries and wants to correlate these data with units are mostly concentrated near the city center.
other variables using Pearson's correlation coeffi- How can this hypothesis be tested?
cient. How should she test sample total fertility rate I 18. Joe Doakes is working on a new "Atlas of Austra-
data from each African country to determine if it is lia" and has been asked to construct an isoline map
from a normally distributed population and for sub- showing the likelihood that more than 20 inches of
sequent input into the Pearson's correlation test? precipitation will occur at various weather stations
112. A spatial analyst for the Center for Climatological across the country. What should Joe create to com-
Research is examining multiple point pattern maps plete this task?
Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 299

119. A geographer wants to learn if mean soybean yields from that area. What type of spatial sampling
per acre on three soil types are significantly different. design is she advocating?
From a sample of farms on each soil type, he collects 126. You would like to summarize graphically the point
the soybean yield values. It is assumed that the popu- distribution and variation of UFO (unidentified fly-
lations from which the samples are taken is normally ing object) sightings in New Mexico since the year
distributed. What statistical text should he use? 2000. What graphic summaries would it be logical
120. Marcia Monadnock is the recreation activities plan- to display?
ner at a ski resort in the Catskill Mountains. She 127. A vocational-technical community college is pro-
thinks that weather conditions in New York City on posed in a community. A comprehensive citywide
the Fridays before a potential ski weekend influence poll conducted two years ago indicated that 58% of
the number of people who will actually ski that all community residents were in favor of financing
weekend. Specifically,she hypothesizes that if it is a construction of this facility. Since then, economic
"warm" Friday there will be fewer skiers, while if it difficulties may have altered residents' views on this
is a "cold" Friday there will be more skiers. Using issue, so a new sample survey has been proposed.
visitation data from a sample of winter weekends How can community leaders learn if the view about
over the last decade, how can she determine if her financing this project has remained unchanged?
hypothesis is correct? 128. A corn farmer in central Nebraska has discovered
121. Religious affiliation (Protestant, Catholic, Jewish, that at least three inches of precipitation are needed
etc.) and voting preference (Democrat, Republican, during a critical three week period in the growing
and Independent) data are collected from a sample season to avoid irrigation. If he has to irrigate, he
of 250 people in a key New York City election pre- will lose money that year. Historic data indicate that
cinct. How can the hypothesis be tested that reli- in 15 of the last 18 years precipitation has exceeded
gious affiliation is related to voting preference? three inches during this time. How can the farmer
122. Transportation planners in metropolitan Eskerville determine if it is profitable to plant com?
know that the mean commuting distance metro- 129. A health services provider is surveying attitudes
wide is 5.4 miles (each way). A detailed study is regarding "freedom of choice" for women in continu-
made of 25 randomly selected commuters, and their ing pregnancy across subareas of Cleveland, Ohio.
mean commuting distance is 7.1 miles. How should She wants to predict the level of support for freedom
planners test whether the sample of commuters is of choice in a critical precinct of the City. Fortunately,
typical of all commuters in Eskerville? she has information from previous precinct-level
123. A World Health Organization researcher in Kenya models (but not this particular precinct) demonstrat-
is studying the distribution of three medical prob- ing that several factors influence these attitudes,
lems: "river blindness," HIV infection, and an infec- including: median age, percent Catholic, median
tious form of "sleeping sickness." Random samples household income, and median school years com-
are collected from 100 people on a river floodplain pleted. What statistical procedure will she utilize?
and 125people on an adjacent highlands area. How 130. The historical record at a site along the lower Brah-
can the researcher determine whether the incidence maputra River in Bangladesh indicates that major
of these medical problems is related to the type of flooding occurs at least once during the rainy season
physical location in which they live? 4 years out of 10 (40% of the time). How would you
124. Fred Foehn lives 500 meters from an interstate high- estimate the probability that major flooding will
way. From his front yard, Fred has been measuring occur at least once in 6 of the next 10 years?
noise levels at noon on a systematic sample of days 131. Georgianna Peach, a medical services provider in
over the last year. At the same time, he also records Savanna, Georgia, is concerned about attendance
the primary wind direction (north, south, east, or at a health care clinic in town. She wants to
west). Fred hypothesizes that noise level is related to develop and test a model that will predict clinic use
wind direction. How can he test this hypothesis? level as a function of population size within one
125. Along a segment of bayfront beach, environmental mile, number of high school educated people
researcher Donna Dune is studying selected mea- within one mile, and number of households below
sures of soil chemistry. She thinks that the human poverty level within one mile. What statistical tech-
impacts on soil chemistry will be most severe in the nique is she proposing?
tidal wetland portion of the beach, which encom- 132. Dilbert Doakes is a consultant for the Los Angeles
passes about 1so;., of the total study area. Donna Fire Department, and he has been asked to look at
therefore decides to "oversample" in the tidal wet- the spacing of fire stations throughout the city. Of
land area, taking almost SOOfc, of her soil samples particular concern are structures that may be too far
300 Part VII .A. Epilogue

from the nearest fire station to be protected ade- 141. A transportation engineer records noise levels at
quately. What statistical technique should he use to two adjacent locations, both 200 yards from a major
help address this concern? highway. The first location has an intervening noise
133. Suppose the mean age of New Zealand's population barrier about IO yards from the highway, while the
is 34.4 years. Raymond Yacht club (the new plan- second location does not have any intervening noise
ning director of Wellington) wants to determine if barrier. Based on random samples (taken at the
Wellington has a typical age profile, with a mean same times at both locations), how can the engineer
age similar to the national figure. If the age of 80 test whether there is a significant difference in noise
Wellington residents is collected, what technique levels at the two locations?
will he use? 142. A political geographer is interested in determining
134. According to a recent nationwide study, only 54% the degree of spatial autocorrelation for illiteracy
of all Americans can locate Mexico on a world map rates in a large U.S. city. Using school district
(this is a true fact!). From a sample of 60 students in boundary lines, he wants to know if school districts
a college geography course, suppose 63%,are able to having high illiteracy rates are located near other
locate Mexico. What test should be used to learn if school districts of high illiteracy. What kind of sta-
the sample of students is typical? tistical test should he perform?
135. From samples of Jess-developed counties and 143. The U.S. Geological Survey wants to map the likeli-
more-developed countries, suppose data are col- hood that a magnitude 5.0 or greater earthquake
lected on mortality rates ca used by heart disease (Richter scale) will occur at various locations across
and other heart-related problems. What technique southern California sometime in the next decade.
should be used to determine if there is a difference What technique should be applied?
between LDC and MDC death rates as a result of 144. Urban geographers have Jong thought that home Jot
these causes? size is larger the greater the distance from the city
136. Coralie Ollis is a geographer doing some consulting center. What technique would you apply to confirm
for a major car insurance company. She has pulled a this hypothesis and predict the lot size of a home six
random sample of 375 accident claims from com- miles from the city center?
pany files, and her hypothesis is that the number of 145. A demographer is studying population growth rates
insurance claims varies significantly by season of within Nigeria to see if this variable is related spa-
the year (spring, summer, fall, and winter). How tially to religious affiliation. The demographer has
will she test this hypothesis statistically? collected a systematic spatial sample from 250
137. The mayor of Concord, Kentucky received 60.8% of households across the country. From each house-
the vote in his successful election to that office. He is hold he has recorded their religious affiliation and
running for re-election, and a survey of 1I0 the number of children. How should he now pro-
expected voters is planned. What technique will tell ceed statistically?
his campaign manager if the current level of support 146. How would a geomorphologist test the relationship
differs significantly from the first election? between type of bedrock material and level of sur-
138. A climatologist has 85 years of annual snowfall data face soil acidity for a sample of sites in a river val-
for Flagstaff, Arizona. What inferential statistical ley? That is, do different types of bedrock material
procedure can be applied to learn if the distribution support soils having different levels of acidity?
of annual snowfalls is normal? 147. Legislators in Aspen, Colorado want to learn if
139. Suppose an economic geographer is given a chorop- year-round residents differ from seasonal residents
leth map of North Carolina counties, with each with regard to the percentage favoring a Jaw to
county classified in a binary fashion as either above strengthen the existing noise control ordinance. If
or below the statewide mean regarding the percent- random samples are taken from both types of resi-
age of population below the poverty level. She dents, what test should be used?
hypothesizes that the far western (Appalachian) 148. From two small villages in the Scottish Highlands, a
counties will emerge as a poorer-than-average clus- researcher has collected random samples of villager
ter. What technique should she use? height. How can the researcher determine if resi-
140. A staff cartographer is assigned to design a world dents of one village are significantly taller (or
map showing birth rates by country. The study shorter) than residents of the other village?
director wants the following categories to be used: 149. Winter tourism levels may be increasing in the
Jess than 9.9, 10.0-19.9, 20.0-29.9, 30.0-39.9, and Canadian Rockies. A survey conducted in 2005
40.0 and greater. What method of classification is to calculated the mean seasonal revenue (due to tour-
be applied in this map design? ism) for 45 communities in this region. What statis-
Chapter 19 .A. Problem Solving and Policy Determination in Practical Geographic Situations 301

tical test would you use to learn if seasonal revenue 23. one-sample difference of means-t
from last year has increased since 2005 for these 24. scatterplot
same communities? 25. kurtosis
150. A community in Indiana has six neighborhood 26. Poisson probability
parks. The local recreation planner takes a random 27. contingency (crosstab) map, conditional coverage
sample of visitors to each park, asking them the dis- 28. simple linear regression
tance from their home to the park. (The sample
2 9. standard score
data is assumed to be from non-normal popula-
30. two-sample difference test-Wilcoxon/Mann-Whitney
tions). What test can be used to determine if park
visitors travel different distances to the six neigh- 31. matched-pairs t
borhood parks? 32. two-sample difference of proportions
151. A university student in Toronto, Ontario is conduct- 33. contingency analysis
ing research about attitudes toward a newly pro- 34. probability map
posed rent control initiative. He knows that about 35. coefficient of variation
58%,of Toronto residents live in rental units, and he 36. geometric probability
wants to ensure that about 58% of the respondents 37. systematic sample
to his survey live in rental units. What type of sam- 38. join count analysis
pling design should he utilize? 39. variogram
152. A university student in Atlanta, Georgia wants to 40. multiple regression
write a course research paper on the distribution of 41. Kolmogorov-Smirnov goodness-of-fit (normal)
poverty in the metro area. She collects poverty rate 42. Pearson's correlation coefficient
data by census tract, and wants to determine if tracts 43. Moran's /-local
with high poverty levels are located near other tracts
of high poverty. What kind of test will she perform 44. two-sample difference of proportions
for her research paper? 45. join count analysis
46. simple linear regression
4 7. Poisson probability
19.2 ANSWERSTO GEOGRAPHIC
PROBLEMS 48. two-sample difference of proportions
49. two-sample difference of means-Z
AND POLICY
SITUATIONS 50. skew
1. systematic sample 51. dendrogram
2. two-sample difference test-Wilcoxon/Mann-Whitney 52. chi-square or Pearson's correlation coefficient
3. Poisson probability 53. standard deviation
4. Pearson's correlation coefficient 54. quadrat analysis
5. quadrat analysis 55. weighted mean center
6. Poisson probability 56. analysis of variance
7. vario gram 57. chi-square goodness-of-fit-uniform
8. cluster sample 58. Kruskal-Wallis analysis of variance
9. analysis of variance 59. two-sample difference test-Wilcoxon/Mann-Whitney
10. Pearson's correlation coefficient 60. natural breaks-either single linkage or Jenks
11. confidence interval 61. Pearson's correlation coefficient
12. random sample 62. simple linear regression
13. contingency analysis 63. geometric probability
14. nearest neighbor analysis 64. Spearman's rank correlation coefficient
15. one-sample difference of means-Z 65. matched-pairs t
16. standard distance 66. two-sample difference of proportions
17. Kolmogorov-Smirnov goodness-of-fit, normal 67. one-sample difference of means-or t
18. matched-pairs t 68. matched-pairs t
19. Pearson's correlation coefficient 69. join count analysis
20. stratified disproportional 70. quadrat analysis
21. simple linear regression 71. two-sample difference of means-t
22. equal intervals based on range 72. simple linear regression
302 Part VII .A. Epilogue

73. two-sample difference of proportions 123. contingency analysis


74. Spearman's rank correlation coefficient 124. analysis of variance
75. Poisson probability 125. disproportional stratified point sample
76. contingency analysis 126. mean center and standard deviational ellipse
77. multiple regression 127. one-sample difference of proportion
78. boxplot 128. binomial probability
79. matched-pairs t 129. multiple regression
80. qua drat analysis 130. binomial probability
81. binomial probability 131. multiple regression
82. cluster sampling 132. nearest neighbor analysis
83. two-sample difference of means 133. one-sample difference of means-Z
84. join count analysis 134. one-sample difference of proportions
85. Kruskal-Wallis analysis of variance 135. two-sample difference of means-Z or t
86. two-sample difference of proportions 136. chi-square goodness-of-fit uniform
87. Spearman's rank correlation coefficient 137. one-sample difference of proportions
88. chi-square goodness-of-fit uniform 138. Kolmogorov-Smirnov goodness-of-fit, normal
89. matched-pairs t 139. join count analysis
90. multiple regression 140. equal interval not based on range
91. Pearson's correlation coefficient 141. two-sample difference of means
92. simple linear regression 142. Moran's I
93. chi-square goodness-of-fit proportional 143. probability map
94. quantile breaks 144. simple linear regression
95. two-sample difference of proportion 145. contingency analysis
96. two-sample difference test-Wilcoxon/Mann-Whitney 146. chi-square goodness-of-fit uniform
97. Manhattan or network 147. two-sample difference of proportions
98. Moran's I 148. two-sample difference of means
99. Kruskal-Wallis analysis of variance 149. matched-pairs t
100. one-sample difference of means ISO. Kruskal-Wallis analysis of variance
10 I. matched-pairs t IS 1. stratified proportional sample
102. Jenks' method of natural breaks I 52. Moran's I
103. linear directional mean
104. one-sample difference of means-Z or t
105. contingency analysis
106. one-sample difference of means-Z or t
107. one-sample difference of proportions
108. coefficient of variation
109. analysis of variance
110. Kruskal-Wallis analysis of variance
111. Kolmogorov-Smirnov goodness-of-fit, normal
112. coefficient of variation
113. one-sample difference of proportions
114. Spearman's correlation coefficient
11S. two-sample difference of means
116. contingency analysis
117. join count analysis
118. probability map
119. analysis of variance
120. two-sample difference of means-t
121. contingency analysis
122. one-sample difference of means-t
Appendix
A

Statistical Tables
0 z

TABLE A

The Normal Table

Note: To get A for a givenvalue of Z, inserta decimalpointbeforethe fourdigits.For example,Z = 1.43 givesA = 0.4236.

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0000 0040 0080 0120 0160 0199 0239 0279 0319 0359
0.1 0398 0438 0478 0517 0557 0596 0636 0675 0714 0753
0.2 0793 0832 0871 0910 0948 0987 1026 1064 1103 1141
0.3 1179 1217 1255 1293 1331 1368 1406 1443 1480 1517
0.4 1554 1591 1628 1664 1700 1736 1772 1808 1844 1879
0.5 1915 1950 1985 2019 2054 2088 2123 2157 2190 2224

0.6 2257 2291 2324 2357 2389 2422 2454 2486 2517 2549
0.7 2580 2611 2642 2673 2704 2734 2764 2794 2823 2852
0.8 2881 2910 2939 2967 2995 3023 3051 3078 3106 3133
0.9 3159 3186 3212 3238 3264 3289 3315 3340 3365 3389
1.0 3413 3438 3461 3485 3508 3531 3554 3577 3599 3621

1.1 3643 3665 3686 3708 3729 3749 3770 3790 3810 3830
1.2 3849 3869 3888 3907 3925 3944 3962 3980 3997 4015
1.3 4032 4049 4066 4082 4099 4115 4131 4147 4162 4177
1.4 4192 4207 4222 4236 4251 4265 4279 4292 4306 4319
1.5 4332 4345 4357 4370 4382 4394 4406 4418 4429 4441

1.6 4452 4463 4474 4484 4495 4505 4515 4525 4535 4545
1.7 4554 4564 4573 4582 4591 4599 4608 4616 4625 4633
1.8 4641 4649 4656 4664 4671 4678 4686 4692 4699 4706
1.9 4713 4719 4726 4732 4738 4744 4750 4756 4761 4767
2.0 4772 4778 4783 4788 4793 4798 4803 4808 4812 4817

2.1 4821 4826 4830 4834 4838 4842 4846 4850 4854 4857
2.2 4861 4864 4868 4871 4875 4878 4881 4884 4887 4890
2.3 4893 4896 4898 4901 4904 4906 4909 4911 4913 4916
2.4 4918 4920 4922 4925 4927 4929 4931 4932 4934 4936
2.5 4938 4940 4941 4943 4945 4946 4948 4949 4951 4952

2.6 4953 4955 4956 4957 4959 4960 4961 4962 4963 4964
2.7 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974
2.8 4974 4975 4976 4977 4977 4978 4979 4979 4980 4981
2.9 4981 4982 4982 4983 4984 4984 4985 4985 4986 4986
3.0 4987 4987 4987 4988 4988 4989 4989 4989 4990 4990

3.1 4990 4991 4991 4991 4992 4992 4992 4992 4993 4993
3.2 4993 4993 4994 4994 4994 4994 4994 4995 4995 4995
3.3 4995 4995 4996 4996 4996 4996 4996 4996 4996 4997
3.4 4997 4997 4997 4997 4997 4997 4997 4997 4998 4998
3.5 4998 4998 4998 4998 4998 4998 4998 4998 4998 4998
Thearea. A, staysat 0.4998untilz • 3.62.Fromz • 3.63to 3.90A • 0.4999.Forz > 3.90.A • 0.5000.to fourdecimalplaces.
FromLeonF.Marzilller, ElementaryStatistics©1990by Wm.C. BrownPublishers.

303
304 Appendix .A. Statistical Tables

TABLE B
Table of Random Numbers
31871 60770 59235 41702 89372 28600 30013 18266 65044 61045
87134 32839 17850 37359 27221 92409 94778 17902 09467 86757
06728 16314 81076 42172 46446 09226 96262 77674 70205 98137
95646 67486 05167 07819 79918 83949 45605 18915 79458 54009
44085 87246 47378 98338 40368 02240 72593 52823 79002 88190

83967 84810 51612 81501 10440 48553 67919 73678 83149 47096
49990 02051 64575 70323 07863 59220 01746 94213 82977 42384
65332 16488 04433 37990 93517 18395 72848 97025 38894 46611
42309 04063 55291 72165 96921 53350 34173 39908 11634 87145
84715 41808 12085 72525 91171 09779 07223 75577 20934 92047

63919 83977 72416 55450 47642 01013 17560 54189 73523 33681
97595 78300 93502 25847 19520 16896 69282 16917 04194 25797
17116 42649 89252 61052 78332 15102 47707 28369 60400 15908
34037 84573 49914 59688 18584 53498 94905 14914 23261 58133
08813 14453 70437 49093 69880 99944 40482 04254 62842 68089

67115 41050 65453 04510 35518 88843 15801 86163 49913 46849
14596 62802 33009 74095 34549 76634 64270 67491 83941 55154
70258 26948 60863 47666 58512 91404 97357 85710 03414 56591
83369 81179 32429 34781 00006 65951 40254 71102 60416 43296
83811 49358 75171 34768 70070 76550 14252 97378 79500 97123

14924 71607 74638 01939 77044 18277 68229 09310 63258 85064
60102 56587 29842 12031 00794 90638 21862 72154 19880 80895
33393 30109 42005 47977 26453 15333 45390 89862 70351 36953
92592 78232 19328 29645 69836 91169 95180 15046 45679 94500
27421 73356 53897 26916 52015 26854 42833 64257 49423 39440

26528 22550 36692 25262 61419 53986 73898 80237 71387 32532
07664 10752 95021 17030 76784 86861 12780 44379 31261 18424
37954 72029 29624 09119 13444 22645 78345 79876 37582 75549
66495 11333 81101 69328 84838 76395 35997 07259 66254 47451
72506 28524 39595 49356 92733 42951 47774 75462 64409 69116

09713 70270 28077 15634 36525 91204 48443 50561 92394 60636
51852 70782 93498 44669 79647 06321 04020 00111 24737 05521
31460 22222 18801 00675 57562 97923 45974 75158 94918 40144
14328 05024 04333 04135 53143 79207 85863 04962 89549 63308
84002 98073 52998 05749 45538 26164 68672 97486 32341 99419

89541 28345 22887 79269 55620 68269 88765 72464 11586 52211
50502 39890 81465 00449 09931 12667 30278 63963 84192 25266
30862 61996 73216 12554 01200 63234 41277 20477 71899 05347
36735 58841 35287 51112 47322 81354 51080 72771 53653 42108
11561 81204 68175 93037 47967 74085 05905 86471 47671 18456

FromLeon F. Marzillter,ElementaryStatistk:s<Ol
990 by Wm. C. BrownPubfishers.
Appendix .A. Statistical Tables 305

TABLE C
Student's t Distribution

Degrees of freedom

t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.1 0317 0353 0367 0374 0379 0382 0384 0386 0387 0388 0389 0390 0391 0391 0392 0392
0.2 0628 0700 0729 0744 0753 0760 0764 0768 0770 0773 0774 0776 0777 0778 0779 0780
0.3 0928 1038 1081 1104 1119 1129 1136 1141 1145 1148 1151 1153 1155 1157 1159 1160
0.4 1211 1361 1420 1452 1472 1485 1495 1502 1508 1512 1516 1519 1522 1524 1526 1528
0.5 1476 1667 1743 1783 1809 1826 1838 1847 1855 1861 1865 1869 1873 1876 1878 1881
0.6 1720 1953 2046 2096 2127 2148 2163 2174 2183 2191 2197 2202 2206 2210 2213 2215
0.7 1944 2218 2328 2387 2424 2449 2467 2481 2492 2501 2508 2514 2519 2523 2527 2530
0.8 2148 2462 2589 2657 2700 2729 2750 2766 2778 2788 2797 2804 2810 2815 2819 2823
0.9 2333 2684 2828 2905 2953 2986 3010 3028 3042 3054 3063 3071 3078 3083 3088 3093
1.0 2500 2887 3045 3130 3184 3220 3247 3267 3283 3296 3306 3315 3322 3329 3334 3339
1.1 2651 3070 3242 3335 3393 3433 3461 3483 3501 3514 3526 3535 3544 3551 3557 3562
1.2 2789 3235 3419 3518 3581 3623 3654 3678 3696 3711 3723 3734 3742 3750 3756 3762
1.3 2913 3384 3578 3683 3748 3793 3826 3851 3870 3886 3899 3910 3919 3927 3934 3940
1.4 3026 3518 3720 3829 3898 3945 3979 4005 4025 4041 4055 4066 4075 4084 4091 4097
1.5 3128 3638 3847 3960 4030 4079 4114 4140 4161 4177 4191 4203 4212 4221 4228 4235
1.6 3222 3746 3960 4075 4148 4196 4232 4259 4280 4297 4310 4322 4332 4340 4348 4354
1.7 3307 3844 4062 4178 4251 4300 4335 4362 4383 4400 4414 4426 4435 4444 4451 4458
1.8 3386 3932 4152 4269 4341 4390 4426 4452 4473 4490 4503 4515 4525 4533 4540 4546
1.9 3458 4026 4232 4349 4421 4469 4504 4530 4551 4567 4580 4591 4601 4609 4616 4622
2.0 3524 4082 4303 4419 4490 4538 4572 4597 4617 4633 4646 4657 4666 4674 4680 4686
2.1 3585 4147 4367 4482 4551 4598 4631 4655 4674 4690 4702 4712 4721 4728 4735 4740
2.2 3642 4206 4424 4537 4605 4649 4681 4705 4723 4738 4750 4759 4768 4774 4781 4786
2.3 3695 4259 4475 4585 4651 4694 4725 4748 4765 4779 4790 4799 4807 4813 4819 4824
2.4 3743 4308 4521 4628 4692 4734 4763 4784 4801 4813 4824 4832 4840 4846 4851 4855
2.5 3789 4352 4561 4666 4728 4767 4795 4815 4831 4843 4852 4860 4867 4873 4877 4882
2.6 3831 4392 4598 4700 4759 4797 4823 4842 4856 4868 4877 4884 4890 4895 4900 4903
2.7 3871 4429 4631 4730 4786 4822 4847 4865 4878 4888 4897 4903 4909 4914 4918 4921
2.8 3908 4463 4661 4756 4810 4844 4867 4884 4896 4906 4914 4920 4925 4929 4933 4936
2.9 3943 4494 4687 4779 4831 4863 4885 4901 4912 4921 4928 4933 4938 4942 4945 4948
3.0 3976 4523 4712 4800 4850 4880 4900 4915 4925 4933 4940 4945 4949 4952 4955 4958
3.1 4007 4549 4734 4819 4866 4894 4913 4927 4936 4944 4949 4954 4958
3.2 4036 4573 4753 4835 4880 4907 4925 4937 4946 4953 4958
3.3 4063 4596 4771 4850 4893 4918 4934 4946 4954
3.4 4089 4617 4788 4864 4904 4928 4943 4953
3.5 4114 4636 4803 4876 4914 4936 4950
3.6 4138 4654 4816 4886 4922 4943
3.7 4160 4670 4829 4896 4930 4950
3.8 4181 4686 4840 4904 4937
3.9 4201 4701 4850 4912 4943
4.0 4220 4714 4860 4919 4948
4.1 4239 4727 4869 4926 4953
4.2 4256 4739 4877 4932
4.3 4273 4750 4884 4937
4.4 4289 4760 4891 4942
4.5 4304 4770 4898 4946
4.6 4319 4779 4903 4950
4.7 4333 4788 4909
4.8 4346 4796 4914
4.9 4359 4804 4919
5.0 4372 4811 4923
5.8 4456 4858 4949
6.4 4507 4882
7.0 4548 4901
10.0 4683 4951
12.8 4752
31.9 4900
63.7 4950
306 Appendix .A. Statistical Tables

TABLE C
(continued}

Degrees of freedom

t
0.1
17
0392
18
0393
19
0393
20
0393
21
0394
22

0394
23
0394
24
0394
25
0394
26
0394
27
0395
28
0395
29
0395
30
0395
-
0398
0.2 0781 0781 0782 0782 0783 0783 0784 0784 0785 0785 0785 0785 0786 0786 0793
0.3 0928 1162 1163 1164 1164 1165 1166 1166 1167 1167 1168 1168 1168 1169 1179
0.4 1529 1531 1532 1533 1534 1535 1536 1537 1537 1538 1538 1539 1540 1540 1554
0.5 1883 1884 1886 1887 1889 1890 1891 1892 1893 1894 1894 1895 1896 1896 1915
0.6 2218 2220 2222 2224 2225 2227 2228 2229 2230 2231 2232 2233 2234 2235 2257
0.7 2533 2536 2538 2540 2542 2544 2545 2547 2548 2549 2550 2551 2552 2553 2580
0.8 2826 2829 2832 2834 2837 2839 2841 2842 2844 2845 2847 2848 2849 2850 2881
0.9 3097 3100 3103 3106 3108 3111 3113 3115 3116 3118 3120 3121 3122 3124 3159
1.0 3343 3347 3351 3354 3357 3359 3361 3364 3366 3367 3369 3371 3372 3373 3413
1.1 3567 3571 3575 3578 3581 3584 3586 3589 3591 3593 3595 3597 3598 3600 3643
1.2 3767 3772 3776 3779 3782 3785 3788 3791 3793 3795 3797 3799 3801 3802 3849
1.3 3945 3950 3954 3958 3962 3965 3968 3970 3973 3975 3977 3979 3981 3982 4032
1.4 4103 4107 4112 4116 4119 4123 4126 4128 4131 4133 4136 4138 4139 4141 4192
1.5 4240 4245 4250 4254 4258 4261 4264 4267 4269 4272 4274 4276 4278 4280 4332
1.6 4360 4365 4370 4374 4377 4381 4384 4387 4389 4392 4394 4396 4398 4400 4452
1.7 4463 4468 4473 4477 4481 4484 4487 4490 4492 4495 4497 4499 4501 4503 4554
1.8 4552 4557 4561 4565 4569 4572 4575 4578 4580 4583 4585 4587 4589 4590 4641
1.9 4627 4632 4636 4640 4644 4647 4650 4652 4655 4657 4659 4661 4663 4665 4713
2.0 4691 4696 4700 4704 4707 4710 4713 4715 4718 4720 4722 4724 4725 4727 4772
2.1 4745 4750 4753 4757 4760 4763 4766 4768 4770 4772 4774 4776 4777 4779 4821
2.2 4790 4794 4798 4801 4804 4807 4809 4812 4814 4816 4817 4819 4820 4822 4861
2.3 4828 4832 4835 4838 4841 4843 4846 4848 4850 4851 4853 4854 4856 4857 4893
2.4 4859 4863 4866 4869 4871 4874 4876 4877 4879 4881 4882 4884 4885 4886 4918
2.5 4885 4888 4891 4894 4896 4898 4900 4902 4903 4905 4906 4907 4908 4909 4938
2.6 4906 4910 4912 4914 4916 4918 4920 4921 4923 4924 4925 4926 4927 4928 4953
2.7 4924 4927 4929 4931 4933 4935 4936 4937 4939 4940 4941 4942 4943 4944
2.8 4938 4941 4943 4945 4946 4948 4949 4950 4951 4952 4953 4954 4955 4956
2.9 4950 4952 4954 4956 4957 4958 4960
Nots: The column headed by oo is for use when df > 30. It coincides with column 1 of table A.
For I> 2.9 and di> 16. A> 0.4950.

All missing entries in the table have A > 0.4950. The reason this value of A was chosen is that for A > 0.495, p-value < 0.005. Therefore, if the value
you want tor A is missing, you may assume the significance level is 0.005 or higher.

Fort> 5.0, selected values lor tare given lor df = 1, 2, 3, to show when signilicance is reached at various significance levels.
For example, I = 7.0, df = 2 gives A = 0.4901, p-value = 0.0099 (< 0.01). Therefore, there is significance at the .01 level. Atr-Jvalue ol I less lhan 7.0
would not be significant at the .01 level.

From Leon F. Marzillier, Elementary Statist;cs 01990 by Wm. C. Brown Publishers.


Appendix .A. Statistical Tables 307

TABLE D

Values of t for Selected Confidence Levels

Levelof confidence
df 0.68 0.80 0.85 0.90 0.925 0.95 0.975 0.98 0.99 0.995

1 1.82 3.08 4.17 6.31 8.45 12.7 25.5 31.8 63.7 12.7
2 1.31 1.89 2.28 2.92 3.44 4.30 6.21 6.97 9.92 14.1
3 1.19 1.64 1.92 2.35 2.68 3.18 4.18 4.54 5.84 7.45
4 1.13 1.53 1.78 2.13 2.39 2.78 3.50 3.75 4.60 5.60
5 1.10 1.48 1.70 2.02 2.24 2.57 3.16 3.36 4.03 4.77
6 1.08 1.44 1.65 1.94 2.15 2.45 2.97 3.14 3.71 4.35
7 1.07 1.41 1.62 1.89 2.09 2.36 2.84 3.00 3.50 4.03
8 1.06 1.40 1.59 1.86 2.05 2.31 2.75 2.90 3.36 3.83
9 1.05 1.38 1.57 1.83 2.01 2.26 2.69 2.82 3.25 3.69
10 1.05 1.37 1.55 1.81 1.99 2.23 2.63 2.76 3.17 3.58
11 1.04 1.36 1.55 1.80 1.97 2.20 2.59 2.72 3.11 3.50
12 1.04 1.36 1.54 1.78 1.95 2.18 2.56 2.68 3.05 3.43
13 1.03 1.35 1.53 1.77 1.94 2.16 2.53 2.65 3.01 3.37
14 1.03 1.35 1.52 1.76 1.92 2.14 2.51 2.62 2.98 3.33
15 1.03 1.34 1.52 1.75 1.91 2.13 2.49 2.60 2.95 3.29
16 1.03 1.34 1.51 1.75 1.90 2.12 2.47 2.58 2.92 3.25
17 1.02 1.33 1.51 1.74 1.90 2.11 2.46 2.57 2.90 3.22
18 1.02 1.33 1.50 1.73 1.89 2.10 2.45 2.55 2.88 3.20
19 1.02 1.33 1.50 1.73 1.88 2.09 2.43 2.54 2.86 3.17
20 1.02 1.33 1.50 1.72 1.88 2.09 2.42 2.53 2.85 3.15
21 1.02 1.32 1.50 1.72 1.87 2.08 2.41 2.52 2.83 3.14
22 1.02 1.32 1.49 1.72 1.87 2.07 2.41 2.51 2.82 3.12
23 1.02 1.32 1.49 1.71 1.86 2.07 2.40 2.50 2.81 3.10
24 1.02 1.32 1.49 1.71 1.86 2.06 2.39 2.49 2.80 3.09
25 1.01 1.32 1.49 1.71 1.86 2.06 2.38 2.49 2.79 3.08
26 1.01 1.31 1.48 1.71 1.85 2.06 2.38 2.48 2.78 3.07
27 1.01 1.31 1.48 1.70 1.85 2.05 2.37 2.47 2.77 3.06
28 1.01 1.31 1.48 1.70 1.85 2.05 2.37 2.47 2.76 3.05
29 1.01 1.31 1.48 1.70 1.85 2.05 2.36 2.46 2.76 3.04

-
30 1.01 1.31 1.48 1.70 1.84 2.04 2.36 2.46 2.75 3.03
1.00 1.28 1.44 1.64 1.78 1.96 2.24 2.33 2.58 2.81

If you want to constructa 0.95 confidenceintervalestimatewith dl • 15, use I• 2.13. Use the last line when df > 30. The values in rtcoincidewith the
normaltablevalues.

0.95

-2.13 2.13

FromLeon F. Marzillier,ElementaryStatist;cs01990 byWm. C. BrownPublishers.


308 Appendix .A. Statistical Tables

TABLE E
Critical Values of the F Distribution
a=.05
df 1

df2 1 2 3 4 5 6 7 8 9 10 11 12

1 161 200 216 225 230 234 237 239 241 242 243 244
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4
3 10.1 9.6 9.3 9.1 9.0 8.9 8.9 8.8 8.8 8.8 8.8 8.7
4 7.7 6.9 6.6 6.4 6.3 6.2 6.1 6.0 6.0 6.0 5.9 5.9
5 6.6 5.8 5.4 52 5.1 5.0 4.9 4.8 4.8 4.7 4.7 4.7
6 6.0 5.1 4.8 4.5 4.4 4.3 4.2 4.2 4.1 4.1 4.0 4.0
7 5.6 4.7 4.4 4.1 4.0 3.9 3.8 3.7 3.7 3.6 3.6 3.6
8 5.3 4.5 4.1 3.8 3.7 3.6 3.5 3.4 3.4 3.4 3.3 3.3
9 5.1 4.3 3.9 3.6 3.5 3.4 3.3 3.2 3.2 3.1 3.1 3.1
10 5.0 4.1 3.7 3.5 3.3 3.2 3.1 3.1 3.0 3.0 2.9 2.9
11 4.8 4.0 3.6 3.4 3.2 3.1 3.0 3.0 2.9 2.8 2.8 2.8
12 4.8 3.9 3.5 3.3 3.1 3.0 2.9 2.8 2.8 2.8 2.7 2.7
13 4.7 3.8 3.4 3.2 3.0 2.9 2.8 2.8 2.7 2.7 2.6 2.6
14 4.6 3.7 3.3 3.1 3.0 2.8 2.8 2.7 2.6 2.6 2.6 2.5
15 4.5 3.7 3.3 3.1 2.9 2.8 2.7 2.6 2.6 2.6 2.5 2.5
16 4.5 3.6 3.2 3.0 2.8 2.7 2.7 2.6 2.5 2.5 2.4 2.4
17 4.4 3.6 3.2 3.0 2.8 2.7 2.6 2.6 2.5 2.4 2.4 2.4
18 4.4 3.6 3.2 29 2.8 2.7 2.6 2.5 2.5 2.4 2.4 2.3
19 4.4 3.5 3.1 2.9 2.7 2.6 2.5 2.5 2.4 2.4 2.3 2.3
20 4.4 3.5 3.1 2.9 2.7 2.6 2.5 2.4 2.4 2.4 2.3 2.3
21 4.3 3.5 3.1 2.8 2.7 2.6 2.5 2.4 2.4 2.3 2.3 2.2
22 4.3 3.4 3.0 2.8 2.7 2.6 2.5 2.4 2.3 2.3 2.3 2.2
23 4.3 3.4 3.0 2.8 2.6 2.5 2.4 2.4 2.3 2.3 2.2 2.2
24 4.3 3.4 3.0 2.8 2.6 2.5 2.4 2.4 2.3 2.3 2.2 2.2
25 4.2 3.4 3.0 2.8 2.6 2.5 2.4 2.3 2.3 2.2 2.2 2.2
26 4.2 3.4 3.0 2.7 2.6 2.5 2.4 2.3 2.3 2.2 2.2 2.2
27 4.2 3.4 3.0 2.7 2.6 2.5 2.4 2.3 2.2 2.2 2.2 2.1
28 4.2 3.3 3.0 2.7 2.6 2.4 2.4 2.3 2.2 2.2 2.2 2.1
29 4.2 3.3 2.9 2.7 2.6 2.4 2.4 2.3 2.2 2.2 2.1 2.1
30 4.2 3.3 2.9 2.7 2.5 2.4 2.3 2.3 2.2 2.2 2.1 2.1
40 4.1 3.3 2.9 2.7 2.5 2.4 2.3 2.3 2.2 2.1 2.0 2.0
50 4.0 3.2 2.8 2.6 2.4 2.3 2.2 2.1 2.1 2.0 2.0 2.0
60 4.0 3.2 2.8 2.5 2.4 2.2 2.2 2.1 2.0 2.0 2.0 1.9
70 4.0 3.1 2.7 2.5 2.4 2.2 2.1 2.1 2.0 2.0 1.9 1.9
80 4.0 3.1 2.7 2.5 2.3 2.2 2.1 2.0 2.0 2.0 1.9 1.9
100 3.9 3.1 2.7 2.5 2.3 2.2 2.1 2.0 2.0 1.9 1.9 1.8
120 3.9 3.1 2.7 2.4 2.3 2.2 2.1 2.0 2.0 1.9 1.9 1.8
= 3.8 3.0 2.6 2.4 2.2 2.1 2.0 1.9 1.9 1.8 1.8 1.8

5%

FromLeon F.Marzillier,ElementaryStatistics©1990by Wm. C. B<ownPublishers.


Appendix .A. Statistical Tables 309

TABLE E

(continued}
a=.05
df,

df2
1 246
15
248
20 24
249
30
250
40
251
50
252
60
253
75
253
100
253
120
253
-
254
2 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5
3 8.7 8.7 86 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5
4 5.9 5.8 5.8 5.8 5.7 5.7 5.7 5.7 5.7 5.7 5.6
5 4.6 4.6 4.5 4.5 4.5 4.4 4.4 4.4 4.4 4.4 4.4
6 3.9 3.9 3.8 3.8 3.8 3.8 3.7 3.7 3.7 3.7 3.7
7 3.5 3.4 3.4 3.4 3.3 3.3 3.3 3.3 3.3 3.3 3.2
8 3.2 3.2 3.1 3.1 3.0 3.0 3.0 3.0 3.0 3.0 2.9
9 3.0 2.9 2.9 2.9 2.8 2.8 2.8 2.8 2.8 2.8 2.7
10 2.8 2.8 2.7 2.7 2.7 2.6 2.6 2.6 2.6 2.6 2.5
11 2.7 2.6 2.6 2.6 2.5 2.5 2.5 2.5 2.4 2.4 2.4
12 2.6 2.5 2.5 2.5 2.4 2.4 2.4 2.4 2.4 2.3 2.3
13 2.5 2.5 2.4 2.4 2.3 2.3 2.3 2.3 2.3 2.2 2.2
14 2.5 2.4 2.4 2.3 2.3 2.2 2.2 2.2 2.2 2.2 2.1
15 2.4 2.3 2.3 2.2 2.2 2.2 2.2 2.2 2.1 2.1 2.1
16 2.4 2.3 2.2 2.2 2.2 2.1 2.1 2.1 2.1 2.1 2.0
17 2.3 2.2 2.2 2.2 2.1 2.1 2.1 2.0 2.0 2.0 2.0
18 2.3 2.2 2.2 2.1 2.1 2.0 2.0 2.0 2.0 2.0 1.9
19 2.2 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9 1.9 1.9
20 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9 1.9 1.9 1.8
21 2.2 2.1 2.1 2.0 2.0 1.9 1.9 1.9 1.9 1.9 1.8
22 2.2 2.1 2.1 2.0 2.0 1.9 1.9 1.9 1.8 1.8 1.8
23 2.1 2.0 2.0 2.0 1.9 1.9 1.9 1.8 1.8 1.8 1.8
24 2.1 2.0 2.0 1.9 1.9 1.9 1.8 1.8 1.8 1.8 1.7
25 2.1 2.0 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.8 1.7
26 2.1 2.0 2.0 1.9 1.8 1.8 1.8 1.8 1.8 1.8 1.7
27 2.1 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.7 1.7 1.7
28 2.0 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.7 1.7 1.6
29 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.7 1.7 1.7 1.6
30 2.0 1.9 1.9 1.8 1.8 1.8 1.7 1.7 1.7 1.7 1.6
40 1.9 1.8 1.8 1.7 1.7 1.7 1.6 1.6 1.6 1.6 1.5
50 1.9 1.8 1.7 1.7 1.6 1.6 1.6 1.6 1.5 1.5 1.4
60 1.8 1.8 1.7 1.6 1.6 1.6 1.5 I.S 1.5 1.5 1.4
70 1.8 1.7 1.7 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4
80 1.8 1.7 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4 1.3
100 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.4 1.3
120 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.3 1.2
~ 1.7 1.6 1.5 1.5 1.4 1.4 1.3 1.3 1.2 1.2 1.0

FromLeon F. Marzillier,ElemenrarySra#srlcs
©1990by Wm.C. BrownPubliShers.
310 Appendix .A. Statistical Tables

TABLE E
(continued}
a=0.01
df 1

df, 1 2 3 4 5 6 7 8 9 10 11 12
1 4052 4999 5403 5625 5764 5859 5928 5981 6022 6056 6082 6106
2 98.5 99.0 99.2 99.2 99.3 99.3 99.3 99.4 99.4 99.4 99.4 99.4
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 27.0
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.4
5 16.3 13.3 12.1 11.4 11.0 10.7 10.4 10.3 10.2 10.0 10.0 9.9
6 13.7 10.9 9.8 9.2 8.8 8.5 8.3 8.1 8.0 7.9 7.8 7.7
7 12.2 9.6 8.4 7.8 7.5 7.2 7.0 6.8 6.7 6.6 6.5 6.5
8 11.3 8.6 7.6 7.0 6.6 6.4 6.2 6.0 5.9 5.8 5.7 5.7
9 10.6 8.0 7.0 6.4 6.1 5.8 5.6 5.5 5.4 5.3 5.2 5.1
10 10.0 7.6 6.6 6.0 5.6 5.4 5.2 5.1 5.0 4.8 4.8 4.7
11 9.6 7.2 6.2 5.7 5.3 5.1 4.9 4.7 4.6 4.5 4.5 4.4
12 9.3 6.9 6.0 5.4 5.1 4.8 4.6 4.5 4.4 4.3 4.2 4.2
13 9.1 6.7 5.7 5.2 4.9 4.6 4.4 4.3 4.2 4.1 4.0 4.0
14 8.9 6.5 5.6 5.0 4.7 4.5 4.3 4.1 4.0 3.9 3.9 3.8
15 8.7 6.4 5.4 4.9 4.6 4.3 4.1 4.0 3.9 3.8 3.7 3.7
16 8.5 6.2 5.3 4.8 4.4 4.2 4.0 3.9 3.8 3.7 3.6 3.6
17 8.4 6.1 5.2 4.7 4.3 4.1 3.9 38 3.7 3.6 3.5 3.4
18 8.3 6.0 5.1 4.6 4.2 4.0 3.8 3.7 3.6 3.5 3.4 3.4
19 8.2 5.9 5.0 4.5 4.2 3.9 3.8 3.6 3.5 3.4 3.4 3.3
20 8.1 5.8 4.9 4.4 4.1 3.9 3.7 3.6 3.4 3.4 3.3 3.2
21 8.0 5.8 4.9 4.4 4.0 3.8 3.6 3.5 3.4 3.3 3.2 3.2
22 7.9 5.7 4.8 4.3 4.0 3.8 3.6 3.4 3.4 3.3 3.2 3.1
23 7.9 5.7 4.8 4.3 3.9 3.7 3.5 3.4 3.3 3.2 3.1 3.1
24 7.8 5.6 4.7 4.2 3.9 3.7 3.5 3.4 3.2 3.2 3.1 3.0
25 7.8 5.6 4.7 4.2 3.9 3.6 3.5 3.3 3.2 3.1 3.0 3.0
26 7.7 5.5 4.6 4.1 3.8 3.6 3.4 3.3 3.2 3.1 3.0 3.0
27 7.7 5.5 4.6 4.1 3.8 3.6 3.4 3.3 3.1 3.1 3.0 2.9
28 7.6 5.4 4.6 4.1 3.8 3.5 3.4 3.2 3.1 3.0 3.0 2.9
29 7.6 5.4 4.5 4.0 3.7 3.5 3.3 3.2 3.1 3.0 2.9 2.9
30 7.6 5.4 4.5 4.0 3.7 35 3.3 3.2 3.1 3.0 2.9 2.8
40 7.3 5.2 4.3 3.8 3.5 3.3 3.1 3.0 2.9 2.8 2.7 2.7
50 7.2 5.1 4.2 3.7 3.4 3.2 3.0 2.9 2.8 2.7 2.6 2.6
60 7.1 5.0 4.1 3.6 3.3 3.1 3.0 2.8 2.7 2.6 2.6 2.5
70 7.0 4.9 4.1 3.6 3.3 3.1 2.9 2.8 2.7 2.6 2.5 2.4
80 7.0 4.9 4.0 3.6 3.2 3.0 29 2.7 2.6 2.6 25 2.4
100 6.9 4.8 4.0 3.5 3.2 3.0 2.8 2.7 2.6 2.5 2.4 2.4
120 6.8 4.8 4.0 3.5 3.2 3.0 2.8 2.7 2.6 2.5 2.4 2.3
= 6.6 4.6 3.8 3.3 3.0 2.8 2.6 2.5 2.4 2.3 2.2 2.2

1%

FromLeon F. Marzillier,ElementaryStat;sticsC1990 by Wm. C. BrownPublishers.


Appendix .A. Statistical Tables 311

TABLE E
(continued/
a=0.01
df1

df2

1
15
6157
20
6209
24
6235
30
6261
40
6287
50
6302
60
6313
75
6323
100
6334
120
6339 6366
-
2 99.4 99.4 99.5 99.5 99.5 99.5 99.5 99.5 99.5 99.5 99.5
3 26.9 26.7 26.6 26.5 26.4 26.3 26.2 26.3 26.2 26.2 26.1
4 14.2 14.0 13.9 13.8 13.8 13.7 13.6 13.6 13.6 13.6 13.5
5 9.7 9.6 9.5 9.4 93 9.2 9.2 9.2 9.1 9.1 9.0
6 7.6 7.4 7.3 7.2 7.1 7.1 7. 1 7.0 7.0 7.0 6.9
7 6.3 6.2 6.1 6.0 5.9 5.8 5.8 5.8 5.8 5.7 5.6
8 5.5 5.4 5.3 5.2 5.1 5.1 5.0 5.0 5.0 5.0 4.9
9 5.0 4.8 4.7 4.6 4.6 4.5 4.4 4.4 4.4 4.4 4.3
10 4.6 4.4 4.3 4.2 4.2 4.1 4.1 4.0 4.0 4.0 3.9
11 4.2 4.1 4.0 3.9 3.9 3.8 3.8 3.7 3.7 3.7 3.6
12 4.0 3.9 3.8 3.7 3.6 3.6 3.5 3.5 3.5 3.4 3.4
13 3.8 3.7 3.6 3.5 3.4 3.4 3.3 3.3 3.3 3.2 3.2
14 3.7 35 3.4 3.4 3.3 3.2 3.2 3.1 3.1 3.1 3.0
15 3.5 3.4 3.3 3.2 3.1 3.1 3.0 3.0 3.0 3.0 2.9
16 3.4 3.3 3.2 3.1 3.0 3.0 2.9 2.9 2.9 2.8 2.8
17 3.3 3.2 3.1 3.0 2.9 2.9 2.8 2.8 2.8 2.8 2.7
18 32 3.1 3.0 2.9 2.8 2.8 2.8 2.7 2.7 2.7 2.6
19 3.2 3.0 2.9 2.8 2.8 2.7 2.7 2.6 2.6 2.6 2.5
20 3.1 2.9 2.9 2.8 2.7 2.6 2.6 2.6 2.5 2.5 2.4
21 3.0 2.9 2.8 2.7 2.6 2.6 2.6 2.5 2.5 2.5 2.4
22 3.0 2.8 2.8 2.7 2.6 2.5 2.5 2.5 2.4 2.4 2.3
23 2.9 2.8 2.7 2.6 2.5 2.5 2.4 2.4 2.4 2.4 2.3
24 2.9 2.7 2.7 2.6 2.5 2.4 2.4 2.4 2.3 2.3 2.2
25 2.8 2.7 2.6 2.5 2.4 2.4 2.4 2.3 2.3 2.3 2.2
26 2.8 2.7 2.6 2.5 2.4 2.4 2.3 2.3 2.2 2.2 2.1
27 2.8 2.6 2.6 2.5 2.4 2.3 2.3 2.2 2.2 2.2 2.1
28 2.8 2.6 2.5 2.4 2.4 2.3 2.3 2.2 2.2 2.2 2.1
29 2.7 2.6 2.5 2.4 2.3 2.3 2.2 2.2 2.2 2.1 2.0
30 2.7 2.6 2.5 2.4 2.3 2.2 2.2 2.2 2.1 2.1 2.0
40 2.5 2.4 2.3 2.2 2.1 2.0 2.0 2.0 1.9 1.9 1.8
50 2.4 2.3 2.2 2.1 2.0 1.9 1.9 1.9 1.8 1.8 1.7
60 2.4 2.2 2.1 2.0 1.9 1.9 1.8 1.8 1.7 1.7 1.6
70 2.3 2.2 2.1 2.0 1.9 1.8 1.8 1.7 1.7 1.7 1.5
80 2.3 2.1 2.0 1.9 1.8 1.8 1.8 1.7 1.6 1.6 1.5
100 2.2 2.1 2.0 1.9 1.8 1.7 1.7 1.6 1.6 1.5 1.4
120 2.2 2.0 2.0 1.9 1.8 1.7 1.6 1.6 1.5 1.5 1.4
00 2.0 1.5 1.8 1.7 1.6 1.5 1.5 1.4 1.4 1.3 1.0

From Leon F. Marzillier, ElementaryStatisticsCl1990by Wm. C. Brown Publishers.


312 Appendix .A. Statistical Tables

TABLE F

p-values for x•
Degrees of freedom Degrees of freedom

x• 2 3 4 x2 2 3 4 5 6 7
3.2 2019 7.2 0273 0658 1257 2062
3.3 1920 7.3 0260 0629 1209 1993
3.4 1827 7.4 0247 0602 1162 1926
3.5 1738 7.5 0235 0576 1117 1860
3.6 1653 7.6 0224 0550 1074 1797
3.7 1572 7.7 0213 0526 1032 1736
3.8 1496 7.8 0202 0503 0992 1676
3.9 1423 7.9 0193 0481 0953 1618
4.0 1353 8.0 0183 0460 0916 1562
4.1 1287 8.1 0174 0440 0880 1508
4.2 1225 8.2 0166 0421 0845 1456
4.3 1165 8.3 0158 0402 0812 1405
4.4 1108 8.4 0150 0384 0780 1355
4.5 1054 8.5 0143 0367 0749 1307 2037
4.6 1003 2035 8.6 0136 0351 0719 1261 1974
4.7 0954 1951 8.7 0129 0336 0691 1216 1912
4.8 0907 1870 8.8 0123 0321 0663 1173 1851
4.9 0863 1793 8.9 0117 0307 0636 1131 1793
5.0 0821 1718 9.0 0111 0293 0611 1091 1736
5.1 0781 1646 9.1 0106 0280 0586 1051 1680
5.2 0743 1577 9.2 0101 0267 0563 1013 1626
5.3 0707 1511 9.3 0096 0256 0540 0977 1574
5.4 0672 1447 9.4 0091 0244 0518 0941 1523
5.5 0639 1386 9.5 0087 0233 0497 0907 1473
5.6 0608 1328 9.6 0082 0223 0477 0874 1425
5.7 0578 1272 9.7 0078 0213 0456 0842 1379
5.8 0550 1218 9.8 0074 0203 0439 0811 1333 2002
5.9 0523 1166 2067 9.9 0071 0194 0421 0781 1289 1943
6.0 0498 1116 1991 10 0067 0186 0404 0752 1247 1886
6.1 0474 1068 1918
6.2 0450 1023 1847 Allmissingentries on this page give p-values > 0.20.
6.3 0429 0979 1778 See the next page forp-values correspondingto larger valuesol x2.
6.4 0408 0937 1712
6.5 0388 0897 1648
6.6 0302 0858 1586
6.7 0351 0821 1526
6.8 0334 0786 1468
6.9 0317 0752 1413
7.0 0302 0719 1359
7.1 0287 0688 1307

0 x2

From Leon F. Mar2illier,Elementa,y Statistics 101990by Wm. C. Brown Publishers.


Appendix .A. Statistical Tables 313

TABLE F
(continued}

Degrees of freedom

x• 2 3 4 5 6 7 8 9 10
11 0041 0117 0266 0514 0884 1386 2017
12 0074 0174 0348 0620 1006 1512 2133
13 0046 0113 0234 0430 0721 1118 1626 2237
14 0073 0156 0296 0512 0818 1223 1730
15 0047 0104 0203 0360 0591 0909 1321
16 0068 0138 0251 0424 0669 0996
17 0045 0093 0174 0301 0487 0744
18 0062 0120 0212 0352 0550
19 0042 0082 0149 0252 0403
20 0056 0103 0179 0293
21 0038 0071 0127 0211
22 0049 0089 0151
23 0062 0107
24 0043 0076
25 0053
26 0037
The missing enlries in lhe 1oprighl•hand corner give ~value > 0.20. All olller missing enlries give ~value < 0.005.

FromLeon F. Marzillier,ElementaryStatisticsOl990 byWm. C. BrownPublishers.


Index

Absolute descriptive measures, 47, 57 selecting the proper measure, 42-43


Absolute dispersion in a point pattern, 69, 73 spatial, 63--69
Accuracy, 25-26 symmetric/ assymetric frequency distributions, 43
Addition rule for mutually exclusive/non-mutually exclusive Chi-square contingency analysis, 196--197, 199
events, 80 Chi-square distribution, 178
Adjacency, geographic association determined through, 208 Chi-square goodness-of-fit test, 188-192
Agglomeration strategy of classification, 28 proportional distribution, 191-192
Alternate hypothesis (HA), 143--145, 176 uniform distribution, 190
American Community Survey, I 03 Choropleth mapping (operational methods/rules of classifica-
Analysis of variance (ANOVA), 174, 176--185 tion), 29-33
Archival data, 22 Class frequency/interval/midpoint, 42
Area pattern analysis, 222-235 Classical/traditional hypothesis testing
goals and objectives in, 222 accepting or rejecting null hypothesis, 143
join count analysis, 223--228 procedure for, 143--I4 7
Moran's I Index (global), 229, 233-234 type I and type II errors in, 144
Moran's I Index (local), 235 Classification
Area patterns, in spatial autocorrelation, 205 Jenks natural breaks method, 48
Area sampling (quadrats), 111 standard deviation brealcs method, 47-48
Arithmetic mean, calculation of, 42 Classification strategies
Artificial vs. natural sampling, 150 agglomeration, 28
Assymetric (skewed) distribution, 43 operational methods/rules in, 28-30
Autocorrelation, spatial patterns in, 206--208. SeealsoSpatial subdivision (logical subdivision), 27-28
autocorrelation Cluster analysis, basics of, 277-284
Average deviation/mean deviation, 45 Cluster centroid, 277, 279, 284
Cluster sampling, 109-110
Balcer,A. M., 114 design sampling, 113--114
Bar charts, 9-10 Coefficient of determination, 258
Berry, B. J. L., 114 Coefficient of multiple determination (R2 ), 271
Between-group variation/variability calculation, 175-177 Coefficient of variation/variability, 47-50
Bimodel frequency distribution, 43 relative dispersion as, 72
Binomial probability distributions, 80-81, 83 Combination/hybrid sampling, I I 0-111
Bivariate regression. SeeSimple linear regression Conditional probability, 81
Boxplots (box-and-whiskers plots), 44-45 Confidence intervals
constructing, general procedure for, 122-126
Categorical data, 23 definition of, I I 8
Categorical difference tests, 187-201 finite population correction to formula for, 126
Center of gravity, 64 sampling error and, 135
Central area, 63 upper and lower bounds of, 122
Central feature (central point/central area), 63 Confidence interval estimation
Central limit theorem, 118, 151 equations for stratified samples, 125
Central point, 63 for the mean, 126
Central tendency measures, 39-43 for random/systematic samples, 124
bimodal frequency distribution, 43 relationship between confidence level and, 137
mean, 41-42 of stratified sample mean, I 31
median, 41 of stratified sample proportion, 133
mode, 40-41 of stratified sample total, I 32

315
316 .A. Index

Confidence levels, 122 Dispersion measures, 44


values oft for, 307 absolute, 47, 57
Contingency (crosstab) map, 199-200 relative, 47, 52-53
Contingency analysis (chi-square), 196-197 spatial, 69-73
Contingency table analysis, 195, 197-200 standard deviation, 45
Continuous probability distributions, 81, 93-99 standard deviation and variance, 45-49
Continuous vs. discrete variables, 23 variance, 45-46
Correlation, 239-251 Disproportional (variable rate) sampling, 109
association of interval-ratio variables, 242 Disproportional stratified systematic aligned sample, 115
association of ordinal variables, 247-250 Distance decay, 5
correlation analysis, definition/purpose of, 239 Distance threshold, geographic association determined through,
nature of, 240-241 208
Pearson's product-moment correlation coefficient, 242-247
Ecological fallacy, 23
Spearman's rank correlation coefficient (r,), 247-250
Covariation, 242-243 Equal intervals
based on range, 28-29, 31, 33
Cross-tabulation/contingency tables, 196
not based on range, 29, 31, 33
Crude mode, 40
Error(s)
Cumulative distribution function (CDF), 85 sampling, 102-103,107, 119
Cumulative frequency polygon (ogive), 41 E (tolerable or acceptable sampling error), 135-136
Cumulative relative frequencies, Kolmogorov-Smimov compari- Type I/Type II, 144-145
son of, 193 Estimation
basic concepts in, 117-119
Data confidence intervals and, 121-134
categorical, 23 of population mean (random/systematic sampling), 126-129
classification methods, 27-34. SeealsoClassification strategies of population total (random/systematic sampling), 129-130
disaggregated, 11 sample statistics as best point estimators of population param-
discrete vs. continuous, 23 eters, 118
implicitly spatial, 22 Estimation concepts
individual-level vs. spatially aggregated, 22-23 point estimation and interval estimation, 117-118
primary, 21-22 sampling distribution of a statistic, 118-119
secondary/ archival, 22 Euclidean (straight-line) distance, 63--04, 68
Data collection, procedures/problems of, 107-108 Euclidean median (median center), 66--07
Data matrix/ array, 8 Events, 78-79
Data sets, 7-8 Explanatory predictor (independent variable), 241, 244-246
individual-level, 22 Explanatory variable, in simple linear regression, 253
spatially aggregated, 22 External boundary delineation, impact on descriptive statistics,
Data value(s) 53-54
definition of, 7
rounding off of, 23 Factorial function, 82
Degrees of freedom, 146, 305-306, 312-313 F-distribution, critical values of, 308-3 I I
Dendrograms, 33-34 Finite population correction to confidence interval formula, 126
Dependent samples, 168 Free sampling/non-free sampling hypotheses, 224
inferential procedures for testing differences in, 168-171 Frequency distributions
matched-pairs t test for, 171 assymetric (skewed), 43
bimodal vs. multimodal, 43
Wilcoxon signed ranks test, 172
of population values, 118
Dependent variable, in simple linear regression, 253
skewness and kurtosis, 52-53
Dependent-sample (matched pairs) difference tests, 168-173
Frequency polygons, 40-41
Descriptive statements, 5, JO, 12-13, 16
Descriptive statistics, 4, 7 General distance metric, 67
measures of central tendency, 3~3 Geographic "center of population," 64-65
measures of dispersion and variability, 44-49 Geographic data
measures of shape/relative position, 52-53 characteristics and preparation of. See also Data, 21-34
spatial, 62-73 dimensions of, 21-23
spatial data's impact on, 53--00 Geographic problem solving and policy situations, 289-301
use in GIS-based total enumeration, 151 answers to problems, 301-302
Deterministic processes, 77-78 Geographic research/ analysis
Directional (one-tailed) alternate hypothesis, 143-144 analysis of variance (ANOVA), 178--183
Directional mean, 67--09 area pattern analysis, 222-223
Discrete probability distributions, 81 central limit theorem, 119-120
binomial, 80-81, 83 Chi-square goodnesS--Of-fit test
geometric, 83-85 proportional distribution, 191-192
Poisson, 85-91 uniform distribution, 189-191
Discrete vs. continuous variables, 23 cluster analysis, 278-284
Index .A. 317

coefficient of variance, 49-51 one-sample difference tests, 142


confidence interval estimation examples, 12~134 organizational structure for selecting appropriate test, 152-153
contingency table analysis, 197-200 parametric vs. nonparametric/distribution free tests, 153-154
join count analysis, 225-228 selecting a suitable test, 151-154
Kolmogorov-Smirnov normality test (normal distribution), Internal subarea boundaries, modification of, 54-57
194-195 Interquartile range (IQR), 44
Kruskal-Wallis test, 182-183 Interval estimation, 117-118
matched pairs t test, 170-171 Interval ratio measurement, 25
Moran's I (global/local), 232-234 Inverse distance, 230
multiple regression, 271-276 squared (J/distance 2), 208
nearest neighbor analysis, 213-216 weight (I/distance), 208
organizational framework for, 6
parametric/nonparametric tests for two-sample differences, Jenks method of natural breaks, 47-48
159, 162-163 Join count analysis/structure, 223-228
Pearson's product-moment correlation coefficient, 244-247
probability mapping, 97-99 Kolmogorov-Smirnov normality/ goodness of fit tests, 193-195
problem-solving examples, 8-19 Kruskal-Wallis (K-W) test, 177-178, 181-185
quadrat analysis, 218-221 Kurtosis and skewness, 52-53
simple linear regression, 252, 264-268
spatial measures of dispersion, 69-73 Laws, 7
spatial sampling, 111-115 Least squares property of the mean, 45
Spearman's rank correlation coefficient (r,), 248-250 Least-squares regression line, 253-254
through space and time, 8~91 Leptokurtic distribution/kurtosis, 52-53
two-sample difference of proportions test, 165-168 Level of confidence, 307
Wilcoxon matched-pairs test, 171-173 Leverage, 274
Geography Line sampling (traverses), 111
concept/ definition of, 3-4 Linear directional mean, 67-69
multivariate problem-solving in, 269-285 Local autocorrelation methods, 208
role of statistics in, 4-8 Logical subdivision strategy of classification, 27-28
Geometric probability distributions, 83-85
Geostatistics, 62 Manhattan distance, 6 7
GIS-based total enumeration, 150-151 Mann-Whitney Utests, 158-159, 162, 183
Global measure of segregation/integration, 208 Maps/mapping
Goodness-of-fit tests, 187-195 choropleth, 29-33
Chi-square, 188-195 dendrogram, 33-34
Kolmogorov-Smirnov, 193-195 descriptive statements generated from, 13, 1~18
Grouping/zoning problem, 54-57 probability, 9~ I 00, 195
residual, 260, 262, 264-267
Histograms, 40-41, 120-121 scatterplot, 13-14, 58, 165
Hybrid sampling, 110-111, 114 Marginal totals in contingency analysis, 197
Hypotheses, 5-6 Marzillier, Leon F., 303-310
alternate(HA), 143-145, 176 Matched-pairs (dependent-sample) difference tests, 168-173
free sampling/non-free sampling in joint count analysis, 224 Matched-pairs t test, 168-171
Null {Ho), 143-147, 176 Mean (arithmetic mean/average), 41-42, 50
Hypothesis testing sample size selection and, I 35-136
classical/traditional, 142, 144, 147-148 weighted and unweighted, 42
P-value/Prob-value, 142, 147-148, 160, 162 Mean center, 63-66, 70
traditional, 142 Mean deviation/ average deviation, 45
Hypothesized mean, 143 Measurement concepts, 25-27
Measurement levels, 23-25
Implicitly spatial data, 22 interval and ratio scales, 25
Independent samples, 168 nominal scale, 23-24
Independent variables, in simple linear regression, 253 ordinal scale, 24-25
Individual point distribution/individual value plot, 44 selection of appropriate inferential test based on, I 53
Individual-level vs. spatially aggregated data, 22-23 Measures of central tendency. SeeCentral tendency measures
Inferential hypotheses, 5-7, I 5 Median, 41
Inferential spatial statistics, 7, 11-12 Median center, 66-67
autocorrelation, 20~208 Mesokurtic distribution/kurtosis, 52-53
elements of, 141-154 Modal class, 40
types of spatial patterns, 205-206 Mode, 40-41
Inferential testing, I 50-154 Model, definition of, 7
artificial vs. natural samples in, 150 Modifiable Areal Unit Problem (MAUP) problem
general characteristics and assumptions in common, 150 change in scale/level of spatial aggregation, 55-57, 59-60
GIS-based total enumeration and, 150-151 modification of internal subarea boundaries, 54, 58
318 .A. Index

Monte Carlo simulation, 78 central feature, 63


Moran's /Index, 229-235 composite/hybrid sampling, 114
Multicollinearity, 270 median center (Euclidean median), 63-67
Multimodal frequency distribution, 43 Euclidean (straight-line) distance, 63-64, 68
Multiple regression nearest neighbor analysis, 210-216
basics of, 269-276 quadrat analysis, 216-221
simple linear regression vs., 269-270 random samples, 113
Multiplication rule of probability, 79 relative dispersion problem, 72-73
Multivariate analysis in spatial autocorrelation, 205
cluster analysis, 277-284 spatial sampling, 111-115
multiple regression analysis, 269-276 standard deviational ellipse, 71-72
Mutually exclusive events, addition rule for, 80 standard distance, 69- 70
Poisson probability distribution, 85-91
Natural breaks, 29-30, 32-34 Population mean, estimate of, 126-131
Natural vs. artificial sampling, 150 Population parameters
Nearest neighbor analysis, 210-216 confidence interval equations for, 126
Negative skew, 52 and sample size, 119
Neighbors, geographical, ways of defining, 208 Population proportion, estimate of, 129-130, 134, 136
Network distance, 67 Population total, estimate of, 129, 131-134-136
Nominal variables, 23-24 Population, statistical, IO1
Non-directional (two-tailed) alternate hypothesis, 143-144 Positive skew, 52
Nonparametric/ distribution free tests, 153-154 Precision, 25-26, 33
for two-sample differences, 159, 162-163 Predictor variable, in simple linear progression, 253
Normal distribution, 93-96 Predictor, explanatory (independent variable), 241, 244-246
standard score/normal deviate, 95-96 Primary data, 21-22
standard score/normal deviate, 94. Seea!S-0 Z-score
Probability
table of normal values, 94
basic processes, terms, and concepts, 77-81
Nugget, in spatial autocorrelation, 207
complement of an outcome, 79
Null hypothesis (Ho), 143-144 conditional, 81
inANOVA, 176
deterministic processes, 78
regions of rejection and non-rejection of, 146
for mutually exclusive events, 78-80
rejection/non-rejection of, 145, 147
multiplication rule for statistically independent events, 79
mutually exclusive categories, 78-80
Observations, 7
probabilistic processes, 78
Ogive (cumulative frequency distribution), 40-41 as relative frequency, 78
One-proportion Ztest, 148-149
rules of, 79-81
One-sample difference of means test, 144-145
statistical independence, 79-80
One-sample difference of proportions test, 148-149
One-tailed (directional) alternate hypothesis, 143-146 Probability distributions
continuous, 81, 93-99. SeealsoNormal distribution
Operational definitions, 26
discrete. SeeDiscrete probability distributions
Operational methods/rules of classification distinguishing between discrete and continuous data, 23
equal intervals based on range, 28-29, 31
equal intervals not based on range, 29 Probability plots/mapping, 96-100, 195
natural breaks, 29-30, 32 Probability sampling
cluster, 109-110
quantile breaks, 29
Ordinal-scale variables, weakly vs. strongly ordered, 24-25 combination/hybrid, 110-111
Outcomes data collection for, 107
complement of, 79 nonprobability sampling vs., 107
in binomial distributions, 82 proportional (constant-rate), 109
maximum/minimum probability of, 79 simple random, I08
relative frequency of, 78 stratified, 109
Outliers, 34, 43, 65 systematic, 108-109
Proportion, and sample size selection, 137
Parametric tests, 153-154, 159, 162-163 p-value(s)
Pearson correlation, 243 hypothesis testing, 142, 147-148, 160, 162
Pearson's coefficient, 52 for X 2, 312-313
Pearson's product-moment correlation coefficient, 242-244
Pie charts, 9-10 Quadrat analysis, 111, 216-218, 220-221
Platykurtic distribution, 52-53 Qualitative vs. quantitative variables, 23
Platykurtic kurtosis, 52 Quantile breaks, 29-30, 32-33
Point estimation, 117-118 Quartiles, 44
Point pattern/point sample analysis, 210-221
absolute/relative point pattern dispersion, 73 Random numbers table, 304
center of gravity, 64 Random processes, 78
Index .A. 319

Random/ systematic samples stratified/systematic point sampling, 109, 113


confidence interval equations for, 124 systematic point sampling, I I 3
estimate of population mean, 126-129 Scatterplots, 13-14, 58, 165, 206
estimate of population proportion, 129-130 Secondary data, 22
estimate of population total, 127-129 Significance level, 122, 145, 147
Randomization/randomness, 102, 108 Sill, in spatial autocorrelation, 207
correlation and, 240 Simple adjacency weighting/ contiguity, 230
Moran's I, 231 Simple linear regression, 252-268
random processes, 85-91 bucket-and-sponge analogy, 256
simple random point samples, 112-113 form of relationship in, 253-256
Range, in spatial autooorrelation, 206-207 domain of X, 254
Ratio measurement scales, 25 regression coefficient/slope, 254
Relative descriptive measures slope (regression coefficient), 253
coefficient of variation, 47 slope (tangent) of the best-fit line, 254
skewness and kurtosis, 52-53 Y-intercept, in least-squares regression line, 253
Relative dispersion in a point pattern, 72-73 geographic research/ analysis, 252, 263-268
Relative distance, 73 inferential use of, 261-264
Relative frequency, probability as, 78 equal variance assumption, 262
Reliability, 26-27 linear relationship of variables, 262
Residual mapping, 259-260, 262, 264-267 measuring both variables on interval/ratio scale, 262
Response (dependent) variable, 241, 244-246 normal distribution of X and Yvariables, 262
Rules of probability, 79-81 required random samples, 262
multiple regression compared with, 269-270
Sample size selection, 134-137 residual or error analysis in, 258
estimating population mean, 135-136 absolute residual, 258-259
estimating population proportion, 136-137 residual mapping, 259-260, 262, 264-267
estimating population total, 136 standard error of the estimate, 259
proportion, 137 standardized residual, 260-26 I
total, 136 strength of relationship in
Samples/sampling coefficient of determination, 258
artificial vs. natural, 150 explained/unexplained variation, 256-257
collecting, preparing and analyzing sample data, 11, 14-15 total sum of squares, 257
confidence intervals for sample types, 126 total variation in Y, 256
definition of, 7 Simple random point sample, 112-113
dependent, 168 Single linkage approach, 33
independent, 168 Skewed (asymmetric) distribution, 43
advantages of, I 02 Skewness
American Community Survey, 103 negative/positive skew, 52
cluster, 110 Pearson's coefficient, 52
distribution of a statistic, 118-119 Slope, in simple linear regression, 255
error, 102-103,107, 119 Spatial (point) sampling cluster design, 113
estimation in, 117-137 Spatial aggregation/scale, change in, 57~
free and non-free, 224 Spatial application of Poisson, 87-91
hybrid/composite, 114 Spatial autocorrelation
oversampling/undersampling, 109 clustered, random, and dispersed area patterns, 205-206
point, line, and area, 111 global vs. local measurement of, 208
primary data, 22 Spatial data
probability. SeeProbability sampling external boundary delineation, 53-54
procedural steps for, 104-108, 114 impact on descriptive statistics, 53
sampled population/ sampled area, I 04 modification of internal subarea boundaries, 54-57
sampling fraction of a problem, 126 scale/level of spatial aggregation, 57-60
simple random point sample, 113 Spatial interaction model, 7
sources of sampling error, I 02-103 Spatial measures of central tendency
spatial. See Spatial sampling central feature, 63
stratified. SeeStratified sampling linear directional mean, 67-69
two-stage design, I 35 mean center, 63-66
Sampling concepts median center (Euclidean median), 66-67
sampled population/ sampled area, I 04-105 Spatial measures of dispersion
sampling frame, 104-105, 107 relative dispersion problem, 72-73
target population/target area, 104-105, 107 standard deviational ellipse, 71-72
Sampling types standard distance, 69-70
cluster design, 113-114 Spatial sampling, 111-115
Simple random point sample, 112 area samples (quadrats), 89-90
simple random point sample, 113 hybrid (point) sampling designs, 114-115
320 .A. Index

stratified point sample, 113 t distribution, 305-306


systematic point sample, 113 Target population/target area, 104-105, 107
"two-stage" cluster approach, 113 Theory, definition of, 7
types of, 112 Tobler's Law, 206
Spatial statistics, inferential, 205-209 Two-sample difference of means tests, 155-159
Spatially aggregated data, 22-23 Wilcoxon Rank Sum W(Mann-Whitney U), 158-159, 162
Spearman's rank correlation coefficient (r,), 247-250 Z or t test, 156-158, 180, 182-183
Spurious precision, 25, 33 Two-sample difference of proportions test, 163-168
Standard deviation, 45-50, 119 Two-stage sampling design, 135
breaks method, 47-48 "Two-stage" cluster approach, 113
deviational ellipse, 71-72 Two-tailed (non-directional) alternate hypothesis, 143-146
Standard distance (weighted standard distance), 69-70 Type I/Type II errors, 144-145
Standard error of the mean, 119
Standard score (Z-value). SeeZ-score Unimodal distribution, 43
Standardized nearest neighbor index, 212 Unweighted mean, 42
Standardized regression coefficients, 270
Statistical population, 7 Validity, 26
Statistical tables, 303-310 Variability, measuring, 44
Statistical techniques, context of, 3-19 Variable(s)
Statistically independent events, probability of, 79-80 definition of, 7
Statistics discrete vs. continuous, 23
data/ data sets in, 7-8, 22 interval and ratio scale classification of, 25
definition of, 4 nominal scale classification of, 23-24
descriptive/ inferential, 7
ordinal scale classification of, 24-25
examples of problem solving in geographic research, 8-19 quantitative vs. qualitative, 23
role in geographic research process, 4-8
strength of association, 240
Stepwise regression, 273
strongly-ordered vs. weakly-ordered, 24-25
Stochastic processes, 78 Variance
Stratified sample(s)
standard deviation and, 45-49
confidence interval equations for, 125 Analysis of variance (ANOVA), 174, 176-185
designs, 109
in quadrat analysis, 216-218
estimate of population mean, 130-131
variance-mean ratio, 216-218, 221
estimate of population proportion, 134
Variograms, 206-207
estimate of population total, 131-134
mean, confidence interval estimate of, 131
proportion, confidence interval estimate of, 133 Weakly-ordered vs. strongly-ordered variables, 24
stratified point sampling, 113 Weighted mean center, 65-66
total, confidence interval estimate of, 132 Weighted mean, calculation of, 42
proportional (constant-rate) or disproportional (variable-rate), Weighted standard deviation, 47
109 Weighted standard distance, 69-70
systematic unaligned point sample, 114 Wilcoxon matched-pairs signed-rank test, 162, 169-173, 183
Strength of association, 240 Wilcoxon Rank Sum W(Mann-Whitney U) Tests, 158-159, 162
Strongly-ordered vs. weakly-ordered variables, 24 Within-group variation, 175, 177
Student's t distribution, 305-306
Subdivision strategy of classification, 27-28 Ztest for proportions, 148-149
Symmetric distribution, 43 Z-model/ I-model one-sample difference of means test, 145
Systematic sampling 108-109 Zoning/grouping problem, 54-57
confidence interval equations for, 124 Z-score, 94-95, 121, 144-145
point sampling, 113-114 Z-value (t-value) of a sample mean, 145
THIRD EDITION

Now in its third edition, this highly regarded text has firmly established itself as the defini
introduction to geographical statistics. Assuming no reader background in statistics, the
authors lay out the proper role of statistical analysis and methods in human and physical
geography. They delve into the calculation of descriptive summariesand graphics,the use
inferential statistics as exploratory and descriptive tools, ANOVAand Kruskal-Wallistests
and different spatial statistics to explore geographic patterns,inferentialspatial statisties,and
spatial autocorrelation and variograms.
The authors maintain an exploratory and investigative approach throughout, providing
readers with real-world geographic issues and more than 50 map examples. Concepts are
explained clearly and narratively without oversimplification. Each chapter concludes with
list of major goals and objectives. An epilogue offers over 150 geographic situations, ll\Vlting
students to apply their new statistical skills to solve problems currently affectingourworld.

Waveland Press, Inc.


www.waveland.com

ISBN 13: 978-1-4786-1119-6


ISBN 10: 1-4786-1119-7
THIRD EDITION

Now in its third edition, this highly regarded text has firmly established itself as the defini
introduction to geographical statistics. Assuming no reader background in statistics, the
authors lay out the proper role of statistical analysis and methods in human and physical
geography. They delve into the calculation of descriptive summariesand graphics,the use
inferential statistics as exploratory and descriptive tools, ANOVAand Kruskal-Wallistests
and different spatial statistics to explore geographic patterns,inferentialspatial statisties,and
spatial autocorrelation and variograms.
The authors maintain an exploratory and investigative approach throughout, providing
readers with real-world geographic issues and more than 50 map examples. Concepts are
explained clearly and narratively without oversimplification. Each chapter concludes with
list of major goals and objectives. An epilogue offers over 150 geographic situations, ll\Vlting
students to apply their new statistical skills to solve problems currently affectingourworld.

Waveland Press, Inc.


www.waveland.com

ISBN 13: 978-1-4786-1119-6


ISBN 10: 1-4786-1119-7
Continuous Probability Distributions
and Probability Mapping

6.1 The Normal Distribution


6.2 Probability Mapping

6.1 THE NORMALDISTRJBUTION gers, the normal curve is an example of a continuous dis-
tribution. The most striking feature of the normal curve
The most generally applied probability distribution is its symmetry; the lower (left-hand) and upper (right-
for geographical problems is the normal distribution. hand) ends of the frequency distribution are balanced.
When a set of geographical data is normally distributed, This symmetric pattern of values in a normal distribution
many useful conclusions can be drawn, and various means that no skewness exists in the data. The central
properties of the data can be assumed. The normal dis- value of the data represents the peak or most frequently
tribution provides the basis for sampling theory and sta- occurring value. In a normally distributed set of data, this
tistical inference, both of which are discussed in later position corresponds to all three measures of central ten-
chapters. The discussion in this section shows how prob- dency-the mean, median, and mode.
ability statements are made from normally distributed A normal curve is sometimes referred to as "bell-
data sets. shaped" in appearance. This characteristic shape repre-
Although a normal distribution is fully described sents a frequency distribution with an intermediate
with a rather complex mathematical formula, it can be amount of kurtosis. The tails of a normal distribution are
generally understood by a simple graph showing the fre- the portions of the curve with the lowest frequencies
quency of occurrence on the vertical scale for the range located furthest from the center. Note that the frequency
of values displayed on the horizontal axis (fig. 6.1). Since of values in a set of normally distributed data declines in a
the values on the horizontal axis are not restricted to inte- gradual manner in both directions away from the mean.
Because of its particular shape and mathematical
definition, the normal distribution is very useful in mak-
ing probability statements about actual outcomes in
many different situations. For example, if the amount of
precipitation a location receives in a year is normally dis-
tributed over a series of years, the probability of this site
receiving a given amount of precipitation can be calcu-
lated. The normal distribution provides the theoretical
basis for sampling and statistical inference-a primary
focus of the rest of the book.
The way in which areas are distributed under the
FIGURE 6.1 normal curve provides the basis for making probability
The Nonna! Distribution estimates. The total area under the normal curve repre-

93
94 Part ill .A. The Transition to Inferential Problem Solving

sents I00% of the outcomes, therefore we can determine I


---.4772---+'
the percentage of values within any portion of the curve ---.6'126--+>
I
along the horizontal axis. For example, due to the sym-
metry of the normal distribution, 50%,of the values must
lie under the curve and to the right of the central or mean
value. Since the normal curve is also a probability distri-
bution, a value taken from a normal distribution has a
.SOprobability of falling above the mean. .3413 I

Given the symmetric form of the normal curve, it is


clear that 50% of all values are greater than the mean.
This methodology is used to determine percentages for -3 0 +1 +2 +3
other intervals under the normal curve. Using integral (Z)

calculus, statisticians can calculate areas under the nor-


mal curve. A simple alternative is to use a table of normal FIGURE 6.2
Selected Areas of the Normal Curve
values, showing the proportion of total area under any
part of the normal curve. This table is derived mathemat-
ically from a theoretical normal distribution. In virtually Similarly, the probability of a value lying between the
all statistical software packages, the area under any por- mean and a standard score of 2 is .4772. Thus, almost
tion of the normal curve is calculated automatically, 48%, of values in a normally distributed set of data lie
given the relevant parameters. between the mean and two standard deviations above the
To use the table of normal values, data must be stan- mean, and somewhat more than 95% of the values are
dardized. On a standardized scale, each observation is within two standard deviations on either side of the
assigned a standard score (also called a normal deviate), mean. The remaining 5%,of the values are in the two tails
which indicates how many standard deviations separate of the distribution, where the standard score is either
a particular value from the mean of the distribution. greater than plus 2.0 or less than minus 2.0.
Standard scores can be either positive or negative. For Although the table of normal values provides proba-
units of data greater than the mean, the corresponding bilistic information on a standardized scale, data can be
standard scores are positive, and for values Jess than the measured on various scales. To apply the normal proba-
mean, standard scores are negative. A standard score of bilities to specific sets of data, values must first be con-
plus 1 represents a value that is one standard deviation verted from their original units of measurement (X) to a
abovethe mean, whereas a score of minus 1 is one stan- standardized (Z) scale without units of measurement. In
dard deviation belowthe mean. The mean corresponds to this process, data values are represented by their relative
a standard score or normal deviate of 0. The larger the position in comparison to the mean:
standard score (in either the positive or negative direc-
tion), the farther the value lies above or below the mean.
(6.1)
For any standard score, the table of normal values
provides a probability that can be interpreted in two
ways. First, the standard score gives the probability of a
value falling between the mean and that standard score where Z; =Z-score or standard score for the ith value
location for a set of normally distributed data. Second, X; =observation i
multiplying the probability by 100, it shows the percent- X =mean of the data
age of all values in the normal distribution that lie s =standard deviation of the data
between the mean and the standard score. Thus, the table
of normal values provides information to determine the The numerator of equation 6.1 shows the deviation of
area under the normal curve for any interval. observation i from the mean of the data set. This deviation
Consider these examples using the table of normal is then divided by the standard deviation of the distribu-
values (appendix, table A). The probability value associ- tion. The resulting Z.value (or standard score) can be
ated with a standard score of 1.0 is .3413 (or 34.13%) interpreted as the number of standard deviations above or
(fig. 6.2). In a normally distributed set of data, approxi- below the mean an observation is located. If the value
mately 34% of the values lie between the mean and one under consideration is greater than the mean, the devia-
standard deviation above the mean. Since the normal tion is positive and Z will be greater than zero, while if the
curve is symmetric, 34.13% of the values also lie between observation is less than the mean, the deviation and result-
the mean and one standard deviation below the mean. ing Z.score will be negative. If X equals the mean, there is
Therefore, by combining these two areas, approximately no deviation from that value and the Z.score will be zero.
68% of all values in a normal distribution lie within one Any set of data can be converted from its original
standard deviation on either side of the mean. units of measurement into the corresponding set of stan-
Chapter 6 • Continuous Probability Distributions and Probability Mapping 95

dardized values. However, to estimate probabilities using 36% of the values under the normal curve lie between the
the table of normal values, the original data must be nor- mean and 1.07 standard deviations. In other words, in
mally distributed. If the data is not normal, it is invalid to about 36 years out of 100, precipitation in Washington,
use the table of normal values to estimate probabilities. DC should fall between 39.95 and 48 inches.
Techniques for testing whether a sample has been drawn
Step 3: Evaluate the probability value:
from a normally distributed population are presented in
section 12.1. Although the table of normal values always deter-
The 40-year data set of annual precipitation for mines probabilities for areas under the curve in relation
Washington, DC presented in earlier chapters is used to the mean, the actual probability being sought may rep-
again to make probability estimates. Probability ques- resent a different part of the curve. For this problem, the
tions can be stated in a couple of different ways. One gen- answer lies in the shaded portion of the curve above (to
eral type of question we could ask is: "Given a particular the right of) the Z-score for 1.07. Because the proportion
data value, what is the corresponding probability?" A of the total area under the normal curve above the mean
second general type of probability question we could ask is .5000, the correct answer is found by subtracting the
is: "Given a particular probability value, what is the cor- probability in Step 2 from .5000 (.5000 - .3577 = .1423).
responding data value?" We'll now take a look at both of Therefore, in 14 years out of 100, annual precipitation in
these approaches. Washington, DC should exceed 48 inches.
For the first approach, what is the probability of For the second approach, what amount of annual pre-
annual precipitation in Washington, DC exceeding 48 cipitation in Washington, DC is likely to be exceeded with
inches? These 40 years of data are normally distributed, a probability of .90 (that is, 9 years out of 10)?To answer
with a mean of 39.95 inches and a standard deviation of this question, the three-step methodology is altered.
7.5 inches. We can estimate probabilities associated with
Step I: Determine the probability from the normal table:
a data value using a simple three-step procedure.
As shown on figure 6.4, precipitation in Washington,
Step I: Calculate the standard score:
DC will be exceeded 90% of the time at the position indi-
The standard score corresponding to 48 inches is cal- cated by the shading. The total shaded area represents .90
culated, where X = 39.95 ands= 7.5 of the total area under the curve, while the portion to the
left of the mean is .40. Therefore, we must determine the
Z-score corresponding to a probability of .40.
_ X,-X _ 48.0-39.95 _ 8.05 _ 1 (6.2)
Z I - -'-"-'-- - ----- - - · 07
s 7.5 7.5 Step 2: Calculate the standard score:
According to the table of normal values, a probabil-
Thus, a precipitation level of 48 inches is 1.07 standard ity of .40 lies between a Z-score of 1.28 (when Z = 1.28,
deviations above the mean precipitation of 39.95 inches the probability is .3997) and a Z-score of 1.29 (when Z =
(fig. 6.3). 1.29, the probability is .4015), but is considerably closer
to Z = 1.28. Since the location is below (or less than) the
Step 2: Determine the probability from the normal table:
mean, the value is minus 1.28. Thus, the precipitation
Using the table of normal values, the Z-score of 1.07 that occurs at least 90% of the time (9 years out of I0) is
corresponds to a probability level of .3577. Thus, almost 1.28 standard deviations below the mean.

<lf-.1000---------.9000-------
--.3577- ... +--.4000-~

lnchea: 39.95 48 Inches: 30.25 39.gs


(Z-s00tes:) (0) (1.07) (Z-soores:) (·1.28) (0)
AnnualPrecipitation AnnualPrecipitation

FIGURE 6.3 FIGURE 6.4


Probability of Annual Precipitation in Washington, DC Exceeding Amount of Precipitation Exceeding Nine Years Out of Ten in
481nches Washington, DC
96 Part ill .A. The Transition to Inferential Problem Solving

Step 3: Calculate the Precipitation Value (X;): Suppose annual precipitation data are collected for a
set of cities in the United States, and the mean and stan-
Using equation 6.1, the precipitation value corre-
dard deviation of values at each location are calculated.
sponding to a Z-score of minus 1.28 is determined:
Assuming all data sets are normal, the level of precipita-
tion exceeded 9 years out of 10 could be determined for
-1.2 8 =X,-39.95
7.5
0 (6.3) each city using the technique discussed in the previous
section. A probability map is produced by assigning each
probability value to its map location and by connecting
and equal probability values with isolines. Just as contours
show elevation patterns in an area, the probability sur-
X, =-1.28(7.50) +39.95 =30.35 inches (6.4) face would show the spatial pattern of precipitation prob-
ability. Many GIS software products have the capability
Therefore, 9 years out of 10, precipitation in Washington, to automatically generate isolines from a point pattern.
DC will likely exceed 30.35 inches, and 1 year out of 10, However, selection of the actual probabilities is the job of
precipitation should be less than 30.35 inches. an educated geographer.
In summary, the normal curve is a very useful proba- You might be asking: what additional information is
bility distribution in geography. For normally distributed provided by a map of precipitation probability that is not
data, the proportions of total area under the normal provided by a simple annual precipitation map? Probabil-
curve are fixed, allowing a variety of probability state- ity maps consider both the central tendency and variabil-
ments to be made. The normal curve serves as the basis ity of the data at each location. In fact, the key advantage
for sample estimation and inferential statistics, discussed of probability maps over maps of central tendency (e.g., a
in upcoming chapters. You might want to determine the map of average annual precipitation) is their consider-
likelihood of a future heavy snowfall or drought using ation of the variability of values at each location.
precipitation or snowfall data from your hometown, sim- The following example illustrates the importance of
ilar to what we have done for Washington, DC. variability in the construction of probability maps. Con-
sider two cities, A and B. They have equal average annual
precipitation (50 inches), but very different levels of vari-
6.2 PROBABILilY
MAPPING ability around those averages (fig. 6.5). Note the contrast
The previous section discussed how to make proba- between city A and city B with regard to the minimum
bilistic statements about an event at a single location. annual precipitation expected 9 years out of 10. City A
We can extend this technique to calculate probability has little precipitation variability from year to year and
estimates for multiple locations distributed across a
region. If probability data are plotted on a map, the
resulting information represents a probability map or
"probability surface," showing spatial variation in the
variable under consideration.
One very important practical constraint must be
carefully considered when constructing probability
maps-the variable you are mapping must be continu-
City A
ously distributed across space. Since virtually no human
phenomena are continuously distributed, (e.g., popula-
tion, number of murders) probability map applications
are limited to natural or physical variables where a value
can be measured at any location. Probability maps can
be constructed for many spatial variables. The technique
seems most directly applicable for the analysis of natural 43.6 so
phenomena in climatology, meteorology, and environ-
mental studies. For example, geographers could con-
struct probability maps of atmospheric particulates, .1000 City B
ozone levels, major winter storms, timing of killing frosts,
or the pH levels of acid deposition. In some instances,
probability mapping is extended to selected topics in 2A.4 so
human geography, such as disease, unemployment, or
poverty. However, extreme care must be taken because FIGURE 6.5
Comparison of Expected Precipitation in Two Cities with
virtually all variables in human geography are not distrib- the Same Average Annual Precipitation, but Very Different
uted continuously over space. Precipitation Variability
Chapter 6 • Continuous Probability Distributions and Probability Mapping 97

the mllllmum expected prec1p1tation is only slightly a spatial pattern that differs greatly from that of average
lower than the mean (43.6 vs. 50). By contrast, city B has annual precipitation. Such maps may be more useful in
great variability in precipitation over time, making the identifying areas that are at greater risk for repeated
minimum expected precipitation considerably lower than flooding or drought. Finally, note that a conventional
the mean (24.4 vs. 50). When mapping the minimum map of average annual precipitation is comparable to a
annual precipitation expected 9 years out of 10 for many probability map showing the amount of precipitation
different cities, the resulting probability map might have expected 5 years out of 10 or 50% of the time.

Example: Annual Heating Degree Day Pattern across the Conterminous United States

A heating degree day (HDD) is a surrogate measure for the the mean of the high and low temperatures for that particular
amount of heating energy needed to produce •comfortable' day) with a base temperature. The average daily temperature is
indoor warmth for a home or business in cold or cooler cli• subtracted from a base temperature of 65 degrees Fahrenheit.
mates. It is a widely used energy management index for mea• If temperature is measured in degrees Celsius, a base tempera-
suring yearly heating energy demand. Generally, the heating ture of 18.3 is used. If the HDD statistic is negative, no heating
requirements for a building at a specific location are directly degree days are recorded, and the assumption is that heating
proportional to the number of HDDs at that location. A cool• will not be needed on that day. If the difference is positive, the
ing degree day (CDD) is the opposite of a heating degree day; HDD value represents the number of HDDs needed that day.
the CDD measurement reflects the amount of air-condition• These daily values totaled over a year define the annual num•
ing energy needed to cool a structure during the warmer ber of heating degree days at that location.
summer months. We have collected 40-year records of annual heating
There are many ways to calculate heating degree days-the degree day values from over 100 weather stations across the
more detailed the temperature data, the more accurate the conterminous United States, excluding Alaska and Hawaii.
HDD calculation. One widely used approximation method sim• These are all level-one weather stations, generally located at
ply compares the average daily temperature at a site (which is large city airports. Our goal was to have a spatial pattern of
weather stations that is fairly evenly distributed and not clus•
tered in just one part of the country.
Using the means and standard deviations calculated from
-----.90---------- the 40-year HDD data, we calculated the number of HDDs
--- .. 0--..,..-.10~ ' expected to be exceeded 1 year out of 10 using the standard
score formula (equation 6.1). The need for this much energy
for heating is fairly rare, as a winter this cold is expected to
occur only once per decade. The calculation for Lexington,
Kentucky indicates that a heating degree day level of S,168.6
will be exceeded 10% of the time (fig. 6.6).
Following this procedure, values for all cities are calculated
and placed on a probability map of the United States, with
isolines (much like contours) drawn connecting locations hav•
X 4695 CriticalX
(Z) (0) (1.1a1 ing the same estimated number of HDDs 1 year out of 10. A
similar process could be followed for any other probability
Step 1: Calet.JlateXands: x= 4S95 level. For example, a probability map could show how many
s =370 HDDs occur 1 year out of 20 (.OSprobability) or 3 years out of
10 (.30 probability).
The probability map of HDDs (fig. 6.7) roughly reflects the
Step 2: Use Z tableto determinestandards.oote
spatial pattern of solar energy or heat received across the
when p =.10 Zwll be 1.28
United States. Since the map portrays a situation expected to
occur 1 year out of 10, it provides a reasonable spatial esti•
Step 3: Caloulatethe ctitlcalhealingday value(X) mate of the amount of energy needed for that once-in-a•
z = x-l1
s
-
then X = X + (Z)(s) decade cold winter.
= 4695 + (1.28X370) The general east-west trend of isolines emphasizes the
= 5168.6 (Critical X) importance of latitude on the distribution of heat. This influ•
ence is demonstrated by the regular south-to-north increase
Lex1n.gton,
KY wUIexceed5168.6 heatingd&greedaysoneyearout of ten. in heating degree days and winter heat need (e.g., from louisi•
ana to Minnesota).
Where isolines dip southward, weather station(s) at a higher
FIGURE 6.6
Calculation of Heating Degree Days Exceeded One Year Out of elevation probably influenced the pattern, since locations with
Ten for Lexington, KY higher elevations generally have cooler temperatures all year

(continued)
98 Part ill .A. The Transition to Inferential Problem Solving

5000 •

• • •

• •OQ?
• •
• 5000

• ,<ii'

• •

3 j)O
• •
3000 N

• Major U.S. city


2000
A
rfi

0 500 1,000

Kilometers

FIGURE 6.7
Probability Map: Number of Heating Degree Days Expected One Year Out of Ten
Source:NationalClimaticData Center (NGOC)

long, but especially in the winter months. For instance, in fig• Interpretation of the HDD pattern nationwide illustrates a
ure 6.7 you can see a slight southward dip over the Appala• common problem associated with isoline mapping. The iso•
chian Mountains of West Virginia and western Virginia. line pattern is influenced by the density and placement of
Perhaps the dominant feature on the probability map is control points from which the isolines are drawn. If data from
the higher HDD values found in much of the western third of additional weather stations were used, the isolines that make
the country. Compared to corresponding latitudes in the east• up the probability surface could be located more precisely.
ern two-thirds, HDD values throughout the Rocky Mountain We need to caution you about several problems associated
region are somewhat higher. Again this is mostly due to the with the HDD statistic. For a variety of reasons, the results are
higher elevations in that area. You might also note the moder• sometimes less accurate than many people believe, even
ating influence of the Pacific Ocean along the California, Ore• though the underlying theory behind these measures is solid.
gon, and Washington coastline, keeping the number of HDDs For example, using a one-size-fits-all base temperature may
needed during winter cold spells lower than locations at com• be inaccurate because different structures heat to different
parable latitudes several hundred miles inland. temperatures and average heat gain will vary significantly
The heating degree day pattern becomes complicated in from building to building, even with the same energy input.
places with abrupt elevation changes. A steep contour gradi• Also, most structures are heated to ideal comfort levels on an
ent is especially notable in places like Arizona where the sig• irregular and intermittent basis.Clearly, home heating and
nificant decline in elevation from Flagstaff to Phoenix results business heating needs fluctuate in different complicated
in a complex regional isoline pattern. Flagstaff has an eleva• ways (business hours versus overnight, weekdays versus
tion of nearly 7,000 feet and requires lots of winter heating weekends, etc.). Yet another problem has to do with properly
energy. The 1•in-10-year annual heating degree day value of monitoring internal temperatures that often vary consider•
Flagstaff is 7948. Phoenix has a lower elevation of about 1,100 ably at different internal locations within a building. However,
feet, and the city's location in the Sonoran Desert results in a it is also true that a well-designed and carefully monitored
limited need for artificial heating, even in the winter months HDD data set can be right on target and allow the implemen•
(the annual HDD value exceeded 1 year in 10 is only 1623. tation of an excellent energy conservation program.
Chapter 6 • Continuous Probability Distributions and Probability Mapping 99

Example: Last Spring Frost Pattern in the Southeast United States

Recall that one of the examples of geographic problem age last spring frost dates vary considerably over relatively
solving we looked at in chapter 1 dealt with the timing of the short distances, partly due to substantial elevation changes
average last spring frost in the southeastern United States (see and associated local climatological factors. Frost prevalence is
fig.1.6). We now return to this data set in the context of draw• clearly affected by local topography. Cool air settles at the bot•
ing an isochronal probability map showing the likelihood of a tom of slopes because it is heavier than warm air. This allows
late spring frost occurring at any particular location for any frost pockets to form in valleys where the cool air is trapped. In
particular date that we want to select. general, however, higher elevations have colder tempera•
Using the means and standard deviations calculated from tures. Combine these influences and you are likely to see frost
the date of last spring frost data, the probability of the last damage at the bottom of slopes and on the higher-altitude
spring frost occurring on or after April 1 is calculated for each hilltops, while the sloped hillsides are frost-free.
of the 76 weather stations. The calculation for Waycross, Geor• Many local conditions affect the timing and severity of
gia (fig. 6.8) shows there is a .1660 probability (that is, a 16.6% frost. These include: (1) the presence of certain weather condi•
chance) that Waycross will have a late spring frost on or after tions (clear skies and calm winds) that sometimes allows
April 1. development of an inversion layer (colder air near the ground
A probability map is usually effective in showing a general with warmer air above this trapped cold layer); (2) compass
regional pattern, but is often less likely to be effective at por• orientation of the site (aspect); (3) the degree of inclination
traying changes in probability over shorter distances or at the (slope); and (4) barriers to cool air drainage, such as narrow
micro.,level. We have already seen with the HDD application valleys and enclosed basins.
that the precision of probability map isoline placement is In addition, any frost climatology data from official weather
directly affected by the density of the point pattern from stations are based on air temperature measurements taken
which the isolines are interpolated. about five feet above ground level, from weather station she!•
A similar problem occurs with the last spring frost proba• ters or platforms. It is not unusual to discover that ground sur•
bility map. The spatial pattern of average last spring frost face temperatures are 4-6°F lower than shelter readings.
dates across the southeast is complex (fig. 6.9). In places like In summary, the timing of the last spring frost at any partic•
western North Carolina and northern Georgia where the Pied• ular site is the result of both general factors like those hypoth•
mont plateau transitions into the Blue Ridge Mountains, aver• esized in chapter 1 (latitude, elevation, and distance from the
coast) and local climatological conditions such as those just
discussed. The cumulative result of all these regional and
localized factors is a probability map with isochrones chang•
ing direction frequently over fairly short distances. The spatial
complexity of the last spring frost pattern will be discussed
further when we construct regression models to predict the
date for each of the weather stations .

.3340

X 75.89' 91
(Z) (0) (0.97)

Step 1: CalculateXands: x= 75.89 JuUan


s = 15.63 } Date
March17 Is Juli.anday 76
March31 Is Juli.anday 90

AptH1 Is Julianday 91

Step 2: Calculat• the Z SOOt• when X =91 (Ap<ll 1)


z= 91 - 75.89 = 0.97
15.63

Step 3: Determinethe area underthe normalOUl"Ve as.soclated


with
Z = 0.97. Thisarea is .3340. Thereforethe associatedprobability
ts .5000 - .3340 = .1660. ThereIs a 16.6%chanceof a lastsptlng
froston Of afterApril1 In Waycross.GA.

FIGURE 6.8
Calculation of Probability that Last Spring Frost Occurs on or
After April 1 in Waycross, GA

(continued)
100 Part III .A. The Transition to Inferential Problem Solving


• •

•• •
••
Waycross,GA

• •
• •


• Weather Station

0 250 500
GULF OF
Kik>metecs
MEXICO

FIGURE 6.9
Probability Map: likelihood of Last Spring Frost Occurring on or After April 1 in Portion of Southeast U.S.
Source:Parnell, 2013

KEY TERMS 3. Describe the technique of probability mapping and


recognize the types of spatial variables for which prob-
normal distribution, 93 ability maps can be constructed.
probability map, 96
standard score (normal deviate), 94
REFERENCES
AND ADDITIONAL READING
MAJOR GOALSAND OBJECTIVES Energy Lens, Energy Management Software. DegreeDays-Han-
dle with Care!www.energylens.com/ articles/ degree-days.
If you have mastered the material in this chapter, you National Climatic Data Center (NCDC). Climatographyof the
should now be able to: UnitedStares,No. 20, 1971-2000. Source for heating degree
1. Explain the nature of the normal distribution, know data.
how to use the table of normal values, and identify Parnell, D. B. A Gimaro/ogyof FrostExtremesAcrossthe Southeast
UnitedStates, 1950-2009.Unpublished manuscript, Salisbury,
potential applications in geography.
MD: Salisbury University, 2013.
2. Understand the calculation and interpretation of stan- Ristinen, R. A. and J. P. Kraushaar. Energyand theEnvironmeni.
dard (Z) scores. 2nd ed Hoboken, NJ: John Wiley and Sons, 2005.
Basic Elements of Sampling

7 .1 Sampling Concepts
7 .2 Types of Probability Sampling
7 .3 Spatial Sampling

Sampling was discussed briefly in chapter 1 as an ious types of probability sampling are reviewed in section
essential component of the scientific research process. In 7.2. All probability samples contain an element of ran-
many aspects of geographic research and problem solv- domization, but simple random sampling is often not the
ing, statistical techniques are incorporated into that best choice of sample design. We will examine some use-
research procedure. In fact, a basic knowledge of sam- ful alternatives to simple random sampling, including
pling procedures and methodology is valuable, whatever systematic, stratified, cluster, and hybrid sample designs.
the area of geographic study. Section 7.3 discusses the circumstances under which spa-
How does sampling fit into the overall scheme of sta- tial sampling is necessary and reviews different types of
tistical analysis and scientific research? Earlier chapters spatial sampling. Special attention is given to point sam-
include some brief comments about the nature of these pling designs.
relationships. The ultimate aim of inferential statistics is
to generalize about certain characteristics of a large
group, based on information obtained from a portion 7.1 SAMPLINGCONCEPTS
(often a very small portion) of that large group. That is, Sampling is an essential skill in virtually all areas of
characteristics regarding a statistical population (the uni- geographic study. A biogeographer interested in the spa-
verse of all individuals) are inferredfrom information tial pattern of environmental change associated with
obtained from a sample(a subset or portion of individuals high-intensity recreation use in a national park cannot
selected from the population for detailed analysis). These examine conditions everywhere in the park. A represen-
concepts are further clarified in the chapter on probability, tative sample of study sites needs to be selected for
where the importance of randomness and random pro- detailed analysis of these human-environment relations.
cesses is discussed in the context of making generalized An urban geographer examining the variation of housing
probability statements about a large population (the con- quality in a metropolitan area selects a sample of homes
ceptual notion of an infinite number of die rolls) using for detailed study. A behavioral geographer conducting
only sample observations (such as a few rolls of a die). research on natural hazards distributes questionnaires to
This chapter introduces some basic sampling con- a representative sample to learn public attitudes toward
cepts and discusses the important roles of sampling in alternative flood-management policies along the flood
geographic problem solving and research. Without a plain of a river. A medical geographer concerned with
properly drawn sample, valid generalizations or statisti- neighborhood variations in the use of hospital emergency
cal inferences about the population may not be possible. rooms as primary care centers uses sampling to select the
Section 7 .1 presents an overview of the advantages of neighborhoods and the hospitals to include in the study.
sampling, the sources of sampling error, and the steps These examples illustrate the geographer's use of
involved in a well-designed sampling procedure. The var- sampling in both spatial and nonspatial contexts. Spatial

101
102 Part III .A. The Transition to Inferential Problem Solving

sampling takes place when the biogeographer selects loca- rather quickly. For example, if you are studying the
tions to examine environmental change in the national attitudes of citizens living in a barrier island commu-
park. Likewise, the neighborhoods where the medical nity toward alternative coastal zone management strat-
geographer chooses to examine patterns of hospital use egies, you may want to follow a sample of residents as
also constitute a spatial sample. On the other hand, the pertinent legislation moves through the political pro-
geographer conducting the study on attitudes toward nat- cess. Their attitudes are likely to change over time,
ural hazards along the river flood plain may have selected especially if a hurricane hits the island during the study
individuals from a nonspatial list of households in the period! An urban geographer analyzing the spatial pat-
area. The urban geographer conducting the study of tern of growth may wish to focus on the views of resi-
housing quality could have taken a sample of homes dents in a neighborhood before, during, and after the
from either a nonspatial list (e.g., tax rolls) or a spatial construction of a nearby shopping mall. With dynamic
source (e.g., a map). situations such as these, sampling is required; informa-
tion from all individuals in the population cannot pos-
Advantages of Sampling sibly be collected without unreasonable effort and cost.
A variety of both practical and theoretical reasons • Sampling can providea high degreeof accuracy.With sam-
make sampling preferable to complete enumeration or pling, an acceptable level of quality control is assured.
census of an entire population. These are some of the Complete and accurate questionnaire returns are more
advantages of sampling: likely to be obtained if a small number of well-trained
personnel conduct all of the interviews. This procedure
• Sampling is a necessityin many geographicresearchprob- may provide more accurate results than a complete
lems. If the population being studied is extremely large census requiring a larger number of personnel, some of
(or even theoretically infinite), completing a total enu- whom may not be fully trained. The 2010 U.S. Census
meration is not possible. If you are studying the rea- of Population once again illustrated the difficulties of
sons why families in the United States change acquiring accurate information from everyone. Consid-
residence, sampling is the only alternative-all those erable controversy developed because many people liv-
who move cannot possibly be contacted and surveyed ing in America were literally not counted.
individually. A geographer analyzing global changes in
the nature and spatial extent of tropical rain forests will Despite the many clear and obvious advantages of
find it impossible to have total spatial coverage, since statistical sampling, the decision to use sampling (and
the number of locations in a rain forest is infinite. how to use sampling) in a specific practical situation is
often controversial with multifaceted implications. Real-
• Sampling is an efficientand cost-effective
methodof collecting
world difficulties and political policy issues can arise
iriformation.An appropriate amount of data concerning
concerning the implementation of sampling. In addition,
the population can be obtained and analyzed quickly
many citizens are not aware of the advantages of sam-
with a sample. Not only is sample information col-
pling over total enumeration and will not be strong advo-
lected in less time, but sampling also keeps expendi- cates of sampling.
tures lower and logistical problems to a minimum. The
overall scale of effort (time, cost, personnel, logistics, Sources of Sampling Error
etc.) becomes realistic with sampling as opposed to an
examination of the entire population. A central goal of sampling is to derive a truly repre-
sentativeset of values from a population. A representative
• Sampling can providehighly detailed information. In geo- sample accurately reflects the actual characteristics of the
graphic problems where in-depth analysis is necessary, population without bias. To ensure an unbiased represen-
only a small number of individuals or locations can be
tative sample, an element of randomness is always incor-
included in the study. These few elements in the sample porated into the sample design procedure. However, just
could then be closely scrutinized with the collection of a having randomness built into a sampling plan does not
comprehensive set of information. A study of shopping guarantee an unbiased sample, for many other sources of
behavior patterns, for example, might require numerous sampling error are possible.
questions about the number of shopping trips, locations Unfortunately, it is usually not possible to know with
visited while shopping, attitudes about alternative absolute certainty whether a sample is totally representa-
stores, as well as demographic or socioeconomic infor- tive. The very nature of sampling often means that every-
mation concerning household members. Such detailed thing cannot be known about the population containing
analysis is obtained only through sampling. the sample. Because only sample data are available, some
• Sampling allows repeatedcollectionof information quickly uncertainty is associated with sample estimates, and
and inexpensively.Many geographic research problems some sampling error occurs. If you are taking a sample,
require detailed information collected over a specific you must try to minimize sampling error given various
period of time or focus on spatial changes that occur practical constraints such as cost or time.
Chapter 7 • Basic Elements of Sampling 103

The measurement concepts of precision and accu- Other types of inaccuracy may result from some
racy help categorize the many sources of sampling error. operational, logistic, or personnel problem not directly
The results from a small sample may not be very exact or connected to the actual sampling procedure. Inconsisten-
precise. Increasing the sample size permits a more exact cies in collecting field data or interviewing could
estimate (fig. 7.1, line A) and can reduce imprecision. adversely affect the sample results. Errors could be made
Larger samples, however, are invariably more costly and in the editing, recording, or tabulating of information.
time-consuming to obtain (fig. 7.1, line B). Satisfactory Even forces of nature beyond the control of the
resolution of this difficult trade-off between sample preci- researcher could bias results. For example, if you are
sion and the effort required to sample is important in examining land use patterns along the flood plain of a
most real-world problems involving sampling. large river you could find the entire project in jeopardy if
Sampling inaccuracy is a more complex issue than severe flooding occurs during the study period.
sampling precision. Systematic bias can enter sampling Simply stated, the quality of statistical problem solv-
in many different ways. Some inaccuracies are the result ing in geography depends heavily on samples that are
of problems with the sampling procedure itself. For properly designed and collected. It is important to
example, the elements or individuals in a population develop procedures carefully to reduce sampling error to
could be "mismatched" with the set of elements or indi- a tolerable or acceptable level. Most sources of error can
viduals from which a sample is taken. An urban geogra- be reduced substantially (and perhaps even avoided or
pher studying new home construction patterns might use eliminated) if all steps of the sampling procedure are
a list of building permits as a source of data. If this list carefully planned and evaluated beforethe full set of sam-
contains home renovations as well as new home con- ple data is collected and analyzed.
struction, some of the renovations might be erroneously
included in the new home construction sample. The AmericanCommunitySurvey
Selecting an improper sampling design is another The government of the United States has a long his-
potential source of inaccuracy. The complexity of a prob- tory of information gathering about Americans. Starting
lem might call for a more sophisticated approach, rather in 1790, as required by the U.S. Constitution and with
than a simple random design. In other sampling prob- funds authorized by congress, the Census Bureau has
lems, the method of data collection may be inappropri- conducted a national census every 10 years. One original
ate. For example, a geographer studying population mandated purpose was to determine the population size
mobility might decide to use a mail questionnaire to keep of each state, thereby allocating the proper number of
costs down when telephone interviews would provide congressional seats in the House of Representatives
more accurate results. according to population size.
Over time, the amount of census information col-
y lected has increased dramatically, reflecting the growing
High
need for data concerning the demographic and socioeco-
nomic characteristics of our population. As we move into
the 21st century, the "long form" of the census (given to a
sample of housing units) has been replaced by the Ameri-
can Community Survey (ACS). The ACS has been
descnbed as the most substantial change in the decennial
census in more than 60 years. The ACS is a nationwide
monthly survey designed to provide communities with
precise and accurate demographic, social, economic, and
housing data annually. With full implementation of the
ACS sampling process, we can now obtain a continual
stream of updated information about localities, congres-
sional districts, and states. It is hoped that this will revolu-
tionize the way in which all levels of government plan and
low evaluate their programs. In addition, small businesses,
'-------------------x
Small Large large corporations, and individuals can utilize ACS data
Sampl&Siz.e to estimate the sales potential of services and products
Lin&@ Amountof imprecision
insample(inexactness
of measurement) and develop strategies for starting or expanding a busi-
Lin&@ Amountof effortto sample(cost.time,etc.)
ness. For example, a restaurant could use demographic
and economic data from the ACS to determine the best
FIGURE 7.1
location and available workforce to start a new franchise.
How Change in Sample Size Affects Level of Imprecision and In addition to these benefits, the Federal government uses
Effort to Sample the survey to allocate more than $400 billion in annual
104 Part III .A. The Transition to Inferential Problem Solving

funding to states and localities for education, health care, central business district (CBD) of a large metropolitan
roads and numerous other programs. area. What set of individuals should make up the target
population? Where should you place the target area
Steps in the Satnpling Procedure boundaries? Arguments could be made in support of any
of the following alternatives:
If faced with a geographic research problem in which
the collection of sample information is necessary, you • all city residents active in local civic groups
need to follow a number of steps (table 7.1 and fig.7.2). • all city landowners
Collecting sample data is not the first action taken. You
• all city residents
must first anticipate and resolve various problems that
might cause sampling error or other difficulties. Numer- • all city and suburban residents
ous safeguards or checks should be incorporated into the • all area residents plus nonresident visitors
sampling procedure whenever necessary. The character-
Step 2: After conceptually defining the target population
istics and relevant issues at each step of a well-designed
and target area, an operational sampling frame(s) is cre-
sampling procedure are now summarized.
ated. A sampling frame is defined as the practical or
Step 1: As descriptive statements and inferential hypoth- operational structure that contains the entire set of ele-
eses are formulated in scientific research, variables in the ments from which the sample is drawn. This structure is
problem must be defined both conceptually and opera- sometimes a comprehensive list of individuals (nonspa-
tionally. In sampling, the population and area "targeted" tial) and sometimes the boundary line delimiting the
for study must be defined. The target population is the extent of the study area (spatial). Sometimes both spatial
complete set of individuals from which information is and nonspatial sampling frames are components of the
collected, whereas the target area is the entire region or same problem. A sampled population is the set of all
set of locations from which information is gathered. Pre- individuals contained in the sampling frame from which
cise delineation of the target population and target area is the sample is actually drawn. A sampled area is the set of
not always a simple task. Suppose, for example, you are all locations within the study area boundary line that
planning to study the spatial variation of public attitudes delimits the spatial sampling frame from which the sam-
regarding several proposed revitalization projects in the ple is actually drawn.

TABLE 7.1

Steps In the Sampling Procedure


Step 1: Conceptually Define Target Population and Target Area
• Target population: the complete set of individuals from which information is to be collected.
• Target area: the entire region or set of locations from which information is to be collected.

Step 2: Designate Sampled Population and Sampled Area from Sampling Frame
• Sampling frame: the practical or operational structure that contains the entire set of elements from which the sample will actually
be drawn.
• Sampled population: the set of individuals contained in the sampling frame, from which the sample is actually drawn .
• Sampled area: the set of all locations wijhin the study area boundary line that delimits the spatial sampling frame, from which the
sample is actually drawn.

Step 3: Select Sampling Design


• Probability sampling preferred over nonprobability sampling .
• Types of probability sampling include random, systematic, stratified, cluster, and hybrid designs .
• Spatial and nonspatial variations in sampling design exist.

Step 4: Design Research Instrument and Operational Plan


• Methods of data collection include direct observation, field measurement, mail questionnaire, personal interview, telephone interview.
• Establish protocols for handling all problems or situations that can be anticipated in the sampling procedure.
• Complete miscellaneous logistic and procedural tasks in the preparation of sample taking.

Step 5: Conduct Pretest


• Complete trial run or pilot survey of sample data collection method.
• Correct all discovered problems that could lead to sampling error.
• Pretest results may be used to determine sample size.

Step 6: Collect Sample Data


• Consistency in collection methods and procedures is essential.
• Ensure overall high level of quality control .
Chapter 7 • Basic Elements of Sampling 105

Why is it important that you distinguish between tar- a complete enumeration of the target population. Lower-
get population and sampled population in nonspatial income residents without phones would be underrepre-
sampling? The CBD revitalization example illustrates sented in the sample, as would students in dormitories,
this distinction. Suppose the population you want to tar- residents in group quarters, and those with unlisted num-
get is all metropolitanarea residents,both in the city and in bers. Landline phones and cell phones may not be avail-
adjacent suburbs. Operationally, however, this target pop- able together on any list. The result is that the sampled
ulation is not listed in any single comprehensive source. population often cannot duplicate the target population
We must construct a sampling frame that lists the because the sampling frame used to delineate the sam-
intended target population as completely as possible. pled population is not complete.
Many possible sources of data could provide an Suppose you want to update past studies concerning
operational definition for this sampling frame, yet each land use and land cover change in the "Greater Yellow-
presents certain challenges. Customer lists from electric stone Ecosystem" (GYE) (fig. 7.3) and would like to col-
utilities would not include all households or all residents lect sample data from various locations within the area.
at those households listed. Some people in apartments or The GYE contains Yellowstone and Grand Teton
other group accommodations would be omitted from the National Parks, 7 national forests, over 20 other federal
sampling frame. Telephone listings will also not provide and state jurisdiction areas (such as Wind River Indian

Step 1: Conceptually Define Target Population and Target Area

Conceptually define Conceptually define


target population target area

Step 2: Designate Sampled Population and Sampled Area from Sampling Frame

r
~

' 7
' 7

Operationally define a nonspatial Operationally define a spatial


sampling frame (e.g., list) that sampling frame (e.g., map) that
depicts the target population as depicts the target area as closely
closely as possible as possible.

~I7
' 7

The set of individuals or elements The set of coordinate locations


contained in the nonspatial that can be identified in the
sampling frame constitute the spatial sampling frame constitute
sampled population. the sampled area.

J ,, ,J 7
"" ""
Step 3: Select Sampling Design

Step 4: Design Research Instrument and Operational Plan

Step 5: Conduct Pretest

Step 6: Collect Sample Data

FIGURE 7.2
Steps in the Sampling Procedure
106 Part III .A. The Transition to Inferential Problem Solving

MONTANA
, __,
,,
,.
,, ... ,,
- Bozeman Livingston
I • I"'-;:::=


~

MONTANA
'I
I
Ennis•
Red
I
I

Lodge I

I
I
Gardiner Cooke City

I
I
0
I
I
I

I
I
I
,, ,
I
- ~

Cody• I

' PARK I
..,,-,
I
I
I
•St. Anthony
IDAHO 1•N~G
.,.
, ---
Dubois
Idaho
I Falls
• WJND RIVER
......
IN01;11:N
-----~
Greater ~
Yellowstone RESER.VA'TIOIII
Ecosystem '' ~
\
• Municipalities ' I
I
Afton

D
• • Ecosystem Boundary
U.S. National Parl<s I
I

-;
......
......
I

,,
..-
I
LJ U.S. National Forests '' i0 ,,
,, ...... ,
D ' ~ ,,
'
Indian Reservations

' ~ ~ ,,
~" N

IDAHO
r-------------~~OUNl:>ARY
~
...
~
, ;
I
0 50

Kilometers
100
A
U TAH •• ••

FIGURE 7.3
The GreaterYellowstoneEcosystem
Source: Greater Yellowstone Coalition (Adapted from Greater Yellowstone Ecosystem Map)
Chapter 7 • Basic Elements of Sampling 107

Reservation), and surrounding private lands. The area • use of only those who volunteer or respond to the survey
supports threatened species such as grizzly bear and free- instrument, such as a questionnaire in a newspaper; and
roaming populations of elk and bison. The surrounding • selection of a single nonrandom study area or sample
private lands include a number of towns whose residents region (case study) in which to conduct research.
traditionally relied on natural resource industries such as
farming, ranching, logging, and mining. In recent years, Excellent descriptive results are sometimes derived
however, the local economies of communities in the from a nonprobability sampling procedure. For example, a
G YE have shifted from primary sector activities to an case study approach may provide considerable insight on
economy based on land development, real estate, and variable interaction within the study area. However, the
underlying problem with all nonprobability approaches to
recreational businesses.
With such a diverse and complex study area, a spa- sample selection is that no valid inferences can be made
from the sampled elements to any wider group in the pop-
tial sampling procedure is going to be difficult. Suppose
your conceptual target area is the ecosystem boundary ulation or to any other areas. In other words, nonprobabil-
ity sampling limits the generalizations that can be inferred
shown in figure 7.3. You would like the actual, opera-
from the research findings.
tionally-defined sampled area to correspond as closely as
By contrast, the nature of probability sampling is more
possible to this target area. However, there are likely to be
objective and is closely associated with scientific research.
almost insurmountable problems with gaining access to
In a probability sample, each individual or location that
all selected sample locations within the GYE. Entry onto
could be selected from the sampling frame has a known
some ranches, resorts, and other types of private property chance (or probability) of being included in the actual sam-
will surely be prohibited, and access may also be denied
ple. The advantage of probability sampling is that we can
onto some public lands leased by lumber and energy estimate the amount of sampling error. Because of the
companies. To what extent will the inability to sample importance of probability sampling in geographic research,
from these portions of the target area bias the results and section 7 .2 is devoted entirely to this topic.
weaken the accuracy of the study? These questions must
be carefully considered before continuing further. Step 4: Many procedural matters must be resolved when
An effective sampling frame is characterized by a sampling in real-world geographic situations. A key task
target population that matches the sampled population is the selection and design of an appropriate measurement
and a target area that matches the sampled area as instrument. Possible formats for data collection include
closely as possible. In the problem concerning CBD direct observation, field measurement (especially in areas
revitalization, what is the nature and probable extent of of physical geography), mail questionnaire, personal
sampling error that will occur if a telephone survey interview, and telephone interview. To select the most
(sampled population) is used to elicit opinions of all appropriate format, the nature of the research problem
metropolitan area residents (target population)? In the must be evaluated carefully. Even after the method of data
Yellowstone example, how much bias is injected into collection has been selected, problems in instrument
the sampling procedure if a considerable portion of pri- design may remain. In survey design, each question must
vate land in the region is excluded from analysis? When- be carefully worded, all possible responses to a question
ever facing mismatches like these, only qualified or must be considered in advance, and the questions must be
conditional statements are possible, and you must sequenced properly. Questionnaire design is complicated
clearly explain the sources of systematic bias and esti- because many different considerations are involved. For
mate their magnitude. those interested in further details regarding questionnaire
design, see the references at the end of the chapter.
Step 3: Selecting an appropriate sample design method is Other procedural decisions and logistic arrange-
crucial to the success of the entire sampling procedure. ments may be necessary, depending on the nature of the
Sampling design is the way in which individuals or loca- sampling problem. Preliminary site reconnaissance may
tions are selected from the sampling frame. be necessary, and interviewers and other personnel may
A fundamental distinction can be made between need to be hired and trained. If not anticipated, any num-
nonprobability and probability sampling designs. Non- ber of difficulties can plague the researcher and even cre-
probability sampling is subjectivebecause the selection of ate bias in the sample, undermining its inferential power.
sampled individuals or locations is based on personal Discussing the wide variety of circumstances that might
judgment. These samples are therefore sometimes called arise is not possible, as every geographic problem is likely
"judgmental" or "purposive samples." Criteria for sam- to have its own peculiarities. However, a list of some
ple selection might include: common problems includes:
• personal experience and background knowledge; • low response to a survey or questionnaire;
• convenience or better access to a nonrandom selection • unexpected response resulting from ambiguity or mis-
of individuals or locations; understanding;
108 Part III .A. The Transition to Inferential Problem Solving

• environmental or political change in the sampled area; sample. The process of using a random number table is
and tedious, and generally unnecessary, as random numbers
• unqualified or incompletely-trained interviewers. are easily obtained using computer technology. As an
Step 5: Before extensive time and money are expended example, desktop applications like Excel have a random
function, = rand(), built in, while computer languages
on the main data collection effort, a small-scale pretest
should be conducted to assess both the strengths and defi- like Visual Basic and JavaScript, use md() and random(),
ciencies in the research instrument design and opera- respectively. These functions will return a random value
tional plan. In a field study, the pretest should reveal between O and 1. Another option is to use web pages
problems with instrument calibration or malfunction, site such as http:/ /www.random.org/integers/ to generate
access problems, and other logistical difficulties. Prob- random integers. Therefore, in this edition of our text-
lems in a survey or questionnaire format may include book we have skipped the discussion of the random num-
non-response, improperly-worded questions, and poorly- ber table and recommend more modern approaches.
sequenced questions. A quality pretest can identify and As an example, suppose university personnel want to
correct difficulties in research design and data collection. determine the detailed opinions of students regarding
A good pretest or pilot survey can also help determine which activities to provide in a planned new student cen-
the ultimate sample size that needs to be taken. Too small ter. Because time is limited, only a small sample of stu-
a sample will not yield meaningful results, whereas an dents is surveyed. From a student body of 8,500, a simple
unnecessarily large sample will waste time, money, and random sample of 25 students is selected. Each student
effort. Pretest information can help you estimate the sam- must have an equal chance of being chosen (that proba-
ple size needed to achieve a certain level of precision in bility is 25/8,500 = .0029, or about 3 per 1,000). Since
the main sample, and this topic is discussed in section 8.3. the target population is 8,500, each member of that popu-
Step 6: The collection of sample data is a major task, lation can be assigned a number between 1 and 8,500.
Whether using a spreadsheet or a computer program, the
where high levels of quality control must be maintained
and well-considered data management procedures care- random number generated between O and 1 can be multi-
fully followed. Consistency in all aspects of data collec- plied by the target population (in this case 8,500). The
tion and processing is absolutely essential. In this step, formula in Excel would look like:
the benefits of the careful design of the research instru-
ment and planning (step 4) become evident, and the =rand( )*8500
improvements made in pretesting (step 5) will expedite
the major data collection effort significantly. This formula would be entered 25 times in the spread-
sheet to obtain 25 randomly generated numbers. Of
course the number calculated would be a floating point
7.2 lYPES OF PROBABILilYSAMPLING value (with decimal precision), so you can easily round
the number off to obtain an integer value.
In probability sampling, each individual or item that
could be selected from the sampling frame has a known
chance (or probability) of being included in the sample.
Syste111atic
Satnpling
This important advantage occurs because a randomiza- Systematic sampling is a widely-used design that often
tion component is incorporated in some known way into simplifies the selection process. A systematic sample
the sample design. If sample data are used in an inferen- makes use of a regular sampling interval (k) between indi-
tial manner (either for estimation of population parame- viduals selected for inclusion in the sample. A "1-in-k"
ters or for hypothesis testing), then an element of systematic sample is generated by randomly choosing a
randomness must be built into the sample design proce- starting point from among the first k individuals in the
dure. All types of probability sampling contain this char- sampling frame, then selecting every kth individual from
acteristic of randomness and avoid the subjectivity of =
that starting point. For example, if a sample of 25 (n 25)
nonprobability sampling. =
is taken from a population of 500 (N 500), the sampling
= =
interval (k) would be Nin (k 500/25 20; a "1-in-20"
Simple Random Sampling sample). Care must be taken to ensure that no nonrandom
Randomization is fundamental to all probability pattern or sequencing is present in the list from which
sampling, and a simple random sample is the most basic every kth individual is selected. Otherwise, bias may be
way to generate an unbiased, representative cross-section introduced into the sample process.
of a population. In a simple random sample, every indi- A systematic method of sampling is generally less
vidual in the sampling frame has an equal chance of cumbersome to apply than simple random sampling.
being included. Systematic sampling provides a relatively quick way to
Traditionally, geographers and statisticians used a derive a large size sample and obtain more information
random number table to obtain random numbers for a at a reasonable cost. As a result, it is used in many prac-
Chapter 7 • Basic Elements of Sampling 109

tical contexts. Government agencies (such as the United sible use of available information and increases the preci-
States Census Bureau and Statistics Canada) and politi- sion of sample estimates.
cal polling firms (such as Gallup or Harris) routinely Stratified sample designs may be either proportional
apply systematic samples when detailed analyses or fol- (constant-rate) or disproportional (variable-rate). In a
low-up studies are needed. To estimate the quality of a proportional stratified sample, the percentage of the total
product coming off an assembly line, factory manage- population in each stratum matches the proportion of
ment could have a more detailed inspection of every individuals actually sampled in that stratum as closely as
100th item. Every 20th visitor to a park could be asked possible. For example, suppose 20%,of all residents in a
to complete a survey regarding park services or every city are apartment dwellers. Proportional representation
50th taxpayer in a city could be asked about alternative of the apartment stratum in a stratified sample design
funding or planning projects. In market analysis, the edi- requires that 20%, of the sampled individuals be apart-
tors of a magazine could send a detailed survey or ques- ment dwellers. Suppose further that this housing study
tionnaire concerning existing and proposed features to specifically involves a resident survey on rent control.
every I0th subscriber. Because apartment dwellers would be particularly
affected by the proposed legislation and their opinions
Stratified Satnpling important to decision makers, city council members
In many geographic sampling problems, the target might want their views to be represented more heavily. If
population or target area is separated into different iden- such a disproportional (or variable-rate) stratified sam-
tifiable subgroups or subareas, called "strata." If sample pling design is appropriate, the apartment resident stra-
units in the different strata are expected to provide differ- tum would be oversampled (with a larger than
ent results, such "target subdivision" is logical. With proportional sample size), and residents not living in
stratified sample design, the effect of certain possible apartments would be undersampled (with a smaller than
influences can be controlled. Taking a simple random proportional sample size). An analogous practical sam-
sample from each class or stratum makes the fullest pos- pling strategy is to maintain proportional representation
of both the apartment and non-apartment strata, but
weight the apartment dweller responses more heavily
Case 1: when calculating sample statistics.
Stratify by place of residence
For the opinion survey on the new student center, a
Campus dormrtory Off-campus housing simple random sample would probably yield less precise
1. ____ _ results than would a well-designed stratified sample. A
1. -----
2. ----- 2. ----- university has a diverse student population, and different
3. ----- 3. ----- student subgroups would likely have varying opinions on
etc. etc. activities for a student center. Several stratified sample
designs are possible (fig. 7.4). For example, those living
Case 2:
Stratify by year in school in campus dormitories may have significantly different
priorities than do commuters living off-campus (fig. 7.4,
Freshman Sophomore Junior Senior
case 1). Also, the views of freshmen may differ from
1. 1. 1. 1. __ those of sophomores, juniors, and seniors (fig. 7.4, case
2. 2. 2. 2. __ _
2). Stratifying both by place of residence and year in
3. 3. 3. 3. __ _
etc. etc. etc. etc.
school might prove most useful, resulting in a composite
sample design structure with eight strata (fig. 7.4, case 3).
Case 3: If the views of each student are considered equally
Stratify by place of residence and year in school important, a proportional stratification is applied. How-
Freshman Sophomore Junior Senior
ever, if the views of dormitory residents are considered
more important, that group could be oversampled in a
1. -- 1. -- 1. -- 1. --
Campus disproportional stratified sampling scheme. In both
2. -- 2. -- 2. -- 2. --
dormitory
3. - 3. - 3. - 3. - instances, if results are expected to vary by stratum, a
etc. etc. etc. etc. stratified sample is preferable to a simple random sample
of the same size. That is, stratification should be applied
1. 1. 1. 1.
Off-campus
2.
- 2.
- 2.
- 2. --
- if background knowledge or logic suggests that it would
housing -- -- -- be beneficial and practical.
3. -- 3. -- 3. -- 3. --
etc. etc. etc. etc.
ClusterSatnpling
Note: Randoms.ampl&sof varyingsiz.estakenfrom eachstratum
For some geographic problems, cluster sampling is
FIGURE 7.4 most appropriate and may be more efficient or cost-effec-
Alternative Stratified Sample Designs: Student Opinion Survey tive than random, systematic, or stratified sampling. A
110 Part III .A. The Transition to Inferential Problem Solving

cluster sample is derived by first subdividing the target persed, resulting in high travel costs and major logistical
population or target area into mutually exclusive and problems. In these circumstances, non-cluster design
exhaustive categories (fig. 7.5). An appropriate number options (e.g., simple random sample) would not be prac-
of categories (clusters) are selected for detailed analysis tical. The choice of a spatially-contiguous or concen-
through random sampling. Two alternatives exist: trated cluster keeps the costs of obtaining the necessary
sample to a minimum and permits an adequate sample
• all individuals within each cluster are included in the size to be generated with reasonable effort.
sample, making a total enumeration or census within A total enumeration cluster approach makes it easier
that cluster; to obtain large sample sizes (option 1 in fig. 7.5). In the
• a random sample of individuals is taken from each student opinion survey, interviewing all students in a ran-
cluster. domly-selected number of dormitories or sampling all
students in a number of general education classes would
The latter option is sometimes called a "two-stage" clus- result in an appropriately large sample size.
ter sample because a random process is used twice-once For cluster sampling to be most effective, however,
to select clusters and then again to select sampled indi-
the individuals within each cluster should be as different
viduals within each cluster. The actual approach used or heterogeneous as possible. This will make the cluster
depends on the circumstances and practical sampling
sample observations representative of the entire sampled
conditions. The complete cluster sample is the composite population and avoid systematic bias. If the clusters are
of these selected clusters of observations. internally similar or homogeneous in nature, stratified
Cluster sampling is generally an effective design sampling may be a better alternative. Therefore, the
option when practical or logistic problems make other appropriateness of clusters in a geographic research prob-
choices more expensive, difficult, or time-consuming. In lem must be evaluated carefully. In this context, the total
some situations, a complete sampling frame is not avail- enumeration cluster sample of a dormitory or general
able, but partial information can be obtained. This might education class may not be appropriate because these
occur in an urban geography study where total enumera- clusters could be too homogeneous (i.e., a dorm may
tions are available for many city blocks, but no complete only house freshmen).
list of all city residents exists. A cluster could be defined
as all (or many) homes on a city block. In other geo- Cotnbination or Hybrid Sampling
graphic problems, the population may be widely dis-
Choosing a sampling design is seldom a simple,
straightforward matter. Decision makers should consider
cost, time, and convenience, as well as various practical
Population divided into problems unique to the specific situation. Common
mutually exclusive and sense and experience are also important in selecting a
exhaustive categories
sample design.
JI 7
In many cases, the simplicity of a simple random
approach offers an important advantage. However, practi-
Appropriate number of cal circumstances may make the use of any simple type of
categories (clusters) sampling (random, systematic, stratified, or cluster) diffi-
randomly selected for
cult or unwieldy. When practical conditions dictate, a
detailed analysis
combination sample or hybrid sample may be most
appropriate. The following experience illustrates how
practical realities influence the selection of sample design.
Some time ago, members of a geography department
Option (1): Option (2):
were asked to conduct a survey of air passengers enplan-
All individuals in selected Random sample taken of
clusters sampled (total individuals in selected ing at the nearby regional public airport. The survey had
enumeration within dusters (sampling within three general objectives: (1) establish the airport's service
clusters) dusters; a "two-stage" area; (2) determine selected passenger characteristics;
duster sample) and (3) provide various recommendations on how to
< 7 '
l ,, improve or expand airport services. Since a complete
census of all passengers was impractical, a survey sam-
Clusters combined to pling procedure was devised. Selection of a proper sam-
comprise complete ple design was critical to ensure that the passengers being
cluster sample
surveyed accurately represented the entire population of
passengers enplaned at the airport. A number of statisti-
FIGURE 7.5 cal requirements and practical realities limited the sam-
Steps in Cluster Sampling pling design choice:
Chapter 7 • Basic Elements of Sampling 111

• ensuring that each passenger was selected at random objects that are already mapped, you can apply the same
with known probability of inclusion; technique used to develop random numbers. In this case,
• guaranteeing representative coverage over the different you could select features in a GIS dataset by randomly
seasons of the year and different days of the week; selecting the appropriate feature ID numbers (assuming
the feature ID is labeled 1 to N). If on the other hand you
• working within economic constraints, to keep both the need to identify site locations that are not currently
researchers' transportation costs and the number of mapped, you can randomly generate X, Y coordinates
hours required by airport personnel to administer the that correspond to the bounding box of your study area.
survey at a reasonable level; and The randomly generated X, Y coordinates can be placed
• selecting a convenient sample design that would pro- in a list, imported into the GIS, and mapped. These loca-
duce appropriate accuracy, allow airport personnel to tions can then be exported to a GPS device to allow navi-
closely supervise the procedure, and result in minimal gation to the appropriate sites.
administrative errors. Spatial sampling from maps or other spatial sam-
pling frames may involve point samples, line samples
As things developed, it turned out that a combination (traverses), or area samples (quadrats) (fig. 7.6). Of these
or hybrid sample design was best able to meet both the three types of spatial sampling, geographers use point
statistical requirements and practical constraints. A year- sampling most frequently, and the various concepts
long survey period was chosen, and an initial survey date involving spatial sampling procedures are illustrated
(cluster) was selected at random. All flights on this calen- most easily through point sampling designs. Therefore,
dar day were sampled (total enumeration within clusters) this section focuses on point sampling, and those want-
rather than distributing questionnaires to individual pas-
sengers on scattered individual flights (sampling within
clusters). Succeeding survey dates were systematically

spaced every 13th day after the initial date. This sequenc- •
ing produced a total of28 survey days or clusters, seven in •

each season of the year, with each day of the week repre- • • • •
sented once each quarter. Thus, all seasons were propor- • •
tionally represented, as were all weekdays and weekend • • •
days in each season. By systematically spacing the sample •
clusters every 13th day, we assumed that no bias was • •
introduced for either the season of the year or day of the
Point Sampling
week. If a holiday happened (randomly) to be selected as (points)
a sample day, it was included in the study. In fact, to
exclude a holiday purposely would have introduced an
unwanted systematic bias into the analysis.
The overall result was a type of randomsystematicclus-
ter sample. Randomness existed because the initial sur-
vey date was randomly chosen, surveying passengers
every 13th day added a systematic component, and giv-
ing all passengers questionnaires on the selected survey
dates created a cluster effect.
Line Sampling
(traverses)
7.3 SPATIALSAMPLING
The types of sample design we have considered up to □ □
this point have not been explicitly spatial. Under certain
circumstances, however, spatial sampling is necessary. □
Spatial sampling is applied when using a map of a con- □
tinuously-distributed variable (such as vegetative cover, □ □
soil type, or pH of surface water) and a sample of loca-
tions is being selected from this map. If you are conduct- □
ing fieldwork and need to select sample site locations
Area Sampling
within a defined target area, then you are going to need (quadrats)
to apply a spatial sampling procedure. The use of GIS
technology makes it relatively easy to select objects ran- FIGURE 7.6
domly. For example, if you need to randomly select Spatial Sampling: Points, Lines, and Areas
112 Part III .A. The Transition to Inferential Problem Solving

ing details on quadrat and traverse sampling are referred pied area for a study, 2(n) random numbers must be
to more specialized texts and references. drawn from a random numbers table or other source. The
For all types of spatial sampling, a spatial sampling resulting pattern of points may be uneven, with some
frame (such as a map) must be constructed. This frame portions of the sampled area seemingly underrepre-
must include a coordinate system that allows clear iden- sented, and others appearing overrepresented.
tification of locations within the sampled area. For point As an example, consider three different random dis-
sampling from a map, designation of the (X, Y) Carte- tributions of 24 points on an agricultural field (fig. 7 .8).
sian coordinate pair will identify a unique location for Each sample distribution is clearly different, with some
each point. apparent clustering of points on parts of the field while
In a simple random point sample (fig. 7.7, case I), other parts of the study area have few or no sample
two numbers (an X and Y coordinate pair) are selected to points. This spatial unevenness is to be expected, of
designate each point. To locate n sample points in a sam- course, because these are all random samples. In theory,

Case 1: Case 2:
Simple Random Point Sample Systematic Point Sample (aligned)

• • • • • • ••
• • Distance
• • interval
• • • • • • ••


• • • • • • • • • •
• • • • • • •
• • • Randomly selected
• starting point
• • • • • • •
• • ••
• • • • • • • •

Case 3: Case 4:
Proportional Stratified Point Sample Disproportional Stratified Point Sample

• • • • • • • •
• Stratum 1

Stratum 1
• • (nonweUand area) • •,,.---t-- (nonweUand area)
• •
• • •

• • • • •
• • • • •
• • • •
• • • • • • •
• • • • •
• Stratum 2 •
• • • • "'--I--
• •
Stratum 2

• • •
(wetland area) • •
(wetland area)
• • • • •
• • •
Case 5:
Random Point Sample within Clusters
(two-stage cluster sample)

• ••
• •

•• •


•• •


• • •

FIGURE 7.7
Types of Spatial Point Sampling
Chapter 7 • Basic Elements of Sampling 113

if we took an infinite number of 24-point samples, and frame (comprising 60%, of the total sampled area), while
plotted them all on the same map, we would have points stratum 2 identifies wetlands (the other 40%). Suppose
rather evenly distributed over the entire field. proportional representation of both strata is desired. This
Systematic point sampling is a convenient way to would dictate that 60% of the sample points be placed in
avoid the problems possible with an uneven distribution of non-wetland locations and 40% on wetland sites. If a total
points across the study area. Systematic point sampling of 30 sample points is sufficient, 18 points should be in
uses a "distance interval, similar to a regular sampling non-wetland locations, and 12 points in wetlands (fig. 7.7,
interval (k) from a list of individuals in a nonspatial context case 3). Suppose instead that particular attention needs to
(fig. 7.7, case 2). A "starting point" is randomly selected, be focused on possible environmental problems in the wet-
then all other points are located using the distance interval lands area. In this situation, the wetlands stratum needs
to space them evenly across the entire sampled area. Only more detailed monitoring or "oversampling," while main-
one point must be randomly located; the subsequent place- taining adequate coverage of the non-wetland portion of
ment of all other points in the systematic pattern is deter- the sampled area. The disproportional stratified design
mined automatically by the distance interval. This (fig. 7.7, case 4) shows twice the intensity of sample points
approach has the advantage of offering representative, pro- (24 rather than 12) in the wetland stratum.
portional coverage of the entire sampled area. Bias will In spatial point sampling, a cluster design has the
enter the sample design when a spatial regularity or period- important advantage of keeping travel costs and other
icity exists in the distribution of some phenomenon that logistical problems to a minimum because the points of a
happens to match the distance interval. Systematic point cluster are located together (fig. 7. 7, case 5). This feature
sampling is often used in geographic research, particularly makes point sampling within clusters particularly attrac-
when dealing with environmental and resource problems tive in geographic projects having an extensive study area.
where data are continuously distnbuted across an area. In many field studies, a cluster approach may be the only
Just as in nonspatial cases, stratified point sampling practical alternative.
in spatial sample design may be either proportional (con- Although two methods of detailed analysis within clus-
stant-rate) or disproportional (variable-rate). The approach ters are available in nonspatial cluster sampling (fig. 7.5),
selected depends on the circumstances of the problem. only the "two-stage" cluster approach is appropriate in spa-
Suppose you are investigating environmental degradation tial cluster sampling. In the spatial sampling of a variable
in a region that includes particularly vulnerable tidal and continuously distnbuted across an area, an infinite number
non-tidal wetlands. In figure 7.7 (cases 3 and 4), stratum 1 of points could be selected in each subarea. As a result, total
shows the non-wetland portion of the spatial sampling enumeration is virtually impossible in a spatial context.

Case 1: Case 2: Case 3:


. ' '
•• '' ''
' '' '
' '' ''
• '.
• • '
• • • ''
''
mo • •
'
''
''
mo
=
'•
''
'
mo
=
• '
'' [l

'
'' 11\l.1 • '
'' 11\l.1
• '' • '' '
''
• ' • '
'
''
• ''
'
• • • '
'

'
''
' •
• • '
''
'
'
''
''
'
' '' ''
'' • '' • •
'' '
• ''
• '
''
'
''
'
'
' ''
' '' • ''
' • '
'
'
'
'
'' • '
''
'' •
• ''
• ''
'
• •
''
'


• '
''
'
''
'
'
''
''
'
• '
'' ' '
• '
'
'
'' • '
''
• ''
''
• • '' ''
• • ' • '
'
'
'
'' '' '
• ' •
• ''
• • ''
'
'
'' • • '
''
'
'' • • •
''
'
''
• • '' • • '' ''
'
' • '
' • • • '
'
'' '' '
' ''
:1 :\ '
:\

FIGURE 7.8
Three Different Random Samples of 24 Points on the Same Agricultural Field
114 Part III .A. The Transition to Inferential Problem Solving

However, a cluster approach to point sampling is not 1. Place a regular grid system over the entire study area,
always advisable. If the phenomenon being studied varies taking care to ensure that grid size is at an appropriate
from one section of the sampled area to another, it may be level of spatial resolution for the problem.
necessary to ensure that some point locations are selected 2. Select a random number for each row and column in
from all sections or subareas. Cluster sampling excludes the gridded area, where the row number specifies the
substantial parts of the study area, and such uneven areal horizontal position of points in that row and the col-
representation can be a disadvantage in many geographic umn number specifies the vertical position of points in
problems. As a result, the practical advantages of conve- that column.
nience and reduction in travel cost often have to be bal-
3. Locate a single point in each grid using the appropri-
anced against unrepresentative spatial bias.
ate row and column positions.
In addition to simple point sampling, a number of
combination or hybrid point sampling designs are also pos- This procedure provides proportional representation of all
sible (fig. 7.9). A composite sample design with its com- segments of the sampled area, yet avoids possible prob-
plex procedures is only worthwhile if it is likely to improve lems with regularities or periodicities in the spatial pattern
the accuracy of the sample in estimating the population. that can be encountered with an aligned systematic point
Perhaps the most widely used hybrid sampling model sample (fig. 7.7, case 2). The needed sample points can
is the stratified systematic unaligned point sample. First also be generated fairly easily. As a result of these advan-
suggested by Berry and Baker (1968), this approach tages, a stratified systematic unaligned point sample is a
includes the following steps (fig. 7.9, case 1): good choice in many realistic geographic situations.

Case 1: Stratified Systematic Unaligned Case 2: Disproportional Stratified


Systematic Aligned
7 4 5 1 8
.. 7...• • • • • • • •
6 :s • • • • • • • • • •
• • • • • • •
• •
1 • • • • • • • •
• • • • • • • • • •
• • • • •
3 • • • •

• • • • • •
5 • •

• •
3 • • • • • •

Case 3: Cluster Systematic Case 4: Disproportional Stratified Cluster

..... •
" ••
.. . •
.. •
•• •
•• • ... ...
•• • .. ..
/
"
• •
••• • •• •
• •• • •• • •
••• • •• •
••• • •
•••
••• • •

FIGURE 7.9
Selected Types of Composite or Hybrid Point Sampling
Chapter 7 • Basic Elements of Sampling 115

Other combination or hybrid sampling models can KEY TERMS


be designed, but should be used only if they are likely to
improve the accuracy or precision of the sampling prob- basic types of sampling:
lem. Geographers try to create models in such areas as simple random, 108
climatology, resource management, and migration to systematic, 108
replicate or duplicate reality as closely as possible. Simi- stratified, 109
larly, in geographic sampling problems, the sample cluster, 109
design must reflect the character of the situation. In this combination or hybrid sample designs, 11O
way, precise and accurate estimates can be obtained with oversampling and undersampling, 113
reasonable effort. Other hybrid point sample design population, 101
options include: "disproportional stratified systematic probability and nonprobability sampling designs, 107
aligned" (fig. 7.9, case 2); "cluster systematic" (case 3); proportional and disproportional stratified sampling, 109
and "disproportional stratified cluster" (case 4). randomness, 102
Are any of these hybrid designs really feasible? sample, 101
Clearly the answer is yes. Suppose you are concerned sampled population and area, 104
with monitoring changes in both the spatial distribution sampling frame, 104
and degree of intensity of nitrogen and phosphorus levels sources of sampling error, 102
in a bay receiving agricultural runoff. In the narrow estu- stratified systematic unaligned point sample, 114
aries of the bay, runoff problems are likely to be more target population and area, 104
severe and will need careful monitoring. On the other total enumeration within clusters and two-stage cluster
hand, no portion of the bay should be left totally unmon- sampling, 111
itored. What type of spatial point sampling procedure types of spatial sampling:
should be used when locating monitor buoys in the bay? simple random point sample, 112
A disproportional stratified systematic aligned design systematic point sample, 114
might be the most practical alternative (fig. 7.10). Dispro- stratified point sample, 113
portional stratification places an appropriate density of cluster point sample, 113
nutrient monitoring stations in the necessary locations.
Systematic placement allows efficient collection of water MAJOR GOALSAND OBJECTIVES
samples, since the buoys are aligned regularly for pick-up
If you have mastered the material in this chapter, you
by boat. This sample design is "tailor-made" to fit the
should now be able to:
practical circumstances of this geographic problem.
1. Understand the various advantages of sampling in
contrast to a complete enumeration or census of an
entire population.
2. Identify possible sources of sampling error in a geo-
Areaswheremore graphic problem or situation.
,,-------- Intensemon.it0ting
Is deslted 3. Explain the basic objectives and practical uses of the
American Community Survey.
4. Understand the steps in the general sampling proce-
• dure, from the conceptual definition of target popula-
• •
• • • •
tion and target area to the collection of sample data .
5. Define and distinguish between these sampling terms:
• target population and target area, sampled population
and sampled area, and sampling frame.
.
.. . .
• • • • ••••••
•.
- .....,
•''-'--' .....,.'-'.,._.,._,., 6. Explain the characteristics of the important types of
probability sampling-including simple random, sys-
tematic, stratified, cluster, and combination or hybrid
• • • • • •
• • • • • • designs.
• • • • • 7. Select an appropriate sampling design (one able to
••• meet both the statistical requirements and practical
circumstances) when presented with a "geographic sit-
• Samplepointlocation uation" or practical problem that requires sampling.
FIGURE 7.10
8. Explain the characteristics (including both advantages
Location of Nutrient Monitoring Stations on a Bay: Disproportional and disadvantages) of the various types of simple and
Stratified Systematic Aligned Sample composite point sampling.
116 Part III .A. The Transition to Inferential Problem Solving

REFERENCES
AND ADDITIONAL READING
American Community Survey (ACS). United States Census
Bureau. www.census.gov/acs/www
American Statistical Association, Survey Research Methods Sec-
tion. What is a Survey?How ro Plan a Survey,How to C.0/JeaSur-
veyData. Accessed January 9_,2014. http://www.amstat.org/
sections/ srms/
Berry, B. J. L. and A. M. Baker. "Geographic Sampling," in
SpatialAnalysis:A Readerin StatisticalGeogrophy,edited by B.
J. L. Berry and D. F. Marble. Englewood Cliffs, NJ: Pren-
tice-Hall, 1968.
Cochran, W. G. Sampling Techniques.3rd ed. New York: John
Wiley and Sons, 1977.
Dixon, C. J. and B. Leach. Sampling Methodsfor Geogrophical
Research.Concepts and Techniques in Modern Geography,
CATMOG 17. Norwich, England: Geo Abstracts_,1978.
Greater Yellowstone Coalition. The Greater YellowstoneEe-0sys-
tem. www.greateryellowstone.org
Henry, G. PracticalSampling.Newbury Park, NJ: Sage, 1990.
Lohr, S. L. Sampling:Design and Analysis. Pacific Grove, CA:
Brooks/Cole_,1999.
McGrew, J. C. and R. A. Rosing. Salisbury-Wie-0mico C.Ounty
RegionalAirport PassengerSurvey.Salisbury: Delmarva Advi-
sory Council, 1979.
Parmenter, A. W., et. al. "Land Use and Land Cover Change in
the Greater Yellowstone Ecosystem: 1975-1995," Ecological
Applicarwns.Vol. 13, No. 3 (Jun. 2003): 687-703.
Salant_,P. and D. A. Dillman. ProcricalSampling.Newbury Park,
NJ: Sage, 1990.
Scheaffer, R. L., W Mendenhall, R. L. Ott and K. G. Gerow.
Elementary Survey Sampling. 7th ed. Boston: Brooks/Cole,
2012.
Thompson, S. K. Sampling. New York: John Wiley and Sons,
1992.
Williams, B. A Sampleron Sampling.New York: John Wiley and
Sons, 1978.
Estimation in Sampling

8.1 Basic Concepts in Estimation


8.2 Confidence Intervals and Estimation
8.3 Geographic Examples of Confidence Interval Estimation
8.4 Sample Size Selection

The primary objective of sampling is to make infer- 8.1 BASICCONCEPTSIN ESTIMATION


ences about the population from which a sample is taken.
More specifically, sample statistics are used to estimate
Point Estitnation and Interval Estimation
population parameters such as the mean, total, and pro-
portion. When sample statistics represent a larger popula- In statistics, a basic distinction is made between
tion accurately, they are considered unbiased estimators. point estimation and interval estimation. The concept
This chapter considers both the theory and practice of point estimation is relatively straightforward-a statis-
of estimating from samples. The basic terminology and tic is calculated from a sample and then used to estimate
the theoretical concepts that underlie sample estimation the corresponding population parameter. With probabil-
are discussed in section 8.1. Distinction is made between ity sampling, the "best" (unbiased) point estimate for a
point estimation and interval estimation, and the nature population parameter is the corresponding sample statis-
of the sampling distribution of a statistic is explained. tic. Therefore, the best point estimate for the population
Also discussed is the importance of the central limit theo- mean(µ) is the sample mean (X); the best point estimate
rem when inferring sample results to a population. Con- for the population standard deviation (cr) is the sample
fidence intervals that indicate the level of precision for a standard deviation (s), and so on (table 8.1).
population estimate can be determined for any desired Note that the denominator for calculating sample
sample statistic. Sections 8.2 and 8.3 include a variety of standard deviation is (n - I) rather than (n). This slight
confidence interval equations. Material is organized by adjustment makes the sample standard deviation a less-
type of sample (random, systematic, and stratified) and biased estimator of the population standard deviation,
by parameter (mean, total, and proportion). A series of particularly for smaller samples. With larger sample sizes
examples illustrates the procedure for constructing confi- (larger than 30 or so), the difference between dividing by
dence intervals around each parameter using the differ- (n) or dividing by (n - I) is insignificant. However, using
ent sampling methods. (n) with a smaller sample would result in an estimate that
To save time and effort, geographers often want to under-represents the magnitude of the true standard devi-
know-before taking a full sample-how large a sample ation in the population.
is needed for a particular research problem. Issues and How precise are sample point estimators? How close
methods that determine minimum necessary sample (or distant) from the true population parameter is the cal-
size before taking the full and complete sample are dis- culated sample statistic? Because probability sampling
cussed in section 8.4. Examples illustrate how sample involves some uncertainty, it is unlikely that a sample sta-
size is established for problems using the mean, total, tistic will exactly equal the true population parameter.
and proportion. What can be determined, however, is the likelihood that

117
118 Part III .A. The Transition to Inferential Problem Solving

TABLE 8.1

Sample Statistics as Best Point Estimators of


Samplingdistribution
Population Parameters of samplemeans(S<'s)

Descriptive Population Sample Calculating


statistic parameter statistic • formula

Mean µ
l=t Frequency distribution
n of values in population

Standard
s
I<x,-X)'
deviation l=l
n-1 µ

FIGURE 8.1
Total r T N(X) = N'[.X, Sampling Distribution of Sample Means and Frequency Distribution
n of Population Values

X
Proportion .. p p
n

'" best point estimator


The CentralLitnitTheore111
.. x = number of units sampled havwlga particular characteristic
n = total number of units sampled When the sample statistic is the mean, its frequency
distnbution has a particular set of features that are of
vital theoretical importance in sampling and general sta-
tistical inference. These features are summarized in the
a sample statistic is within a certain range or interval of central limit theorem (CLT):
the population parameter. The determination of this Understanding how the central limit theorem works
range is the basis for interval estimation. A confidence is essential for understanding why we can make infer-
interval or bound represents the level of precision associ- ences about a population using only a sample. Let's take
ated with the population estimate. Its width is deter- a look at each of the three primary ideas in the CLT theo-
mined by: (I) the sample size; (2) the amount of retically and then consider a simplified example.
variability in the population; and (3) the probability level The first idea states that the frequency distribution of
or level of confidence selected for the problem. We will sample means will have a mean of µ, the mean of the
now explore these ideas in some detail. total population. Suppose we have a population of 1000
students and decide to sample five students to estimate
The Satnpling Distribution of a Statistic the amount of the average student loan. One of our sam-
Suppose a random sample of size n is drawn from a ples might include these student identification numbers
population, and the mean of that sample (.X) is calcu- (27, 252, 516, 717, and 927), allowing us to compute a
lated. Now suppose a second sample of size n (totally sample mean, x, . Suppose we decide to take another
independent of the first sample) is drawn and its mean
calculated. If this process is repeated for many similar-
sized independent samples in a population, the fre-
quency distribution of this set of sample means can be
graphed (fig. 8.1). This curve is referred to as the sam-
pling distribution of sample means.
CentralLimitTheorem
A sampling distribution can be developed for any Suppose all possible random samples of size (n) are drawn
statistic, not just the mean. After many independent sam- from an infinitely large, normally-distributed population having
a mean =µand a standard deviation= a. The frequency
ples of size (n) are drawn from a population, the statistic
distribution of these sample means (Xs) will have:
of interest (i.e., mean, total, and proportion) is calculated
for each sample, and the distribution of the sample statis- 1. a mean ofµ (the population mean);
tics graphed. The resulting frequency distribution has a 2. a normal distribution around this population mean; and
shape, a mean, and a certain amount of variability (as 3. a standard deviation of a I .Jn.
reflected by the standard deviation and variance). These
This particular set of characteristics applies only to the
general characteristics of sampling distributions are distribution of sample means and not to the distribution of
important, but for now the focus is on particular charac- other sample statistics.
teristics of sample means.
Chapter 8 .A. Estimation in Sampling 119

sample of five students; this would yield yet another set sample size, cr/ ✓ n This measure of standard deviation

of students for which we can calculate a sample mean, is called the standarderror of the mean (crx):
x 2 . We now have two sample means, and can use that to
generate a mean of both samples, x, + x 2 /2 . Theoreti-
(8.1)
cally, we can continue to take an infinite number of ran-
dom samples of five students, and the average of these
samples would be equal to the population average. In an The standard error of the mean indicates how much a
extreme example, it is possible that one of our samples typical sample mean is likely to differ from the true popu-
selects five students with very high student loans. There- lation mean. Quite simply, standard error is a basic mea-
fore, x, would be much higher than the population aver- sure of the amount of sampling error in a problem.
age for student loans. However, it is equally possible that Recall that the standard deviation is a descriptive sta-
one of our samples selects five students with very low stu- tistic that measures the average amount that a single
dent loans, and thus x2 would be much lower than the observation differs from the mean. In our example, we are
population average for student loans. If we averaged measuring how much an individual sample of 5 students
these two extreme samples, the result would be a number differs from the true population mean. This difference is
closer to the population mean. dependent upon the amount of variability in the popula-
The second idea states that the frequency distribu- tion and the size of the sample. As an illustration, suppose
tion of sample means will be normally distributed around the student loans of all students range anywhere from
the true population mean. In reality, each sample of five $12,000 to $14,000. That is a small range of only $2,000.
students will likely have some students whose loan Any sample we take will tend to be close to the mean.
amount is larger than the population average, and some However, if the range of student loans is between $0 and
students whose loan amount is smaller than the popula- $215,000, it is very likely that a sample will be farther
tion average. Therefore, the average of each sample of 5 from the mean because we have a range of loan values of
students will tend to be closer to the mean. As we accu- $215,000. Therefore, samples taken from populations
mulate these five-student samples, we will have many with a small standard deviation have a lower standard
more sample means that are closer to the population error than samples taken from populations with a higher
mean than farther away. If this frequency distribution is standard deviation.
plotted, it will look increasingly normal in shape as the Similarly,we know intuitively that larger samples tend
number of samples increases. to provide better estimates of the population than smaller
The third idea provides insight into the variability of samples. Thus, increasing the sample size will improve the
the sample means. It states that the standard deviation of estimate of the population parameter. Therefore, mathe-
the sampling distribution of means is equal to the popula- matically, an increase in n in the denominator will cause
tion standard deviation divided by the square root of the the standard error (sampling error) to be reduced.

Example: Sampling from an Experimental Agricultural Plot

While the discussion above makes intuitive sense from a Table 8.2 shows the descriptive statistics for the original
mathematical standpoint, we are interested in determining if 7240 yield locations as well as the computed average and
the concept actually works in a practical sense. To test the standard deviation of the respective sampling distributions
practicality of the central limit theorem, we present an illustra• using sizes of S, 10, and 40 samples. You can see that as the
tion using some real world data. sample size increased from S sample values (average yield of
The experimental agricultural plot shown in figure 8.2 con• 122.S)to 40 sample values (average yield of 122.2), the aver•
tains 7240 yield values. The average yield was 123.1 bu/ac age yield value for the sampling distribution was closer to the
with a standard deviation of 38.S. Furthermore, the histogram actual value of 122.1. Thus, even with skewed data, the theo•
portion of figure 8.2 shows that the data had a moderate retical sampling distribution has a mean ofµ (in reality, we are
skew of 1.63. slightly off because we only included 2000 samples).
Suppose we didn't actually know the average and stan• Figure 8.3 shows a series of sampling distributions for the
dard deviation for the population, but were only able to three tests. While the original data of 7240 points had a skew
obtain samples. Therefore, to theoretically test the claims of of 1.63, the sampling distribution of 2000 samples using S
the central limit theorem, we can generate 2000 random yield values reduced the skew to 1.09, and thus caused the
samples with each of these different sizes (S yield locations, sampling distribution to appear more normally distributed.
10 yield locations, and 40 yield locations), and calculate the We are looking at these changes in skew because a reduced
average, standard deviation, and skew. For the remainder of skew (as related to larger sample size) is an indicator of a distri•
this discussion, keep in mind that we did not include an infi• bution closer to normal. Using a larger sample size of 10 yield
nite number of samples, but rather capped our number of values further reduced the skew to 0.61, and the sample size of
samples at 2000. 40 reduced the skew even more to 0.36. This series of figures

(continued)
120 Part III .A. The Transition to Inferential Problem Solving

TABLE 8.2

Descriptive Statistics for Population (Size 7,240) and Samples of Size 5, 10, and 40

Standard Theoretical standard error


Samples Mean Deviation of the mean (u I .fn)
Original 7,240 values 122.1 38.5
5 sample values 122.5 17.9 17.21
1O sample values 122.3 13 12.174
40 sample values 122.2 6.8 6.08

illustrate that as a larger sample size is used to estimate the In reality, we will never take an infinite number of samples,
average yield, the sampling distribution appears more nor• nor will we even have the luxury of taking 2000 samples of 40
mal-even if the original data was skewed to begin with! data points. Often the best we can do is to obtain a single
Revisiting table 8.2 also shows the average yield, the actual sample of a certain number of data points. However, as our
computed standard deviation of the sampling distributions, practical example shows, all samples will theoretically (and
and the theoretical standard error of the mean using the for• practically) fall within the normal sampling distribution.
mula a/ fn. In our example, s represents the population stan• Therefore, if we only take a single sample, it is safe to assume
dard deviation of 38.S, and n represent the sample sizes we that the sample will fall somewhere within the normal distri•
investigated (5, 10, and 40 respectively). bution. Further, given the size and standard deviation of the
For the 2000 samples of S yield values, the theoretical stan• sample, we can reasonably determine the confidence interval
dard error of the mean is computed as 17.2 38.S/ ✓ S, which = of our sample (the probability that the true population falls
is very close to the actual value of 17.9. Similarly, the theoreti• somewhere within a specified range).
cal standard error computed with 40 yield values is
=
6.08 38.S/ ✓ 40, which is also similar to the theoretical value
of 6.8. Once again, even with skewed data, the theoretical
standard error of the mean is very similar to the actual stan•
dard deviation of the sampling distribution, especially when
the sample size increases.

''
.
'.
•. y
...llo
' 2500 -
-
~

ll\i.l
.. '
'

' 2000
'•
'
''
'

' 1,-1500 •
'•
' . .,
C -
...
'
' ::,
g
-
...
..• ~

IL 1000 - -
.
'•
••
.
'.
••
'•
•.
500 •
- -
.. JI h
...
'
0
' ' ' ' • X
..' 0 100 200 300 400 500 600 700

:,\ Yield

FIGURE 8.2
An Experimental Agricultural Plot and Histogram Showing Distribution of 7,240 Yield Values
Source: Lembo, A., Lew, M .. Laba, M., Baveyye, P. 2006
Chapter 8 .A. Estimation in Sampling 121

8.2 CONFIDENCE
INTERVALS
ANDESTIMATION
y
250 • Suppose you want to place a confidence interval
about a sample mean with 90% certainty that the interval
200. range contains the actual population mean. The general
formula for a confidence interval is:
~ 150
C:
X+ Za ,r (8.2)
"
:, Skew: 1.09
~
lL
100

where X =sample mean


50. Z =Z-value from the normal table
a .I' =standard error of the mean
If l
0 . . . X
Because the central limit theorem establishes the
90 120 150 180 210 240

Sample: 5 normality of the distnbution of sample means, Z-values


from the normal table (appendix, table A) provide the
proper probabilities for defining the confidence interval.
y Given a desired certainty of90%,, what is the correspond-
350.

-- ing Z-value or area under the normal curve? That is, what
Z-value from the normal table is associated with the con-
300
fidence interval having a 90%, likelihood of containing
250 • - ~

the true population mean? The desired Z-value represents


~ the situation where 90%,of the total area under the curve
C: 200 - is encompassed by the confidence interval and 45% of
"i
:, Skew: 0.61
this total area is on either side of the mean (fig. 8.4).
~ 150
lL
If you look at the table of normal values in Appen-
100 • - dix A, you can see that when the area "~' equals 4495,
50 - the Z-value is 1.64 and when the area "~' equals 4505,
the Z-value is 1.65. Ifwe follow the convention of round-
,..f h
ing up if midway between two numbers, a Z-value of
0 X
100 120 140 160 180 200
1.65 is used for this confidence interval estimate. There-
Sample: 10 fore, we can be 90% confident that the true population
mean lies within the confidence interval defined by:
y
250 X+l.65a x

200
-
-
~ 150 •
C:

"
:, Skew: 0.36 ----.9000----

~
U. 100 • - '~ .4,500 ----
I
.4S00 _..,

,-
50.
~
~

_r -,_
0
104 120
.
.j....=-,[.1.l.l.,J..Ll.l.,J..Ll.J.J...1.Jl..l,.JL..Cb;:~~-
128
'
136 144
. .
152
X
112

Sample: 40

(Z) -1.65 0 1.65


FIGURE 8.3
Histograms Showing Sampling Distributions of Size 5, 10, FIGURE 8.4
and 40 Standard Scores (Z-Values} Associated with a 90 Percent
Confidence Interval
122 Part III .A. The Transition to Inferential Problem Solving

The confidence interval contains expressions that are contain the true population mean. Conversely, a .10
added to and subtracted from the mean to define the probability exists that the confidence interval placed
upper and lower bounds of the interval: around the single sample mean doesnot include µ. The
probabilities that an unusually large sample mean (well
Upper bound = X + 1.65a x above µ) or an unusually small sample mean (well below
µ) could be drawn are represented graphically by the two
Lower bound = X - I .65a x
unshaded tails of the sampling distribution. Note that by
chance, the fifth sample mean {X 5) falls below the lower
Recall that the entire area under the sampling distri-
bound of the confidence interval around µ.
bution curve of figure 8.1 represents the set of all possible
Several terms are used when making interval esti-
sample means that could be drawn from the original pop-
mates in sampling. The confidence level refers to the
ulation. The shaded area in the center of the sampling
probability that the interval surrounding a sample mean
distribution (fig. 8.4) shows the location of 90% of all
encompassesthe true population mean. This confidence
sample means that could be drawn from the population.
level probability is defined as 1 - a. The significance
Suppose 10 actual samples are taken from this popula-
level refers to the probability that the interval surround-
tion and the location of each sample mean is plotted (fig.
ing a sample mean fails to encompassthe true population
8.5). Of the 10 sample means shown (Jt1 to x 10 ), 9 of
mean. The significance level is denoted by a and equals
them (.90 probability) are within the confidence interval.
the total sampling error. Because error is equally likely in
This result is expected because the interval was con-
either direction from µ, the probability of the sample
structed with a 90%,chance of containing µ .. Of course,
mean falling into either tail of the distribution is a/2.
when taking a single sample, the location of µ is not
known. However, a .90 probability exists that the interval
General Procedure for Constn1cting
or bound placed around that single sample mean does
a Confidence Interval
The best way to learn about the construction of con-
fidence intervals is to present a simple example in some
detail. Suppose that a random sample of 50 commuters
Conflden.oe
interval
aroundXsfailsto in a metropolitan area revealed that their average jour-
lnciudeµ ney-to-work distance was 9.6 miles. Moreover, a recent

~
study has determined that the standard deviation of jour-
ney-to-work travel distances for this metropolitan area is
l<s approximately 3 miles. What is the confidence interval
around this sample mean of9.6 that guarantees with 90%
certainty that the true population mean is enclosed
within that interval?
l1a The confidence interval for µ is calculated from the
following values:
519

x,o • the sample mean (X= 9.6)


• =
the population standard deviation (a 3)
• =
the sample size (n 50)
• the Z-value associated with a 90% confidence level
=
(Z 1.65)
The single best estimate of the true population mean
=
(µ) is the sample mean (X 9.6). This sample mean is the
statistic around which the confidence interval is placed.
From equation 8.1, the confidence interval is defined as
µ- 1.6Sax µ µ+1.65ai
X+Zax
Conftdenceln.tervat(1 - a)

D 90% of all sample means (l<'s) Substituting the value of the standard error, the confi-
Confidence Sevel(1 - a) = .90
Significancelevel (a)= .10 dence interval equation becomes
Number of standard devla6ons(Z) = 1.65

FIGURE 8.5 X+Z;,; (8.3)


Distribution of Sample Means and the Confidence Interval Concept
Chapter 8 .A. Estimation in Sampling 123

Inserting all values for the journey-to-work problem: in most situations, if the population mean is unknown,
the standard deviation and variance of the population are
-
x+z r
vn
a
=9.6+1.65 =
3
v50
=9.6+0.70
also unknown. With an unknown population variance,
the sample variance (s2) provides the best estimator and is
the statistic inserted into the standard error formula.
This interval ranges from a lower bound of8.90 (9.6- .70) Recall the standard error formula (equation 8.1):
to an upper bound of 10.30 (9.6 +.70). It can now be con-
cluded with 90%,certainty that the mean journey-to-work
distance for all commuters is between 8.90 and 10.30miles.
When dealing with confidence intervals, various sit-
uations arise. Sometimes sample sizes are small and Squaring both sides to convert standard deviations into
often population standard deviations are unknown. vanances:
Proper decisions must be made about how to proceed
under these circumstances. a2
What level of confidenceslwuld be used? A confidence a-'=- (8.4)
, n
interval can be created for any desired level of confidence
(I - a). The most commonly accepted and widely used If the population variance (a 2) is unknown, the sample
confidence levels are .99, .95, and .90. The corresponding variance (s2) is substituted:
likelihoods of making a sampling error (also called the
significance level, a) when using these confidence levels
are .01, .05, and .10. These "conventional" levels of con- (8.5)
fidence are generally considered rigorous enough to guar-
antee acceptably low probabilities of sampling error.
The impact of various confidence levels on signifi- To get this expression into an effective form to place con-
cance levels, number of standard errors, and confidence fidence intervals (bounds) around point estimates, one
interval characteristics are now summarized for the jour- final step is needed-take the square root of both sides.
ney-to-work problem (table 8.3). Notice what happens
when the confidence level increases from .80 to .99. First, (8.6)
the significance level decreases from .20 to .01, reducing
the sampling error for the problem. Also, the number of
standard errors (Z) increases from 1.28 to 2.58, thereby This standard error formula is the expression used for the
widening the confidence interval range and width. Thus, many confidence interval equations that follow.
higher levels of confidence result in wider confidence What if the sample size is small? The journey-to-work
intervals and less precise estimates, but lower sampling example incorporated the standard normal deviate (Z)
error. The investigator has the responsibility for deciding into the confidence interval calculation. However, Z is
how to trade off confidence level against level of precision. valid only if the sample size is greater than 30. With
What if the populationstandarddeviation(a) and variance smaller samples, the confidence interval equation must be
2
( a ) are unknown? In the previous example, it was altered, and the standard Z-valuemust be replaced by the
assumed that the population standard deviation and vari- corresponding value from the student's t distribution. This
ance were known, while the population mean (µ) was modification is necessary because the sample standard
unknown. This assumption is unrealistic and impractical: deviation (s) is not always accurate when the sample size

TABLE 8.3

The Relationship Between Confidence Level and Confidence Interval for the Journey-to-Work Example

Number of Number of Confidence Interval


Confidence Level Significance Level
standard errors standard errors
(1- a) (a)
(Z) (X ± Zax) Range Width

.80 .20 1.28 9.6 ± .54 9.06 to 10.14 1.08


_go• .10 1.65 9.6 ± .70 8.90 to 10.30 1.40
.95 .05 1.96.. 9.6 ± .83 8.77 to 10.43 1.63
.98 .02 2.33 9.6 ± 1.00 8.60 to 10.60 2.00
.99 .01 2.58 9.6 ± 1.09 8.51 to 10.69 2.18
• Calculations for (I - a)= .90 are detailed in situation 1.
.. A Z•value of 2.00 is often used for 1.96 for convenience.
124 Part III .A. The Transition to Inferential Problem Solving

is smaller than 30. In some situations, it may be impracti-


cal or impossible to have a sample size as large as 30. If the
X+tug (8.7)
value of s appears to be inaccurate because of the small
sample size, logic demands that a wider interval be pre-
sented as the confidence interval estimate for µ to avoid Like the normal distribution, the t distribution is sym-
introducing bias. This wider interval allows an equivalent metric and bell-shaped. The exact shape of the t distribu-
amount of confidence in the result that the true population tion depends on the sample size: as n approaches 30, the
mean lies within the interval. value oft approaches the standard normal (Z) value. Two
Whenever the sample size is less than 30, the confi- pieces of information are needed to use the t table
dence interval formula is (appendix, tables C and D):

TABLE 8.4

Confidence Interval Equations for Random and Systematic Samples•

Part 1: Population parameter - mean(µ)

Best point estimate: x n

Standard error of the point estimate (sampling error): sn2(Nn n) where s2


E(X; - X) 2
n-1

Confidence interval (bound) around the point estimate: _


X ±
_
Z8-x = X ±
s2
Z -n
(N-n)
N

Part 2: Population parameter-total (ti

Best point estimate: f T NX

Standard error of the point estimate (sampling error):

Confidence interval (bound) around the point estimate: T ± Z8T = T ±z N ( s~)


2
2
(NN-n)
Part 3: Population parameter - proportion (p)

X
Best point estimate: f5 p
n

Standard error of the point estimate (sampling error):

Confidence interval (bound) around the point estimate: p ± Z8P = p ±Z p(l - p))
( ---
(NN- n)
n-1

N , from the confidence interval equationsif n/ N < .05.


• Excfude the finite population correction, ( N•n) Replace 2 v«lh lhe corresponding t if n < 30.

N = populationsize; n = samplesize, i = samplemean, s 2 = samplevarianoe. (In part 3 only: p = sampleproportion).


Chapter 8 .A. Estimation in Sampling 125

• the desired significance level (a) - common levels are For smaller sample sizes (30 or less), the t value will
a= .10, .05, and .01. always be slightly larger than the Z.value at the corre-
• the number of degrees of freedom (df), where df is sponding level of significance, resulting in a slightly
defined as one less than the sample size (df = n - 1). larger confidence interval, or bound, on the error of esti-
An explanation of degrees of freedom is included in mation. For example, when placing a 95% confidence
chapter 9 in the context of inferential statistics. interval (a= .OS)around a sample statistic when n = 30,

TABLE 8.5
Confidence Interval Equations for Stratified Samples*

Part 1: Population parameter - mean(µ)

Best point estimate: p. = x=

Standard error of the point


estimate (sampling error):
2_
N2 L
f N,
2 (sf)
n1
(N; - n;)
N,
1=1

Confidence interval (bound)


around the point estimate: X
- _
± Z8g - X ± z NZ
- 1
6m
N1
2
2 ( S1 )
n1 (
N; - n;
N1 )

Part 2: Population parameter - total ( t)


m
Best point estimate: f = T = L
1=1
N1X1= [N1X1 + ... + NmXml

Standard error of the point


estimate (sampling error):

Confidence interval (bound)


around the point estimate: T ± ZaT

Part 3: Population parameter - proportion (p)


m m
p 1 '\' 1 '\' x,
Best point estimate: p N L, Nip; = N L, N1n
i=1 1=1 I

m
Standard error of the point 2_ '\' Nl (p;(l - p;)) (N1 - n;)
estimate (sampling error): N2 L
i=t
n· -1
l
N,

m
Confidence interval (bound) z~ = ± Z 2_ '\' N~ (P1(l - P1)) (NI - n;)
around the point estimate: P ± <1p P NZ L, ' n - 1 N
i=1 I I

• Exclude the finite population correction (N;n) and all (N';n,)


1
from the confidenoe interval equations if n/ N< .05. Replace Z with the corresponding t if n < 30.

m = number of strata; subsaipt i refers to stratum i. N, = size of population stratum i; n, = size of sample from stratum i; sJ = variance of sample i: N = total population
of all strata (N = N, + N1 + . .. + N..). (h part 3 only: p1= sample proportion with characteristic ofstratum iJ.
126 Part III .A. The Transition to Inferential Problem Solving

Z = =
1.96, while t 2.04. This larger confidence interval These different sample types and population param-
fort is to be expected because the smaller the sample, the eters are combined to create the six basic situations
larger the sampling error and the greater the uncertainty shown in tables 8.4 and 8.5 with regard to confidence
that the sample precisely represents the population from intervals. Each of these basic situations is illustrated with
which it is drawn. a geographic example. Other situations focusing on addi-
What if the sample size is large relativeto the population tional sampling designs (such as cluster, composite, or
size? You may sometimes encounter the fortunate situa- hybrid samples) are discussed in advanced texts that
tion where your sample size is relatively large compared cover sampling more extensively. If you want to find out
to the population size. This is a fortunate situation more about sample estimation, refer to the readings at
because having a relatively large sample drawn from a the end of the chapter.
relatively small population allows you to add what is
called the finite population correction (fpc) to the confi-
dence interval formula. Including the fpc when calculat- 8.3 GEOGRAPHIC
EXAMPLES
OFCONFIDENCE
ing the confidence interval allows you to narrow the INTERVAL
ESTIMATION
confidence interval width and derive a more precise sam-
ple. Since narrowing the confidence interval width is The following examples are all placed in a very practi-
always a good thing, the fpc should be included in the cal geographic context. A realistic set of scenarios has
confidence interval formula whenever possible. been created utilizing representative values from the 2010
All of the confidence interval formulas are shown in Census of Population and other reliable sources. Each of
tables 8.4 and 8.5, and the last portion of each confi- the following six "situations" is designed to be a practical
dence interval formula is the fpc component: problem or issue that could occur in any community. With
each situation, the emphasis is on the geographic charac-
teristics of the problem and related policy considerations.
(8.8) Following one of the main themes of the book, our hope is
to demonstrate the strong linkages between informed geo-
graphic decision-making and the use of statistics.
where: fpc = finite population correction According to the 2010 U.S. Census of Population,
N =
population size the fictitious city of Middletown (fig. 8.6) has a popula-
n =
sample size tion of 20,340 and is the county seat of Clinton County
(104,500 total population). For various administrative
The sampling fraction of a problem is defined as the and economic reasons, Middletown has collected official
ratio of sample size to population size: (nl N). Whenever statistics for the last few decades from each of their four
this sampling fraction is large, the finite population cor- districts: Northside (pop. 2,444), Central (pop. 9,498),
rection should be included in the confidence interval Easton (pop. 4,910) and Southside (pop. 3,528). Just out-
equation. In general, the fpc value is included in popula- side the Middletown city limits is the planned commu-
tion estimation equations when the ratio of sample size nity of Parkwood Estates (pop. 438), whose residents will
to population size exceeds 5% (n I N > .OS).In the next be deciding in the next election whether to approve or
section of this chapter (section 8.3) the only situation that reject annexation into Middletown (fig. 8.7).
includes the fpc is situation 6.
Whenever the sampling fraction is less than 5%, the Situation 1: Random or Syste1natic Sample-
fpc portion of the confidence interval formula should not
be included, as it will have only a negligible factor. In the
Estitnate of Population Mean
next section, the fpc is not included in the confidence Like many other communities, a critical concern of
interval equations for situation 1-5. many Middletown residents is the rapidly rising cost of
All of the components are now in place to calculate household energy expenditures. According to a recent
various confidence intervals. However, deciding which Residential Energy Consumption Survey (RECS) con-
one to select depends on two factors: ducted by the U.S. Energy Information Administration,
the average annual energy expenditure per household
• Different sample types require different confidence inter- nationwide is nearly $2,000. Local officialssuspect energy
val equations. The following discussion of confidence costs in Middletown may be even higher than the national
interval calculation will be limited to three sample average. Community officials and a number of citizen
types-random, systematic, and stratified. organizations have been supporting various federal and
• Different population parameters require different confi- state voluntary initiatives to promote energy conservation
dence interval equations. The following discussion will and reduce energy consumption. These programs include
be restricted to three parameters-mean, total, and improved (more energy-conserving) building codes,
proportion. improved appliance standards and incentives, expanded
Chapter 8 .A. Estimation in Sampling 127

The City of Middletown

Northslde

Easton

Central

Southside

Par1<wood
Estates

A
To open 0 1 2
water

Kilometers

FIGURE 8.6
The City of Middletown
128 Part III .A. The Transition to Inferential Problem Solving

weatherization assistance, and retrofit incentives (with ners take a simple random or systematic sample of 25
increased equipment standards). To date, however, these households. From the sample data, the sample mean and
community efforts have had negligibleresults. variance are X = $1,810 and s2 = 38,416. They wish to
As part of a newly-created local residential energy determine the confidence interval that contains the true
conservation program, Middletown officials want to esti- average annual energy expenses per household for Mid-
mate the average annual energy expenditures per house- dletown, with a 90% certainty (I - a= .90). The relevant
hold. The hope is that this information, along with more values for this problem are:
detailed monitoring of energy cost components (such as
energy expense by specific appliance) will allow them to X =1810 s2 =38416 N=8306 n=25 a=.10
recommend specific energy reduction strategies for local ' ' '
households that will best reduce their overall energy costs. Since n < 30, t rather than Z should be used in the
To estimate and place bounds on the average annual confidence interval equation (the top portion of table
energy expenditures per household, Middletown plan- 8.4). Since a= . 10 and degrees of freedom= n - I = 24,

ParkwoodEstates

9o 0
■ 167

0
5
0 0
4

3
Cb c?
1 0
2

-
V 0
~ & 0
Oo
0 00
cg 0 00 0 30
56
157 156 155 154

37 34 33
0
38 54 57
n
62 0 151

&
0 95
94 93 89
78
75 69
53
58
61 0 0 150

41 115 116 86
f57 88
74 70
52 58 0
96 79
73
114 117 81 51 50
42 82 80
97 85
118 0
~ 43 113
98 119 84
83
0
44 112
99
120
00
111 110
100
0
45
121 0 Country Club
140
0 46
101 102 109
0
0 122
0 0 0 0 139

123
0 0
124 125
106 107 126 133 138
127 128 129 132
0 130 131 137 136 135 134
0 ooo 0 o 0 o
0
0 Tree Building To Middletown
2 kilometers ►
CJ water CJ Parcel
0 75 150 N
CJ Golf Course
Meters A
FIGURE 8.7
The Planned Community of Parkwood Estates
Chapter 8 .A. Estimation in Sampling 129

the value from the t table is I. 71. Because the sampling total (T), which is the sample mean (X}multiplied by
fraction (n IN) is less than .OS,the finite population cor- the population size (N):
rection can be ignored. The confidence interval in this
problem is calculated as i=T=NX (8.11)

For the Middletown problem, if the smaller sample


(8.9) size of(n = 25) is being used, the best estimate of the pop-
ulation total is

x + rf = 1,810+1.11J(38,416/2s) = 1,810+67.03 i = T = 8,306(1,810) = 15,033,860

Based on this best point estimate, Middletown


Middletown planners are 90% certain that the true mean households spend slightly more than $1S million annu-
energy expenditures per household falls within the inter- ally to meet their collective residential energy needs.
val from 1,742.97 to 1,877.03. Using the equations in the middle portion of table 8.4,
This is a fairly narrow confidence interval, but it can the confidence interval around the sample total can be
be made even more precise if decision-makers think the determined by:
additional precision will allow them to make even better
energy policy recommendations to Middletown house-
holds. Two strategies are available to narrow the interval: T+zfm (8.12)
(1) lower the confidence level from .90 and/or (2)
increase the sample size above 25. The latter strategy
requires more effort, but the larger sample size will per- As in situation I, the sampling fraction is less than
mit the confidence interval to be narrowed. .OS and the fpc can be ignored. Also, remember that
Suppose the decision is made to increase the number when the sample size is less than 30, t rather than Z
of households surveyed from 25 to 250. If the level of should be used. Therefore, when a= . IOand t = 1.71, the
confidence is kept at .90, the resulting interval should be confidence interval for the total household energy use in
considerably more precise. Suppose the mean and vari- Middletown is calculated as
ance of this larger survey of households are calculated

T+t✓N 2 ( 5:) =15,033,860+1.71


from the sample data as X = 1,821 and s2 = 36,864.
2 30 6
Now, with n > 30, Z rather than t is used in the confi- (8306) ( ;; )
dence interval formula. Also, with a = . 10, the value
from the normal table becomes Z = 1.65. Once again, the = 15,033,860+ 5,567,671
finite population correction (fpc) is not useful because
the sampling fraction n I N is still less than .OS. The Planners are 90% certain that the true total amount
revised confidence interval is calculated as of energy costs in all community residential households
(-r) is found within the interval from $9,466,189 to
$20,601,531. This is an extraordinarily wide confidence
X+Zag=X+zf (8.10) interval. The combination of a very small sample size of
25, coupled with a data set having considerable variabil-
ity has resulted in a generally unacceptable result, at least
from a practical policy perspective.

Situation 3: Randon1 or Systematic Satnple--


With this larger sample, planners are now 90%,certain Estimate of Population Proportion
that the true mean household energy costs in Middle- The recent nationwide meltdown in the real estate
town (µ) falls within the very narrow interval from market has placed extraordinary pressures on communi-
$1,800.96 to $1,841.04. ties to meet their financial obligations. As homeowners
default on mortgages, the number of properties on the tax
Situation 2: Randotn or Syste111aticSample-- roll diminishes, along with the revenues generated by
Estitnate of Population Total those taxes. Not only is housing foreclosurea problem, but
Middletown planners can now use the best point many residential properties with a mortgage are now find-
estimate of average annual energy expenditures per ing themselves "underwater" as the value of their homes
household to determine the total amount of energy costs has declined. This means many homeowners are now
for all residential households in their community. The experiencingnegative equity, as the amount owed on their
best estimate of the population total (i) is the sample mortgage is greater than the current value of their home.
130 Part III .A. The Transition to Inferential Problem Solving

The spatial pattern of foreclosures in the United Using the equations in the bottom portion of table
States is very highly concentrated. According to Realty- 8.4, the confidence interval around this population esti-
Trac, more than half of the nation's foreclosures in 2008 mate of the proportion is calculated as
took place in only 35 counties. This dramatic clustering
is a sign that the financial crisis that devastated the U.S.
economy probably began with collapsing home loans in p+Z (p(l-p)) (8.14)
n-l
only a few scattered places. A few of the 35 "crisis coun-
ties" were in the already depressed areas in and around
Cleveland and Detroit. But nearly all of the most severely As in situations 1 and 2, the sampling fraction is less than
stressed places were concentrated in Southern California, .05 and the fpc can be ignored.
Las Vegas, South Florida, Phoenix, and Washington. To enclose the true population proportion with 95%
These are locations where home values impressively =
confidence (1 - a= .95 and Z 1.96), the confidence
increased in the years 2000-2007, but then plummeted interval is:
dramatically later in the decade.
The nationwide pattern of negative equity and
"underwater" households (those who owe more in a .294 + 1.96 (•
294 706
(.
85-1
>)=.294 + .096
mortgage than the value of their house) is much more
widely dispersed. The most recent data available as this is
Middletown planners can conclude with 95%, cer-
written (the fourth quarter of2011) estimates that over 11
tainty that p, the true proportion of owner-occupied
million residential properties were in a condition of nega-
housing units with a mortgage that currently has negative
tive equity. That is nearly 23% of all residential properties
equity, is in the interval from . 198 to .390. This is a very
carrying a mortgage. The states with the highest negative
equity percentages are Nevada (where 61% of all mort- wide confidence interval, and would undoubtedly be a
gaged properties are underwater), Arizona (48%), Flor- disappointing result for Middletown officials. Their
choices for narrowing the confidence interval width are
ida (44%), Michigan (35%), and Georgia (33%).
Though not in one of these "problem states," offi- to increase the sample size, reduce the confidence level
(perhaps to 80%), or both.
cials from fictitious Middletown are concerned that their
percentage of households with mortgages currently
"underwater" is greater than the national average. From
Situation 4: Stratified Satnple--Estimate of
the tax records and other sources, it is known that there Population Mean
are 8,306 occupied housing units in town. If Middletown Members of the Board of Education in fictitious
follows the national average, slightly over 68% of all Clinton County wish to evaluate the "geographic compe-
occupied housing units are owner-occupied and nearly tencies" of all high school juniors in the County school
32% are renter-occupied. Middletown officials also know district. One group of students has completed an intro-
that about 70% of all owner-occupied housing units ductory geography course taught in several of the County
nationwide are carrying a mortgage. high schools, whereas another group of students has not.
Armed with this general background information, School officials suspect that the geographic competencies
planners estimate that nearly 4,000 Middletown owner- of these two groups of juniors are quite different. If a sim-
occupied housing units have a mortgage. The goal is to ple random sample is taken of all juniors, either of the
take a random or systematic sample of these housing groups could be over- or under-represented in the sample.
units to estimate what proportion are currently "under- Therefore, a stratified sample design with proportional
water." Eighty-five owner-occupied housing units having representation of each group of students is appropriate.
a mortgage are sampled, and 25 of these are found to To evaluate students' geographic skills, school offi-
have negative equity. cials decide to use the Secondary-Level Geography Test
The best point estimate of the population proportion designed by the National Council for Geographic Educa-
(p)is the sample proportion (p), which is the number in tion (NCGE). This test evaluates a student's basic geo-
the sample having the specified characteristic (x) divided graphic knowledge in three areas: geographic skills,
by the total sample size (n). physical geography, and human geography. Of 2,160 total
juniors in the school district, 480 have completed the
• X
p=p=- (8.13) geography course and 1,680 have not. School officials
n have only 90 test booklets from NCGE and must restrict
The best estimate of the proportion of Middletown their sample to 90 students. The general procedure for
owner-occupied households that are underwater is estimating the mean from a stratified sample is shown in
the top portion of table 8.5 and the worktable showing the
. X 25 best estimate and confidence interval for the test scores of
p=p=-=-=. 294
n 85 all Clinton County students is shown in table 8.6.
Chapter 8 .A. Estimation in Sampling 131

Notice that the overall best estimate of the true popu- Situation 5: Stratified Sample-Estimate of
lation mean is the sample mean of 54.11. This 54.11 Population Total
value is actually the weighted mean of the two sample
means (64.36 for the 480 students who have taken the The Middletown City Council has just held a public
geography course and 51.18 for the 1,680 students who meeting presenting the results of the energy expenditures
have not taken the course). Because proportional repre- sample survey. Recall from situations 1 and 2 that best
sentation of these two student groups is necessary to have estimates and confidence intervals were placed around
an unbiased sample, it is not surprising that the much the average annual energy costs per household and the
larger group of students who did not complete the geog- total annual energy costs for all residential units in Mid-
raphy course would influence the overall result much dletown. The vast majority of citizens attending the
more (with a heavier weighting) than the group of stu- meeting expressed concern about the limited usefulness
dents completing the course. However, we can be assured of the sample results.
that this stratified sample result is more likely to be closer A number of criticisms were aired, but several com-
to the true population mean (which is unknown) than plaints seemed to predominate: (1) the sample of house-
would be a simple random sample of the same size. holds was too small; (2) the best point estimates seemed

TABLE 8.6
Confidence Interval Estimate of Stratified Sample Mean: Secondary-Level Geography Test Results for Clinton
County Students
Task: Estimate the mean test score of all high school juniors on the Secondary-Level Geography Test, based on a strat~ied sample,
and place a 95% (a= 0.5) confidence interval (bounds) around the estimate.

Stratum 1 Stratum 2
Students who have taken Students who have not taken
introductory geography courses introductory geography courses

N1 = 480 n1 - 20 N, = 1680 n2 = 70

x, = 64.36 s,2 - 65.61 x, = 51.18 s2 2 - 90.25

The best estimate of the true population mean(µ) is the strat~ied sample mean (X):
m

r, = X = ! IN,X,
l=t

1
= 2160 [480(64.34) + 1680(51.18)] = 54.11

The confidence interval around X is:

_
X ± Zifx =
_
X ± Z N'
1
I N1 ~,
m

l=t
(
s
')
N1 -
( N, )
n1

Exclude the finrte population correction (


N·I -
Ni
n·) (n/ NJ
I
as < .05 for each stratum:

X±z f N'
..!...
NZ Li l
l=t
(•l)
n
l

=54.11 ± 1.96

With 95% certainty, the mean test score of all high school juniors,µ, is wrthin the interval from 52.21 to 56.01.
132 Part III .A. The Transition to Inferential Problem Solving

to be reasonable, but the confidence intervals around people were upset that the sample might not have ade-
those estimates were so wide that it seemed to make the quately represented their neighborhood (recall that Mid-
predictions basically useless (this criticism was particu- dletown has four districts-Northside, Central, Easton,
larly aimed at the total expenditures of all households in and Southside) and the different demographic profiles
town on energy needs); (3) some people were concerned that exist in each district which would certainly be
that the sample very likely did not reflect the complexity reflected in various housing characteristics (year of con-
of the types of housing units in Middletown. Citizens struction, total square feet of floor space, owner-occupied
know that the city has a wide variety of housing types versus renter-occupied, and so on). For example, one per-
which probably have very different energy needs (single- son living in the Central district pointed out that her
family detached, single-family attached, large and small neighborhood was mostly rental units, with Jots of apart-
apartments, and even some mobile homes) and the sim- ments and relatively few single-family homes of any
ple random sample procedure used in the energy expen- kind. A resident in Easton mentioned that his district had
ditures survey didn't account for this at all; and (4) other the city's only large mobile home park.

TABLE 8.7

Confidence Interval Estimate of Stratified Sample Total: Energy Expenditures for Middletown Households

Stratum 1 Stratum 2 Stratum 3 Stratum 4


Northside Central Easton Southside

N1 = 873 N2 = 4,112 N3= 1,964 = 1,357


N4
n1 = 21 n 2 = 99 n = 47
3 n = 33 4

X1 = 1,910 X2 = 1,680 ¾3= 1,840 ¾4 = 1,885


S1 = 194 S2 = 173 S3= 192 S4 = 181

S 12 -- 37,636 sf = 29,929 s2 = 36,864


3
s2 = 32,761
4

N = 8,306 n= 200 X= 1,816 m 4

m
BPE T L
i=1
N;X1 = 873(1,910) + 4,112(1,680) + 1,964(1,840) + 1,357(1,885) 14,747,295

lf90% confidence level, then Z =1.65 (no finite population correction)


2

T ± Z8r inf (~ )
i=l i

14,747,295 ± 1.65 (873) 2 ( 37,636)


21 + (4,112) 2 (29,929)
99 + (1,964) 2 (36,864)
47 + (1,357) 2 (32,761)
33

14,747,295 ± 1.65 ✓ 136,588,750,000 + 511,145,280,000 + 302,527,720,000 + 182,819,050,000


14,747,295 ± 1.65 ✓ 1,133,080,800,00
14,747,295 ± 1.65 (1,064,463)
14,747,295 ± 1,756,364 90% confidence from 12,990,931 to 16,503,659

With 90% certainty, total energy expenditures for all Middletown households is within the interval from
$12,990,931 to $16,503,659.
Chapter 8 .A. Estimation in Sampling 133

After considerable discussion, it was decided that if decided to increase the total sample size to 200 house-
any practical policies were ever going to be implemented, holds. The general procedure for estimating the total
another energy expenditures survey was needed. All energy costs from a stratified sample is shown in the mid-
steps in the sampling procedure were reexamined and a dle portion of table 8.5 and the associated worktable cal-
new sampling protocol was designed. A district-levelpro- culations are summarized in table 8.7.
portional (constant-rate) stratified sample procedure was The best point estimate of total annual energy costs
selected, with the number of households in each stratum for all residential units in Middletown is $14,747,295.
(district) to be proportionally represented. It was also We can be 90% certain that the true amount of energy

TABLE 8.8
Confidence Interval Estimate of Stratified Sample Proportion: Parkwood Residents Opinion Regarding
Annexation into Middletown

Stratum 1 Stratum 2 Stratum 3


Apartments Homes not on water Homes on water

N1 = 28 N2 = 105 N3 = 42
n1 = 8 n2 = 30 n3 = 12

Opinion } (X) Yes = 5 (X) Yes = 22 (X) Yes = 4


regarding
annexation No= 3 No= 8 No= 8

N = 175 n so (X) Yes 31 No = 19 m = 3

X1 S
P1 = - = - = .625 .733 .333
n1 8

1
= 175 (28(. 625) + 105(. 733) + 42(. 333)) = .620

If 90% confidence level, then Z = 1.65 (include fpc, as ( ~) > .OS)

p ± z <1p
_ = p ± z 2._,
N 2 L,
N 2 (p,(1 - P1)) (N' -
1 n - 1 N
n1)
1=1 I I

1
. 620 ± 1.65 30 675 ((784)(. 03348)(. 714) + (11,025)(. 00675)(. 714) + (1,764)(. 0202)(. 714))
'

1
. 620 ± 1.65 30 675 (18.7413 + 53.1350 + 25.4418)
'

. 620 ± 1.65 v'.003173

. 620 ± 1.65 (. 0563293)

. 620 ± .0929 confidence interval from .5271 to .7129


134 Part III .A. The Transition to Inferential Problem Solving

consumed by all residential units in the community is val is due to the divergent opinions regarding annexation
within the confidence interval from $12,990,931 to between the three portions of the development. Unlike
$16,503,659. This is certainly a much more precise esti- the other two strata, homeowners on the water (lakefront)
mate than was obtained in situation 2, which was limited seem to be opposed to annexation. Notice that only four
by the small and simple random sample. of twelve households sampled are in favor of annexation.
Despite the differences of opinion within Parkwood
Situation 6: Stratified Sample-Estimate of Estates, we can be very confident that annexation into
Population Proportion Middletown will be approved when put to the vote. The
entire range of the 90% confidence interval is above 50%
Yet another issue is facing some of the citizens of
approval of annexation, so it is quite unlikely that annex-
Clinton County. The planned community of Parkwood
ation will be defeated. In fact, the lower end of the confi-
Estates is located just outside the city limits of Middle-
dence interval is well above 50% (at 52.7%), and only 5%
town (figs. 8.6 and 8.7). Residents of Parkwood Estates
of the range of possible values is below the estimated con-
will have the opportunity in the next election to decide
fidence interval, so we can be well over 95%, confident
whether they want to be annexed into the City of Middle- that annexation will be approved when voted upon by
town. Although both positive and negative attitudes have
Parkwood residents.
been expressed about this decision, no specific informa-
tion has been collected from Parkwood residents regard-
ing their opinions about annexation. 8.4 SAMPLESIZE SELECTION
It is recommended that a survey of Parkwood resi-
dents be taken to ascertain their views about this ques- In problems using sampling, geographers often want
tion. Specifically, planners would like to know what to determine the minimum sample size needed to make
proportion of Parkwood residents are in favor of annexa- sufficiently precise estimates beforethe complete sample
tion. The entire community contains 175 residential is actually taken. Taking a sample larger than necessary
units-too many to interview with the limited amount of wastes both time and effort. Some of the major factors to
time, money, and personnel available. Based on knowl- consider in selecting sample size are:
edge of the area, it is known that the community has 28 • the type of sample (random, stratified, etc.);
apartment units (Karla Court), 105 homes on lots with-
• the population parameter being estimated (mean, total,
out direct water access, and 42 lots with direct water proportion);
access. The water-accessible properties are located in a
separate part of the development along the shoreline of • the degree of precision (width of confidence interval
Trudel Bay. It is felt that a proportional stratified sample that can be tolerated);
is appropriate because it is likely that residents of these • the level of confidence to be obtained for the estimate.
three portions of Parkwood Estates might have different As sample size increases, key trade-offs occur. At a
opinions regarding annexation. It is common knowledge particular confidence level (.95, for example), increasing
that the three groups have different demographic and eco- the sample size provides greater precision and narrows the
nomic profiles (income, age, square footage of property, confidence interval width around the population estimate.
etc.). Clearly it is important to ensure that none of the Similarly, at a particular level of precision (e.g., estimating
three groups is over-or under-represented in the sample. the population proportion within .03 of its true value),
There is enough time and personnel to collect opin- increasing the sample size will raise the level of confidence
ions about annexation from a stratified sample of SO that the estimate is within the selected interval.
households. The general procedure for estimating the Unfortunately, a larger sample generally requires more
proportion from a stratified sample is shown in the bot- time and effort. In many practical sampling problems, you
tom portion of table 8.5 and the calculations are summa- are likely to have multiple conflicting objectives.You must
rized in table 8.8. take a sample large enough to achieve the desired precision
Because our sample (SOhouseholds) is greater than level and confidence interval width, but simultaneously
so;.,of the population (175 households), it is advantageous avoid taking too large a sample, which wastes time and
to include the finite population correction (fpc) in the effort and provides estimates more precise than necessary.
confidence interval estimation. Remember, including the When samples get much larger than needed, the extra
fpc when the sampling fraction is large will improve the effort yields smaller and smaller incremental improve-
precision of our sample estimates-a desirable outcome. ments in precision. In other words, the extra effort is more
The results indicate that 62.0% of the Parkwood costly and yields less "bang for the buck."
Estates households sampled are in favor of annexation This section illustrates how an appropriate sample
into Middletown. However, the confidence interval size is determined from random sampling for the three
around this best estimate is rather wide, ranging from basic population parameters-mean (µ), total ('r), and
52.71-71.29%. The somewhat imprecise confidence inter- proportion (p).
Chapter 8 .A. Estimation in Sampling 135

Sample Size Selection-Mean After some algebraic manipulation, the sample size (n)
needed to estimate µ with a certain level of precision or
Suppose your task is to determine the m1nunum tolerable error (E), at a chosen level of confidence (Z),
sample size needed to place the population mean estimate can be expressed as
within a desired confidence interval around the true pop-
ulation mean. After choosing a confidence interval hav-
ing an acceptable range, the issue becomes finding the
appropriate sample size needed to place the sample sta-
tistic within that range. Simply stated, how large a sam-
ple is needed to get within the stated range of the actual
population mean? (8.16)
Recall that the confidence interval forµ is X + Za x,
where ax is the standard error of the mean (sampling
error). Suppose E (for Error) represents the amount of
sampling error we are willing to tolerate. That is, for
any particular problem, a maximum acceptable differ-
ence separates the sample mean statistic from its popu-
lation mean: If the population standard deviation (CJ')is unknown, the
sample standard deviation (s) is substituted:

~ (8.15)
E = zax = z~--;;- (8.17)

The magnitude of tolerable sampling error depends The required minimum sample size is directly related to
on the circumstances of the problem. In some applica- the desired level of confidence (Z) and variability in the
tions, a considerable amount of error can be tolerated, sample (s), but inversely related to the degree of error (E)
making Ea large number. In other instances, only a lim- you are willing to tolerate. In the political example, this
ited amount of error can be accepted, so E must be a relationship suggests that sample size will need to be
smaller number. For example, a political party taking a increased in the second polling taken just a few days
survey many months before an election only wants a before the election to ensure that sampling error (confi-
"rough estimate" of a candidate's popularity and is will- dence interval range) is reduced to an acceptable level.
ing to accept an E value of large magnitude. Just a few In most real-world examples, the population standard
days before the election, however, the party wants a more deviation is not known. The sample standard deviation
precise estimate of the expected election outcome, so a value is usually substituted to determine an appropriate
small-sized E value is desired. sample size. A value for s is best derived from a pretest or
The tolerable sampling error (E) is equivalent to one- preliminary sample. However, s could also be obtained
half the width of the confidence interval (fig. 8.8). from some previous study, or, if no better alternative exists,
s could be an "educated guess" based on past experience.
Since a population parameter is being estimated, at
least 30 observations should be included in a pretest or
Tote,able (acceptable) preliminary sample. You should return to the general
samplinge,ro, (E) population to draw additional sample units if needed.
These new units can then be combined with the original
sample units to create a single larger sample that meets
the size requirement. If this "return" procedure is used, it
is considered a two-stage sampling design.

Example of Sample Size Selection-Mean


Middletown planners wish to estimate the mean
l_____ ~ _) annual energy expenditures per household and be 90%
confident that their estimate will be within $50 of the true
Confld&nc&Interval (1 - a)
population mean(µ). What is the minimum number of
households that must be surveyed to ensure this degree
FIGURE 8.8
Relationship Between Confidence Interval and Tolerable of precision at this selected confidence level?
(Acceptable) Sampling Error (E)
(continued)
136 Part III .A. The Transition to Inferential Problem Solving

Suppose a preliminary sample of 30 households is From a pretest or preliminary survey of 30 households,


drawn, ands is calculated as $210. Since the population the standard deviation of the annual energy expendi•
standard deviation is unknown, equation 8.18 is used: tu res per household is calculated as $215. Because the
population standard deviation is not known, the sample
size needed to meet the requirements of the problem is
calculated from equation 8.20:

Therefore, a random sample of at least 49 households


n=(NZs)'=((8,306)(1.65)(215))' = 34 _71
should be taken to ensure the desired degree of precision E 500,000
at the 90% confidence level. It is always better to include
an extra observation or two above the absolute minimum A random sample of at least 35 households should be
necessary sample size, so rounding up to 49 households is taken to estimate the total energy cost within $500,000
appropriate. Since this result was obtained from a prelimi• and be 90% confident in that level of precision. Since 30
nary sample of 30 households in a two-stage sampling households were already surveyed in the preliminary sur•
design, only 19 additional households need be contacted vey, only another 5 households need to be contacted in
to complete the study satisfactorily. the second stage of the survey design.

Satnple Size Selection-Total Sample Size Selection-Proportion


The minimum sample size needed to make an inter- To estimate a population proportion within a certain
val estimate of a population total within a certain tolera- allowable level of error (E), the minimum sample size can
ble error level (E) can also be determined. The confidence also be calculated in advance of full sampling. Again, a
interval for -ris T+ ZiJr, where 6r is the sampling error pretest or preliminary survey can be used to estimate the
or standard error of the total: population proportion (p) from the sample proportion
(p). The confidence interval for p is p + ZiJ,, where iJ,
2 is the sampling error or standard error of the proportion:
2
E=ZiJr=ZJN (: )
(8.18)

E=ZiJP=Z✓p(ln-1 p) (8.21)
Algebraic manipulation isolates (n), the mllllmum
sample size needed to estimate -rwith a selected level of
tolerable error (E) at a chosen confidence level (Z): You can isolate the minimum sample size (n) alge-
braically as:

(8.19)
(8.22)

If the population standard deviation (er)is unknown,


the sample standard deviation (s) is substituted: If the population proportion (p) 1s unknown, the
sample proportion (p) is substituted:

(8.20)
(8.23)

Once again, the population standard deviation is seldom


known and must be estimated with the sample standard Unlike the mean and total, it is possible to determine
deviation, which generally requires a pretest or prelimi- the minimum sample size needed to estimate a popula-
nary sample. tion proportion without taking a pretest or preliminary
sample. Note that the numerator of equation 8.23 con-
tains the product p (l - p). Consider the range of values
Example of Sample Size Selection-Total this product can take given different values of p:

Middletown planners also want to estimate the total p .l .2 .3 .4 .5 .6 .7 .8 .9


amount of energy expenditures for all residential house-
holds in the community. Suppose they wish to estimate total
p (l - p) .09 .16 .21 .24 .25 .24 .21 .16 .09
energy costs within $500,000and be 90% confident with
that level of precision and sampling error (E).What is the The maximum product of .25 occurs when p .5 =
minimum number of households that must be surveyed? and (1 - p) = .5. Inserting .25 into equation 8.24, the
result is the largest minimum sample size needed under
Chapter 8 .A. Estimation in Sampling 137

conditions of maximum uncertainty and represents a KEY TERMS


"worst case scenario." This value, .25, will alwaysprovide
a large enough minimum sample size, thereby making bounds of a confidence interval (upper and lower), 122
this a popular sampling strategy: central limit theorem, 118
confidence interval, 118
confidence level, 122
E (tolerable or acceptable sampling error), 135
finite population correction (fpc) and fpc rule, 126
interval estimation, 117
point estimation, 117
(8.24) sampling distribution of sample means, 118
sampling fraction, 126
significance level, 122
standard error of the mean (sampling error), 119
two-stage sampling design, 135
22
=(.!.)
4 E2
MAJOR GOALSAND OBJECTIVES
If you have mastered the material in this chapter, you
Example of Sample Size Selection-Proportion should now be able to:
Middletown planners wish to estimate the proportion 1. Understand the basic concepts in point estimation and
of owner-occupied housing units that are carrying a mort• interval estimation, including the sampling distribu-
gage and have negative equity. They want to be 90% cer• tion of a statistic and the central limit theorem.
tain their sample statistic is within .10 (10%) of the true 2. Recognize the conditions when it is appropriate to
population proportion. To determine in advance the mini•
apply the finite population correction.
mum sample size needed when estimating a population
proportion (p), two options are possible: 3. Explain the terms "confidence interval," "significance
(1) Preliminary survey taken. Suppose a preliminary sur•
level," and "confidence level" and the relationship
vey of 30 housing units reveals that 30% of Middletown between them.
owner-occupied households with a mortgage are cur• 4. Understand the procedure for constructing a confi-
rently "underwater~ Since the population proportion (p) is dence interval (bounds around the point estimate)
=
unknown, p .30 is used with equation 8.24 to determine when appropriate sample statistics are provided.
the necessary sample size:
5. Understand clearly the factors involved in selecting
the exact equation used to place a confidence interval
n= z'p(l-p)= =
(1.65)'(.30)(.70) _
57 2 around a population estimate, including type of sam-
2
£2 (.10)
ple taken and population parameter being estimated
by the sample statistic.
Thus, a random sample of at least 58 households should
be taken to estimate the proportion of Middletown 6. Know how to apply the appropriate equations to cal-
households that are in a condition of negative equity culate confidence intervals around population esti-
within .10 (10%) and be 90% confident in a result that pre• mates for various geographic problems.
cise. Since 30 households have already been surveyed, 7. Identify the factors that need consideration when select-
another 28 observations must be taken in the second ing sample size for a geographic problem and know
stage of the survey design.
how to determine the appropriate minimum sample
(2) Preliminary survey not taken. If no preliminary sur• size when presented with a geographic situation.
=
vey is taken, the maximum product p (1 - p) .5(.5) .25 =
may be used in a •worst case• situation (equation 8.24):
REFERENCES
AND ADDITIONAL READING
n =Z'p (1-p) =(1.65) (.25)(.25) =68_1
2

Many of the references from chapter 7 are also relevant


£2 (.10)2
here.
If you are interested in exploring further the geographic
In this case,a random sample of at least 69 households examples mentioned in this chapter, the following are good
should be taken to achieve the desired level of precision. places to start.
A larger required minimum sample size should always be
expected with this option. • To examine the farm data presented in both chapter 3 and
this chapter, see the Lembo, A., et.al. reference in chapter 3.
138 Part III .A. The Transition to Inferential Problem Solving

• For more information about the Residential Energy Con-


sumption Survey (RECS), a comprehensive nationwide sur-
vey on household energy use, see the U.S. Energy Information
Administration website: www.eia.gov.
• RealtyTrac is a real estate information company and an
online marketplace for foreclosed and defaulted properties in
the U.S. www.realtytrac.com. They constantly update the spa-
tial patterns of foreclosures and negative equity properties.
• Additional information regarding geography competency
tests is available at the National Council for Geographic
Education (NCGE) website: http://ncge.org/aphg. In par-
ticular, it is worthwhile examining the Geography-for-Life
material, with information about the recently updated geog-
raphy standards, published by the National Educational Test-
ing Service.
• Considerable data is available regarding housing vacancies
and home ownership rates from the U.S. Census Bureau at:
www.census.gov/housing/hvs
PART IV

INFERENTIAL
PROBLEMSOLVING
IN GEOG HY
Elements of Inferential Statistics

9.1 Terms and Concepts in Hypothesis Testing: One-Sample Difference of Means Test
9.2 One-Sample Difference of Proportions Test
9.3 Selected Issues in Inferential Testing

In the last chapter we introduced various concepts If a sample is used in further analysis or research and
concerning estimation in sampling. We learned that a doesnot represent the original population accurately, then
primary objective of sampling is to infer some character- future results obtained using this sample may be flawed.
istic of the population based on statistics derived from a For example, suppose community planners want to sam-
sample of that population. Sample statistics are used to ple the local population to determine the level of support
make point estimates of population parameters such as for constructing a new vocational-technical school. Sup-
the mean, total, and proportion. In addition, to deter- pose the views of upper- and lower-income residents dif-
mine the level of precision of these point estimates, a fer significantly on this issue. If the sample happens to
confidence interval, or bound, is placed around the sam- represent upper-income residents disproportionately,
ple statistic, making it possible to state the likelihood that survey results are likely to be flawed because the views of
a sample statistic is within a certain range or interval of lower-income residents are under-represented. There-
the population parameter. In this chapter, these ideas are fore, a one-sample difference test needs to be used early
extended to a form of statistical inference known as in the planning process to verify that the sample opin-
hypothesis testing. Using these inferential procedures, ions are truly representative, thereby avoiding subse-
we are able to reach statistical conclusions for a wide quent problems.
variety of problems. In other cases, we might be interested in comparing
The practical application and value of inferential sta- known statistics from a local sample to population
tistics are best understood in the context of the scientific parameters that are available from some external source.
research process. As outlined and discussed in chapter I The issue here is to show that the sample taken from the
(fig. I. I), the formulation of hypotheses and their testing local population differs significantly (or does not differ
through inferential statistics play a central role in the significantly) from some given external population. For
development of the science of geography. For example, example, suppose you want to know whether pollution
hypothesis evaluation may lead to the refinement of spa- levels from a local sample of sites differ significantly from
tial models and the development of laws and theories. In national or state EPA standards (i.e., the external popula-
addition, conclusions from inferential testing often con- tion) for those pollutants. If the sample of local pollution
tribute toward the advancement of scientific research. levels meets the standards, then federal or state penalties
Methods of hypothesis testing are introduced in this are not imposed on the community.
chapter with examples of one-sample difference tests. A If the population parameter and the sample statistic
properly created sample is essential in the successful are not significantly different, you can be confident that
application of inferential statistics. We sometimes need the sample is truly representative of the target popula-
to verify whether a particular sample is truly typical or tion. In such cases, the chosen sample is adequate for fur-
representative of the population from which it is drawn. ther analysis. On the other hand, if the population

141
142 Part IV .A. Inferential Problem Solving in Geography

parameter and sample statistic are significantly different, 9.1 lERMSANDCONCEPTS


IN HYPOTHESIS
the sample may be inaccurate or otherwise deficient.
Excessive sampling error may produce an unrepresenta- lESTING:ONE-SAMPLE
DIFFERENCE
OF
tive sample. Human error may occur in one of the steps MEANSlEST
taken to produce the sample.
Alternatively, a particular sample may represent the Classical or traditional hypothesis testing involves
situation described in the last chapter (fig. 8.5), where the a formal multi-step procedure that leads from the state-
mean of sample five {X 5} falls into a tail of the sampling ment of hypothesis to a conclusive statement (decision)
distribution of means. That illustration demonstrates regarding the hypothesis (table 9.1). Hypothesis testing
how a sample statistic calculated from a properly drawn allows us to make inferences about the magnitude of one
random sample has a 10% chance of being quite different or more population parameters, based on sample statis-
from the actual population parameter value. tics estimating those parameters. Hypotheses regarding a
An example helps clarify the need for testing the dif- population parameter are evaluated using sample infor-
ference between a sample and population mean. Suppose mation, and a conclusion is reached (at some preselected
local officials want to determine the attitudes of commu- significance level) about the hypotheses. Because of the
nity residents toward constructing a new public swim- nature of sampling, a measurable probability can always
ming pool. The advisory committee wants to select a 5% be assigned to the conclusions reached through statistical
sample of families at random for a telephone survey. hypothesis testing. This logic should seem very familiar,
However, because families with children are more likely as it is based directly on the sampling estimation and
to want this facility, the committee needs to make sure confidence interval themes we discussed in chapter 8.
that their sample accurately represents the family struc- Recall in the first scenario from the previous chapter
ture in the community. (random or systematic sample-estimate of population
One way to measure the success of the committee's mean) that planners in the city of Middletown estimated
objective is to determine the number of children in each the average annual energy expenditures per household.
family selected in the sample. Using these data, the More specifically, they estimated the mean energy costs
mean family size for the sample could be computed. per household from two random samples-a smaller
This sample mean could then be compared to the aver- sample having only 25 households and a larger sample of
age family size for the target population found in either 250 households. Both samples were taken from a popula-
recent census data for the community or perhaps in tion of 8,306 Middletown households, and a confidence
information from a school census. If the mean family interval was placed around both sample estimates. The
size for the sample is not significantly different than the smaller 25-household sample used the t table to estimate
mean for the target population, then the committee can the width of the confidence interval. This procedure
have confidence that their sample is truly representative resulted in a 90%, confidence interval of $67.03 placed
of family structure in the area and can confidently base around a mean household expenditure of $1,810. In the
other analysis off of this sample. The critical results of larger 250-household sample the mean household energy
the survey can then be analyzed with the knowledge that expenditure was $1,821 and the associated Z-value for
the sample is not biased. 90%, confidence produced a considerably more precise
In section 9 .1, two complementary methods of interval of $20.04:
hypothesis testing are presented. The classical-tradi-
tional approach provides a solid, logical foundation for smaller sample (n= 25): X + tif x = 1810+67.03
understanding hypothesis testing. The p-value (or prob-
value) method of hypothesis testing builds on this solid largersample (n=250): X+Zax =1821+20.04
foundation yet provides additional useful information
concerning the research problem. An example problem From the 2009 Residential Energy Consumption
from the previous chapter is used to illustrate these Survey conducted by the Energy Information Adminis-
methods. The one-sample difference of means t test is
applied as the hypothesis testing terminology and proce-
dure is introduced. TABLE 9.1
Section 9.2 introduces the one-sample difference of Steps in Classlcal/Traditional Hypothesis Testing
proportions test. Then in section 9.3 the discussion
focuses on the circumstances or geographic problems for Step 1: State null and alternate hypothesis
Step 2: Select appropriate statistical test
which inferential testing is appropriate. In addition, some Step 3: Select level of significance
issues that influence the selection of the proper inferen- Step 4: Delineate regions of rejection and nonrejeclion of
tial statistical test are briefly discussed. null hypothesis
Step 5: Calculate test statistic
Step 6: Make decision regarding null and alternate
hypothesis
Chapter 9 • Elements of Inferential Statistics 143

tration, Middletown officials have learned that the aver- tional and non-directional formats offer two possibilities.
age household in the United States annually spends In our example, we know that the nationwide average
nearly $2,000 (µ = 2000). They want to determine if aver- annual energy cost per household is $2,000. We also
age household energy costs in their community are simi- know that in the smaller Middletown sample (n = 25),
lar to (or different from) this national figure. To answer the average energy cost per household was estimated to
this question, a formal hypothesis testing procedure is be $1,810. If our goal is to determine whether the Mid-
established and appropriate terminology and concepts dletown average is different than the national average,
regarding statistical testing are introduced. we would state the alternate hypothesis in a non-direc-
tional form:
Steps in the Classical or Traditional Hypothesis
Testing Procedure Ho:µ= 2,000
Step 1: State the Null and Alternate Hypotheses: Two
complementary hypotheses of interest are the null HA:µ ,tc2,000

hypothesis (denoted Ho) and the alternate or alterna-


These statements hypothesize that mean annual energy
tive hypothesis (denoted HA)- Consider the formulation
of hypotheses concerning the mean of a population (µ). cost for Middletown households is no different from the
The typical claim is that µ is equal to some value, µ H (for national average of 2,000. Obviously, $1,810 is a differ-
hypothesized mean). This claim of equality is called the ent amount of money than $2,000. However, our consid-
null hypothesis, and takes the general form: eration in statistical testing is whether the magnitude of
the difference can be explained by the inherent natural
variability of the data. If the Middletown mean is close
to this population mean, the likely conclusion is that the
null hypothesis should not be rejected. Conversely, if the
The null hypothesis can also be stated:
Middletown mean is not close to the national popula-
tion mean, the likely conclusion is that Ho should be
rejected. By expressing HA in this form, the direction of
difference between household energy costs in Middle-
In the latter form, attention is focused on Hoas a state- town and the United States is not important. That is, the
ment of "no difference" between µ and µ H• which is the Middletown average annual household energy expendi-
case if(µ - µ 8 ) equals O (or null). The null hypothesis ture could be either greater than or less than the national
statement always includes the equal sign. We should note figure. If Ho is rejected, the only conclusion that can be
that even though the more "interesting" result might be to drawn is that the difference in household energy cost is
demonstrate that differences exist, the null hypothesis significant; no conclusion on the direction of that differ-
always focuses on the assumption that no differencesexist. ence is possible.
The converse of the null hypothesis is the alternate For some situations, the alternate hypothesis can pro-
hypothesis, HA. The alternate hypothesis expresses the vide more specific information about the hypothesized
conditions under which Ho is to be rejected and can be direction of difference:
viewed as a positive statement of difference. The two
hypotheses are mutually exclusive, for if Ho is rejected, Ho:µ= 2,000
HA is accepted. The alternate hypothesis takes one of two
forms, depending on how the research problem is struc-
tured. In some problems, the form of HA is non-direc-
tional ( or two-tailed), while in others it is directional (or
or
one-tailed), but HA always consists of an inequality indi-
cating the conditions under which Ho is rejected. So, for
example, each of the following is a valid form of HA:
In addition to determining whether a significant difference
HA:µ ,tcµ H (non-directional) exists, the direction of that difference (expressed in HA) is
also specified. IfHA: µ < 2,000, then rejection ofH 0 indi-
HA:µ< µH (directional) cates that the average annual energy cost of Middletown
households is significantlylower than the national average.
HA:µ> µH (directional) This would be a reasonable alternate hypothesis to test, for
example, if Middletown has a mild year-round climate or
The selection of a specific form of HA depends on an accessible location making energy transportation costs
how the hypothesized difference is stated. The direc- relatively low. If HA:µ> 2,000, then rejection ofH 0 indi-
144 Part IV .A. Inferential Problem Solving in Geography

cares that the average Middletown household energy cost TABLE 9.3
is significantly higher than the national average. This
Possible Decisions In a Court of Law
might be a logical alternate hypothesis if Middletown has
an energy-demanding climate or a relatively inaccessible True situation (unknown)
Decision
location resulting in high energy transport costs. from jury Not guilty Guilty
With classical hypothesis testing, Middletown plan-
Guilty Incorrect decision Correct decision
ners must decide which form of HA to use. The critical
issue is whether a prioriknowledge exists about any direc- Not guilty Correct decision Incorrect decision
tion of difference in energy use between Middletown and
households nationwide. If Middletown planners have no
preconceived rationale for believing their household not rejecting Ho when it is actually false (a Type II error).
energy cost is larger or smaller than the national average, In a court of law, when reasonable doubt exists about the
the non-directional format would be appropriate. guilt of a defendant, the jury reaches a verdict of not
In classical hypothesis testing the conclusion is to guilty. The defendant is not declared "innocent," but
either reject or not reject the null hypothesis. Becausethis rather the verdict is "not guilty." Similarly, if reasonable
decisionis basedon a singlesample, we can measure the doubt exists about rejecting the null hypothesis, you
chance or probability of making an incorrect decision or should not reject it. For this reason, the significance level
reaching a wrong conclusion. Error comes from two pos- (a) of most problems is kept relatively low (at a level
sible sources (table 9.2): such as .OSor .01), to minimize the chances of a serious
Type I Error: A decision could be made to reject the Type I error.
null hypothesis as false when it actually is true. For our
Middletown example, it could be concluded that the dif- Step 2: Select the Appropriate Statistical Test: The sec-
ference between the national average household energy ond step in classical hypothesis testing is selecting the
cost and the sample of Middletown households is signifi- appropriate statistical test for your problem. As will be
cant, when actually no difference exists. The likelihood explained near the end of this chapter, a logical and con-
of this sort of error occurring (Type I error) is equivalent venient way to categorize the many statistical tests avail-
to the significance level (a), discussed in chapter 8. able is by the type of question asked and the assumptions
Type II Error: Conversely,a decision could be made that are met. Because we are comparing our one sample
to not reject the null hypothesis when it actually is false. against a population parameter, it is sufficient for now to
For the Middletown example, it could be concluded that state that the appropriate statistical test for the Middle-
their household energy cost does not differ from the town household energy cost problem is a one-sample
national average, when a significant difference really difference of means test:
exists. The likelihood of this error (Type II) occurring is
beta (/3). (9.1)
The logic of hypothesis testing operates on much the
same principle as judicial decision making in a court of
Jaw (table 9.3). In hypothesis testing, the null hypothesis
is presumed correct until rejected or proven otherwise. where Z or t = test statistic
Similarly, in court, a defendant is presumed innocent X = sample mean
until proven guilty. In the judicial system, convicting a µ = population mean
person who is truly not guilty is considered a more seri- a JI = standard error of the mean
ous error than freeing a guilty person. Similarly, in
hypothesis testing, rejecting Hoas false when it is actu-
a = population standard deviation
ally true (a Type I error) is considered more serious than n = sample size
You should select Z only if the sample size is greater
than or equal to 30 and you know the population stan-
TABLE 9.2
dard deviation (a fairly rare occurrence); select t if the
Possible Decisions in Classical/Traditional sample size is Jess than 30 or if the population standard
Hypothesis Testing deviation is unknown (the usual situation), in which case
Null hypothesis in reality you must substitute the sample standard deviation.
Decision from
True
In the chapter introducing continuous probability
hypothesis testing False
distributions (chapter 6), the concept of a Z-score or stan-
Reject Hoas false Type I error Correct decision
(prob. = a) (prob. = 1 - /3)
dard score is presented. The one-sample difference of
means test is structurally similar to the Z-score equation
Do not reject Ho Correct decision Type II error (table 9.4). The standardized Z-score of a value in a set of
(prob. = 1 - a) (prob. = /3) data may be interpreted as the number of standard devia-
Chapter 9 • Elements of Inferential Statistics 145

TABLE 9.4 nature of the problem and the effects of the decision. With
some problems, the consequences of sampling error may
Comparable Logic: Z-Score of an Observation and
Z-Value (or t-Value) of a Sample Mean
be severe, and in these instances, a very low significance
level is required. With the Middletown household energy
Z-value or I-value of a sample
Z-score of a value (i) cost, using a commonly accepted, "conventional" signifi-
mean in a frequency distribution
in a set of data
of sample means cance level such as a= .05 provides sufficient precision.

xi- x X-µ s Step 4: Delineate Regions of Rejection and Non-rejec-


zi
s
Zor t = where ax= -
,In tion of Null Hypothesis: Once a significance level has
been selected, this value is used to create the regions of
rejection and non-rejection of the null hypothesis
(fig. 9.1). For our example, the total area in which Ho is
tions a value is above or below the mean. Using similar
rejected, as represented by the significance level, encom-
logic, the Z-model or t-model one-sample difference of
passes 5% (a= .OS)of the area under the curve. This area
means test statistic measures the number of standard of rejection can be distributed in one of two ways. In case
errors a sample mean lies above or below the hypothe-
I, the alternate hypothesis is non-directional, so the
sized population mean.
shaded rejection area of Ho is distributed equally between
For the Middletown household energy expenditure
examples, the population standard deviation (o-) is not the two tails of the curve. With this two-tailed format and
known (this is the usual circumstance). Recall from chap-
=
a= .05, 2.5% (a I 2 .025) of the total area is in each of
the rejection regions or tails of the distribution. In case 2,
ter 8 (table 8.1) that the sample standard deviation, s, is a
the alternate hypothesis is directional (one-tailed), so the
proper estimator of o- when o- is unknown, but a slight shaded rejection region of H 0 is placed entirely on one tail
adjustment is needed. Thus, substituting s for o- in equa-
of the distnbution. In this diagram, the rejection region
tion 9.1 gives:
happens to be on the right tail, but the placement of the
-
X-µ
-
X-µ
rejection region depends on the form of HA. In both the
z- - O' x -- s/✓n
(if n > 30) (9.2) two-tailed and one-tailed cases, the unshaded area under
the curve delineates test statistic values where Hois not
rejected. This area of non-rejection is 95% (1- a= .95) of
or =
the total area under the curve. When a .05 and HA is
non-directional, each of the two tails contains a 12 or .025
X-µ X-µ of the area under the curve (fig. 9.2). This leaves .475 of
-
t -
O' x
- s/ ,/n
-
I
(if n < 30) (9.3)

For illustrative purposes, suppose Middletown offi-


cials must use the smaller ZS-household sample in which
the average annual energy expenditure is $1,810 (X = One-Sample
Difference
of Means
1,810)because the larger sample of 250 households has not ZortTest
yet been collected. This means that trather than Zis used.
Primary Objective: Compare a random sample mean to a
Step 3: Select the Level of Significance: The next task in population mean for difference
classical hypothesis testing is to place a probability on the
Requirements and Assumptions:
likelihood of a sampling error. As mentioned earlier,
1. Random sample
committing a Type I error and rejecting a null hypothesis
2. Population from which sample is drawn is normally
as false when it is actually true is generally considered distributed
serious. Therefore, the usual procedure is to select a fairly 3. Variable is measured at interval or ratio scale
low significance level (a) such as .10 or .OS.Then, the
conclusion is specified in terms of the level of signifi- Hypotheses:
cance of the result. In classical hypothesis testing, a null Ho : µ = µ., (where µ., is the hypothesized mean)
hypothesis may be rejected at the .OSlevel, which is the HA : µ ~ "" (two-tailed)
same as saying the statistical test is significant at the .OS HA:µ>µ,, (one-tailed) or
HA : µ < µ,, (one-tailed)
level. This would mean that there is a 5% chance that a
Type I error has occurred, and it is only so;., likely (I Test Statistic:
chance in 20) that the null hypothesis has been improp-
erly rejected because of random sampling error. Zart=
x- µ
For many geographic research problems, an
extremely stringent significance level is not necessary. In
If sample size 2:30, use Z; if sample size< 30, use I
general, the choice of significance level depends on the
146 Part IV .A. Inferential Problem Solving in Geography

the area on each side of the distnbution (1 - a .95 in =


total) in the non-rejection region. ar&a= .475 area= .475
The next task is to determine the critical t-table values
that delimit the boundaries separating the rejection and
non-rejection regions. To do this precisely,we must further area=a/2 area=a/2
explain the concept of degrees of freedom, which was =.025 =.025
briefly introduced in chapter 8. The degree of freedom
indicates the number of values in a calculation that are
free to vary. Suppose, for example, we know that the mean l -2.06 µ 2.06
of a data set having four numbers is 10 and the numbers
are 7, 8, 14, and some unknown number. To determine the ar&a= 1 - a= .95
mean, we add all the values and divide by the number of
observations, resulting in the formula (7 + 8 + 14 + x) I 4
0 Reject Ho
0 Do not reject Ho
= 10, where xis the unknown number. With some algebra
=
we can calculate that x 11. In this simple example, the FIGURE 9.2
first three numbers are allowed to vary, but when they Normal Distribution Values Associated with a Significance Level
(and the mean-the known statistical parameter in this (a)= .05: Two-Tailed Case
case) are fixed, the value of the fourth number cannot vary
and has no freedom to move. This example therefore has 3
degrees of freedom. degrees of freedom. All numbers must add up to a total of
Suppose now that we have a sample size of 20. If we 200: (20 x 10). Once we know the magnitude of 19 of the
know that the sample has a mean of 10, but do not know 20 numbers, the magnitude of the 20th number is fixed
the values of any of the observations, then there are 19 (not free to vary), so we have 19 degrees of freedom.
Degree of freedom is a vital component when using
the t-table to determine the critical test statistic value.
Case 1: Two-tailed (nondirectional format) There are many t-table distributions, and the one selected
+- rejectHo------do n.otrejectHo------rej&et Ho~
for a particular problem depends on the sample size. If
our sample size is n, then the number of degrees of free-
dom is n - 1. For example, a sample of size 20 requires
use of the t-table row with 19 degrees of freedom.
We return now to our problem of estimating average
annual energy expenditure for Middletown households
area=a/2 area=a/2 from a sample of 25. In the t-table (appendix, table D),
=.025 =.025 the value that corresponds with a 0.95 level of confi-
dence at 24 (n -1) degrees of freedom is 2.06. Therefore,
in this problem, a t-value of 2.06 defines the boundary
L_____ ~ _____ j separating the rejection and non-rejection regions, and a
area=1-a=.95
calculated t-value (test statistic) must have an absolute
value of less than 2.06 to keep the null hypothesis from
Case 2: One-tailed (directional format) being rejected.
+-------- do not rej&ctHo-----+1-reject Ho~
The following decision rule identifies and summa-
rizes the regions of rejection and non-rejection of the null
hypothesis for a= .05 and HA non-directional:

area = a Decision rule:


=.05 If t < -2.06 or if t > 2.06, reject Ho
Conversely, if -2.06 st s 2.06, do not reject Ho

µ
Step 5: Calculate the Test Statistic: At this step of the
area= 1 - a = .95
hypothesis testing procedure, sample data are evaluated
FIGURE 9.1
using the test statistic. In the Middletown sample, mean
General Regions of Rejection and Non-rejection of Null Hypothesis: household annual energy cost was $1,810, with a sample
Significance Level ( a ) = .05 standard deviation of $ I96. Substituting these sample
Chapter 9 • Elements of Inferential Statistics 147

TABLE 9.5 level of .05 or .10 is often chosen because they are the
"conventional" probabilities commonly provided in sta-
Summary of Classlcal/Tradltlonal Hypothesis
Testing: Middletown Household Energy Cost
tistical tables. Second, the final decision regarding the
Example
null and alternate hypotheses is binary in nature-either
Ho is rejected or not rejected at that arbitrary significance
Step 1: Ho:11= 2000 and H.:µ f- 2000 level. This type of conclusion provides only limited infor-
Step 2: One-sample difference of means t test selected as
test statistic mation about the calculated test statistic. We rarely use
classical hypothesis testing today for statistical problem
Step 3: a = .05
solving; instead, the p-value approach is commonly
Step 4: If t < -2.06 or if t > 2.06, reject Ho
If -2.06 st s 2.06, do not reject Ho
employed in scientific research.
The more flexible p-value method of hypothesis test-
Step 5: Calculate t (from random sample)= -5.00
ing provides additional valuable information. With this
Step 6: Since t < -2.06, reject Ho
approach, the exactsignificance level associated with the
calculated test statistic value is determined. That is, the
p-value is the exact probability of getting a test statistic
statistics into the difference of means t test gives the fol- value of a given magnitude, if the null hypothesis of no
lowing: difference is true. This can generally be interpreted as
the probability of making a Type I error. Unnecessary
t= x-µ =1810-2000=_ pre-selection of a "standard" significance level (a) is
500 (9.4)
s/✓n 1 196/✓24 • avoided, and decisions to reject or not reject Ho at that
arbitrary level need not be made.
The Middletown sample mean is 5.00 standard errors In the Middletown household energy cost exam-
belowthe U.S. average. If household energy cost in Mid- ple, the decision rule was to reject the null hypothesis if
dletown had been more than the national norm of 2000, t < -2.06 or t > 2.06 and to not reject Ho if -2.06 < t <
the calculated test statistic would have been positive. This 2.06. The calculated test statistic value oft= -5.00 led
calculated value is now compared with the critical t-table to the conclusion that the mean annual energy cost of
values determined in step 4 to reach a final decision. Middletown households was significantly different
than mean national household energy cost, and the
Step 6: Make Decision Regarding Null and Alternate null hypothesis was rejected.
Hypotheses: All the information is now available to This conclusion is of limited use. All that has been
make a decision regarding rejection or non-rejection of decided is that we can be at least 95% certain that the
the null hypothesis. The calculated test statistic is t = - Middletown household annual average energy cost is
5.00. The critical t-table values are -2.06 and 2.06. Since different from the national average. However, knowing
the calculated test statistic is not between the critical the exact significance level associated with the specific
table values (-2.06 < t < 2.06) the null hypothesis should sample mean or calculated test statistic value of -5.00
be rejected.The conclusion is that mean annual household would be more informative. If the null hypothesis had
energy cost in Middletown differs significantly from the been rejected on the basis of the 1810 sample mean,
mean annual household energy cost nationally. The steps what would be the exact significance level and likeli-
in the classical hypothesis testing procedure for the Mid- hood that a Type I error had been made? The precise
dletown example are summarized in table 9.5. probability of making a Type I error cannot be deter-
mined with the classical approach.
P-Valueor Prob-Value
We can assess the Middletown sample more infor-
Hypothesis Testing Procedure matively using the p-value approach. The critical values
The formal multistep procedure of classical hypothe- that separate the regions of rejection and non-rejection of
sis testing provides a logical basis and excellent theoreti- the null hypothesis are now based on the location of the
cal underpinning for all inferential decision making. For particular sample mean (X= 1810) relative to the popu-
example, all of the terminology just presented (null and lation mean (µ = 2000). The region of non-rejection is
alternate hypotheses, one-tailed and two-tailed tests, centered on µ = 2000, and the absolute difference
Type I and Type II errors, etc.) is fully valid and effec- between X and µ, IX - µI= 190, is used to establish the
tively carries over to the p-value method of hypothesis width of the interval on either side ofµ. Thus the non-
testing. However, the usefulness of the results from classi- rejection region has an upper bound of 2190 (µ + 190)
cal analysis is limited in some important ways. First, a and a lower bound of 1810 (µ - 190). The rejection
specific significance level must be selected to delineate regions occupy the extremes of the distribution, outside
the regions of rejection and non-rejection of the null the upper and lower bounds of the confidence interval.
hypothesis. This a priori selection of a is often arbitrary The proportion of the total area under the normal
and may lack a clear theoretical basis. A significance curve area lying within the rejection region(s) represents
148 Part IV .A. Inferential Problem Solving in Geography

the p-value, and most statistical software packages com- p-values in a consistent and rational way. P-values report
plete these steps: what the sample data reveal about the credibility of a null
hypothesis, but still demand the same stringent rules of
1. The test statistic fort is calculated.
inference as required in classical hypothesis testing.
2. The probability or relative area under each tail of the Since the p-value method offers many advantages
normal curve is determined for that t-value. over classical hypothesis testing, all statisticaltestspresented
3. The rejection area is determined by subtracting the in the remainderof this text will use the p-va/uemethod. In the
probability (of step 2) from .5000. statistical analysis of all the geographic problems that fol-
4. This area is doubled if a non-directional (two-tailed) low, associated p-values will be reported without showing
alternate hypothesis is used. all steps in their manual calculation.

This procedure is completed with the Middletown


example of mean annual household energy expenditure. 9.2 ONE-SAMPLEDIFFERENCEOF
The difference of means t test statistic is calculated: PROPORTIONS
lEST
t= X-µ =1810-2000 __ Hypotheses can also be formed to study the differ-
500 (9.5)
s/✓n-1 196/✓'24 - • ence between a sample proportion and a population pro-
portion. The test statistic for a difference of proportions
This statistical test value is then used to determine a prob- test is sometimes called the Z test for proportions or one-
ability or area under the t-model curve. When t = -5.00 proportion Z test. Like the difference of means Z test the
normal distribution is used. '
and the degrees of freedom is 24, the Student's t distribu-
tion table (appendix, table C) does not list a value because In one of the examples from the last chapter (ran-
it is so far out in probability that it is beyond the table. dom or systematic sample-estimate of proportion),
You may safely assume that the p-value is much less than Middletown planners estimated the proportion of all
.005. Computer software does not have a problem with owner-occupied housing units in the community that are
this issue and calculates the exact p-value of .000. "underwater." Remember this term refers to homeown-
How can this p-value be interpreted? If the decision is ers experiencing negative equity, with the amount still
made to reject Ho, the significance level equals the p-value owed on their mortgage being greater than the current
or .000, representing a .000% chance of a Type I error. In value of their home. Eighty-five Middletown owner-
this situation, the likelihood of making an error is occupied housing units with current mortgages were
extremely low, so the clear and logical decision is to reject sampled, and 25 of them were "underwater," with nega-
the null hypothesis. In the Middletown energy cost exam- tive equity. This is a sample proportion of .294.
ple, a p-value of .000 can also be interpreted this way: you In the last chapter, we were interested in placing a
could take 1000 independent random samples of 25 Mid- confidence interval around this best sample estimate.
dletown households, but not even 1 time out of 1000 Now, we want to compare the Middletown proportion of
would you expect to get a test statistic value having an owner-occupied housing units that are underwater with
absolute magnitude of 5.00, ifthere really is no difference the national proportion. Therefore, a one-sample differ-
between Middletown and national average annual house- ence of proportions test is appropriate, since we have one
hold energy expenditures. sample proportion (from Middletown) to compare with a
Looked at from another perspective, the p-value is a population (the nationwide proportion). RealtyTrac
probabilistic measure of the belief or conviction that the reports that, nationwide, 22.8% (.228) of all owner-occu-
decision not to reject the null hypothesis is correct. For pied housing units carrying a mortgage are underwater in
example, a p-value relatively close to I indicates a high the fourth quarter of 2011 (p 8 = .228). Middletown offi-
degree of trust in the validity of the null hypothesis, cials are concerned that the proportion of housing units
whereas a p-value very close to O suggests that little or no underwater in their community (p) is greater than the
faith should be placed in the null hypothesis. In the Mid- national proportion. Since the direction of difference is
dletown example, the p-value of .000 indicates that the stated, this is a one-tailed or directional alternate hypoth-
null hypothesis should be rejected, and the Middletown esis. The corresponding null hypothesis would be that the
mean annual household energy cost of $1810 is signifi- Middletown proportion of housing units underwater is
cantly different from the nationwide average annual not greaterthan the nationwide proportion. Quite simply,
we are asking if the Middletown sample proportion
household energy cost of$2000.
(.294) is greater than the national proportion (.228).
Use of the p-value approach, however, does not pro-
vide an excuse to interpret results subjectively or avoid That is:
making decisions. Just because computers produce exact
p-value instantaneously, you are still responsible for inter-
preting the statistical and geographic meaning of the HA: P > PH (directional)
Chapter 9 • Elements of Inferential Statistics 149

The appropriate test statistic is the one sample differ- TABLE 9.6
ence of proportions test:
The Similar Structure of One-Sample Difference
Tests
Difference Test Test Statistic
(9.6)
Difference of means•
large sample z= x- µ
_ _;_
x- µ
(n > 30) ax s/../n
where c,P = standard error of the proportion
The standard error of the proportion (crR)is the standard Difference of means•
small sample t=
x- µ
_ _;_
x- µ
deviation of the sampling distribution of proportions: (n s 30) ax s/,/n - 1

p-p p-p
cr = Jp(l p) = (.228)(.772) = .0455 (9.7) Difference of proportions z - :.._....:...-
-=~~=
, n - <Ip - Jp(1 - p)/n
85
• Populationstandarddeviation(d) is unknown.
therefore:
l:(X;-i) 2
Best estimate of u = s =
z = p- p = .294-.228 .066 = 1.45 n-1

c,, .0455 .0455


General Format of One-Sample Difference Test:

sample stadstic - hypothesized populadon parameter value


If the null hypothesis is rejected when Z = I .45, the Zor t = standard deviation of sample statistic (standard error)
exact associated significance level and p-value is .0735
(fig. 9.3). This indicates a 7.35% chance that a Type I
error has been made if Ho is rejected. This is a borderline
situation. There is a 92.65% chance that Middletown of each test statistic is the difference between the sample
does indeed have a greater proportion of housing units statistic and the hypothesized population parameter
"underwater" than the national proportion, but it is not a value. The denominator is the standard deviation of the
totally definitive result. How Middletown officials decide sampling distribution, which is also referred to as the
to respond to this statistical outcome, and whether they standard error of the sample statistic.
decide to develop any housing policy or program are
questions they have to resolve for themselves.
The various one sample difference tests are related,
which we can see in the similar structure of the one sam-
ple test statistics (table 9.6). A general format is common
to each of these difference tests. Note that the numerator
One-Sample
Difference
of Proportions
Test
Primary Objective: Compare a random sample proportion
to a population proportion for difference
area = .4265
Requirements and Assumptions:
1. Random sample
2. Variable is organized by dichotomous (binary)
ar&a= .0735 categories

Hypotheses:
Ho : p = p,, (where p,. is the hypothesized proportion)
µ
H. : P7' /JH (two-tailed)
.228 .294
H.: p> p,, (one-tailed) or
(Z) (0) (1.45)
HA: p< p,, (one-tailed)
P-vatue= .0735
Test Statistic:
FIGURE 9.3
One-Sample Difference of Proportions Test: Probability that p-p
Middletown Proportion of Households "Underwater" is Greater
Zart=
than the National Proportion
150 Part IV .A. Inferential Problem Solving in Geography

9.3 SELECTED
ISSUESIN tain characteristics about the population based on the
sample data. No one questions the appropriateness of
INFERENTIAL
TESTING inferential techniques in geographic problems or situa-
So far in this chapter, attention has been focused on tions in which a proper artificial sample has been taken.
a related set of one sample difference tests. However, Opinion differs on whether inferential statistics
should be applied on geographic data sets not obtained
there are numerous other types of inferential tests that
can be applied to help solve a wide variety of geographic from artificial sampling. Geographers often wish to ana-
problems. Before continuing, however, some basic ques- lyze spatial patterns or data sets that comprise complete
tions need to be addressed: Under what general circum- enumerations or total populations. In these situations,
stances are inferential statistics appropriate? More are inferential procedures ever appropriate? Some geog-
specifically, can or should an inferential test be used raphers suggest that inferential statistics may be permit-
when one has a total population (as in a GIS-based data ted when using a data set considered a "natural sample."
base) and not a sample from which to infer to a popula- In natural sampling, the "natural" or real-world pro-
tion? How does one determine which of the many avail- cesses that produce the spatial pattern under analysis
able inferential techniques is appropriate and should be contain random components. Suppose a geographer
selected for a particular geographic problem? That is, wishes to analyze a data set showing the pattern of all
what are the dimensions and choices that need to be hurricane "landfalls" along a coastline during the last
made, and should the questions and answers be in any century. The landfall site of a hurricane is partly a func-
sequence? Under what circumstances should a paramet- tion of prevailing global water currents and wind patterns
ric or nonparametric statistical test be chosen and are (nonrandom or systematic influences), but is also
there situations where it doesn't matter or situations affected by a complex variety of meteorological processes
where both might be applied? associated with that particular hurricane (random influ-
All inferential techniques have certain general char- ences). It is possible to argue then that the observed pat-
acteristics and assumptions in common. As stated ear- tern of hurricane landfall sites is actually a "natural
lier, the general goal of hypothesis testing is to make sample" from the population of all possible landfall loca-
inferences about the magnitude of one or more popula- tions that could have occurred.
tion parameters based on sample estimates of those The pattern of state-level 2000-2010 population
parameters. When applying any inferential test statistic change (shown in fig. 1.7) is another example of a natural
it is assumed that a random sample has been drawn from sample. This choropleth map pattern is partly the result
a population. Often, however, other unbiased types of of such nonrandom or systematic factors as climate and
sampling (such as systematic sampling) are also valid job opportunities related to the location of natural
for inferential hypothesis testing. Recall from chapters resources and partly the result of numerous individual
7 and 8 that an element of randomness must be and family decisions to migrate from one state to another
included in the sample procedure, whatever specific (some of which could be random).
type of sample is selected. If multiple samples are If inferential procedures are applied to natural sam-
required for a particular problem, it is always assumed ples such as these, the resultsmust be interpretedwith extreme
that each sample is drawn separately and independently. caution. Descriptive summaries of natural samples are
Suppose a geographer wishes to compare the average certainly appropriate, but you must take care to avoid
size of pebbles on two beaches. This would be done by making improper inferential statements. Issues regarding
taking a random sample of pebbles from one beach and the application of inferential statistics to natural samples
a separate and independent random sample of pebbles are quite complex and continue to generate considerable
from the other beach. (An exception to the assumption discussion and controversy in applied statistics. Those
of independence occurs if a matched-pair or dependent interested in pursuing these arguments further (particu-
sample difference test is used, examples of which will larly from the geographer's perspective) are directed to
be seen in chapter 10). the references at the end of this chapter.

Artificialversus NaturalSamples GIS-basedTotal Enumeration


and InferentialTesting and InferentialTesting
Beyond these general characteristics, geographers Another related issue that geographers face when
have actively debated the circumstances for appropriate considering the use of inferential statistics, especially
use of inferential testing in problem solving. One dimen- those making extensive use of GIS technology, is differ-
sion of the discussion centers on the difference between entiating between a sample and a total enumeration
artificial and natural sampling. In an artificial sample, within a data set. You will recall that the purpose of
the investigator draws an unbiased, representative sample inferential statistics is to make inferences about a popula-
from a statistical population and then is able to infer cer- tion based on a statistical sample. Inferential tests allow
Chapter 9 • Elements of Inferential Statistics 1S1

us to understand the degree to which a sample reflects 2. Take a sample of the larger dataset to allow the use
the characteristics of the total population, taking sam- of inferential statistics. Oftentimes, GIS datasets
pling error into account. have tens or hundreds of thousands of records as in
Oftentimes, especially in GIS-based applications, the case of a countywide parcel map with assessed val-
you might already have what might be considered a total ues. A geographer can sample a smaller subset from
population. For instance, a county-wide GIS may have that dataset and perform inferential statistics. The
the assessed value of every property in the county. Simi- advantages of this approach are that you will have a
larly, a weather monitoring station will have the rainfall sample, the sampling error will be known theoretically
totals of every day of the year, and not just a sample of through application of the Central Limit Theorem,
days. Finally, some data like that illustrated for life expec- and inferential tests for differences can be conducted.
tancy may only include a single data value for each coun- While that solves the sampling problem, it raises other
try-in many cases it is unclear if these data represent a concerns regarding the wisdom in throwing out what
total enumeration or a sample. others would consider perfectly good data. In our own
When a total enumeration of data exists, one can review of this question we found that statisticians were
certainly compare the descriptive values. That is, the rather evenly split on the merit of this approach.
annual total rainfall in one city may be 1.24" higher than 3. Treat the population as though it is a sample and
the total rainfall in another city, or the average assessed perform inferential tests. One might argue that many
value of property in one community is $1,267 higher of the attributes stored in a GIS have inherent error
than the assessed value of property in another commu- built in. For example, the amount of rain captured at
nity. A non-geographic example is the test scores illus- a monitoring station has some inherent error caused
trated in table 8.6. In reality, the high school probably by wind direction or other environmental conditions,
maintains records for every student's exam. Therefore, while the assessed value of a property may be subject
descriptive statistics regarding the average test scores can to an assessor's apparent mood that particular day.
be computed using the entire population. Similarly, for some data, like life expectancy or obe-
In cases where the total population is known, you sity discussed in chapter 1, it is impossible to say that
might simply decide to evaluate the descriptive statistics every person who has died in a country had their age
and perform no additional tests. The simple reason for recorded, or every person in a state has been evalu-
this is because the total population provides us with the ated for obesity.
actual descriptions, whether it be property values, rain-
Obviously there are many issues that a geographer
fall, or test scores. Also, inferential statistics assume that using GIS data must consider when attempting to apply
there is a sample with a corresponding sampling error.
statistical inferences. At a minimum, you should feel
Using the total population masked as a sample violates
comfortable exploring descriptive statistics. If you choose
assumptions of the inferential tests.
to apply an inferential test based on the above scenarios,
Nonetheless, simply knowing that one city has 1.24
take great care in considering the applicability of the test
more inches of rainfall than another city, or that one com-
for the given data. In most cases, if an inferential test is
munity's property value is $1,267 more than another com- used, you can evaluate the results to make thoughtful
munity may be unsatisfying for certain policy decisions.
observations but refrain from a strict adherence to a
In many cases, the geographer is really asking whether the p-value to represent statistical significance.
differencesbetween two groups are explained by the natu-
ral variation in the data itself. For example, when the Selecting a Suitable Inferential Test
average property values are around $500,000 in two com-
We need to know more than the general characteris-
munities, a difference of $1,267 may not represent a sig-
nificant difference. In this case, the differences in the tics of a particular problem to select an appropriate infer-
values may simply be explained by the natural variation ential test. Because of the idiosyncratic nature of
in the property values throughout both communities. geographic problems, selecting the single best statistical
When faced with the common situation of a GIS procedure is often difficult. To complicate matters, the
same geographic problem can be structured or organized
dataset having what is considered the total population,
in different ways, or data can be collected and correctly
you have a few reasonable options:
analyzed statistically in various ways. You should con-
1. Ignore the use of inferential statistics and make sider this discussion as providing general guidelines, not
extensive use of descriptive statistics. Remember that a rigid set of fixed rules.
descriptive statistics can tell us a significant amount However, determining the structure of a problem
about one or more datasets. As previously discussed, and the organization of data in that problem will almost
the coefficient of variation (CV) provides great power certainly direct you to a particular technique or set of
in comparing two different datasets while also consid- appropriate techniques. To help you make an appropriate
ering the variability in the data. selection, we provide a logical organizational framework
152 Part IV .A. Inferential Problem Solving in Geography

in which to place each of the many inferential techniques After we have presented a number of other inferential
presented in this text. tests in the following chapters, this table is provided again
The fundamental dimensions of our organizational in the final (epilogue) chapter with all inferential tests
framework are shown in table 9.7. The only specific infer- covered in the text placed in their proper location.
ential tests covered so far in the text are the one-sample The columns of the table represent the type of question
difference tests presented earlier in this chapter, and they being asked or being investigated. At the most basic level,
are placed in the appropriate cell of the table. At the pres- statistical inference is concerned with estimating differ-
ent time, all of the other cells in this table are left blank. ences, similarities, or relationships between a sample (or

TABLE 9.7
Organlzatlonal Structure for Selection of Appropriate Inferential Test

Questions about differences

Level of Dependent sample Three or more {k)


One sample Two samples
measurement (matched-pairs) samples

Nominal scale
(categorical variables)

Ordinal scale

One-sample difference
of means
Interval/ratio scale
One-sample difference
of proportionstest

Questions about similarities

Questions about form or


Questions about form or nature of relationship
Questions about strength nature of relationship between one dependent
and direction of relationship between one dependent and variable and two or more
between two variables one independent variable independent variables
Nominal scale
(categorical variables)

Ordinal scale

Interval/ratio scale

Questions about explicitly spatial relationships

Nominal scale
(categorical variables)

Ordinal scale

Interval/ratio scale
Chapter 9 • Elements of Inferential Statistics 153

set of samples) and a hypothesized value of a population ential tests that require knowledge about population
parameter (or set of parameters). However, problems are parameters and make certain assumptions about the
organized or structured differently according to the type underlying population distribution are termed parametric
of difference, similarity, or relationship being examined. tests. For example, in the one sample difference of means
The first four columns in the top portion of table 9.7 tests (Z or t) and the one sample difference of proportions
refer to various types of questions about differences. For test (Z), population parameters such as µ, o; and p are
example, the problems considered earlier in this chapter included in the test statistic formulas. A commonly-
were concerned with the magnitude of difference applied assumption is that the population is normally dis-
between a single sample and a hypothesized population tributed with mean µ and standard deviation a. Other
parameter. These problems are collectively referred to as assumptions that apply for some multi-sample parametric
one-sample difference tests (first column in top portion of techniques will be discussed as needed.
table 9.7). In certain special cases, the data set under Another group of statistical tests requires no such
investigation consists of one sample of observations col- knowledge about population parameter values and has
lected for two or more different variables or at two or fewer restrictive assumptions concerning the nature of
more different time periods. In these circumstances, a the underlying population distribution. This group of
dependent-sample (matched-pairs) difference test is tests is termed nonparametric or distribution-free.
appropriate (second column in top portion of table 9.7). Parametric tests requiresample data measured on an
Other inferential tests consider the magnitude of differ- interval/ ratio scale, because their test statistics use
ence between two or more independently-drawn random parameters such as the mean or standard deviation-
samples, and hence these tests are known as two-sample, descriptive statistics that can only be calculated from
three-sample, or k-sample difference tests (columns 3 and interval/ ratio data. Recall, a mean or standard deviation
4 in the top portion of the table). should not be calculated from nominal or ordinal data.
The columns in the middle portion of table 9.7 Nonparametric tests, on the other hand, do not require
respectively refer to: questions about the strength and such "interval/ratio" statistics. Some nonparametric
direction of relationship between two variables; ques- tests are specifically designed to be used with ordinal
tions about the form or nature of relationship between data, whereas others are designed to be applied effec-
one dependent and one independent variable; and ques- tively to nominal or categorical data (or combinations
tions about the form or nature of relationship between thereof). For data measured at a nominal or ordinal
one dependent variable and multiple (two or more) inde- scale, only a nonparametric test can be applied.
pendent variables. Notice that the types of questions Different strategies may be possible, however, with
asked in the columns of the middle portion of the table data at the interval/ratio scale:
are fundamentally different than the columns in the top
portion of the table. Rather than asking if a sample is sig- 1. Run only a parametric test. This approach is appro-
nificantly different from a population or if two or more priate if there is virtually no doubt that the require-
samples are significantly different from one another, the ments and assumptions needed to use the test have all
focus in the middle segment of the table is on how similar been met.
(strong or weak) the relationships are between two or 2. Run only a nonparametric test. If there is reason to
more variables and a description of the form or nature of believe that one or more of the parametric test
those relationships. assumptions is moderately or severely violated, then
The bottom portion of the table considers questions the results from running the parametric test are likely
about patterns and relationships that are explicitly spa- to be invalid or inaccurate. Therefore, the set of inter-
tial. Specially designed tests are available to analyze a val/ratio data may be converted ("downgraded") to
point pattern or area pattern to determine if it is random an ordinal or nominal scale, and a corresponding
or significantly different from random (more clustered or appropriate nonparametric test can be run.
more dispersed than random). Geographers face a num-
3. Run both a parametric and nonparametric test. If
ber of particular concerns with the statistical analysis of there is uncertainty about the degree to which require-
data that are explicitly spatial, and an entire section of
ments and assumptions are violated or an assumption
the text (Part Five) deals with inferential spatial statistics.
is violated only slightly (for example, a sample is
The rows of table 9.7 represent different levelsof mea-
drawn from an underlying population distribution that
surement. Depending on the nature of the geographic
is only slightly non-normal), then it may be appropri-
problem under investigation, data may be scaled at the
ate to run both a parametric and a comparable non-
nominal (categorical), ordinal, or interval/ratio level of
parametric test.
measurement. Selection of an appropriate inferential test
depends in part on this level of measurement. Additional insights may be gained by running both
Statisticians traditionally divide inferential techniques tests on the same data and comparing the resulting p-val-
into two categories: parametric and nonparametric. Infer- ues. This "pairing" strategy may be particularly useful
154 Part IV .A. Inferential Problem Solving in Geography

when conducting an exploratory or investigative geo- 2. Recognize when it is appropriate to use a directional
graphic analysis. This idea is demonstrated several times or nondirectional alternate hypothesis.
in later chapters. 3. Understand and apply the steps necessary to test
The number of geographic problems that can be hypotheses using the classical/traditional approach.
examined using inferential statistics is limitless. In the
4. Understand and apply the p-value approach to
remainder of the text, inferential techniques are applied
hypothesis testing and evaluate its advantages over
to a number of real-world geographic situations. Each
the classical approach.
inferential technique or set of techniques is presented
using a common format: 5. Know when to apply the difference of means Z or t
test or the difference of proportions Z test in compar-
1. Presentation of the rationale, purposes, and objectives ing a sample statistic to a population parameter.
of the technique. A set of appropriate geographic 6. Recognize the difference between artificial and natu-
problems that could be solved using the technique are ral sampling approaches.
listed, and an overview of the key assumptions and
7. What are the statistical options available when faced
required conditions is provided.
with the situation of having a GIS dataset considered
2. Presentation of basic formulas and computations asso- to be a total population?
ciated with the technique.
8. Understand the issues involved in selecting the appro-
3. Discussion of the example geographic problem under priate inferential statistical test. Issues include: type
examination and the application of the inferential of question investigated, level of measurement used,
technique. This step has been considerably strength- and whether a parametric and/ or nonparametric test
ened in this edition of the text, as we constantly strive is appropriate.
to include more practical geography in support of the
use of statistics.
4. Evaluation of the use of the technique for a particular REFERENCES
AND ADDITIONAL READING
geographic problem and discussion of both inferential
Burt., J. E .., G. M. Barber, and D. L. Rigby. ElementaryStatistus
issues and geographic factors that might have affected
for Geographers.3rd ed. New York: Guilford Press, 2009.
the test results. Court, A. "All Statistical Populations are Estimated from Sam-
ples," ProfessionalGeographer, Vol. 24 (1972): 160-161.
Gould, P. R. "Is Statistix Inferens the Geographical Name for a
KEY TERMS Wild Goose?" Economic Geography(Supplement), Vol. 46
(1970): 439-450.
artificial and natural samples, 150 Hays, A. "Statistical Tests in the Absence of Samples: A Com-
classical (traditional) hypothesis testing, 142 ment." ProfessionalGeographer,Vol. 37 (1985): 334-338.
degrees of freedom, 146 Marzillier, L. F. Elementary Statistics. Dubuque, IA: Wm. C.
hypothesis testing, 141 Brown, I 990.
hypothesized mean, 143 Maxwell, N. P. "A Coin-Flipping Exercise to Introduce the
null and alternate (alternative) hypotheses, 143 P-value." Journalof StatisticsEducation.Vol. 2, No. I (1994).
one-sample difference of means test, 144 Meyer, D. R. "Geographical Population Data: Statistical
one-sample difference of proportions test, 148 Description, Not Statistical Inference." ProfessionalGeogra-
one-tailed (directional) hypothesis, 143 pher. Vol. 24 (I 972a): 26-28.
Meyer, D. R. "Samples and Populations: Rejoinder to 'All Sta-
parametric and non-parametric
tistical Populations are Estimated from Samples."' Profes-
(distribution-free) tests, 153
sionalGeographer.Vol. 24 (1972b):161-162.
p-value hypothesis testing, 147 Siegel, S. and N. J. Castellan. NonparametricStatistus for the
rejection and non-rejection regions, 146 BehavioralSciences.2nd ed. New York: McGraw-Hill, 1988.
two-tailed (non-directional) hypothesis, 143 Summerfield, M. "Populations, Samples, and Statistical Infer-
Type I and Type II errors, 144 ence in Geography." ProfessionalGeographer.Vol. 35 (1983):
143-149.

MAJOR GOALSAND OBJECTIVES


If you have mastered the material in this chapter, you
should now be able to:
1. Explain null and alternate hypotheses, directional and
nondirectional hypothesis testing formats, Type I and
Type II errors, rejection and non-rejection regions,
degrees of freedom, significancelevels,and test statistics.
Two-Sampleand Dependent-Sample
(Matched-Pairs)Difference Tests

10.1 Two-sample Difference of Means Tests


10.2 Two-sample Difference of Proportions Test
10.3 Dependent-Sample (Matched-pairs) Difference Tests

The previous chapter introduced the process of or event. The before and after measurements constitute a
hypothesis testing using both the classical and p-value matched-pairof observations and are considered depen-
approaches. The focus in that chapter was on the appli- dent samples. In other geographic situations, two vari-
cation of one-sampledifference tests, where a sample sta- ables or indicators for the same sample of observations
tistic was compared to a population parameter using or locations are compared to see if the samples of
both means and proportions. Geographers encounter matched- pairs of observations are significantly differ-
many other situations where the objective is to deter- ent. Asking two types of questions of the same sample of
mine whether a significant difference exists between two participants could also constitute a matched pair. Para-
samples.If the sample statistics are significantly different, metric and nonparametric methods for comparing differ-
you can infer that the samples were drawn from truly dif- ences between dependent samples are examined in
ferent populations. However, if the sample statistics are section 10.3.
not significantly different, it can be concluded that they
were drawn from a single population ..
This chapter is organized by the type of statistic 10.1 l\vO-SAMPLEDIFFERENCE
OF
being compared (mean or proportion) and the nature of MEANSlESTS
the relationship between the samples (independent or
dependent). Sections 10.1 and 10.2 expand on the meth- Geographers use two primary methods to compare
ods discussed in the previous chapter for testing for dif- and test means from two independent samples for signifi-
ferences between two independentsamples. Independent cant difference. The basic distinction separating the two
samples occur when the items collected in the first sam- methods is the requirements for parametric and nonpara-
ple are unrelated to (independent of) the items collected metric tests. If the sample data and population distribu-
in the second sample. In section 10.1, differences tions meet the requirements of parametric testing, the
between two sample means are tested for statistical sig- appropriate independent sample difference of means pro-
nificance using both parametric and nonparametric pro- cedure would be either the Z or t test. For these methods,
cedures. Section 10.2 considers the two-sample the sample data are measured on an interval-ratio scale
difference of proportions test, which is designed to com- and the samples are drawn from sufficiently large and
pare two sample proportions for significant difference. normally distributed populations. Conversely, if the sam-
In other problems, the samples may not be indepen- ple data are measured at the ordinal scale or if the sam-
dent. For example, you might measure each subject or ples are too small or drawn from populations that are
location twice-before and after some type of treatment clearly not normal, a nonparametric procedure is

lSS
156 Part IV .A. Inferential Problem Solving in Geography

required. For two-sample difference of means problems format) or as HA: µI > µz (or HA: µI < µz), when the
that do not meet parametric requirements, the Wilcoxon direction of difference is hypothesized (one-tailed format).
Rank Sum W test or Mann-Whitney U test is used. Which of the two forms should be used in the pre-
ceding examples? In the water runoff problem, a one-
Two-SampleDifference of Means Z or t Test tailed (directional) format seems appropriate because
Suppose you want to compare surface water runoff natural surfaces in an undeveloped landscape logically
levels in two different environmental settings: (I) a natu- absorb more water, allowing more moisture to soak
ral landscape with little or no modification by human directly into the ground, resulting in a lower volume of
activity; and (2) an exurban setting partially modified water runoff. Conversely, precipitation falling on artifi-
with the development of several residential subdivisions cial surfaces in the exurban setting might not be absorbed
and their access roads. You might hypothesize that the as readily, resulting in highervolumes of runoff into gut-
presence of artificial surfaces like concrete and asphalt ters and drains along residential streets. In the subdivi-
affects the volume of surface water runoff, making the sion design problem, a two-tailed (non-directional)
amount different in the exurban versus natural settings. format is suggested if no logical, pre-established basis
To test for significant differences in runoff, you could exists for hypothesizing that either design type would
take spatially random samples measuring surface runoff produce a faster home value appreciation rate. In the
levels from both settings. The question being asked is problem examining patterns of Chinese fertility, a one-
whether the mean volume of runoff from the natural tailed format is appropriate because a particular direction
locations differs significantly from the mean volume of of difference makes sense; it is logical to hypothesize that
women in the coastal province have fewer children on
runoff at the exurban locations. If the two sample means
average than women in the more isolated interior prov-
are significantly different, you can logically infer that the
ince due to the increased one-child policy pressure placed
samples were taken from two distinct populations.
In another possible application, suppose you want to on coastal communities.
Similar to the one-sample difference of means proce-
determine if current home values differ between two
dure, the parametric two-sample difference of means test
adjacent residential subdivisions built at the same time
can take different forms depending on the size of the
and with similar construction costs, but with very differ-
ent design principles. One of the subdivisions used clus- samples and nature of the population variances. If the
populations are normally distributed with known vari-
ter zoning of homes on smaller lots, including such
ances and the sample sizes are large (n 1 and n2 greater
amenities as a tennis court, baseball diamond, playing than 30), the sampling distribution for the difference of
field/village common, and some open space. The other
means follows the normal (Z) distribution, and the test
subdivision used a conventional "cookie cutter" statistic is
approach to design with larger private lots and no com-
munity recreational amenities or open space. Twenty - -
years later, if a random sample of homes is taken from Z=X,-X2 (10.1)
each subdivision, you could determine if the properties in a 2,-2,
the cluster-zoned development appreciated in value at a
faster rate than those in the conventional subdivision. where X, = mean of sample 1
As another example, suppose you want to compare X 2 = mean of sample 2
fertility patterns in two different Chinese provinces. The a i,-i, = standard error of the difference of means
first is mostly an urban coastal province heavily exposed
to the one-child policies of the government over previous and
years, whereas the second is a rural interior province
with a large ethnic population that experienced Jess pres-
a 2,-2, -- (10.2)
sure from central government officials regarding the one-
child policy. Suppose independent random samples of
women aged 30 to 34 are chosen from each of the two Following the general format for difference tests, the
provinces, and the number of births for each woman numerator of the two-sample test statistic in equation
recorded. The question can then be asked: is there a sig- IO.I shows the actual or observed difference between the
nificant difference in the mean number of children per two sample means. Recall from chapter 8 that the sample
woman between these two provinces? mean (Jt}is the best estimate of the population mean
The null hypothesis when testing two independent (µ). Similarly, in this situation involving two samples, the
sample means for significant differences has the form H 0: difference of sample means (X, - X 2} is the best estimate
µ 1 = µ 2 , which is equivalent to Ho:µ 1 - µ 2 = 0. The alter- of the difference of population means (µ 1 - µ 2).
nate hypothesis is stated in two ways: HA: µ 1 * µ 2 , when The denominator of equation 10.1 is the standard
the direction of difference is not hypothesized (two-tailed error of the difference of means. This standard error
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 1S7

expression is an estimate of how much difference the two populations. In addition to its usefulness for prob-
between X, and X 2 is expected to occur due to sam- lems with unknown population variances, the t distribu-
pling and is considered a measure of the expected sam- tion also provides more accurate results than Z when one
pling error. Think of it this way-if two independent or both sample sizes are small (n Jess than 30).
random samples are taken from the very same popula- Two methods exist for using sample data to estimate
tion, the two sample means would likely be somewhat the standard error in the denominator of equation 10.3.
different, just because of the random effects of sampling. Selection of the appropriate method depends on the
Thus, the standard error of the difference of means is an assumed relationship between the two population vari-
estimate of the expected difference that should occur sim- ances. In the first instance, when the population variances
ply by chance. are assumed to be equal, (o} = a i), a pooled variance
What factors influence the size of the test statistic in estimate (PVE) is calculated as the weighted average of
equation 10.1?The magnitude of the actualorobserveddif- the two sample variances:
ference in sample means (the numerator) is divided by
the magnitude of the expecteddifference in sample means
(the denominator). If the actual difference in sample (10.4)
means is considerably greater than the expected differ-
ence in sample means, then the resultant test statistic Z or
tis large (either positive or negative) and not very close to The denominator of equation 10.4 represents the degrees
zero. A large test statistic leads to the inferential conclu- of freedom in the problem. When the two population
sion that the two sample means are more different than variances are unknown but assumed equal, the standard
could have happened due to chance if both samples are error estimate for equation 10.3 is written as
taken from the same population. The large difference
also leads to the complementary conclusion that the two
sample means are most likely taken from two distinct (10.S)
(different) populations, supporting the likelihood of the
alternate hypothesis. In problems where the population variances are
Conversely, if the actual difference in sample means assumed unequal(that is, if a 1 ~ a2 ), the pooled estimate is
(numerator of equation 10.1) is not much greater than not appropriate. In such cases, the sample variances are
the expected difference in sample means (denominator), substituted directly into the standard error portion of equa-
then the resultant test statistic is relatively small. The
small test statistic leads to the conclusion that the two
sample means are taken from the same population and
supports the validity of the null hypothesis. It is also pos-
sible for the actual difference in sample means to be Two-Sample
Difference
of Means
smaller than the expected difference in sample means, ZortTest
again resulting in the conclusion that the two sample
means are taken from the same population. Primary Objective: Compare two independent random
sample means for difference
In most geographic applications of a two-sample dif-
ference of means test, the test statistic for Z in equation Requirements and Assumptions:
JO.I cannot be used because the variances of the two 1. Two independent random samples
populations are unknown. In these situations, we must 2. Each population is normally distributed
estimate the standard error from the samplevariances 3. Variable is measured at interval or ratio scale
(s2 ), and the difference of means test will follow the tdis-
Hypotheses:
tribution. The test statistic for tis
Ho:µ,=p2
HA : µ, ;, JJ2 (two-tailed)
HA : µ, > p 2 (one-tailed) or
(10.3) HA : µ, < p 2 (one-tailed)

Test Statistic:
Although equation IO.3 has the same general appearance
as equation IO.I, the equations differ in the way the stan-
dard error (denominator) is estimated. In equation 10.1,
the standard error is derived directly from the known vari- If sample size ;;:30 and population variances are known,
ances of the two populations using equation IO.2. In use Z
If sample size < 30 and population variances are unknown,
equation 10.3, the standard error can only be estimated use t
from the variances calculated for the samples taken from
158 Part IV .A. Inferential Problem Solving in Geography

tion 10.2 as best estimates of the respectivepopulation vari- U statistic. Like the t test, the Wilcoxon and Mann-Whit-
ances. A separatevarianceestimate(SVE) is calculated: ney tests examine two independent samples for differ-
ences. Rather than using parameters like mean and
2 2 variance, however, these techniques use the ranks of sam-
SVE=CYx· ,- x·2 = !.i.+b. (10.6) ple observations to measure the magnitude of the differ-
ences in the ranked positions or locations between the two
sets of sample data. Although no form of distribution is
Since geographers rarely know population parame- assumed for the two populations, the procedure requires
ters (like µ or u), we must rely heavily on sample data to that the two distnbutions be similar in shape. This charac-
estimate population characteristics. This task supports teristic makes the Wilcoxon and Mann-Whitney tests
one of the central goals of inferential statistics discussed especially useful for problems with small samples drawn
earlier-the use of samples to produce unbiased esti- from populations that are not necessarily normal.
mates of population parameters. Thus, the method cho- To keep the explanation of the nonparametric two-
sen to estimate variance (pooled vs. separate) depends on sample difference tests simple, the discussion here
whether the population variances are assumed to be focuses on the Wilcoxon test procedure, which uses a
equal. In practice, researchers usually decide equality or "difference test" format similar to the two sample differ-
inequality of population variances by testing the corre- ence of means Z and t tests. The discussion includes a
sponding sample data. Sample variances are considered brief explanation of the direct relationships of the Mann-
the best estimators of population variances. Whitney and Wilcoxon test statistics.
To determine whether the variances of the two sam- In the Wilcoxon rank sum W test, the data from the
ples are significantly different, a test for equality of vari- two samples are combined and placed in a single ranked
ances is needed. A widely used approach is the F statistic set. When two or more values are tied for a particular
for the Levene test for equality of variances, derived by rank, the average rank value is assigned to each position.
computing a one-way analysis of variance (ANOVA) on The samples are then considered separately and the sum
the absolute deviations of each observation from its sam- of ranks ( W) calculated for each sample. Suppose the
ple mean. Details on the application of analysis of vari- sum of ranks for sample 1 is called W1, and the sum of
ance and the associated F test are discussed in the next ranks for sample 2 is called W2. If the two samples are
chapter. If the two sample variances are found to be sta- drawn from the same population, the ranks should be
tistically different, the population variances are assumed randomly mixed between the two samples, and the sum
to be unequal, and the separate variance method is used of ranks for each sample should be roughly equal when
to test for the difference in the means. However, if the
sample variances are not significantly different, the
pooled variance estimate is applied. Most computer pro-
grams provide the result of an equality of variance test, as
well as the t test statistic and p-values for both situations Wilcoxon
RankSumTest
(variances assumed to be equal and not equal).
Primary Objective: Compare two independent random
sample rank sums for difference
Wilcoxon RankSum W (Mann-WhitneyU)Tests
When testing the difference between two indepen- Requirements and Assumptions:

dent samples, the parametric two-sample difference of 1. Two independent random samples
2. Both population distributions have the same shape
means tests (Z or t) may not be appropriate for some geo-
3. Variable is measured at ordinal or downgraded from
graphic problems. Sometimes data are available only in interval/ratio scale to ordinal
ordinal or ranked form, making it impossible to calculate
sample means and sample variances. In other situations, Hypotheses:
you may be working with data measured at the interval- Ho : The distribution of measurements for the first
ratio level, but have reason to believe that the samples are population is equal to that of the second population
from populations that are not normally distributed. HA : The distribution of measurements for the first
population is not equal to that of the second
When populations exhibit moderate-to-severe deviation
population (two-tailed)
from normality, using a difference of means test raises HA : The distribution of measurements for the first
serious questions about the validity of such a parametric population is larger (or smaller) than that for the
procedure. When a parametric test is deemed inappropri- second population (one-tailed)
ate, a nonparametric test for two independent samples
Test Statistic:
often provides a better alternative.
The most widely used nonparametric alternatives for
the two-sample difference of means test are the Wilcoxon Sw
rank sum W test and the directly related Mann-Whitney
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 1S9

=
the sample sizes are equal. That is, if n 1 n2 , then W1
should be roughly equal to W2 . If the sample sizes are
not equal, but drawn from the same population, then
their respective sum of ranks should be proportionate to
their respective sample sizes. These findings would all W, ands,. represent the theoretical mean and standard
confirm the null hypothesis of no significant difference deviation of W;, respectively,and are determined totally by
between the two samples. However, if the sum of ranks the sample sizes. As shown in equation 10.9, only one
for the first sample ( W1) is quite different from the sum of standard deviation exists for W However, because we can
ranks for the second sample of similar size (W2), it is determine rank sums (W 1 and W2 ) for each sample, two
more likely that the two samples have been drawn from means can also be calculated using equation 10.8-one for
different populations, making confirmation of the null sample I (W,)and another for sample 2 (w2).
hypothesis Jess likely. The Wilcoxon statistic, W, is simply the value or
The Wilcoxon test uses a variation of the Z test to see magnitude of the sum of ranks of the group with the
if the sum of sample ranks is significantly different from smaller sample size. The Mann-Whitney statistic, U,
what it should be if the two samples are actually drawn which complements the Wilcoxon statistic, is determined
from the same population. The test statistic (Zw) for the by the number of times an observation from the group
two-sample Wilcoxon procedure is with the smaller sample size ranks lower than an observa-
tion from the group with the larger sample size. Because
the Mann-Whitney calculation is often more cumber-
(10. 7)
some than the Wilcoxon calculation, the Wilcoxon work-
table is the only one included in the geographic example.
Statistical software packages usually calculate both Wil-
=
where W; sum of ranks for sample i coxon and Mann-Whitney test statistics. Both test statis-
tic values are equivalent, in that they always provide the
-
W; =mean rank of W;=n;(n,+n2 2 +1) (10.8) same significance level (p-value) when applied to the
same set of data.

Example: Two-sample t and Wilcoxon Rank Sum-


Female Labor Force Participation Rates, More-developed Countries versus Less-developed Countries

Many countries are experiencing profound changes in ety of problems and issuesassociated with gender differences
development as the result of an increasingly integrated global and trends in employment. Over the past few decades,
economy. A variety of economic, social, political and demo• women have joined the labor market in increasing numbers;
graphic changes are taking place as less-developed countries however, the global rate of change has been very gradual.
attempt to advance through economic development and Over the last thirty years the global rate of female labor force
modernization. Measuring changes in modernization and participation has increased only slightly from 50.2% to 51.8%,
development is a continuing and complicated process, poten- indicating that nearly half of all adult women are neither
tially involving literally hundreds of different development employed nor seeking work.
indicators, as people across the globe seek to improve their The current spatial distribution of global female labor force
standard of living and quality of life. participation shows a moderate to strong regional patterning
This rapidly changing situation offers geographers many (fig. 10.1). A great disparity in female labor force participation
opportunities for data analysis and research. An important tar- rates seems to exist across different less-developed regions.
get indicator involves women's progress towards equality in Several countries in Sub-Saharan Africa and EastAsia have
the global labor market. Have women made any recent participation rates in excess of 70%. By contrast, various coun-
advances in labor force participation? If so, what changes have tries in the Middle East, North Africa, and South Asia have
occurred and where have they occurred? Suppose female female participation rates below 40%.
labor force participation rate is selected as a target indicator To explore these spatial patterns of female labor force par-
for investigation. Does the participation level of women in the ticipation rates, we suggest the following process. Suppose
labor force vary by the economic level of the country in which we take two random samples of countries-one sample from
they live, or is the relationship more complex? Are past dispar- more-developed countries (MDCs) and another from less-
ities between men and women narrowing or widening with developed countries (LDCs).The same random samples of
regard to job opportunities and the monetary and social gains MDCs and LDCs are used for each of three years (1989, 1999,
of employment? and 2009) so that we can evaluate trends in female labor force
A recent (2012) World Development Report from the World participation rates over the last couple of decades. We pro•
Bank, GenderEquality and Development,examines a wide vari- pose these two related inferential hypotheses: ( 1) "For each of

(continued)
160 Part IV .A. Inferential Problem Solving in Geography

Female LaborForce ParticipationRates, 2009

' .

.....

■ 70.1%+
■ 60.1%to70.0%
0 50.1% to 60.0%
D 40.1% to 50.0%
D 0.1%to40.0%
)
D NoData

FIGURE 10.1
Female Labor Force ParticipationRates, 2009
Source:WorldBank,Wor1dDevelopmentReport2012: Gender Equalityand Development

the three years,the average female labor force participation MDC sample mean is 50.2 and the LDCsample mean is 47.4. In
rate of MDCs is not different than the average female labor all three years, however, the difference in means is quite small
force participation rate of LDCs"and (2) "The more recent the and found to be statistically insignificant. Again, looking at
year, the smaller the magnitude of difference between MDC the 1989 results, the calculated t test statistic is 0.74 and the
and LDCfemale labor force participation rates~Taken associated p•value is .462. Furthermore, as time progresses,
together, these two hypotheses are attempting to conclude the magnitude of the calculated t test statistic gets smaller
that there is no statistically significant difference between the and the associated p•value becomes larger. Using classical
average female labor force participation rates of MDCsand hypothesis testing terminology, the null hypothesis of "no dif-
LDCsand that the LDCsare "catching up" or narrowing this ference• should not be rejected at any conventional signifi-
already small disparity in female labor force participation as cance level (such as .10 or .05) for any of the years. For
time goes by. example, the 1989 p•value of .462 provides the exact proba•
If we decide to run a parametric test, then the appropriate bility associated with making a Type I error. That is, if we con•
test would be the two-sample difference of means t test. (Note: elude that there is significant difference between the two
we have more to say later about the appropriateness of select• sample means and reject the null hypothesis, then there is a
ing a parametric test for this particular problem). Suppose at .462 or 46.2% chance that the wrong or incorrect conclusion
test is applied to the sample data for each of the three years: was reached. The logical decision in this case is to not reject
1989, 1999,and 2009. It is best to use a separate variance esti• (that is, accept) the null hypothesis.
mate (SVE)as there appears to be a much larger variation The sequence of p•values (1989-.462; 1999-.528; 2009-
among LDC female labor force participation rates than among .657) indicates that the magnitude of difference between the
MDC rates.Table 10.1 summarizes all of the relevant sample MDC and LDCmeans is decreasing over time. The hypothesis
statistics and basic steps involved in the t test calculations. (educated guess) suggesting that the gap is narrowing
What information can we glean from the worktable, and between MDC and LDCfemale labor force participation rates
what are the basic conclusions that can be drawn from this appears correct.
analysis?First, notice that for all three years the mean female Table 10.1 reveals yet another important finding; for each
labor force participation rate for the more-developed country year of analysis,the standard deviation of the LDC sample is
sample is slightly larger than the corresponding mean for the much larger than the standard deviation of the MDC sample.
less-developed country sample. For example, in 1989 the For instance, in 1989 the MDC standard deviation is only 10.0,
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 161

TABLE 10 .1
Worktable for Two-Sample Difference of Means t test: Female Labor Force Participation Rates (1989,
1999, and 2009)

Samele Statistics: 1989

n x s
More-developed countries
15 50.2 10.0 t = 0.74
(sample 1)
Less-developed countries
45 47.4 19.0 2-tailed p-value = .462
(sample 2)

Samele Statistics: 1999

n x s
More-developed countries
15 50.87 8.19 t = 0.64
(sample 1)
Less-developed countries
45 48.8 17.2 2-tailed p-value = .528
(sample 2)

Samele Statistics: 2009

n x s
More-developed countries
15 52.60 7.82 t = 0.45
(sample 1)
Less-developed countries
45 51.2 16.5 2-tailed p-value = .657
(sample 2)

(7.82} 2 + (16.5) 2
Separate Variance Estimate (SVE} = <1x
1
-x2 =
15 45
= ,./4.077 + 6.05 = 3.18

52.60 - 51.2
t 0.45
3.18

while the LDC standard deviation is nearly twice as large at Africa (Yemen, Lebanon, Jordan, Libya, and Tunisia) have less
19.0. Furthermore, the difference in magnitude of standard than 30% of females participating in the labor force.
deviation values widens considerably over time. Clearly, the social and economic status of women varies
Let's explore the 2009 data further. Recall from chapter 3 greatly across the less-developed world. The obvious next
that an individual value plot is a graph showing the distribu- question you might ask in the geographic research process is
tion of all observations in a sample. We can display country- why. Why are women in Sub-Saharan Africa and Southeast
level observations in comparative individual value plots with Asiaheavily involved in the economies of their country, while
all sampled MOCs plotted in one column and all sampled Middle Eastern and Northern African woman are very lightly
LOCsin another (fig. 10.2). The 15 MDCs and 45 LDCs have involved (even excluded) from labor force activities? The rea-
nearly equal mean female labor force participation rates (52.6 sons are multiple, complex, and deeply rooted in cultural heri-
and 51.2, respectively). However, the standard deviation of the tage, values, and traditions of each country. Many individual
MDC participation rates is less than half the standard devia- as well as societal factors influence a woman's decision to
tion of the LDC rates (7.82 versus 16.5). enter the labor force. It is unknown how many women would
These individual value plots clearly show a much larger like to work but have not done so due to cultural norms. The
variation in LDC values as contrasted with MOC values. Several different rates of increase in women's participation in the work
Sub-Saharan African and Southeast Asian LDCsin the 2009 force from one LDC to another are probably linked to
sample have female labor force participation rates in excessof improved economic development, rising education among
70% (Ethiopia, Uganda, Malawi, Ghana, and Cambodia). At the women, declining fertility, and different rates of change in var-
other extreme, several LOCsin the Middle East and North ious cultural and religious norms.

(continued)
162 Part IV .A. Inferential Problem Solving in Geography

We need to make one final point that casts severe validity The worktable for the Wilcoxon rank sum Wtest (table 10.2)
questions on this entire geographic example. The magnitude summarizes the application of these tests on the female labor
of certain factors related to female employment is virtually force participation rates for the 2009 data. In that year, the sum
unknown in many countries. These factors include the extent of ranks for the smaller MDC sample is 4 73.5 and for the larger
of labor underutilization among females in many countries LDC sample it is 1356.5.The Wilcoxon Wis the sum of ranks of
(both unemployed and working only part time}. Also gener• the smaller sample (that is, the MDC sample of 473.5). Results
ally unknown is the degree to which women are working in are also shown for the standard deviation of Wand the test
the "informal economy" or in a situation of "vulnerable statistic Zw,with a low test statistic value <Zw=0.273) and asso•
employment; where wages may not be monetary and ciated p-value of .7912.
women are working as unpaid family members without mar• The results of the non-parametric two-sample difference
ket economy benefits. The bottom line here is that data tests (Wilcoxon and Mann-Whitney} are about what we should
regarding female labor force participation rates may be noth• expect in relation to the corresponding two-sample ttest. The
ing more than very rough guesses (even from reputable orga• p-values associated with the t tests are just slightly lower
nizations such as The World Bank} subject to major
differences in data collection from country to country.
The Wilcoxon rank sum Wand Mann-Whitney U tests are TABLE 10.3
now applied to the same female labor force participation rate Comparing p-values of Parametric and
data. 'Pairing• these nonparametric tests with the parametric
Nonparametric Tests for Two-Sample Differences:
t test is an experiment. For one thing, we do not know
Female Labor Force Participation Rates (1989,
whether the two populations from which samples were taken
1999, and 2009)
(MDCs and LDCs}are normally distributed. This is an impor•
tant consideration, since normality of the two populations t versus
being tested is one of the assumptions associated with the
1989 .462 .7071
parametric two sample difference tests (Zand t}. Also, the
MDC sample size of 15 is rather small (certainly smaller than 1999 .528 .6756
30), so it is not possible to take full advantage of the central
2009 .657 .7912
limit theorem.

TABLE 10.2

Worktable for Wilcoxon Rank Sum Test (Two Samples): Female Labor Force Participation Rates, 2009

Sample Statistics: 2009

D Sum of Ranks Mean Rank

More-developedcountries
15 473.5 31.57
(sample 1)
Less-<levelopedcountries
45 1356.5 30.14
(sample 2)

Wilcoxon W, is the sum of ranks of the group with the smaller sample

W1 = 473.5
Mean rank of sample 1

S,. = standard deviation of W 58.575

W;- W; 473.5 - 457.5 16


Zw = 0.273
Sw 58.575 58.575

2-tailed p-value = .7912

Note: W1 - W1 is the actualdifferenceand Sw is the expecteddifference.


Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 163

(stronger statistically) than the comparable Wilcoxon (Mann- data is lost when this interval-ratio variable is downgraded to
Whitney) p•values. This finding is expected, given the slightly ordinal. However, both the parametric and nonparametric
stronger statistical power of the parametric t test over its non- results are telling basically the same story-for all three dates,
parametric equivalents. the p-values are much larger than conventional cut-off values
For example, a quick comparison of the 2009 p•values from (.10 or .05) for rejecting the null hypothesis and for concluding
tables 10.1 and 10.2 (.791 for the non-parametric test versus there is a significant difference between means. Thus, we
.657 for the parametric ttest) shows a moderately weaker con- must conclude there is no significant difference between MDC
clusion for the non-parametric tests (table 10.3).Clearly, some and LDC female labor force participation rates.
information about the female labor force participation rate

e
EINQ9<8
80 -
euganda
.......
Gton11• • Cambodiil
I
70 -
••
-
<I)

l!!
C
- C11niidae
S~en.
e Nor-wt
•••
I
.2
iii 60 - ••
a.
·.:;
'f: Countries I •••••••
"'
a. duster&d •EB• EB
8
- 50 - in
ml<kang&
••• •••••
0-
./2
.Q •• Huf'l9,lry •••
.!!1
.!! 40 - ••
~
.,..,
•••
,f
30 -
I
TunisioI lb)'cl
JonfQnI leban011
20 - Vernon•

'
More-developedcountries
'
less~developedcountries

FIGURE 10.2
Individual Value Plots of Female Labor Force Participation Rates, 2009
Source:Wor1dBank, WorldDevelopmentReport2012: Gender Equalityand Development

of voters in each state could be polled and the proportion


10.2 l\vO-SAMPLE DIFFERENCEOF of potential voters giving the President a favorable rating
PROPORTIONS
lEST tabulated. A two-sample difference of proportions test
could then assess the likelihood that a statistically signifi-
Geographers and other scientists often work with cant difference in approval ratings exists. A recreation
proportions as well as means, so it should not be surpris- planner could ask visitors at a regional park whether they
ing that statisticians have developed a two-sample differ- approve of a proposed change in operating hours. If the
ence of proportions test. Inferences concerning the park visitors were divided into two samples (weekday vis-
difference of two population proportions are made by itors versus weekend visitors), it is possible to apply a two-
comparing the difference of two sample proportions, sample difference of proportions test to determine if the
using a similar logic (and similar test structure) as the dif- proportion of those who approve of the new hours differs
ference of means tests. between weekday and weekend visitors.
For example, a political geographer might want to In these situations, where a dichotomous (binary)
learn if a difference exists in the President's approval rat- variable has only two possible outcomes or responses,
ing among the voting population from two neighboring the two-sample difference of proportions test is appro-
states or among male and female voters. Random samples priate. For many problems, the variable analyzed fits this
164 Part IV .A. Inferential Problem Solving in Geography

requirement of exactly two categories or classes, such as direction is hypothesized for difference in proportion
male-female, approve-disapprove, or yes-no. In situa- between the two samples, a two-tailed procedure is
tions with more than two possible outcomes or applied where HA: Pt - P2 ~ 0. A two-tailed procedure is
responses, you can sometimes reduce three or more cate- used if one wishes to test if the proportion of elderly
gories to two so that the difference of proportions test is adults in favor of capital punishment differs from the pro-
appropriate across the entire data set. For example, a portion of young adults in favor of capital punishment.
public survey measuring attitudes toward capital punish- On the other hand, if one wishes to test whether one of
ment might include the following possible responses: the sample proportions is larger or smaller than the other
strongly favor, mildly favor, mildly oppose, and strongly (specifying the direction of difference), a one-tailed test is
oppose. If these four possible responses were "col- appropriate. A one-tailed test is used for a hypothesis that
lapsed" to two (favor and oppose) then the data would a greater proportion of elderly favor capital punishment
be structured in a way that would allow a single two- than do young adults. In these instances, the alternate
sample difference of proportions test to be applied across hypothesis is HA: Pt - P2> 0 if it is assumed that Pt is the
the entire set of data. Then you could test whether the larger of the two proportions.
proportion of elderly (age 65 and above) favoring capital The test statistic for the difference of proportion pro-
punishment differs from the proportion of young adults cedure (Zp) has a form similar to other two-sample differ-
(age 20-34) sharing that opinion. ence tests:
In problems testing for significant differences with a
binary variable, one of the two categories is selected as
the focus for the analysis. The proportion of the sample (10.10)
in the focus category is termed p 1, while the other pro-
portion is termed p2 . The objective of the difference test is
to determine whether the proportion of populationJ (p 1) =
where Pt proportion of sample 1 in the category of focus
having the focus attribute differs significantly from the
corresponding proportion of population2 (p2).
=
P2 proportion of sample 2 in the category of focus

The null hypothesis for a two-sample difference of


=
a P, - ,, standard error of the differenceof proportions
proportion test is written: H 0: p 1 - p2 = 0. By expressing Just as the sample proportion (p) is the best estimate of
the null hypothesis this way, it relates directly to the differ- the population proportion (p), the sample difference of
enceof population proportions and it is hypothesized that proportions (p 1 - p 2) is the best estimate of the popula-
the difference is zero. It is possible to set the hypothe- tion difference of proportions (p 1 - p 2). The denominator
sized difference as something other than zero, but that of equation 10.10 represents the standard error of the dif-
treatment is not covered here. As in other difference tests, ference in proportions and is estimated as follows:
the alternate hypothesis (HA) can take two forms. If no

(10.11)

Two-Sample
Difference
of Proportions =
where fa pooled estimate of the focus category for the
Test population
Primary Objective: Compare two independent random The pooled estimate (fa)is the proportion in the focus
sample proportions for difference category if the two samples are combined into one sam-
ple. Operationally, the pooled estimate is the weighted
Requirements and Assumptions:
proportion from the two samples:
1. Two independent random samples
2. Variable is organized by dichotomous (binary)
categories fa=n,p, +n2P2 (10.12)
n, +n 2
Hypotheses:
Ho:p,=p,
HA : p, ;, p, (two-tailed)
HA : p, > p, (one-tailed) or
HA : p, < p, (one-tailed)

Test Statistic:
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 165

Example: Two-sample Difference of Proportions-Changes in State-level Obesity Rates from 2000 to 201 O

The two-sample difference of proportions test is illustrated adults in 2000 and the magnitude of change in the propor-
by returning again to the obesity data provided by the Centers tion of adults classified as obese from 2000 to 2010.
for DiseaseControl and Prevention (CDC).Their excellent sam•
piing processes allow scientists to develop hypotheses from a From a purely descriptive perspective, every state in the
United States (without exception!) experienced an increase
sequence of independent samples at various spatial levels and
from 2000 to 2010 in the proportion of adult population clas-
over different time periods. One of many such possibilities is
sified as obese by the CDC (table 10.4).The greatest increase
to develop hypotheses regarding changes in state-level obe•
sity rates over time. We suggest the following hypotheses: in proportion occurred in Oklahoma, which went from .1920
(19.20%) obese in 2000 to .3091 (30.91%) in 2010-an incredi-
• "For each state in the United States, the proportion of the ble change in just one decade. A few other states also experi-
adult population classified as obese in 2010 is higher than enced over a .100 proportional increase (over a 10% increase)
the proportion of the adult population classified as obese in in obesity level over the decade, including Delaware, Florida,
2000." This hypothesis is testable inferentially using the two• Indiana, Missouri, South Carolina, and Virginia. The states with
sample difference of proportions test because the CDC has the least proportional change in obesity were all western
taken independent samples from the residents of each state states-California .0256 (2.56%); Colorado .OS10 (5.10%); and
for both 2000 and 2010. Hawaii .0610 (6.10%).
• "States with higher obesity levels in 2000 have larger If you are looking for an easily explained spatial pattern in
increases in the proportion of their population that is obese these 2000-2010 obesity level changes, you will probably be
from 2000 to 2010 than states with lower obesity levels in disappointed. The descriptive statement we presented ear-
2000." This descriptive statement (from the obesity problem lier suggests that states with higher obesity levels in 2000
in chapter 1) suggests that the level of "obesity inequality" increased their obesity levels more dramatically from 2000 to
by location is increasing and that the "obesity gap" between 2010 than did states with lower obesity levels in 2000. Is
high and low obesity level states is widening. Some explor- there tangible evidence of this widening "obesity gap" over
atory results are presented, including a scatterplot showing the last decade? To explore this contention, a scatterplot is
the state-level relationship between the proportion of obese constructed showing the obesity level for each state in 2000

0.12 -
•OK

VA
••DE IN
0
~

0 0.10 •
FL•
.so •KS
• MO
.sc

-
N
Ml •KY
0 •PA
0 RI AZ. •WA .w1 •TN •Al •MS
0 ND .. NE
0 ID WVILA
N
.,. 0.08 - wv•
•MA '•NH
• •MD
rx. •OH
NV •M.N ME• •AR
~ • UT NY
•IA •GA

~
(/)
•CN
• • •NJ NM OR• ell
•AK

.2l 0.06 •
•HI •vr •• •NC

., MT

--
0

<ll .co
(/)

.,
C

O> 0.04 -
C
1
t)
•CA

0.02 -
>
V I I I I I
0.150 0.175 0.200 0.225 0.250

Proportion of state adult population classified as obese, 2000

FIGURE 10.3
Scatterplot of Proportion of State Adult Population Classified as Obese (2000) and Change in Obesity Level, 2000 to 201O
Source: Centersfor DiseaseControland Prevention(CDC)

(continued)
166 Part IV .A. Inferential Problem Solving in Geography

plotted against magnitude of proportional change in obesity tion {discussed later in chapter 16), and found no evidence
from 2000 to 2010 {fig. 10.3).There appears to be no strong of a significant "widening obesity gap.• At this point, it is suf•
relationship between these two variables. ficient to say that only a slight {and statistically insignificant)
In other words, knowing the obesity level of a particular relationship exists between state obesity level in 2000 and
state in 2000 gives little or no clue as to that state's change in the percent change in state obesity level from 2000 to 2010
obesity level from 2000 to 2010. We calculated the direction {the Pearson'scorrelation is 0. 196 and the associated p•value
and strength of this relationship statistically using correla• is .172).

TABLE 10.4
Proportion of State Adult Population Classified as Obese (2000 and 2010) and Change In Obesity Level
from 2000 to 2010
Obesity Level in 2000 Obesity Level in 2010
{proportion of adults (proportion of adults Change in Obesity Level
State State ID classified as obese) classified as obese) 2000 to 2010
Alabama AL .2393 .3285 .0892
Alaska AK .2250 .2931 .0681
Arizona AZ .1800 .2670 .0870
Arkansas AR .2338 .3085 .0747
California CA .2188 .2444 .0256
Colorado co .1633 .2143 .0510
Connecticut CT .1728 .2407 .0680
Delaware DE .1910 .2974 .1064
Florida FL .1875 .2913 .1039
Georgia GA .2256 .3016 .0761
Hawaii HI .1539 .2150 .0610
Idaho ID .1962 .2794 .0832
Illinois IL .2144 .2781 .0637
Indiana IN .2192 .3228 .1036
Iowa IA .2192 .2935 .0743
Kansas KS .2089 .3071 .0982
Kentucky KY .2353 .3306 .0953
Louisiana LA .2392 .3259 .0867
Maine ME .2068 .2829 .0761
Maryland MD .2095 .2885 .0790
Massachusetts MA .1747 .2578 .0831
Michigan Ml .2281 .3267 .0986
Minnesota MN .1786 .2569 .0783
Mississippi MS .2540 .3434 .0894
Missouri MO .2194 .3228 .1034
Montana MT .1949 .2568 .0619
Nebraska NE .2078 .2932 .0854
Nevada NV .1664 .2403 .0739
New Hampshire NH .1809 .2643 .0834
New Jersey NJ .1875 .2538 .0663
New Mexico NM .1969 .2603 .0634
New York NY .1817 .2492 .0675
North Carolina NC .2230 .2856 .0626
North Dakota ND .2067 .2922 .0855
Ohio OH .2292 .3098 .0806
Oklahoma OK .1920 .3091 .1171
Oregon OR .2108 .2753 .0645
Pennsylvania PA .2095 .3018 .0923
Rhode Island RI .1795 .2678 .0883
South Carolina SC .2282 .3312 .1030
South Dakota SD .2021 .3001 .0980
Tennessee TN .2284 .3183 .0899
Texas TX .2255 .3070 .0815
Utah UT .1782 .2470 .0688
Vermont VT .1855 .2479 .0624
Virginia VA .1894 .2970 .1076
Washington WA .1909 .2805 .0896
West Virginia WV .2386 .3231 .0845
Wisconsin WI .2172 .3059 .0887
Wyoming WY .1748 .2555 .0807
Source: Centers for DiseaseControland Prevention(COC)
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 167

The significance of these obesity level changes are gener- sample proportions is calculated. This "pooled" proportion
ally (but not completely) confirmed statistically. A series of 50 f>=n,p, + n,p,/n, + n, is the weighted average of the two
two-sample difference of proportions tests are run, with a proportions (for Alabama, this value is .3081). Using a
separate statistical test for each state. A representative two- weighted average is best because the two sample sizes are
sample difference of proportions procedure is shown for the not the same. For Alabama, the 2010 sample size is more
State of Alabama in table 10.S.The obesity level for adults in than three times the 2000 sample size. The standard error of
Alabama increased from .2393 (23.93%) in 2000 to .3285 the difference of proportions can now be calculated. This
(32.85%) in 2010. Therefore the observed (or actual) differ- standard error is the difference of proportions "expected" if
ence of proportions, (p 1- P2), equals .0892. Unfortunately for the null hypothesis is true and there really is no difference in
the overall health of Americans, this is a fairly typical state- the two samples. The value of this expression
level increase in the proportion of adults who are obese. a,,-,, = ✓ f>(l-f>)(n,+ n, / n,n,) for Alabama is .0113. The
To proceed with the Z test for difference of proportions, associated test statistic Z,,= p,- p,/ a,, -,, is the ratio of the
Z, = p,- p,/a ,,-,, the denominator of the test statistic (that actual (observed) difference of proportions to the expected
is, the standard error of the difference of proportions) must difference of proportions if the null hypothesis is true and
be calculated. To do this, the "pooled" proportion of the two both samples have been drawn from the same underlying

TABLE 10. 5

Worktable for Two-Sample Difference of Proportions: Comparing Alabama Adult Rates (Proportion Obese),
2000 to 2010

HA : P1 - P2 > 0 if rt is assumed (hypothesized)that population 1 has the larger of the two proportions

Number classified Sample


Year Total sampled (N) as obese (X) proportion obese

Sample 2 2000 2,156 516 0.2393

Sample 1 2010 7,269 2,388 0.3285

p, - p, (observed difference of proportions) = 0.0892

n,p, + n2P2 7269(.3285)+ 2156(.2393)


Pooled proportion p = .3081 (weighted avg. of the two proportions)
7269+2156

2
p 1 -p 2 (standard error of the difference of proportions)= p(l - p) (n~~n~ )

7,269 + 2,156)
0.3081(0.6919) ( 7,269(2,156)

✓o.3081(0.6919)(0.0006013)

.0113 (expected difference of proportions, if Hois true)

0.0892
z 0.0113
7.89 with associated p- value = .0000

(continued)
168 Part IV .A. Inferential Problem Solving in Geography

population. The observed difference of proportions is nearly by any reasonable standard, California's obesity level, Z-value
eight times the expected difference of proportions: of 3.14 (p = .001), also went up significantly. Some states had
Z, = p,-p,/a,,-,, = .0892/.0113 = 7.89 and the associated higher Z-values than you might expect simply due to the larger
p-value is .000. It is clear that the proportion of Alabama's 2010 sample sizestaken by the CDC. [Note: remember the gen•
adult population classified as obese has increased signifi- eral statistical rule-all other things being equal, the larger the
cantly from 2000 to 2010, and we can be virtually 100% con- sample size, the higher (and more statistically significant) the
fident that this is true. test statistic value). This is certainly the case with Florida, which
The most significant statistical change in obesity level had a huge statewide sample of 33,638 in 2010: nearly double
occurs in Florida, which has a Z-value of 15.29 (p = .000). The the size of any other state. The state of Washington ranked sec•
only state not having a p-value of .000 is California. However, ond in 2010 sample size with 18,571 respondents.

line; attitudes toward managing the coast before and after


10.3 DEPENDENT-SAMPLE (MATCHED-PAIRS) a winter storm). Thus, neither the two sample t test nor
DIFFERENCE lESTS Wilcoxon W procedure is appropriate for these problems.
When two sets of data are collected for one group of
As geographers, we sometimes want to test whether observations, the samples are termed dependent, and a
two sets of sample values are significantly different, even if test of matched-pairs is the proper inferential procedure
they are drawn from only one group of individuals or only for examining differences between such samples. As the
one set of locations. For example, we may want to com- name implies, each observation or sample member has
pare the number of migrants moving into a set of counties two values, known as a "matched pair." The differences
with the number of migrants leaving the same set of coun- in the set of matched-pairs are tested for statistical signifi-
ties. The number of in-migrants and number of out- cance and results inferred to the population from which
migrants could be determined for a sample of geographic the dependent samples are drawn. In the example of life
areas and the differences tested for statistical significance. expectancy in the Mexican villages, two equal-sized sam-
In another example, residents of an area are sampled to ples are drawn--one male and the other female. How-
determine the miles traveled per week for purposes other ever, the samples are not independent. In fact, average
than commuting. Following a major rise in the price of male and female life expectancies from a particular vil-
gasoline, which can occur during times of increased politi- lage are directly related to one another, since men and
cal tension in the Middle East, the same individuals are women at the same location are likely affected by many
surveyed a second time to determine changes in their dis- of the same economic, social, and environmental factors.
cretionary travel behavior. One might ask if the change in The male and female life expectancies from each village
gasoline prices led to decreased automobile use for drivers are a matched pair, and it is appropriate to compare a
within a particular region. In another example, the same sample of such matched-pairs of observations to see if
sample of coastal zone residents could be surveyed both they differ significantly.
before and after a severe winter storm (nor'easter) to deter- Two common inferential procedures for testing dif-
mine if attitudes toward coastal zone management change ferences in dependent samples are the matched-pairs t
significantly. As a final example, suppose that average test and the Wilcoxon matched-pairs signed-ranks test.
male life expectancy and average female life expectancy The matched-pairs t test is a parametric test requiring
data are collected from a single random sample of villages interval-ratio level data and a normally distributed popu-
in Mexico. Are male and female life expectancies in these lation. Like the Wilcoxon W two-sample difference test,
Mexican villages different from one another? the related matched-pairs signed-ranks test is a distribu-
In each of these examples, the data consist of one set tion-free {that is, the population from which the sample is
of observations (locations or individuals) collected for two taken can have any distribution) nonparametric method
different variables or at two different time periods. At first, that uses either ranked data directly or interval-ratio data
the difference of means t or Wilcoxon rank sum W might downgraded to its ordinal equivalent.
seem most appropriate to test differences for significance.
However, as discussed in section 10.1, these two differ- Matched-pairst Test
ence tests require two independent samples. In the exam- The matched-pairs t test considers the difference
ples from the previous paragraph, only one sample is between the values for each matched pair. The greater
drawn. In two of those examples, data are collected for this difference (d), the more dissimilar the results of the
two variables at one time period (in- and out-migrants; two values within the matched pair. The mean of the dif-
male and female life expectancy). In the other two exam- ference values {d) is determined for the set of all matched
ples, data are collected for one variable at two time periods pairs in the sample. If the differences within the matched-
(miles traveled before and after a price increase for gaso- pair values are small, the average difference value is close
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 169

to zero. However, in a siruation where the matched-pair The mean of the difference values (d) can also be calcu-
differences are large, d is also large, suggesting a signifi- lated from the means of the two vanables:
cant difference between the two sets of data.
The null hypothesis in the matched-pair problem (10.16)
states that the mean difference for all pairs in the popula-
tion (o) equals zero. The best estimate of the population Like other difference tests, the denominator of equa-
matched-pair mean difference (o) is the sample matched- tion 10. 13 contains a standard error measure. In this
pair mean difference {ti). The alternative hypothesis is case, the standard error refers to the mean difference in
either nondirectional or directional, depending on a priori matched pairs:
information about the expected matched-pair differences.
The test statistic for the matched-pairs t (tmp)is defined
as follows: (10.17)

d
tmp=- (10.13) where
<, d

where d =mean of matched-pair differences (d) (10.18)


crd = standard error of the mean difference
The numerator of equation 10.13 is the mean of matched-
pair differences (d): Standard deviation of the matched-pair differences can
also be derived from the computational formula (table 3.3):

(10.14)
(10.19)
where d; =difference for matched-pair i n-1
n =number of matched pairs

The difference (d;) is found by subtracting the corre- Wilcoxon Matched-pairsSigned-rankTest


sponding paired values of the second variable ( Y) from In some geographic problems, a matched-pair
those of the first variable (X): (dependent-sample) test may be appropriate, but the sam-
ple data for analysis are measured at the ordinal level. In
di=Xi-Yi (10.15) other siruations, the sample sizes are too small for the
reliable use of a parametric technique or the sample data
are not drawn from a normally distributed population. In
the first instance, the parametric matched-pair t test can-
not be applied because the measurement scale is not
t Test
Matched-Pairs appropriate. In the other cases, the parametric test may
Primary Objective: Compare matched pairs from a random produce biased results. The Wilcoxon signed-ranks test is
sample for difference the nonparametric equivalent for dependent sample or
Requirements and Assumptions: matched-pair problems and is the appropriate procedure
1. Random sample
in these siruations.
2. Data are collected for two different samples or at two
The Wilcoxon signed-ranks test uses matched-pair
different time periods differences ranked from lowest (rank 1) to highest. The
3. Population is normally distributed matched-pairs data can come either from direct ordinal
4. Variable(s) is (are) measured at interval or ratio scale measurement or from interval-ratio differences down-
graded to ranks. The absolutedifference between the two
Hypotheses:
variables (rather than the positive or negative difference)
Ho :o=O
is used to determine the rank for each matched pair.
H, : o ~ o (two-tailed)
H, : o > 0 (one-tailed) or
When the difference for any matched pair is zero, the data
H, : o < O (one-tailed)
are ignored and the sample size reduced accordingly.
When differences for matched pairs are tied for a particu-
Test Statistic: lar rank position, the average rank is assigned to each
such pair. The null hypothesis for the signed-ranks test
states that the matched-pairs differences (in ranks) for the
population from which the sample is drawn equals zero.
170 Part IV .A. Inferential Problem Solving in Geography

able I). If the two variables measured for the single sample
show very little difference, TP should be approximately
Wilcoxon
Matched-Pairs
Signed-Ranks equal to T,,. However, for a problem in which the differ-
ences between the two variables are large, the disparity
Test between TP and Tn will also be large. In these situations,
Primary Objective: Compare matched pairs from a random one of the rank sums (either the positive or negative differ-
sample for difference ences) will be large and the other small.
Requirements and Assumptions:
The Wilcoxon test for dependent samples uses only
one of the two possible rank sum values. The decision of
1. Random sample
2. Data are collected for two different variables or at two
which rank sum to test depends on whether the alternative
different time periods hypothesis (HA) is directional (one-tailed) or nondirec-
3. Population is normally distributed tional (two-tailed). If no direction of difference between the
4. Variable(s} is (are} measured at ordinal scale or two variables is hypothesized, a two-tailed test is applied,
downgraded from interval/ratio scale to ordinal and the smallerof TPand T,,is chosen. In this instance, the
hypothesis concerns only a difference between the two
Hypotheses:
variables under study and not which variable is the largest.
Ho : The ranked matched-pair differences of the
populations are equal
The second possibility involves a directional hypoth-
HA : The ranked matched-pair differences of the esis and a one-tailed procedure. In this case, the hypothe-
populations are not equal (two-tailed} sis states that either the positive or negative differences
HA : The ranked matched-pair differences of the population for the matched pairs are expected to dominate. The
are posrtive or negative (one-tailed} value of T corresponding to the smallernumberof hypoth-
Test Statistic:
esized differences (either positive or negative) is selected
for testing. Thus, if more differences are expected to be
T- n(n + 1) positive, T,., the sum of the negative differences is used.
4 When the number of matched pairs exceeds ten, the
Zw
jn(n + 1)(2n + 1) rank sum (T) can be converted to a Z statistic (Zw) and
24 tested using the distribution of normal values:
whenn>10 T_n(n+I)
Zw = -;===~==4 (10.20)
n(n+l)(2n+I)
Two sums can be calculated from the set of ranked 24
matched pairs: Tp, the sum of ranks for positivedifferences
(variable I greater than variable 2), and Tn, the sum of where n = number of matched pairs (n > IO)
ranks for negativedifferences (variable 2 greater than vari- T=ranksum

Example: Matched-pairs tTest-Salmon Growth Rates with (and without) Growth Hormone

A biogeographer is interested in determining if a growth TABLE 10.6


hormone can increase the size of farm-raised salmon faster
than the normal (hormone-free} growth rate. Typically, farm- Change In Weight of Sample of Salmon after
raised salmon require 30 months to reach market size. It is Hormone Treatment
hoped the introduction of a growth hormone will decrease Weight before Weight after Difference in
Salmon ID
hormone hormone weight after
the time to 18 months. Obviously, if the farmer can ship the Number
treatment {kg) treatment (kg} one year
product to market sooner, profits can increase.
Normally, farm-raised salmon grow an average of .7Skg per 1 5.2 5.8 0.6
year. Therefore, even without the hormone, the farmer would 2 5.3 5.8 0.5
expect the salmon to grow by an average of .7Skg in a given 3 5.8 7.3 1.5
year. Therefore, the farmer wants to learn if the added growth 4 4.1 6.7 2.6
hormone will increase the size of the salmon significantly 5 4.8 6.7 1.9
more than .7Skg. To test this hypothesis, a random sample of 6 5.2 7 1.8
10 salmon are weighed, given the growth hormone, and 7 4.7 7 2.3
weighed one year later. Because the farmer is monitoring the 8 4.9 5.8 0.9
same fish before and after the hormone treatment, a 9 4.9 7.4 2.5
matched-pairs procedure is applied. 10 4.5 5.1 0.6
The findings are shown in table 10.6:
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 171

As expected, every salmon gains weight. However, the As shown in table 10.7, the calculated matched-pairs t
farmer is interested in whether the growth rate is significantly value is 2.%, with a corresponding one-tailed p-value of .01. It
more than expected {that is, more than .75kg). Therefore, the seems clear that we should reject the null hypothesis and con-
null hypothesis states that the change in salmon weight before clude that the growth rate of the salmon is significantly greater
and after the growth hormone is not significantly greater than than .75kg. In fact, we can be over 99% confident that the dif-
.75kg, making a one-tailed test appropriate {table 10.7). ferences in growth can be attributed to the growth hormone .

TABLE 10. 7
Worktable for Dependent-Sample (Matched-Pairs) tTest: Salmon Weight Before and After Hormone
Treatment

Ho: ti= .75

Calculate the average difference ( d):

Ld = 15.2 Ld 2
= 29.18 n = 10 x = 4.94 Y = 6.46

Where d = difference in weight before and after honnone treatment


X = salmon weight before hormone treatment

Y = salmon weight after hormone treatment

-
d = Lnd, = 10
15.2
= 1.52

Calculate the standard error of d:

2
L df- (l: d;) 29.18- (1.52)2
0.82
n-1 9

Calculate the test statistic r...,and p-value:


d 1.52
...!..d..... 0.82 = 2.96 p-value = .01 {one-tailed)
.Jn-1 ./9

Example: Wilcoxon Matched-pairs Test-


Parkwood Estates Household Water Usage Levels before (and after) Water Conservation Program

Over the last few decades, the Environmental Protection discussed in the EPA"Guidelines~ The Conservation Board
Agency {EPA)has developed and updated nationwide "Water members hope that households will significantly reduce their
Conservation Plan Guidelines• for indoor residential water use water consumption levels.
monitoring and conservation. The members of the Water Con• Twenty randomly selected households initially volunteer
servation Board of Middletown {introduced in chapter 8) want to participate in this multi-year study. Baseline indoor water
to select a small sample of households in the Parkwood consumption levels are determined before implementing the
Estates development and provide them with information informational program for each of the following uses {show•
about the various water-saving features, strategies, and incen• ers, clothes washers, toilets, dishwashers, baths, leaks, faucets,
tives {such as rebates for the purchase of low-flow appliances) and other domestic uses). The informational program is

(continued)
172 Part IV .A. Inferential Problem Solving in Geography

implemented over a two-year period, to provide enough time size,coupled with the likelihood that this sample of households
for people to adjust their water-use habits and purchase has probably not been taken from a normally distributed popu•
water-saving devices if they wish. At the end of the program, lation, makes the non-parametric Wilcoxon matched-pairs test
the water consumption levels for all of the uses listed above is the only logical choice. The "before-after• household water
measured again. consumption data are shown in the top portion of table 10.8.
Only 11 of the 20 households successfully complete the At first glance, the results seem inconsistent. A majority of
program with useable water consumption data. During the the households (6 out of 11) show decreasing water use levels,
program implementation period, some people change resi• and the average daily indoor water usage of the 11 house•
dence, others have a major demographic change such as add· holds that actually complete the program decreases slightly,
ing or subtracting a household member, and others decide going from 152.8 gallons to 146.5 gallons. Are these modest
they don't want to continue with the program. Further compli• results enough to conclude that less water is being used over•
cating the data, 2 of the 11 households actually completing all? A one-tailed Wilcoxon matched-pairs test is applied to
the program are seasonalresidents. While it is disappointing to answer this question (see the bottom portion of table 10.8).
lose almost half of the participants, one often sees large attri• The calculated test statistic value from Wilcoxon matched•
tion rates in real-world sampling. pairs is -1.245. Reading from the normal table, the corre•
Becausethe Water Conservation Board is monitoring the sponding p-value is approximately .1066. Ap-value of this
same households before and after implementation of the magnitude indicates there is marginal justification for con•
water-saving information program, the appropriate type of sta• eluding that there is a decrease in water usage levels. If we
tistical test is matched-pairs. However, the very small sample conclude the water conservation program is a success in

TABLE 10.8

Worktable for Dependent-Sample Wilcoxon Signed-Ranks Test: Dally Indoor Water Usage by Parkwood
Estates Households (in gallons)

Daily indoor water usage per household

Household Before program After program Change in


Rank
ID Number implementation (X) implementation (Y) water usage
1 174.2 146.7 -27.5 11
2 185.0 174.5 -10.5 7
3 57.2 52.3 -4.9 2
4 194.7 177.6 -17.1 9
5 143.4 149.4 +6.0 4
6 116.5 123.8 +7.3 5.5
7 72.1 59.4 -12.7 8
8 206.3 212.2 +5.9 3
9 226.8 202.7 -24.1 10
10 102.9 104.7 + 1.8 1
11 201.3 208.6 +7.3 5.5

x 152.8 j7 146.5

T. = sum of ranks of negative changes {less water used} =47


Tp = sum of ranks of positive changes (more water used) = 19

Select the T which corresponds to the smaller number of hypothesized changes. Since most changes are expected to be
negative {less water used), select T•·

Calculate the test staUstic t..,, and p-value:


T- n(n+1) 19- 11(12)
4 4 19-33
,/126.5
-1.245 corresponding p-value = .1066 (one-tailed)
n(n+1)(2t1+1) 11(12)(23)
24 24
Chapter 10 • Two-Sample and Dependent-Sample (Matched-Pairs) Difference Tests 173

reducing water consumption, there is a 10.66%change we are In several respects, this is a flawed study. Ifthis were an
wrong and are making a Type 1 error. Certainly when looking actual water usage survey, our recommendation would be to
at the very slight decrease in average household water usage, go back to the drawing board and redesign the entire
very little water is conserved overall. It is probably best to con- research process. The sample itself is nonrandom and tainted
clude that the water conservation program does not signifi- in several ways: the participants are volunteers, not randomly
cantly reduce water usage levels. selected; the overall sample size is very small; and 2 of the 11
Given the very limited success of the water conservation households (with only seasonal occupancy and lower water
program, it is now incumbent upon the members of the Water consumption levels) clearly have different water usage pat-
Conservation Board to examine these disappointing results on terns than the year-round resident households. You should be
a household-by-household basis. For example, which house- able to identify some other weaknesses or characteristics of
holds (if any) actually purchased low-flow toilets? Did any this research design that could be strengthened.
households systematically check for drips and leaks in kitchen Note: we present this example as a reminder that as
and bathroom sinks, bathroom showers, and bathroom toi- human beings we often have our personal opinions about
lets? Did any households install a low-flow showerhead or aer- how we wish an experiment to conclude. Yet,as scientists, we
ator/flow restrictor on kitchen and bathroom sinks? Did any are committed to following the scientific method and hon-
households take advantage of Energy Star rebates for pur- estly reporting our findings. Members of the Water Conserva-
chase of such water-saving devices? If some (or all) of these tion Board may be genuinely surprised by these results.
water conservation strategies were not implemented, what Hopefully,however, they would recognize that aspects of the
reasons for non-implementation were given by the household research design need strengthening. The goals and objectives
members? Only after these questions are answered can the of this study are certainly worthwhile and deserving of a bet-
Water Conservation Board decide what future programs and ter designed research process.
policies should be tried (if any) in Parkwood Estates.

KEY TERMS REFERENCES


AND ADDITIONAL READING
dependent-sample (matched-pairs) difference tests, 168 If you are interested in further exploring the geographic
independent and dependent samples, 168 examples mentioned in this chapter, the following are good
pooled variance estimate (PVE), 157 places to start.
separate variance estimate (SVE), 158 • For information regarding gender equality internationally, as
two-sample difference of means Z or t test well as data dealing with female participation in the labor
(parametric), 156 force, see various World Bank publications www.world-
two-sample difference of proportions test, 163 bank.org including the 2012 World Development Report,
Wilcoxon rank sum W (Mann-Whitney U) tests GenderEquality and Developmentand their extensive World
(non-parametric), 158 Development Indicators data set. Another excellent source is
the International Labour Organization (a specialized agency
of the United Nations) www.ilo.org. Especially see the report
by Sara Elder, Women in Labour Markets:MeasuringProgress
MAJOR GOALSAND OBJECTIVES and IdentifyingChallenges,2010.
• For state-level obesity data in the U.S., the Centers for Dis-
If you have mastered the material in this chapter, you
ease Control and Prevention (CDC) is once again recom-
should now be able to: mended www.cdc.gov.
1. Recognize those geographic problems or situations for • General sources for household energy consumption and
which application of a two-sample difference test is expenditures are available from the Energy Information
appropriate. Administration www.eia.gov. More specifically, see their lat-
2. Explain the basic difference between independent and est Residential Energy Consumption Survey (2009).
dependent samples. • Detailed water conservation guidelines for households are
available through the Environmental Protection Agency at
3. Understand the procedure for selecting the proper www.epa.gov/WaterSense/pubs/guide.html.
two-sample difference test based on level of data mea-
surement.
4. Explain the types of geographic problems where a
matched-pairs (dependent-sample) difference test is
appropriate.
Three-or-More-SampleDifference Tests
Analysis ofVariance Methods

11.1 Analysis of Variance (ANOVA)


11.2 Kruskal-Wallis Test
11.3 Example Applications in Geography

Many geographic problems involve comparison of scape modification versus an exurban setting that had
three or more independent samples for significant differ- been partially modified by development, such as residen-
ences. The logic applied to these problems is an extension tial subdivisions. If this problem is expanded to three dis-
of the reasoning used with two-sample difference tests. If tinct landscapes whose mean surface water runoff levels
multiple (three or more) samples are taken from the same are compared (e.g., totally natural setting, slightly modi-
population, their sample means are expected to vary fied landscape, and urban area substantially modified by
somewhat from one another just because of sampling vari- human activity), then the t and Wilcoxon W tests are
ation. However, if multiple sample means vary consider- again not applicable.
ably more than what is expected from ordinary sampling The statistical procedures used to test three or more
variability, then it can reasonably be concluded that the (multiple) samples for differences are analysis of variance
samples are drawn from at leasttwo different populations. (ANOVA) and the Kruskal-Wallis test. ANOVA is a
Consider again the example used in section 10.1 to parametric test that requires interval-ratio data drawn
test for differences between two samples. In that applica- from normally distributed populations. In addition, the
tion, a random sample of less-developed countries variances in all groups in the population are assumed
(LDCs) is compared to a second independent sample of equal. The equivalent nonparametric procedure is the
more-developed countries (MDCs) with regard to female Kruskal-Wallis one-way analysis of variance by ranks test.
labor force participation rates. Suppose that the research The Kruskal-Wallis test uses ordinal data directly or inter-
design of this two sample problem is expanded to test for val-ratio data downgraded to ordinal if either the assump-
differences between samples of countries from each of tion of normality or equal variance is badly violated.
the six major world regions identified by the Population
Reference Bureau. If this six-group classification scheme
is used to test for difference of means, a two-sample dif-
11.1 ANALYSIS
OFVARIANCE
(ANOVA)
ference of means test such as the t test or Wilcoxon rank Recall that the term "variance" is a descriptive statis-
sum W test is no longer appropriate. tic that measures the total amount of variability about the
Early in the previous chapter, another possible appli- mean. Analysis of variance (ANOVA) involves the sepa-
cation of two-sample difference of means tests was sug- ration or partitioning of the total variance found in three
gested. In that example, a geographer is interested in or more groups or samples into two distinct components:
comparing surface water runoff levels in two different (1) variability betweenthe group or category means them-
environments: a natural setting with little or no land- selves; and (2) variability of the observations within each

174
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 175

group around its group mean. Even though the structure totaled for all groups, the result is the total within-group
of ANOVA uses variation as the key descriptive statistic, variability. Internal variation within each of the groups is
the means of the samples are compared for significant not "explained" by the grouping or categorization
differences, making ANOVA a three-or-more-sample dif- scheme and thus is considered "unexplained," "resid-
ference of means test. ual," or "error"variation.
ANOVA is explained this way. If the variability Many different testing structures are available for
betweenthe group means is relatively large as contrasted ANOVA, but only a single-variable model called "one-
with a relatively small amount of variability within each way analysis of variance" is examined here. This testing
group around its group mean, then the statistical conclu- model is called "one-way" because observations fall into
sion is likely that the different groups have been drawn different groups or samples based on their values for one
from different populations. As a result, the null hypothe- variable. Other, more complicated ANOVA techniques
sis would be rejected. The null hypothesis for ANOVA incorporate two or more variables simultaneously in the
takes the form Ho:µ 1 = µ 2 = ... = µk (where k is the num- statistical model. In the problem regarding the percent-
ber of independent random samples), so the null hypoth- age of females participating in the labor force, for exam-
esis asserts that all samples are drawn from the same ple, it might be useful to see if another variable such as
population. Conversely, the alternate hypothesis takes national literacy rate also varies with female labor force
the form HA: µI ~ µz ~ ... ~ µk, asserting that at leastone participation. Perhaps literacy should be included with
sample is drawn from a different population than the the level of development variable in the analysis.
other samples. More complex ANOVA models with multiple vari-
The interpretation of ANOVA becomes more appar- ables are often used in agricultural, medical, and psycho-
ent when viewed graphically (fig. 11. I). The null hypoth- logical research, but less often in geography. In part, this
esis states that independent samples have been drawn is because geographers infrequently work with controlled
from the samepopulation,whereas the alternate hypothe- experimental research designs containing equal sample
sis asserts that the different samples are from at leasttwo sizes and other restrictive assumptions required by the
separateand distina populations.In case 1 of figure 11.1, more complex models. ANOVA testing structures having
only a small difference separates the three sample means multiple variables are generally discussed only in non-
(X,,X\, andX 3), and when ANOVA is applied the introductory statistics texts that examine multivariate sta-
likely decision is to not rejeathe null hypothesis. That is, tistical techniques.
the three sample means are inferred to be no more differ- The basic goal of the ANOVA test is to determine
ent than would be expected with three independent sam- which is more dominant or pronounced: between-group
ples drawn from the same population. Conversely, in variability or within-group variability. The test statistic for
case 2, the seemingly large difference between the three ANOVA {F) incorporates these two sources of variation:
sample means leads to the inferential conclusion that the
different samples are taken from more than one separate F= MSa (11.1)
and distinct population. The proper decision in this case MSw
is to rejeathe null hypothesis. Note that these decisions
are all non-directional or two-tailed, since the concern is
to find a difference between three or more samples. The
=
where MS 8 between-group mean squares
presence of more than two samples makes it impossible MSw = within-group mean squares
to conduct a directional (one-tailed) test of difference.
Between-group variation focuses on how the sam- CalculatingBetween-GroupVariability
ple mean of each group differs from the total or overall To calculate between-group mean squares (the numer-
mean when all categories are grouped together. Quite ator in equation 1 I.I) a three-step procedure is followed:
simply, the total or overall mean is the weighted average
Step 1: Calculate the total or overall mean (X r ), which
of the individual group or sample means. Therefore, if is defined as the weighted average of the individual group
the individual group or sample means differ significantly
or sample means (x "X 2 , ••. , X •):
from the total or overall mean, the between-group vari-
ability is large. Sometimes the between-group variation is
called the "explained variation" because variability (11.2)
attributed to differences between group or sample means
is a measure of how much variation is explained by (or
statistically dependent on) the group structure itself.
Within-group variation measures the variation of
where X; =mean of sample i
observations in each group or sample about the mean of n; =number of observations in sample i
that group or sample. That is, each sample has a mean k =number of groups or samples
and variance, and when these estimates of variability are N =total number of observations in all samples
176 Part IV .A. Inferential Problem Solving in Geography

Step 2: Calculate the between-group sum of squares (SSi): large, it will make the entire expression (the between-
group sum of squares) a larger value. This makes sense,
• (-
SS 8 =I.n 1 Xi-XT- )2 (11.3) for if an individual group mean is very different from the
l=l total or overall mean, it suggests a large amount of vari-
ability betweenthe group means. In equation 11.3, each of
these differences is squared and multiplied by the number
(11.4)
of observations in that group (to give that group its
proper weight). Finally, these weighted differences are
Equation 11.3 is the definitional formula of the between- summed across all k groups.
group sum of squares, and equation 11.4 is the computa-
tional formula. Notice what is happening in equation Step 3: Calculate the between-group mean squares (MSB):
11.3. Within the parentheses, the total or overall mean is
subtracted from each group mean. If any of these differ- MS =SSs =SSs (11.5)
ences between total mean and individual group mean is B dfs k-1

Assertion of null hypothesis: Assertion of alternate hypothesis:

H 0: All samples drawn from the HA: At least one sample drawn from
same population a different population
(µ, =µ2 =µ3) (µ, T-µ2 'Fµ3)

Case 1:
Small apparent difference between sample means
Likely decision: do not reject H 0

Case 2:
Large apparent difference between sample means
Likely decision: reject H0

FIGURE 11.1
Null and Alternate Hypotheses in Analysis of Variance (ANOVA)
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 177

where dfe = between group degrees of freedom Even if multiple independent samples are drawn
from the same population, the set of sample means will
Note that the between-group mean squares is simply the
fluctuate somewhat. This is to be expected because of
between-group sum of squares divided by the between-
the nature of sampling. This expected fluctuation is mea-
group degrees of freedom (dfe), where dfe depends on
sured by the standard error of the estimate. Now con-
the number of groups or samples in the problem.
sider what should occur if the null hypothesis is correct;
CalculatingWithin-groupVariability the group or sample means should vary no more than
expected if they are a set of independent random sam-
The calculation of within-group mean squares (the ples drawn from a single population. Therefore, when
denominator in equation 11.1)also involvesmultiple steps: the null hypothesis is correct, the between-group vari-
Step 1: Calculate the within-group sum of squares (SSw): ance would approximately equal the within-group vari-
ance. As a result, the F statistic will have a magnitude of
about one.
SSw = I.(n,-l)si (11.6) However, if Ho is incorrect and the population
f=I
means are significantly different, then the sample means
Interpretation of equation 11.6 is straightforward. The estimating these population means will vary more than
variance in each group or sample {sf)is multiplied or expected from simple random fluctuations of multiple
"weighted" by its sample size (n; - 1). Then all of the samples from the same population. This will cause the
weighted variances are summed to obtain the total within- between-group variance to be significantlylarger than the
group sum of squares for the groups. within-group variance, and the F statistic will be greater
than one.
Step 2: Calculate the within-group mean squares (MSw):

SSw SSw (11. 7)


11.2 KRUSKAL-WALLIS
lEST
MSw= dfw N-k
The Kruskal-Wallis (K-W)test is the nonparametric
equivalent of ANOVA. Kruskal-Wallis may be the most
The within-group mean squares is simply the within-
appropriate technique in cases where assumptions
group sum of squares divided by the within-group required for the use of the parametric ANOVA test (such
degrees of freedom (dfw), where dfw equals the total as normality and equal population variances) are not
number of observations in all samples (N) minus the fully met. Kruskal-Wallis is the nonparametric extension
number of groups or samples (k).
of the Wilcoxon rank sum W test to problems with three
When equation 11.1 is examined, the ANOVA test or more samples.
statistic, F, defines the ratio of the between-group mean In the K-W test, values from all samples are com-
squares (equation I 1.5) to the within-group mean bined into a single overall ranking (as in the Wilcoxon
squares (equation 11.7). How can we interpret the mag- test). The rankings from each sample are summed and
nitude of this ratio of mean squares?
the mean rank of each sample (r;) is then calculated. For
example, in sample one the sum of ranks is R1 and the
mean rank is , 1 = R 1!n 1. In sample two, the sum of
ranks is R2 and the mean rank is r2 = Rzlnz, and so on.
Analysis
ofVariance(ANOVA) Because of the random nature of sampling, the sample
Primary Objective: Compare three of more (k) independent mean ranks should differ somewhat, even if the samples
random sample means for difference are drawn from the same population. The K-W test
examines whether the mean rank values are signifi-
Requirements and Assumptions: cantly different.
1. Three of more (k) independent random samples If the multiple (k) samples are from the same popula-
2. Each population is normally distributed tion, as asserted by the null hypothesis, their mean ranks
3. Each population has equal variance
should be approximately equal. Thus, if the null hypothe-
4. Variable is measured at interval or ratio scale
sis is correct:
Hypotheses:
Ho : µ, = µ, = • • • = µk
H. : µ, * µ, * ... * µk
On the other hand, if the mean ranks differ more than is
Test Statistic:
likely with chance fluctuations it may be concluded that
at least one of the samples comes from a different popu-
lation and the null hypothesis is rejected.
178 Part IV .A. Inferential Problem Solving in Geography

The Kruskal-Wallis H test statistic is

H= 12 k
( N ( N+l )I,n,r/
l=I
J-3(N+l) (11.8) Kruskal-Wallis
Test
Primary Objective: Compare three of more (k} independent
random sample mean ranks for
where N =total number of observations or values in all difference
samples
Requirements and Assumptions:
= (n1 + n2 + ... + n1,)
1. Three of more (k} independent random samples
l1j = number of observations or values in sample i 2. Each population has an underlying continuous
r;= mean rank in sample i distribution
3. Variable is measured at ordinal scale or downgraded
In some geographic data sets, the ranks for a number from interval/ratio scale to ordinal
of observations across the k samples may be tied. If this
affects more than 25%,of the values, an adjustment factor Hypotheses:

is included in the K-W test statistic. As a rule, statistical Ho : The populations from which the three or more (k}
software packages calculate both the unadjusted and samples have been drawn are all identical
HA : The populations from which the three or more (k}
adjusted test statistic automatically. The correction factor
samples have been drawn are not all identical
for ties is
Test Statistic:
(11.9)
12 '°k -2]
[N(N + l) L..c=i n 1r 1 - 3(N + 1)
H
r,r
where T =t3 - t 1- N3 -N
t = number of tied observations in the data
N = total number of observations or values in all k
samples The Kruskal-Wallis H distribution approximates the
With the correction factor for ties, the H statistic is commonly used chi-square (x 2) distnbution, when k > 3
and/or at least one sample has a size (n;) > 5. Thus, to
12 - 2 determine a p-value for a calculated Kruskal-Wallis statistic,
k
I,n 1r 1 -3 (N +1)
N (N +1) l=I you would first convert H into chi-square and generate the
H=~------~---- (11.10)
p-value from this distribution. The sampling distnbution of
1- I,T
His chi-square, where degrees of freedom equal k - I. The
N 3 -N
chi-square distribution will be discussed in section 12.1.
This modification for ties increases the value of H,
reduces the p-value, and improves the likelihood of dis-
covering significant differences between samples. 11.3 EXAMPLEAPPLICATIONSIN GEOGRAPHY

Example: Analysis of Variance (ANOVA)-Differences by Census Region in County-level Obesity Rates

In the first chapter, the spatial pattern of obesity across the counties is taken from each of the four official census regions
United States was one of the four example problems used to (Midwest, Northeast, South, and West}, we can apply ANOVA
illustrate the geographic research process. At that time, we to test whether there is a statistically significant difference in
looked at a state-level map showing the percentage of adults obesity levels from region to region.
classified as obese in 2009 (fig. 1.3). That map reveals that the Exactly 200 counties in the conterminous United States are
states with the highest obesity levels are mostly in the mid• selected using a stratified sample with each census region pro•
South (especially in the Mississippi River Valley} and in Appala• portionally represented, resulting in these sample sizes:65
chia. Conversely, states with the lowest levels of obesity form counties from the Midwest, 14 from the Northeast, 90 from the
two identifiable clusters-one in the West and Rocky Moun• South, and 31 from the West. Before applying ANOVA, we must
tains and another in the Northeast, from New Jersey and New determine if the samples are drawn from normally distributed
York up into New England. populations and if each population has an equal variance. Test•
If we disaggregate the obesity data to the county level, ing of the sample data reveals that the samples are normally
inferential hypotheses can be created and tested using a ran• distributed but that they do not have equal variances. The Lev•
dom sample of counties. From those sample county statistics, ene test for equality of variance results in a test statistic of 21.08
we can infer to the entire population of all counties in the and a p•value of 0.000. This means that the sample variances are
United States. If a proportional stratified random sample of certainly unequal, and so a separate variance estimate is used in
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 179

the ANOVA analysis rather than a pooled variance estimate. All averaging 30.609%obese, while the West and Northeast have
the major statistical software packages make this determina• relatively lower obesity rates (under 26%). The statistical question
tion (pooled or separate variance estimate) automatically. concerns the likelihood or probability that such observed differ•
Table 11.1 summarizes the basic ANOVA procedure. Asyou ences in obesity levels could occur if indeed no real difference
might expect from our chapter 1 discussion and obesity map, the exists from one census region to another. If these observed differ•
counties sampled from the South have the highest obesity rates, ences in mean obesity level are significantly more than expected

TABLE 11 .1

Worktable for Analysis of Variance (ANOVA): Differences In County-Level Obesity Rates by Census Region

Ho: µ1 = µ2 = µ3 = µ4
HA : µ1 1'-µ2 'Fµ3 'Fµ4

Sample statistics
Census region n X s
Midwest 65 28.780 1.544
Northeast 14 25.914 4.295
South 90 30.609 3.006
West 31 25.819 5.104

N= In= 200

Calculate between group variability:

Step 1: Total or overall mean (Xr ):

'°N
L.t=1 n;X;
- 65(28.780) + 14(25.914) + 90(30.609) + 31(25.819)
28.94
N 200
Step 2: Between-group sum of squares (SSa):

S58 = (i~ n;(X"f) )- N(Xf)

(65(28.780) 2 + 14(25.914) 2 + 90(30.609) 2 + 31(25.819) 2) - 200(28.94) 2


682.4
Step 3: Between-group mean squares (MSa):

S58 682.4
MS 8 = - = = 227.5
dfs 3

Calculate within-group variability:

Step 1: Within-group sum of squares ( SSw):


4

SSw = L
i=1
(n 1 - l)s"f = 64(1.544) 2
+ 13( 4.295) 2 + 89(3.006) 2 + 30(5.104) 2
= 1977.9

Step 2: Within-group mean squares (MSw):

SSw SSw 1977.9


MSw = dfw = N-k = 200-4 = 10.1

Calculate test statistic (F) and associated p-value:

MS 8 227.5
F = = 22.54 (p = .000)
MSw 10.1

(continued)
180 Part IV .A. Inferential Problem Solving in Geography

from random sampling of a single population, then the result will African-American or Indian population have higher obesity
be a large F statistic and a p•value close to zero. levels, these positive outliers are expected. The low obesity
The calculated Fstatistic of 22.54 has an associated p•value level outlier (24.4% obese) is Hamilton County, Indiana, a sub•
or significance level of .000. The logical conclusion is that at least urban area immediately northeast of Indianapolis. If suburban
one of the four census regions has an obesity level that differs places with relatively high income levels generally have low
from the other census regions. We are virtually 100% confident obesity levels, this negative outlier is also expected.
that it is the correct conclusion to reject the null hypothesis and The series of two-sample ttests (comprising all six possible
conclude that county obesity levels differ by census region. census region pairs) shows that five of the six census region
To gain further valuable insights (beyond ANOVA) about pairs have significantly different obesity levels. The only
the precise differences between obesity rates from one census exception is the Northeast-West pairing, with a t·test value of
region to another, we follow two exploratory continuations. 0.06 and an associated p•value of .949. This result is not sur•
First we construct a set of boxplots showing the distribution prising given the similarity of the obesity level sample means
of county obesity rates by census region (fig. 11.2), and then (25.914 for the Northeast and 25.819 for the West). It seems
we conduct a set of six two-sample t tests to determine which that obesity levels generally vary considerably from one
specific pairs of census regions have significantly different region of the country to another.
obesity rates and which do not (table 11.2).
The boxplots nicely confirm the ANOVA results. The South
TABLE 11 . 2
clearly has the largest mean obesity level and the West has the
largest standard deviation of obesity levels. The Midwest box• Two-Sample Difference of Means t test Results:
plot is the only one showing any outliers, and it has three. Two County-Level Obesity Rates by Census Region
counties are large positive outliers with very high obesity lev• Census Region Pair p-value
t df
els of 33.4 (a positive outlier would be an obesity level more
than 1.5 box-heights above the upper whisker for the Census Midwest and Northeast 2.46 13 0.029
Region). One of these is St. Louis city, Missouri, a densely pop· Midwest and South -4.90 137 0.000
ulated urban area with a high percentage of black population Midwest and West 3.16 32 0.003
and the other is Charles Mix County, South Dakota, with the Northeast and South -3.94 15 0.001
Yankton Indian Reservation occupying the eastern half of the Northeast and West 0.06 29 0.949
County. To the extent that places with high percentages of South and West 4.93 37 0.000

35

**
Mean

30
(/)
G)
iii
~

~
.,
(/)

.D
0
G) 25
>
G)
*
~ Outlier~
C
::,
0
t)

20

15

Midwest Northeast South West


Census Region

FIGURE 11.2
Boxplots of County-Level Obesity Rates by Census Region, 2008
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 181

Example: ANOVA and Kruskal-Wallis Tests-Differences in Home Purchase Price by Middletown District

Suppose that members of the Middletown City Council prices that differ from the other districts. We can be nearly
want to determine whether home purchase prices differ 100% confident that rejecting the null hypothesis and con•
between the four official town districts: Northside, Central, cluding that Middletown home purchase prices differ by dis•
Easton, and Southside. (Note: to refresh your memory about trict is indeed the correct conclusion.
Middletown, you might want to review parts of chapter 8, The next logical question is "where do the sample differ•
which show Middletown-related examples of confidence ences occur?" The ANOVA significance level of .000 clearly
intervals and a map of Middletown with its four districts). indicates that significant differences exist somewhere in the
A random sample of home purchase prices is collected grouping structure of four districts, but does not indicate
from each district and the multi-sample difference tests of where. To determine the location of these differences, a series
ANOVAand Kruskal-Wallisare applied to answer this question. of two sample difference of means t tests is applied. The
Table 11.3 shows the home purchase prices of the 34 total descriptive statistics from ANOVA show the Central district
homes randomly selected from the four districts. with the lowest mean home purchase price (X 1 = $186,200)
First, an analysis of variance (ANOVA)test is applied. and the Northside district with the highest mean purchase
ANOVA seems appropriate since this is a "difference test" price (X, = $242,260).Therefore, the largest magnitude t test
problem with four samples and the data are measured on a result and lowest p•value will likely occur when the Central
ratio scale (with the purchase price of each home expressed in and Northside districts are paired.
thousands of dollars). Before applying ANOVA,it must be The results (table 11.5) confirm the largest magnitude t
determined if the samples are drawn from normally distrib• value exists when these two districts are contrasted (t34 =
uted populations and if each population has an equal vari• -4.58,p = .001). However, an equally significant p•value occurs
ance. Testing of the sample data confirms that the samples are when the Eastonand Central districts are compared (t23= 3.97,
normally distributed and have equal variances. p = .001). Why is the p•value for Central and Northside the
Table 11.4 summarizes the results of the ANOVAanalysis. same as the p-value for Eastonand Central when the calcu•
The average home purchase price ranges from a low of lated tfor Central and Northside is quite a bit larger (absolute
S186,200 in the Central district to a high of $242,260 in the
Northside district, with a total or city-wide mean purchase
price of $214,971. The statistical question concerns the likeli• TABLE 11.4
hood or probability that such observed sample differences in
purchase price could occur if indeed no real difference exists Summary of Analysis of Variance (ANOVA):
in the overall Middletown population of home values. If these Home Purchase Prices by Mlddletown District
observed differences in mean purchase price are significantly (in thousands of dollars)
more than expected from random sampling, then the result
will be a large F statistic and a p•value close to zero. Ho: µ, = µ, = µ, = µ., HA:µ, t- µ,t- µ,t-µ,
The calculated F statistic of 10.22 has an associated p•value
or significance level of .000. The logical conclusion is that at Middletown District s
least one of the four Middletown districts has home purchase
Southside (n 1 = 9) 201.30 22.47
Easton (n 2 = 10) 231.19 22.29
TABLE 11. 3 Central (n3 = 8) 186.20 25.77
Random Sample of Home Purchase Prices by Northside (n, = 7) 242.26 20.96
Middletown District (in thousands of dollars)
(n, = 9) (n,= 10) (n,= 8) (n.= 7) n=34 X7 = 214.971
Southside Easton Central Northside
5382
218.38 212.28 202.52 212.28 F=
527
= 10.22 (p = .000)
231.80 220.82 154.94 233.02
179.34 235.46 151.28 244.00 Ninety-five Percent Confidence Intervals by Middletown District •

180.56 262.30 189.10 256.20


Southside ( 0 )
224.48 248.88 173.24 270.84 (-0 )
Easton
168.36 223.26 219.60 257.42
Central ( 0 )
204.96 235.46 183.00 222.04
Northside ( 0 )
213.50 239.12 215.94
175 200 225 250
190.32 184.22
250.10 • based on pooled standard deviation= 22.95.

(continued)
182 Part IV .A. Inferential Problem Solving in Geography

magnitude 4.58 versus 3.97)? The answer can be found by summed, and the mean rank of each sample is calculated. The
examining the slightly different sample sizes.Mean home pur• mean ranks of the four districts are quite different (Southside
chase prices differ by the largest magnitude between Central (12.6); Easton (23.1); Central (8.5); and Northside (26.1)]. It is
and Northside, but the Northside sample size is somewhat therefore not surprising that the K-W test statistic indicates
smaller than the Easton sample size (table 11.3).Quite simply, significant differences in home purchase prices by Middle•
the smaller sample size from Northside has the effect of keep•
ing the level of statistical confidence from dropping below p = TABLE 11 . 5
.001, even though the magnitude of the test statistic tis higher.
The ANOVA results validate the existence of significant dif• Two-Sample Difference of Means tTest Results:
ferences in home purchase prices by neighborhood. Further• Home Purchase Prices by Middletown District
more, the normality and equal variance requirements
District Pair t df p-value
regarding the application of ANOVA have been met. However,
in any problem with rather small sample sizes,application of Southside and Easton -2.91 17 0.010
powerful parametric tests like ANOVA must be evaluated cau•
Southside and Central 1.29 15 0.216
tiously. One conservative strategy is to run the equivalent non•
parametric test as well and compare test results. In this case, Southside and Northside -3.72 14 0.002
the equivalent nonparametric test is Kruskal-Wallis.
Easton and Central 3.97 16 0.001
The Kruskal-Wallis procedure is now applied to the same
home purchase price data (table 11.6). With K-W,the sample Easton and Northside -1.03 15 0.318
values from all Middletown districts are combined into a sin• Central and Northside -4.58 13 0.001
gle overall ranking, the rankings from each sample are

TABLE 11. 6

Worktable for Kruskal-Wallls: Home Purchase Prices by Middletown District

(D, 9) = (D, =10) = 8)


(D, (.a.= 7)
Southside Easton Central tioctbside
Home Home Home Home
Purchase Overall Purchase Overall Purchase Overall Purchase Overall
Price Rank Price Rank Price Rank Price Rank
218.38 17 212.28 13.5 202.52 11 212.28 13.5
231.80 23 220.82 19 154.94 2 233.02 24
179.34 5 235.46 25.5 151.28 1 244.00 28
180.56 6 262.30 33 189.10 9 256.20 31
224.48 22 248.88 29 173.24 4 270.84 34
168.36 3 223.26 21 219.60 18 257.42 32
204.96 12 235.46 25.5 183.00 7 222.04 20
213.50 15 239.12 27 215.94 16
190.32 10 184.22 8
250.10 30

Ri = 113 Rz=231.5 RJ =68 R,. =182.5


flt= 9 nz = 10 llJ =8
f1 = 12.6 f2 = 23.1 f 3 = 8.5

12
6k -2 )
H N(N + l) n 1r 1 - 3(N + 1)
(

12 2
34(35) (9(12.6) + 10(23.1) 2 + 8(8.5) 2 + 7(26.1) 2 ) - 3(35)

.010084(12111.41) - 105

17.13145 associated p-value = .001 (both adjusted and not adjusted for ties)
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 183

town district (H = 17.13 and p = .001). These results clearly Wilcoxon and Mann-Whitney tests are run on each pair of
confirm the ANOVA findings that somewhere in the four sam• neighborhoods as a follow-up to the K•W testing. The Wil•
pies, significant differences exist. coxon/Mann-Whitney results identify several significant differ•
ences in home purchase prices, findings that closely match
TABLE 11. 7 the results obtained earlier from the corresponding para met•
ric ttests (table 11.7).The most significant difference statisti•
Two-Sample Difference Test Results {Wilcoxon cally occurs when home purchase prices in Easton and Central
and Mann-Whitney): Home Purchase Prices by are compared (p = .0029), and the next most significant differ•
Middletown District =
ence is between Central and Northside (p .0032).
District specific statistics: Statistically, the ANOVA and Kruskal-Wallis test results are
Median home remarkably consistent. Both tests confirm that highly signifi-
Overall mean rank purchase price cant differences in home purchase prices exist among the four
District (Wilcoxon) (Mann-Whitney) districts of Middletown. The slightly lower p-value associated
Southside 12.6 204.96 =
with ANOVA (p .000) is expected given the greater statistical
strength of this parametric technique, as contrasted with the
Easton 23.1 235.46
=
nonparametric Kruskal-Wallis test (p .001).
Central 8.5 186.05 As a geographer, you should now ask why home purchase
Northside 26.1 244.00 prices vary significantly among the four Middletown districts.
Why do district pairs Easton-Central and Central-Northside
Middletown District Pairing Wilcoxon and differ by the largest magnitudes? These ANOVA and K•W
Mann-Whitney p-value
results should be viewed as intermediate steps in the research
Southside-Easton .0128 process. You might want to examine other neighborhood
Southside-Central .2685 characteristics that possibly affect home purchase prices, such
as age of home, number of square feet, size of lot, and so on.
Southside-Northside .0081
Here is another instance where the results (output) from sta•
Easton-Central .0029 tistical analysis provide questions (input) that lead to further
Easton-Northside .4350 analysis. In other words, effective geographic problem solving
Central-Northside often involves this cumulative, exploratory process .
.0032

Example: ANOVA and Kruskal-Wallis Tests-


Differences by CensusRegion in Immigrant Growth Rates for Large Metro Areas

Over the last 15 years, the Brookings Institution has devel• ter. The Brookings policy brief states that "Metro areas experi•
oped a Metropolitan Policy Program that redefines the chal• encing the fastest growth rates were places that had relatively
lenges facing metropolitan America and promotes innovative small immigrant populations. A swath of metro areas from
solutions to help communities grow in more productive, Scranton stretching southwest to Indianapolis and Little Rock
inclusive, and sustainable ways. A key component of the and sweeping east to encompass most of the Southeast and
Metro Program is an ongoing examination of immigration lower mid-Atlantic ... saw growth rates on the order of three
patterns. In a recent report, "Immigrants in 2010 Metropolitan times that of the 100-largest•metro-areas rate.•1
America: A Decade of Change; Jill H. Wilson and Audrey Using both ANOVA and Kruskal-Wallis, we apply these
Singer examined newly released data from the 2010 Ameri• tests in a descriptive fashion only to explore the existence and
can Community Survey on the foreign-born population. They strength of a regional concentration of high-percentage•
focused on immigration flow patterns into the 100 largest change immigrant populations during the last decade. Each
metropolitan areas (as ranked by their 2010 census popula• of the 100 largest metro areas is assigned to its census
tions) and found that immigration settlement has become region (Northeast, South, Midwest, and West), and both of
less concentrated (more dispersed) during the 2000s as met• these difference tests are run. The statistical analysis must be
ropolitan areas with relatively small numbers of immigrants in limited to confirming or rejecting descriptive statements. We
2000 have been experiencing rapid growth of immigrant cannot make direct use of statistical test values and associ•
population since that time. ated p•values to make inferential probabilistic statements
Building on their findings, we explore statistically the spa• because our data set is the total population of the 100 larg•
tial pattern of immigration growth for these 100 largest met• est metro areas, and not a sample of metro areas from some
ropolitan areas, using percent change in immigrant larger population. However, we can use these inferential sta•
population from 2000 to 2010 as our measurement parame• tistics in an exploratory way and visually show the spatial

1
The Brookings Institution. "Immigrants in 2010 Metropolitan America: A Decade of Change," Published October 24, 2011.
http:/ /www.brookings.edu/research/speeches/2011/10/24-immigration-singer#

(continued)
184 Part IV .A. Inferential Problem Solving in Geography

pattern of metro areas with the most rapidly growing immi• rapid growth of immigrant population in many southern cities
grant populations. is clearly evident. In the 100 largest metro areas of the United
The results from both the ANOVA and Kruskal-Wallis tests States, 19 of the 25 with the most rapidly growing immigrant
shown in table 11.8 describe patterns of change in immigrant populations are located in the South.
growth that seem to vary considerably from one census Boxplots of each census region are compared in figure
region to another. Both the ANOVA result (F = 1 1.39 and 1 1.4.The boxplot graphic confirms both the ANOVA and Krus•
p•value = .000) and K•W result (H = 29.04 and p•value = .000) kal-Wallis test results and the map evidence-the large met•
suggest notable overall regional differences. In addition, the ropolitan areas of the South have large percentage increases
intermediate summary statistics show a separation of the in immigrant population from 2000 to 2010.
South from the other three census regions. With ANOVA, the Scranton, PA is the lone outlier in this series of boxplots,
mean percentage of immigrant growth in metro areas of the with an extraordinary immigrant population growth rate of
South is 71.19, compared to much lower mean growth rates in 140.2% over the last decade. Although the numerical growth
the other three census regions [Northeast (40.52), Midwest in immigrant population is not extraordinary (a 15,907
(43.28), and West (35.28)]. Very similar results are obtained increase in immigrants), the growth rate is much higher than
with Kruskal-Wallis, as the average rank of immigrant growth any other large metropolitan area in the Northeast. In fact, it is
in metro areas of the South is 70.2, much higher than the the fastest immigrant growth rate of all 100 large metro areas
other census regions [Northeast (37.7), Midwest (43.1), and in the country!
West (35.3)]. What is going on in Scranton? We know that most of the
A map showing the percentage change in immigrant pop• immigrant growth is among younger cohorts. For example,
ulation in the largest 100 metro areas contrasts the spatial pat• according to the 2010 Census, over 80% of the Scranton Wil•
tern of the 25 fastest-growing areas with the other 75 slower• kes•Barre metropolitan area Hispanic population is less than
growing places (fig. 11.3). This map is a modification of one 44 years old and nearly 40% are under the age of 24. We also
that appears in the Brookings policy brief cited above. The know that the city of Scranton faces a total debt of about $300

Scral'l!Ol'I,
PA

• •
• •
: ----,---_JL~ 0 • •

•• •
• •


• •
• 0
0
• • 0

• •

,o

Top 25 U.S. Metro Areas with the Greatest Percentage


• 100 largest U.S. Metro Areas 0 Increase in Immigrant Population (2000 to 2010)

FIGURE 11.3
The Hundred Largest Metro Areas in the Unrted States: Highlighting the Twenty-Five Metro Areas with the Greatest Percentage
Increase in Immigrant Population, 2000 to 2010
Source:BrookingsInstitution,Immigrantsin 2010 MetropolitanAmerica:A Decade of Change
Chapter 11 .A. Three-or-More-Sample Difference Tests: Analysis of Variance Methods 185

TABLE 11. 8 million and may have to declare bankruptcy (as of July 2012).
This includes an unfunded pension obligation of $90 million,
Hundred Largest Metro Areas In the United States: $150 million in debt held directly by the city or the Scranton
ANOVA and Kruskal-Wallls Tests for Differences Parking Authority, and a $15 million mandatory arbitration
by Census Region In Percent Change In Immigrant settlement due to the city's firefighters' union. Why then is this
Population, 2000 to 2010 metropolitan area so attractive to immigrants? This is certainly
Analysis of Variance Statistics a question worthy of further attention.
Census region n X s
West 24 35.28 17.09
South 38 71.19 28.63
Midwest 19 43.28 30.99
Northeast 19 40.52 28.70
OVERALL: 100

Overall Mean = 51.44 F = 11.39 p =.0000

Kruskal-Wallis Statistics

Census region Average


n Median
rank
West 24 33.35 35.3
South 38 67.50 70.2
Midwest 19 42.80 43.1
Northeast 19 33.00 37.7
OVERALL: 100 50.5
H = 29.04 p = .0000 {both adjusted and not adjusted for lies)

140
0
~
*
0
N

-
0
0
0
0
120 Outlier__/
(Scranton,PA)

N
C
. 100
0
iii
~
a.
0 80
-.,
ll.
C
~
0)

E 60

C
G)
0) 40
C.,
.c
0
c
20
~
G)
ll.

West South Midwest Northeast


Census Region

FIGURE 11.4
Boxplots of Percent Change in Immigrant Population, Hundred Largest Metro Areas in the Unrted States, by Census Region,
2000 to 2010
186 Part IV .A. Inferential Problem Solving in Geography

KEY TERMS REFERENCES


AND ADDITIONAL READING
analysis of variance (ANOVA) or F-test (parametric), 174 If you are interested in exploring further the geographic
between-group variation, 175 examples mentioned in this chapter, the following are good
Kruskal-Wallis test (non-parametric), 174 places to start. Once again, see the Centers for Disease Control
within-group variation, 175 and Prevention for obesity data www.cdc.gov.Home purchase
price and other housing data are available from many sources.
The U.S. Census Bureau has a variety of reports and detailed
MAJOR GOALSAND OBJECTIVES data related to housing affordability, prices, and starts at
www.census.gov/housing. Home price data is constantly
If you have mastered the material in this chapter, you updated by numerous private agencies, including RealtyNow
should now be able to: www.realtynow.com and Zillow www.zillow.com/homes/
recently_sold. With regard to immigration data, the U.S.
1. Recognize those geographic problems or situations for Department of Homeland Security, Office of Immigration Sta-
which application of a multiple sample difference test tistics www.dhs.gov/immigration-statistics is a fine place to
is appropriate. begin. Also see the U.S. Census Bureau. The specific issue of
2. Explain the rationale and purposes for conducting a immigration and metropolitan growth is covered by the Brook-
difference test for three or more samples, either analy- ing Institution, especially in a series of reports dealing with the
sis of variance (ANOVA) or Kruskal-Wallis. "State of Metropolitan America." See www.brookings.edu. Of
course_,many other online organizations and interest groups
3. Distinguish between the major components of analy- focus on aspects of the immigration issue.
sis of variance: between-group variability and within-
group variability.
Categorical Difference Tests

12.1 Goodness-of-fit Tests


12.2 Contingency Table Analysis

In the last few chapters, we have presented a logical nonrandom relationship exists between these two vari-
progression of difference tests. The progression started in ables. This type of data structure (contingency table
chapter 9 with one-sampledifference tests comparing a analysis) is presented in section 12.2, with three accom-
single sample statistic to a corresponding population panying geographic examples.
parameter. The sequence continued through two-sample
difference tests that analyze two independent samples for
differences and dependent-sample(matchedpairs) difference 12.1 GOODNESS-OF-FIT
lESTS
tests that analyze one sample twice (chapter 10). The In some problems, the distnbution of frequency
progression finished with three-or-more-sample difference counts is expected to be uniform or equal. For example,
tests that determine whether multiple samples have been suppose an environmental geographer wishes to examine
drawn from one or more populations (chapter 11). nutrient runoff levels from a series of five adjacent tribu-
The focus now shifts to special types of difference taries running into the same bay. Water samples measur-
tests that are designed to analyze data organized into cate- ing nitrogen and phosphorus levels could be taken on a
gories. Some of these tests, known as goodness-of-fit random selection of days simultaneously from each tribu-
tests, compare actual or observed sample frequency tary, and the number of sample days that nutrient levels
counts of a single variable (that is, the actual or observed exceed some critical (threshold) value could be counted
number of observations in each category of a single vari- for each tributary. Although low levels of such nutrients
able) with some expected frequency distribution. In many are acceptable, nutrient levels exceeding some threshold
cases, this type of goodness-of-fittest determines whether value become pollution problems. It might be expected
the actual data "fits" (is similar to) the distribution of that the nutrient level is exceeded about the same number
some expected geographic model or theory. In section of sample days in each tributary, if the nutrient levels are
12.1, we look at three goodness-of-fit examples: (1) the similar from tributary to tributary. This expected fre-
application of the chi-square statistic to determine if the quency count information can be compared with the
actual sample frequency counts of a variable are distrib- observed frequency counts of high nutrient level days
uted in a uniform or even manner; (2) the application of among the bay tributaries to see whether the two fre-
the chi-square statistic to determine if the actual frequency quency counts are statistically different. A goodness-of-fit
counts of a variable are distributed in a proportional man- procedure can determine whether the actual frequency
ner similar to what is expected in some predictive model; counts are uniform or equal for this set of tributaries.
and (3) the application of the Kolmogorov-Smimov statis- Goodness-of-fit procedures can also evaluate
tic to determine if a set of data is normally distributed. whether a set of frequency counts fits an expected distri-
Sometimes we have a two-way (contingency) table bution that is uneven or proportional. Suppose you
that classifies observations according to two different cat- expect the distribution of high nutrient level days from
egorical variables. Our goal is to determine whether any tributary to tributary to be something other than uni-

187
188 Part IV .A. Inferential Problem Solving in Geography

form. For example, a tributary with a greater proportion example, a goodness-of-fit test for randomness could be
of its watershed in agricultural land might be expected to applied to a spatial pattern of influenza outbreaks in a
have more sample days exceeding critical nutrient levels metropolitan area. This procedure would compare the
than another tributary with Jess of its watershed in farm- observed pattern of influenza cases with the pattern
land. You might hypothesize that the expected number of expected if the outbreak follows a random spatial process.
high pollution days is proportional to the percentage of
tributary land area under cultivation. Chi-square(x 2) Goodness-of-FitTest
Proportional goodness-of-fit tests can also be applied When applied as a goodness-of-fit test, the chi-
to evaluate the validity of a geographic model. For exam- square statistic compares the observed frequency counts
ple, suppose a recreation planner is studying attendance of a single variable (organized into nominal or ordinal
patterns at a regional park that serves people in several categories) with an expected distribution of frequency
nearby communities. How many park visitors might be counts allocated over the same categories. The chi-
expected from each community? To estimate park usage square test must use absolute frequency counts and can-
patterns, a model of spatial interaction might be useful. not be applied if the observations or sampling units are
One simple model predicts that the volume of spatial in relative frequency form, such as percentages, propor-
interaction (that is, the number of park visitors from a tions, or rates.
community) is directly related to the population size of Chi-square is a method to determine if a truly signif-
the community and inversely related to the distance from icant difference exists between a set of observed fre-
the community to the park. That is: quency counts and the corresponding expected
frequency counts. The focus is on how closely the two
frequency counts match, thereby providing a goodness-
(12.1) of-fit measure. The null hypothesis (Ho) states that the
population from which the sample has been drawn fits
an expected frequency distribution. Thus, Ho assumes
where S/ij ="spatial interaction"-that is, the number no difference between observed and expected frequency
of park visitors expected (predicted by the counts. If the magnitude of difference between fre-
model) from community i to park j quency counts is small across all categories, the data are
=
Pop; population of community i likely to be a random sample drawn from the expected
Dij = distance from community i to park} frequency distribution and Ho is not rejected. Con-
versely, the alternate hypothesis (HA) suggests that the
The recreation planner can compare the observed magnitude of difference between frequencies is large in
number of park visitors from each community with the at leastone categary.If HA is true, the data cannot be con-
corresponding number of visitors predicted (expected) by sidered a random sample from the expected frequency
the spatial interaction model. The extent to which the distribution and Ho is rejected.
model "fits" (the extent to which the observed and The formula for the chi-square test statistic is
expected visitor counts from the set of communities match)
can be tested statisticallywith a goodness-of-fitprocedure.
Goodness-of-fit tests are also used to compare an (12.2)
actual frequency distribution with a theoretical probability
distribution such as normal or random. A frequency distri-
bution often needs to be tested for normality-an impor- where O; =observed or actual frequency count in the /h
tant assumption for the data to be appropriately applied in category
a parametric test such as tor AN OVA. Suppose a geomor-
phologist is studying soil erosion rates of several soil
E; = expected frequency count in the i th category
orders (alfisols, aridosols, entisols, etc.) in a large regional k = number of nominal or ordinal categories
study area. To test for significant differences in erosion If the observed and expected frequency counts for
rate by soil order, a spatial random sample is taken from each category are similar, then all of the differences
each type of soil. If the samples are drawn from normally ( O;- E;) will be slight, x2 will be small, the goodness-of-
distributed populations, a parametric ANOVA could be fit will be strong, and Ho will not be rejected. However,
applied as the difference test. A goodness-of-fitprocedure if at least one difference between frequency counts is
can be used to determine whether this normality require- large, then the x2 statistic will be large, the observed fre-
ment has been met, thereby confirming the application of quency counts will not necessarily come from the popu-
ANOVA as a valid statistical test. lation or model theorized by the expected frequency
In some geographic problems, a spatial frequency counts, and Ho will be rejected. In later example prob-
distribution can be tested for randomness. In these cases, lems, the methodology for calculating expected frequen-
a Poisson probability distribution would be expected. For cies will be discussed.
Chapter 12 • Categorical Difference Tests 189

0.2

(x 2) Goodness-of-Fit
Chi-square Test
d.f. = S Primary Objective: Compare random sample frequency
counts of a single variable with expected
frequency counts (goodness-of-frt)

Requirements and Assumptions:


d.f. = 15 1. Single random sample
2. Variables are organized by nominal or ordinal
categories; frequency counts by category are input to
d.f. = 30 statistical test
3. If two categories, both expected frequency counts
must be at least five; if three or more categories, no
0 10 20 30 40 so more than one-fifth of the expected frequency counts
Chi-square should be less than five, and no expected frequency
count should be less than two
FIGURE 12.1
Chi-square Distributions Hypotheses:
Ho : Population from which sample has been drawn fits an
expected frequency distribution (uniform or equal,
The shape of the x,2 distribution is determined by
proportional or unequal, random, etc.); no difference
the degrees of freedom (df), which itself depends on the between observed and expected frequencies
number of nominal or ordinal categories in the data set H. : Population does not fit an expected frequency
(fig. 12.1). The distribution is always positively skewed, distribution; there is a significant difference between
but the amount of skew decreases as the degrees of free- observed and expected frequencies
dom approaches 30. In addition, the total area under
Test Statistic:
the curve is always equal to 1.0. When the df is 30 or
more, the x,2 distribution is closely approximated by the k

normal distribution.
While chi-square is a very flexible goodness-of-fit
xz
Ii=1
test, minimum size restrictions apply under certain cir-
cumstances. For example, if the variable has only two
categories, the expected frequency in both should be at should be less than 2. Sometimes categories need to be
least 5. For the test to be valid with more than two cate- combined to increase the expected frequencies to an
gories, no more than one-fifth of the expected frequen- appropriate size.
cies should be less than S, and no expected frequency

Example: Chi-square Goodness-of-fit (Uniform Distribution)-Comparing High School Student SAT Scores

Suppose a school district superintendent wants to conduct an SATscore exceeding the national median is recorded (table
a geographic study that compares the Scholastic Aptitude 12.1). A chi-square goodness-of-fit test is appropriate since a
Test (SAT)scores among juniors from five comparably sized set of observed frequency counts from nominal categories is
high schools located in different portions of a metropolitan being compared with a comparable set of expected frequency
school district. Are students from all five high schools per- counts. Moreover, expected frequency counts are uniform or
forming equally well on the SATs?More specifically, do about equal, because each of the five high schools is expected to
the same number of students in each school earn test scores have an equal number of sampled students with above-
above the national median? If the educational experiences are median SATscores.
similar at all five high schools, it should be expected that The chi-square calculation from the "Initial year results"
about the same number of students from each school will column of table 12.1 is shown in the top portion of table 12.2.
have above-median SATscores. If performance levels are A combined total of 245 sampled students from all five high
unequal, the superintendent has indicated she will begin a schools scored above the national SATmedian. If this group of
five-year improvement program, expending particular effort students is uniformly or evenly distributed among the five
toward improving the SATscores of the lower-achieving high schools, the expected frequency per high school would
schools, while continuing to improve the performance levels be 49 (245 divided by 5). The statistical question is whether
of the better schools. the actual (observed) frequency counts (42, 45, 51, 47, and 60)
Suppose an initial random sample of 100 students is taken are significantly different from the expected frequency counts
from each school and the number of sampled students with (49, 49, 49, 49, and 49). If the frequency counts are different,

(continued)
190 Part IV .A. Inferential Problem Solving in Geography

then the superintendent has statistical verification to begin TABLE 12.1


her five-year "SATimprovement" program.
The calculated chi-square statistic (x 2 = 3.959) has an asso• Random Sample of Students with SAT Scores
ciated p-value or significance level of .412. How should this above National Median, by High School
intermediate-sized p•value be interpreted? If the p•value is Number of SAT scores
very close to 1.0, the actual and expected frequency counts above the national median
would be very similar, the logical conclusion would be to not
Initial year Results after
reject the null hypothesis, and it would be clear that each of High School
results fifth year
the five high schools has an equal or uniform number of stu•
John Muir H.S. 42 51
dents exceeding the national median SATscore. Alternatively,
if the p•value is very close to 0.0, the actual and expected fre• Gifford Pinchot H.S. 45 52
quency count distributions would be very different, and the Rachel Carson H.S. 51 56
decision would be to reject the null hypothesis and conclude Garrett Hardin H.S. 47 53
that the high schools have different SATtest results. Aldo Leopold H.S. 60 64
This intermediate-sized p•value is inconclusive statistically,
but should suggest to the superintendent that there seems to TOTAL 245 276

TABLE 12. 2

Worktable for Goodness-of-Flt Chl•square Uniform: Number of Students with SAT Scores above National
Median, by High School

Initial Year Results:

Observed number of students wrth above-median test scores (from table 12.1)

Muir H.S. Pinchot H.S. Carson H.S. Hardin H.S. Leopold H.S. Total
42 45 51 47 60 245

If testing goodness-of-fit uniform or equal distribution, the expected frequency of each category is calculated by dividing the total
observed frequency count by the number of categories.

E; = roi
--
k
for all categories

245
E; -- 49
5

(42 - 49) 2 (45- 49) 2 (51 - 49) 2 (47 - 49) 2 (60 - 49) 2
x2 ----+
49
----+49 ----+49 ----+
49 49

3.959 (p = .412)
Results after fifth year:

Observed number of students wrth above-median test scores (from table 12.1)

Muir H.S. Pinchot H.S. Carson H.S. Hardin H.S. Leopold H.S. Total
51 52 56 53 64 276

276
55.2
5

x2
L k
(O; ~ E;)
2
(51- 55.2) 2 (52 - 55.2) 2 (56 - 55.2) 2
-----+-----+-----+-----+-----
55.2 55.2 55.2
(53 - 55.2) 2
55.2
(64- 55.2) 2
55.2
i=1 i

2.007 (p = .734)
Chapter 12 • Categorical Difference Tests 191

be "room for improvement; particularly if the test scores of frequency counts, there seems to be a shift toward inter•
Muir (42), Pinchot (45), and Hardin (47) high school students school equality. The magnitude of the chi-square statistic
could be substantially improved. The "SAT improvement pro• declines from 3.959 to 2.007, confirming this shift toward
gram" can be deemed a success (at least statistically) if the equity among high schools (table 12.2).
magnitude of the chi-square test statistic can be reduced and The "SATimprovement program" has apparently been sue•
the resultant p•value shifted dramatically toward 1.0. A cessful. High schools with the lowest frequency counts during
p•value of 1.0 will occur only if there is perfect uniformity or the initial year improved their frequency counts the most by
equality across all of the frequency count categories-that is, the end of the fifth year. Leopold High School had the highest
if an equal number of students from each high school were to frequency count during the initial year. Their students man•
score above the national median on their SATtest. aged to improve, but not to the same degree as the other high
Suppose now that the "SATimprovement program" is schools. As a result of these changes in student test perfor•
implemented. Five years later, the effectiveness of the pro- ma nee, the p•value increases considerably from .412 to .734,
gram is evaluated. The number of students scoring above the indicating stronger confidence that the null hypothesis
national median increases at each of the five high schools and should not be rejected. If the null hypothesis is rejected and a
the five-school total of students increases from 245 to 276 conclusion drawn that high school test scores are different, a
(table 12.1, "Results after fifth year"). Looking at the observed 73.4% chance exists that this conclusion is incorrect.

Example: Chi-square Goodness-of-fit (Proportional Distribution)-


lnterprovincial Migration to Alberta, Canada

Chi-square goodness-of-fit is now applied to a simple spa• The spatial interaction we are examining is the pattern of
tial interaction model to see how well the model "fits" an interprovincial migration into Alberta, Canada during 2011.
actual set of data. We will also focus on the ways in which the During that year, 87,064 out-of-province migrants settled in
model pattern does not fit the actual pattern. Geographers Alberta (table 12.3),making it Canada'sfastest growing prov•
often use spatial interaction models to replicate or predict the ince. In just the fifteen months from January 1, 2011 through
volume of movement or amount of activity (interaction) March 2012 Alberta experienced a 2.37% growth rate, consid•
between locations. One simple spatial interaction model erably more than any other province. Of course, this popula•
assumes that the amount of movement or interaction from a tion growth is largely due to the recent development of the
given origin to a destination is directly related to the popula• Athabasca oil sands (or tar sands) located in the northeastern
tion size of the origin and inversely related to the distance part of the Province. The Athabasca deposit is the single larg•
from the origin to the destination. If the frequency counts of est known reservoir of crude bitumen in the world.
the chi-square goodness-of-fit model closely match the fre• Our question concerns how the actual pattern of Alberta
quency counts that actually exist, then the model is predicting in-migration and population growth compares with the pat•
the actual situation effectively. tern predicted by a simple spatial interaction model. The spe•
cific model we are using predicts that the expected number of
migrants into Alberta is directly proportional to the popula•
tion in each origin province and inversely proportional to the
TABLE 12. 3 distance from that origin province to Alberta. That is:

lnterprovlncial Migration to Alberta, Canada, 2011


Pop_
Actual number of E :---L (12.3)
ij D
lnterprovincial migrants ij
Origin to Alberta, 2011
Newfoundland and Labrador 4,369 To estimate these distances, we used a mileage guide pub-
Prince Edward Island 814 lished by the Canadian Government Travel Bureau. The actual
Nova Scotia 5,253 measure used was average road distance in miles from Cal•
New Brunswick 3,176 gary and Edmonton (the two largest cities in Alberta) to the
Quebec 4,171 estimated population centroid of each of the other provinces.
Ontario 22,736 Using the actual number of interprovincial migrants to
Manitoba 5,433 Alberta (as provided in Statistics Canada),the observed fre•
Saskatchewan 11,299 quencies for the model are a "simulated sample" of 255 in•
British Columbia 28,156 migrants (maintaining a constant proportion of approximately
Yukon Terrrtory 329 three-tenths of 1% (.003) of the actual total as closely as possi•
Northwest Territories 1,164
ble). For example, the actual number of arrivals from Manitoba
Nunavut Territory 164
is 5,433 (.0624 or 6.24% of the 87,064 total) (table 12.3).If the
TOTAL 87,064 "simulated sample" of 255 is used as the total, then Manitoba's
Source: Statistics Canada = =
in-migration would be .0624(255) 15.912 16 (table 12.4).

(continued)
192 Part IV .A. Inferential Problem Solving in Geography

With such a low simulated total of in-migrants (255), we the contribution of each Canadian province and region to the
must combine certain low-population provinces and use total chi-square value.
regional population totals instead of individual province pop• One key discovery is that the Atlantic Provinces are send•
ulations. As discussed earlier, a valid application of chi-square ing many more migrants to Alberta than expected using the
requires at least two observations in each cell and no more "population/distance" predictive model. Nearly 82% (81.82%)
than 20% (one-fifth) of the cells should have an observed fre• of the total "contribution to chi-square" is from the Atlantic
quency count of 5 or less. Therefore we have chosen to com• region. The logical follow-up geographic question is: "why are
bine Newfoundland and Labrador, Prince Edward Island, Nova many more migrants than expected relocating from the Atlantic
Scotia, and New Brunswick into an "Atlantic Canada" region. region to Alberta?" We might investigate a variety of economic
Also, the three territories of "Northern Canada" (Yukon, North• conditions in these eastern provinces (depressed primary
west, and Nunavut) are dropped from the analysis due to industries such as fishing and forestry, current unemployment
small simulated sample sizes. rate, and so on) to begin to answer this question.
Using the "population/distance• model and assuming a The other important contributor to the magnitude of the
simulated total of 255 in-migrants, we are now able to con• chi-square statistic is Quebec, which accounts for slightly
trast the observed and expected frequency counts using the more than 16% of the chi-square value. In this case, many
chi-square test. The result is a chi-square test statistic of fewer migrants than expected are moving to Alberta. As we ask
130.51. By themselves, the magnitude of the chi-square statis• why, we might refer to the "linguistic divide" and other
tic and its associated p•value has no inferential value: we are aspects of cultural distinctiveness separating largely French•
using chi-square in an exploratory and descriptive fashion speaking Quebec from the rest of English-speaking Canada.
only. In short, this is because the simulated data are just con• Other factors may be suggested as well.
stant proportions (.003) of the actual total and not true ran•
dom samples. We cannot make any inferences from the
"simulated sample" of 255 in-migrants to the actual total of
87,064 in-migrants to Alberta.
We are applying chi-square as an exploratory and descrip•
tive tool to see where the largest differences between
observed and expected frequencies are located. Then we can
speculate about the underlying spatial processes causing
these differences. The final two columns in table 12.4 indicate

TABLE 12. 4
Summary Table for Chi-square Goodness-of-Fit Proportlonal: "Simulated" lnterprovinclal Migration to
Alberta, Canada, 2011

Observed number of Expected number of


Province (Region) of origin interprovincial interprovincial migrants Percent of total
included in final simulation migrants (simulated) (population/distance model) Contribution to x2 contribution to x'
Atlantic• 40 9 106.778 81.82
Quebec 13 43 20.930 16.04
Ontario 68 76 0.842 0.64
Manitoba 16 19 0.474 0.36
Saskatchewan 34 28 1.286 0.99
British Columbia 84 80 0.200 0.15
TOTAL 255 255 130.510 100.0

• The Atlantic region of Canada comprises the four provinces located on the Atlantic coast, excluding Quebec. That is
Newfoundland and Labrador, New Brunswick, Prince Edward Island, and Nova Scotia.

Note: The three terrttories of Northern Canada {Yukon, Northwest, and Nunavut) were dropped from the analysis due to small
simulated sample sizes.

E _ Pop;
"Population/Distance" model: ij - D;j

k
"\' (O, - £,) 2 _ (40 - 9) 2 (13 - 43) 2 (68 - 76) 2 (16 - 19)2 (34 - 28) 2 (84 - 80) 2
x' - L,
C=t
E
l
- 9 + 43 + 76 + 19 + 28 + 80

- 130.510
Chapter 12 • Categorical Difference Tests 193

Kolmogorov-S111irnov
Nonnality Test The step-by-step manual calculation of the Kol-
The Kolmogorov-Smimov statistic is an alternative mogorov-Smirnov is somewhat cumbersome-involving
to chi-square for testing similarity between two frequency the conversion of the actual ranked data to Z.scores, the
distributions. Chi-square uses frequencies from either calculation of cumulative normal values and cumulative
nominal or ordinal classes, but Kolmogorov-Smirnov relative frequencies from those Z-scores, and the subse-
(K-S) requires data measured at the ordinal level or inter- quent comparison of those with the corresponding cumu-
val-ratio level downgraded to ordinal. Technically, K-S lative relative frequencies expected with a perfectly
requires continuously distributed variables, but only very normal distribution. Fortunately, we can make use of a
statistical software package to do the sequence of calcula-
slight errors result when the technique is applied to dis-
crete data. In the most common goodness-of-fit applica- tions for us, so it is not necessary to include a step-by-
tion ofK-S-testing a set of sample data to determine ifit step ':"orktable of calculations. The resultant K-S statistic,
(D), 1s the maximum absolute differencebetween the two
has been taken from a normally distributed population-
sets (observed and expected) of cumulative relative fre-
the observed distribution of sample data is compared to
an expected theoretical normal distnbution. The null quencies, that is: D = maximum I CRF0 (X) - CRF,(X) 1.
hypothesis states that no significant difference exists The concept is illustrated in figure 12.2 with a hypotheti-
between the two frequency distributions. If it is confirmed cal sample of 10 observations (ranked in ascending
that the sample has been taken from a normally distnb- order) being tested for normality.
uted population, then it is fully valid to use a parametric where CRF0 (X) = cumulative relative frequencies
statistic such as tor ANOVA in the subsequent analysis. (observed) for variable X
More specifically, the Kolmogorov-Smirnov test for CRF,(X) = cumulative relative frequencies
normality compares the cumulative relative frequencies (expected if normal) for variable X
of the observed sample data with the cumulative frequen-
cies expected for a perfectly normal distribution. If the . Note that the CRF,(X) is a continuous curve, because
two sets of cumulative frequencies closely match, the the- its value can be calculated for any magnitude of X. By
oretical and sample distributions can be considered the contrast, CRF0 (X) is a series of points, because its values
same and the population considered normal. Any sizable are determined only at the actual sample points. The K-S
difference between the two cumulative distributions sug- test compares the expected and observed relative fre-
gests that the sample data are not taken from a normally quency values only at the sample locations.
distributed population. When the deviation between the actual and theoreti-
cal (normal) distribution is very large, Dis very large and
the null hypothesis of no difference between the two dis-

Kolmogorov-Smirnov
Goodness-of-Fit 1.00

Test • •
D •

Primary Objective: Compare random sample frequency •


counts of a single variable with expected •
frequency counts (goodness-of-fit}

Requirements and Assumptions:
1. Single random sample

2. Population is continuously distributed (lest less valid
w~h discrete distribution}
3. Variable is measured at ordinal scale or downgraded •
from interval/ratio to ordinal
.00 +-::=::.+--+---+-
• _________ -I

Hypotheses: 0 5 10
Ranked X valu&s ( in ascendJn.gorder)
Ho : Population from which sample has been drawn frts an
expected frequency distribution (uniform or equal, / CRF • (X} cumulative relative frequency expected
proportional or unequal, random, etc.}; no difference (with theoretical normal}
between observed and expected frequencies
• CRF O (X} cumulative relative frequency observed
HA : Population does not ffl an expected frequency
(from actual sample of 10 observations}
distribution; there is a significarrt difference between
observed and expected frequencies D = max lcFR 0 (X}- CFR. (X}i
Test Statistic:
FIGURE 12.2
D = maximum ICFR0 (X) - CFR,(X)I Kolmogorov-Smirnov Test: Maximum Difference (0) between
Observed and Expected Cumulative Relative Frequencies
194 Part IV .A. Inferential Problem Solving in Geography

tributions is likely to be incorrect. In this instance, if D is In the p-value approach to determining the signifi-
a large enough value, the actual distribution may be cance of the Kolmogorov-Smirnov test statistic, the
found statistically non-normal. Conversely, if all the dif- approximate value of p is estimated by interpolating from
ferences between the actual and expected distributions a special table of D values. Most statistical software pack-
are small, then D will be small. In this case, the null ages provide both the K-S (D) value and the correspond-
hypothesis is not rejected, and the observed distribution ing p-value estimate, along with certain basic statistics
is considered normal. such as the mean and standard deviation of the data set.

Example: Kolmogorov-Smirnov Goodness-of-fit (Normal Distribution)-


Household Annual Energy Expenditures, Parkwood Estates

The Parkwood Estates survey of household water usage could be subsequently used to analyze the energy expendi-
change before and after the attempted implementation of a ture data. Recall that many statistical procedures (parametric
water conservation program was not very successful (see the tests) assume population normality, and using a normality
Wilcoxon matched-pairs test example in chapter 10). Local
planners are not deterred, however, and want to conduct
another survey of Parkwood Estates households, this time TABLE 12.5
analyzing annual energy expenditures.
Annual Energy Expenditures for Twenty-five
A random sample of 25 households is selected and their
Parkwood Estates Households
total energy expenditures are monitored for an entire year
(table 12.5). Most of the households sustained energy costs in Household Annual energy
the S1,800 to $2,000 range. This finding is not very surprising number expenditure (dollars)
to local officials, as the homes in the development were gen•
erally built about the same time with similar quality construc• 1 $ 1,915
tion materials and insulation levels, and have about the same 2 $ 412
square footage. For some reason, however, a few of the homes
3 $ 1,840
(particularly numbers 2, 5, and 6) registered much lower
energy costs. 4 $ 1,845
To explore further, a histogram is constructed (fig.12.3) and
5 $ 841
the low-energy cost outliers are easily seen. The shape of the
histogram suggests that this sample of households may have 6 $1,044
been taken from a non-normal population (the •population" 7 $ 1,773
being all of the households in Parkwood Estates). If true, sig•
nificant non-normality would affect which statistical tests 8 $ 1,691
9 $ 1,865
10 $ 1,846
15 •
11 $ 1,880
12 $1,844
12
13 $ 1,822
9.
14 $ 1,915
15 $ 2,010
6. 16 $ 1,991
17 $ 1,974
3 18 $ 1,967
19 $ 1,920
20 $ 1,885
21 $ 1,872
22 $ 1,887
Household annual energy expenditure (in dollars)
23 $ 1,891
FIGURE 12.3 24 $ 1,820
Histogram of Annual Energy Expenditures for Twenty-five 25 $ 1,813
Parkwood Estates Households
Chapter 12 • Categorical Difference Tests 195

test such as K·S is often an important step in the analysis pro- of 0.372 indicates the maximum absolute difference in cumu•
cedure. Kolmogorov•Smirnov must be used as the normality lative percentages between the actual and theoretical normal
test in this problem because the sample size is less than 30. distributions. The likelihood (p•value) that a difference as large
We also decided to generate a probability plot of the as 0.372 could occur if the sample is drawn from a normally
energy expenditure data along with calculation of the K·S sta• distributed population is less than 0.010. That is, we are over
tistic (fig. 12.4) to assesspopulation normality visually. This 99% confident that the sample has been taken from a non•
graphic plots the actual ranked data values against the values normal population.
you would expect to occur if the population from which the Local officials are upset with this result, as they expected
sample has been taken is normally distributed. household energy costs to be normally distributed. A quick
The sequence of plotted points will form an approximately check of the situation revealed that households 2 and 5 were
straight line if the sample has been taken from a normally dis• occupied only during the summer months and that a single
tributed population: clearly that is not the case in the figure. elderly person was the sole occupant of household 6. The
Basedon visual evidence, we must conclude that the sample study would have been improved by including only year•
has been taken from a non-normal population. This is con• round occupants or by using energy expenditure per month
firmed by the Kolmogorov•Smirnov test result. The K·Svalue of occupancy as an alternative measure.

99


95

90 •
,,, 80
.,, •••
••
0
,:;:
,,,
(I>
:,
70

--
0
,:;:
0
60
•••
•••
C
(I> 50
"8.
~

40
•••
(I>

~
>
"' 30
:,
E
:, •
u 20
••

10 •
5 •

1
500 1,000 1,500 2,000 2,500
Household annual energy expenditure (in dollars)

FIGURE 12.4
Probability Plot of Annual Energy Expenditures for Twenty-five Parkwood Estates Households

a single variable are not tested against some theoretical


12.2 CONTINGENCY
TABLEANALYSIS model or expected distribution (such as uniform, propor-
tional, or normal). Instead, the frequency distnbutions of
The idea of testing an actual frequency distribution two variables are compared directly with one another to
against an expected frequency distribution is now see if they are different (independent) or related in some
expanded to research problems that examine the rela- way (not independent). The null hypothesis would state
tionship between two variables. This is a logical exten- that the two categorical variables have no relationship,
sion of the single-sample or single-variable goodness-of. while the alternate hypothesis would state that they do
fit tests. In this instance, categorical frequency counts of have a non-random relationship.
196 Part IV .A. Inferential Problem Solving in Geography

For example, suppose a city planning agency is ables. If at least one difference between actual and
developing a five-year capital improvement program expected frequency counts is large, then the variables are
(CIP). This program involves the scheduling of public not likely to be statistically independent, but rather
physical improvements based on estimates of future fiscal related in some nonrandom (systematic) fashion. The
resources and project costs. As part of the CIP develop- null hypothesis of no relationship is then rejected. Con-
ment process, decision-makers want citizen input in versely, if the differences between frequency counts in all
determining the method for financing several proposed cells of the table are small, then you would conclude the
cultural facilities. The proposals include: variables are statistically independent with only a ran-
dom association, and the null hypothesis is not rejected.
• improving the grounds of a local historical site
The chi-square statistic evaluates contingency table
• constructing a new wing on the art museum frequency counts and is a simple extension of the good-
• expanding the city aquarium ness-of-fit test described in section 12.1. Theentirecontin-
• building a new science center with science education gencytableis evaluatedstatisticallyas a singleentity.That is,
facilities the analysis determines whether significantly large differ-
ences exist between actual and expected frequency
• constructing a new branch library
counts of the two variables, but does not indicate which
Several methods of financing are available, including cells in the table contain these differences. When both
general tax revenue, user fees, administrative fees, and variables have exactly two categories, this analysis poses
revenue bonds. Suppose a random survey of residents is no problem. With tables having three or more categories
taken and respondents are asked which of the five proj- for at least one variable, however, further analysis is
ects they prefer. For that preferred project, each respon- sometimes needed to learn exactly where the signifi-
dent is then asked which method of funding they prefer. cantly different frequency counts are located.
A contingency table could then be constructed showing Certain restrictions apply to the use of chi-square in
the number of people who prefer each project/funding contingency analysis. Data must be absolute frequency
source combination. Contingency analysis could then be counts rather than relative frequencies such as percent-
applied to the frequency counts in this table to determine ages or proportions. In addition, no more than one-fifth
if the preferred financing methods differ significantly of the expected frequencies should be less than five and
from project to project. none less than two. As discussed with goodness-of-fit
Geographers can also use contingency analysis effec-
tively when comparing maps. Given two spatial pattern
(choropleth) maps of the same study area, the extent of
areal relationship between the maps can be examined by
selecting n identical sets of random point locations from Contingency
Analysis(x 2 )
both maps, then allocating each of the sample point com- Primary Objective: Compare random sample frequency
binations to the appropriate cell of a contingency table. counts of a two variables for statistical
For example, suppose a geographer has two maps of the independence
same region. One map shows the suitability of land for
Requirements and Assumptions:
general farming (in five ordinal categories: well suited,
moderately suited, somewhat poorly suited, poorly 1. Single random sample
2. Variables are organized by nominal or ordinal
suited, and unsuited), and the other map shows soil asso-
categories; frequency counts by category are input to
ciation (in five nominal categories: alfisols, entisols, his- statistical test
tosols, inceptisols, and spodosols). The spatial 3. No more than one-fifth of the expected frequency
relationship between these two maps can be measured counts should be less than five, and none of the
though contingency analysis of the frequency counts of expected frequencies should be less than two
points assigned to each cell of the table.
Hypotheses:
When using contingency analysis, data values from
Ho : There is no relationship between two variables in the
both variables are assigned to nominal or ordinal catego-
population from which sample has been drawn
ries, and the frequency count of each category from one HA : There is a relationship between two variables in the
variable is cross-tabulated with the frequency count of population from which sample has been drawn
each category from the second variable. The frequencies (variables are not statistically independent)
are summarized in a two-dimensional contingency or
Test Statistic:
cross-tabulationtable. Each variable must have at least
two categories to which observations can be allocated. r k
The actual frequency count in each cell of the contin-
gency table is compared with the frequency count
x2 =
II
i=t i=t
expected if no relationship exists between the two vari-
Chapter 12 • Categorical Difference Tests 197

procedures, combining categories is sometimes an option expected frequency counts, the test statistic will be large
if too many expected frequency counts are too low. and the null hypothesis will more likely be rejected.
The chi-square statistic used in contingency analysis is: To calculate the expected frequency of cell ij in the
contingency table, the sum of all observed frequency
counts in row i (R;) is multiplied by the sum of all
(12.4) observed frequency counts in columnj (Cj), then divided
by the grand total of all observed frequencies (N):

where: O;i =observed


th
frequency count in the i th row and (12.S)
j column
=
E;i expected frequency count in the i th row and
j th column Two items regarding these expected frequencies
=
r number of rows in the contingency table should be noted. First, the frequency count totals of a
row or column are often called marginal totals. Thus, R 1
=
k number of columns in the contingency table
is the margin total for row 1, C3 is the marginal total for
If the observed and expected frequency counts in column 3, and so on. Second, even though all observed
each cell of the contingency table are similar, then the frequency counts are integers, the calculated expected
differences (O;j-Eij) in all of the cells will be small, the frequency values will generally not be integers. Rounding
statistic will be low, and the null hypothesis will not be the expected frequency values to integers is not neces-
rejected. Conversely, if at least one cell in the contin- sary. Doing so would only result in a less precise test.
gency table has a large difference between observed and

Example: Contingency Table Analysis-


Comparing Obesity Status, Metropolitan and Non-metropolitan Counties

A number of interrelated risk factors have been associated counts. The number (frequency count) of "low" obesity coun•
with the spatial patterns of obesity in the conterminous ties is almost identical to the number of "high" obesity coun•
United States. Obesity seems to be related to several "meso• ties, which is to be expected since we are using the median as
environmental" factors such as land use, population density, the dividing value. There was a technical problem with one of
and level of urbanization. Since obesity data are available at the counties, making the final sample size 199, not 200.
the county level, and each county can be classified as either
metropolitan or non-metropolitan, we propose a research
hypothesis testing the relationship between obesity status TABLE 12.6
and type of county (metropolitan or non-metropolitan).
Suppose we take a random sample of 200 U.S.counties, Contingency Table: Difference Between Metro and
classify each county as either metropolitan or non-metropoli- Nonmetro County-level Obesity Rates•
tan, and classify its obesity status as either "low" (less than the County obesity rates
national median of 29%) or "high" (greater than or equal to "Low" "High"
the national median). We can now formulate the following County obesity level obesity level
inferential null hypothesis: "The type of county (metropolitan classification•• (< 29o/,) ~29%) TOTAL
or non-metropolitan) and the obesity status of the county Nonmetro 60 (62.7) 70 (67.3) 130(130)
(low or high obesity level) are not independent~ That is, the Metro 36 (33.2) 33 (35.7) 69 (69)
frequency counts of counties across these two variables are TOTAL 96 (96) 103 (103) 199 (199)
not related. On the other hand, if the alternate or research • Eachcell of the table contains the observedfrequencycount,
hypothesis is true, type of county and obesity status of the followedby the expectedfrequencycount, In parentheses.
county will be related (not independent).
"Low" obesity level defined as less than the medianof 29.
Agencies such as the Centers for Disease Control and Pre• "High" obes~ydefinedas greater than or equal to the median of 29.
vention (CDC) and the National Center for Health Statistics
(NCHS)have made statements that obesity levels are gener• All expected frequency counts E11 = •:,
ally higher in rural areas. We should therefore expect that a for example: E11 = <'~!> = 62.7
96

statistically significant chi-square value will be calculated from


the random sample of 200 counties.
The data are entered into a chi-square analysis, with the fol-
lowing results (table 12.6). The number of non-metro counties
x' -
, '
L, L,
l=t
,

J=l
. (0,1-E11 )'
E
lj
= o.654 p = 0.419

(130) exceeds the number of metro counties (69) by a consider•


able margin, but this is proportional to the nationwide county Source: Centersfor DiseaseControland Prevention(CDC)

(continued)
198 Part IV .A. Inferential Problem Solving in Geography

Notice that the magnitudes of the observed and expected these two variables appear to be unrelated. This finding seems
frequency counts are very similar in all cells of the table. These to contradict the conventional wisdom expressed by the CDC
small differences result in a chi-square statistic of 0.654, with an and the NCHS,and it therefore seems logical to explore the
associated p•value of 0.419. We should therefore conclude that sample data further. Perhaps we should look at particular
the type of county (metropolitan or non-metropolitan) and the counties in the sample, to see if there are any influential anom•
obesity status of the county (low or high obesity level) are alies or outliers. In fact, we will return to this set of data when
independent. That is, the frequency counts of counties across constructing a multiple regression model in chapter 18.

Example: Contingency Table Analysis-Comparing Congressional Approval Ratings by Census Division

By now, it is common knowledge that most Americans do Lessthan 10% of all respondents (125 out of 1347) had a
not have a very favorable view of Congress. In fact, the great deal of confidence in Congress. Most cells of the con•
approval rating of Congress has been quite low for several tingency table show only slight differences between the
years: in 2010, it hit an all-time low of 10%. Not surprisingly, actual frequency counts and those frequency counts
Congressional ratings have not improved much given the expected if there is no difference in views by census division
continuing budget problems facing the country. Paradoxi• (the nu II hypothesis).
cally, while the general approval rating of Congress remains The calculated chi-square value of 22.56 has an associated
very low, most Americans have a surprisingly strong positive p-value of 0.126 (table 12.8). This result suggests that some
impression about their own representative. The overwhelm• minor differences between observed and expected frequency
ing majority (90%) of all congressional lawmakers running for counts exist, but they are relatively insignificant. If we reject
reelection were returned to Washington in 2012. the null hypothesis and conclude that there are significant dif•
Much less attention is directed toward how congressional ferences from one census division to another in congressional
approval ratings vary from one part of the country to another. approval levels, we would be wrong 12.6% of the time. If we
We propose the following inferential hypothesis: "Congressio- follow conventional practice, it is probably best to not reject
nal approval ratings are no different from one census division the null hypothesis and conclude that approval ratings do not
to another.• The latest available survey of approval ratings by vary across census divisions; it seems that people all across
census division was done in 2010, as part of the General Social the country think that Congress is doing a poor job!
Survey conducted by the National Opinion ResearchCenter at
the University of Chicago. The results are shown in table 12.7,
with various levels of confidence in Congress cross-tabulated
against census division.

TABLE 12. 7

Contingency Table: Census Division of Respondent Cross-tabulated with Level of Confidence in


Congress, 2010

Level of confidence in Congress•


Census Division A great deal Only some Hardly any TOTAL
New England 6 (4.64) 21 (23.57) 23 (21.79) 50 (50)
Middle Atlantic 19(16.15) 83 (82.03) 72 (75.83) 174 (174)
E. North Central 19 (21.25) 95 (107.95) 115 (99.79) 229 (229)
W. North Central 5 (6.87) 35 (34.88) 34 (32.25) 74 (74)
South Atlantic 23 (26.63) 148 (135.30) 116 (125.07) 287 (287)
E. South Central 7 (8.35) 39 (42.43) 44 (39.22) 90 (90)
W. South Central 22 (13.27) 70 (67.41) 51 (62.32) 143 (143)
Mountain 3 (8.82) 45 (44.78) 47 (41.40) 95 (95)
Pacific 21 (19.02) 99 (96.64) 85 (89.34) 205 (205)
TOTAL 125 (125) 635 (635) 587 (587) 1,347 (1,347)

• Each cell of the table contains the observed frequency count, followed by the expected frequency count, in parentheses.

All expected frequency counts E11= •~, ( for example: E1t = csox, 25> = 4.64)
1,347

Source: General Social Survey (GSS). conducted by the National Opinion Research Center (NORC)
Chapter 12 • Categorical Difference Tests 199

TABLE 12. 8

Worktable for Chi-Square Contingency Analysis: Level of Confidence in Congress

Ho: there is no relationship between two variables (variables are statistically independent with only a random association).

HA: there is a relationship between two variables (variables are not statistically independent, but related to one another in
some nonrandom fashion).

where 0 1 = I"'
the observed frequency count in the ,"' row and column
E1 = the expected frequency count in the l" row and/' column
r = the number of rows in the contingency table
k = the number of columns in the contingency table

The row variable is Census division and the column variable is level of confidence in Congress.

(R1)(C1)
E11
N

(R1)CC1) (50)(125)
4.64
N 1347

r k (
011- E11)2 (6 - 4.64) 2 (21 - 23.57) 2 (99 - 96.64) 2 (85 - 89.34) 2
X2 =
I I
l=l l=l
-'-'---c---''-'--
Eij
= --~-
4.64
+ -~~~-+
23.57 + 96.64 + 89.34
22.56

x2 = 22.s6 p-value = 0.126


(1) cell wtth expected count less than 5

Example: Contingency Table Analysis-


Patterns of State-level Housing Vacancy Rates, Former Owner-occupied Units and Former Rental Units

As data for the 2010 Census of Population was being col- vacancy rate was 2.4% and the national average rental unit
lected, the United States was struggling through a major vacancy rate was 9.2%.
housing crisis. Part of chapter 8 deals with the problem of To show the spatial patterns of both homeowner-occupied
housing units being •underwater" (having lost so much value and rental unit vacancy rates, we construct a crosstab map (fig.
that the amount owed on the mortgage exceeds the current 12.5). Each state is allocated to its particular combination of
value of the home). vacancy rates, whether above or below the national average.
Another unfortunate consequence of this crisis has been Certain states were particularly hard-hit by the housing cri-
the abandonment of many housing units. As geographers, sis, including Nevada (especially Las Vegas), Arizona (Phoenix),
one of many interesting issues we might investigate is the Florida, Michigan (Detroit), and Ohio (Cleveland). As you can
spatial distribution and variation of vacant housing units see, these are all among the states that have both homeowner
across the United States. As part of the 2010 Census effort, and rental vacancy rates above the national average. The
state.,level data were collected concerning housing unit Southeast appears to be the region that has suffered the
vacancy rates, both for units last occupied by the homeowner most. For your information, Nevada had the highest home•
and for units last occupied by a renter. At the time these data owner-occupied vacancy rate (5.2%),while South Carolina
were collected, the national average homeowner-occupied had the highest rental unit vacancy rate (14.3%).

(continued)
200 Part IV .A. Inferential Problem Solving in Geography

Conversely, many states in the upper Midwest and New TABLE 12.9
England had vacancy rates below the national average for
both former homeowner-occupied units and former rental Contingency Table: State-level Housing Unit
housing units. Massachusetts had both the lowest former Vacancy Rates, Formerly Owner-occupied Units
homeowner-occupied vacancy rate (1.5%) and former rental and Rental Units, 2010
unit vacancy rate (6.5%). Vacancy rate of
A contingency table analysis confirms a state-level rela• formerly owner-occupied
tionship between the owner-occupied and the renter-occu• housing units
Vacancy rate of Above Below
pied vacancy rates of housing units (table 12.9). While only formerly renter- national national
descriptive in nature, a chi-square test statistic of 7.710 (with occupied housing average average
associated p•value of 0.005) strongly suggests that states hav• units vacancy rate vacancy rate TOTAL
ing above national average owner-occupied housing unit Abovenational
averagevacancy 13 (8.24) 8 (12.76) 21
vacancy rates are also likely to have above national average
rate
rental unit vacancy rates.
Belownational
Once again, the basic intriguing question for geographers averagevacancy 7 (11.76) 23 (18.24) 30
is why these spatial patterns vary in the way they do. Many dif• rate
ferent factors undoubtedly influence these patterns in com• TOTAL 20 31 51
plex ways. It is known, for example, that mortgage lending
and banking practices vary considerably, even at the local or x' = 7.710 p-value = 0.005
metropolitan level.

,o

.....
(13) (7) (8) (23)
□ Both homeownerand rentalvacancy □ Homeownervacancyrates hi~ □ Homeownervacancyrateslow □ Both homeownerand rentalvacancy
rates above the nationalaverage Rental vacancyrateslow Rental vacancyrate,shigh rates Sassthan the nationalav8fage

FIGURE 12.5
Crosstab Map: Pattern of State-level Housing Vacancy Rates, Owner-occupied and Rental Units
(Washington, DC also included in the analysis)
Source: United States Bureau of the Census, Population and Housing Occupancy Status
Chapter 12 • Categorical Difference Tests 201

KEY TERMS REFERENCES


AND ADDITIONAL READING
chi-square goodness-of-fit (proportional distribution), 191 If you are interested in further exploring the geographic
chi-square goodness-of-fit (uniform distribution), 189 examples mentioned in this chapter, the following are good
contingency (crosstab) map, 199 places to start. Additional information about educational testing
contingency table (cross-tabulation table) analysis, 187 and the statistical validity, accuracy, and precision of test scores
cumulative relative frequency, 193 can be found at the National Center for Educational Statistics
Kolmogorov-Smirnov goodness-of-fit (NCES) wwwnces.ed.gov. Without priority access.,however, it
will be difficult to obtain location-based test score data (confi-
(normal distribution), 194
dentiality restrictions). Migration and immigration pattern data
marginal totals, 197 are available from Statistics Canada wwwstats.ca. The Energy
probability plot, 195 Information Administration has the nationwide averages from
which the Parkwood Estates example problem data were
designed to closely replicate: www.eia.gov/consumption/resi-
MAJOR GOALSAND OBJECTIVES dential. Attitudes regarding Congressional approval have been
collected for decades through Gallup polling www.gallup.com.
If you have mastered the material in this chapter, you
For a full statistical analysis of congressional approval rating
should now be able to: contingency tables by Census Division and Region, see the
1. Recognize the variety of geographic research prob- General Social Survey archives at the Survey Documentation
lems for which goodness-of-fit tests are appropriate. and Analysis (SDA) website: http://sda.berkeley.edu. Finally,
housing vacancy rate data (for both owner-occupied and rental
2. Explain the difference between an actual or observed units) can be found at American FactFinder, especially various
frequency distribution and an expected frequency dis- tables related to "Population and Housing Occupancy Status"
tribution in a goodness-of-fit context. http://factfinder2.census.gov/
3. Understand the nature of various frequency distribu-
tions (uniform or equal, proportional or unequal, and
normal) and identify the corresponding goodness-of-
fit test.
4. Define the term "cross-tabulation" and identify geo-
graphic situations that use cross-tabulation.
S. Identify the types of geographic problems for which
contingency analysis is appropriate and be able to cre-
ate and organize an original "geographic situation"
where contingency analysis could be used.

You might also like