Chapter Seven Disease Cluster and Cluster - Analysis

Chapter 7
Disease Cluster and Cluster

Analysis
1
Session Objectives
At the end of this Session students will enable to:
 Describe the geocoding and data linkage using primary and
secondary data
Define cluster analysis
 Differentiate Local and Global Spatial autocorrelation
 Identify the types of interpolation
 Define network analysis
2
What are spatial statistics?
 They are similar to traditional statistics, but integrate spatial
they
relationships into the calculations.
 Spatial statistics will allow you to answer the following questions about your data:
 How are the features distributed?
 What is the pattern created by the features?
 Where are the clusters?
 How do patterns clusters of different variables compare on one

and
another?
 What are the relationships between sets of features or values?
3
What is spatial autocorrelation?
 Spatial autocorrelation in GIS helps to understand the degree to which one

object is similar to other nearby objects.
 It is a statistical concept that measures the degree of similarity or dissimilarity
between neighboring locations in a spatial dataset.
 It measures how much close objects are in comparison with other close
objects.
 Measure of how likely two neighboring areas are to have similar values for a
specific field of data.
4
What is spatial autocorrelation?
 Moran’s I (Index) measures spatial autocorrelation.
 Global Moran's I statistic measures spatial autocorrelation based on feature

locations and attribute values
 Moran’s I statistic is robust in detecting the presence of a spatial pattern

amongst a variable
 Moran’s I can be classified as positive, negative and no spatial auto-
correlation.
 Positive occurs when observations having similar values are closer
(clustered) to one another
 Negative occurs when observations having dissimilar values occur near one
another 5
Autocorrelation cont’d….
 Positive Spatial Autocorrelation Example

 Positive spatial autocorrelation is when similar values cluster together in a
map and occurs when Moran’s I is close to +1.
 This means values cluster together.

 Similar attribute values tend to cluster together in neighboring locations.
 It indicates a spatial pattern where areas with similar characteristics are

spatially grouped. 6
Autocorrelation cont’d….
Negative Spatial Autocorrelation Example

 Negative spatial autocorrelation is when dissimilar values cluster
together in a map
and occurs when Moran’s I is near -1
 Moran’s I is -1 because dissimilar values are next to each other
 A value of 0 for Moran’s I typically indicates no autocorrelation
 It indicates a spatial pattern where areas with contrasting characteristics

are spatially grouped.
7
Testing of the existence of clusters (Autocorrelation)
A. Global Tools /Statistics/
• Are tools used to test the existence of overall clustering (either high or
low)
• It doesn’t indicate the occurrence of specific pattern
• But, used to identify and measure the pattern of the entire study areas
• It is a single value statistic used to summarize pattern
• Homogeneity
8
Testing of the existence of clusters cont’d…
B. Local Tools /statistics/

 Test the existence of local clusters
 Identify variation across the study area, focusing on individual

features and their relationships to near by features
 It is location specific statistics (i.e. specific areas of clustering)
 Heterogeneity
9
A. Global Statistics
i) Getis-Ord General G (High/Low Clustering)
 General G is a tool used to measure the concentration of high/low values for

a given study area
 The Global G statistic computes a single statistic for the entire study area
 Able to indicate whether there is a clustering of high or low values but not
both
 Value of G score indicates statistically significant relationships

10
Global Statistics cont’d…
 The draw back is that, if there are both high and low clusters they will
counteract each other so it is advisable to first use Moran’s I
 G statistics are useful when negative spatial autocorrelation (outliers) is

negligible
 High G score: Statistically significant clustering of high values
 Low G value: Slight clustering of low values

11
Formula for Getis-Ord General G
12
Interpretation of Getis-Ord General G result
 It is an inferential statistic, which means that the results of the analysis are interpreted
within the context of the null hypothesis
 The null hypothesis states that there is no spatial clustering of feature values
 When the p-value returned by this tool is small and statistically significant, the null
hypothesis can be rejected
 If the null hypothesis is rejected, the sign of the z-score becomes important.
 If the z-score value is positive, the observed General G index is larger than the expected
General G index, indicating that high values for the attribute are clustered in the study
area.
 If the z-score value is negative, the observed General G index is smaller than the
expected index, indicating that low values are clustered in the study area.
13
Global statistics cont’d….
ii) Spatial Auto-Correlation (Global Moran’s I)
 Measures whether the pattern of feature values is clustered, dispersed, or

random.
 Global Statistic
 Calculates I values to test for statistically significant clustering
 High and low values are not separated (exist together)
14
Spatial Autocorrelation Calculation
15
Global statistics cont’d..
Interpretation Global Moran’s I
• Global Moran's I tool is an inferential statistic, which means that the
results of the analysis are always interpreted within the context of its null
hypothesis
• The null hypothesis states that the attribute being analyzed is

randomly distributed among the features in your study area;
• In another way, the spatial processes promoting the observed pattern of

values is random chance.
• When the p-value returned by this tool is statistically significant, you can
reject the null hypothesis. 16
Interpretation Global Moran’s I cont’d….
17
18
Global Moran's I vs. Getis-Ord General G
 Both techniques are used to assess the global clustering (simply tell you whether there is a
cluster or not where the clustering actually exist)
 The assumptions behind both statistics are that your data is continuous and normally
distributed in the study area.
 Moran's I measure only indicates that similar values occur together (It does not indicate
whether any cluster is composed of high or low values)
 General G statistic can be used to indicate whether high or low values are concentrated
over the study area
 Hence, when we wish to find out whether our data is clustered in general (auto correlated)
we can use Moran's I.
 However, if we want to know more specifically whether or not there are clusters of high/low
values we can use G statistics 19
Moran's I vs. Getis-Ord General G
20
B. Local statistics
i) Anselin Local Moran’s I (Cluster and outlier analysis )
 Measures the strength of patterns for each specific feature.
 Given a set of weighted features, cluster and outlier analysis identifies

statistically significant hotspots, cold spots and spatial outliers
 The math of the two are the same as to the global variant but the result are
somewhat different
 Anselin Local Moran’s I can identify HH, LL, HL, LH clusters H=High
L=Low HL is a high value surrounded by low values (outliers) 21
Local statistics cont’d…
Interpretation of Anselin Local Moran’s I
 A positive value for I

 Indicates that a feature has neighboring features with similarly high or low attributes
values;
 Feature is part of a cluster
 Statistically significant clusters can consist of high values (HH) or low values (LL)
22
Negative value for I
 Indicates that a feature has neighboring features with dissimilar values;
 Feature is an outlier
 Statistically significant outliers can be a feature with a high surrounded by

features with low values (HL) or a feature with a low value surrounded by
features with high values (LH)
 In either instance, a p-value for the feature must be small enough for the
cluster or outlier to be considered statistically significant 23
ii) Hotspot Analysis (Getis-Ord Gi*)
• Local version of the G statistic that indicates hot spot (cluster of high
values) or cold spots (clusters of low value)
 To be statistically significant, the hot or cold spot will have a high/low value
and be surrounded by other features with high/low values exist in the area
 Getis-Ord Gi* can identify Hot (High) or Cold (Low) clusters with
different confidence levels
 It is useful when negative spatial autocorrelation (outliers) is negligible 24

Getis-Ord Gi* (High/Clustering) vs. Anselin Local Moran’s I
 The math of the two are the same as for the global variant but the results
are somewhat different
 Getis-Ord Gi* can identify Hot (High) or Cold (Low) clusters with
different confidence intervals.
 Anselin Local Moran's I can identify HH, LL, LH, HL clusters where
H=High, L=Low and HL is a high value surrounded by low values
25
Why is spatial autocorrelation important?
• One of the main reasons why spatial auto-correlation is important is because
statistics relies on observations being independent from one another
• If autocorrelation exists in a map, then this violates the fact that observations are
independent from one another
• Another potential application is analyzing clusters and dispersion of ecology and

disease
• Is the disease an isolated case or spreading with dispersion?
• These trends can be better understood using spatial autocorrelation analysis
26
Best practice guidelines for using cluster and outlier analysis
(Anselin Local Moran’s I)
 Results are only reliable if the input feature class contains at least 30
features;
 This tool requires an input field such as count, rate, or other numeric
measurements
 If you are analyzing point data, where each point represents a single event or
incident, you might not have a specific numeric attribute to evaluate (a
severity ranking, count or other measurement)
 If you are interested in finding locations with many incidents (hot spots)
and /or locations with very few incidents (cold spot)s), you will need to
27
Best practice guidelines for using cluster cont’d……
 Select an appropriate conceptualization of spatial relationships
 Select an appropriate distance band or threshold distance
 All features should have at least one neighbor
 No feature should have all other features as a neighbor
 Especially if the values for the input field are asked, each feature should
have about eight neighbors
28
Best Practice guidelines for using Cluster cont’d…
 Given a set of weighted features, the Getis-Ord Gi* (pronounced as Gee Eye Star)
statistic identifies statistically significant hot pots and cold spots
 This tool works by looking feature with in the context of neighboring features.
 To be statistically significant hot spot, a feature will have a high value and be
surrounded by other features with high values as well.
 The local sum of features & its neighbors is compared proportionally to the sum of
all features;
 Wen the sum is very different from the expected local sum, and when that
difference is too large to be the result of random chance, a statistically significant
z-score results.
29
Clustering vs Clusters
 The mapping clusters tools perform cluster analysis to identify the locations
of statistically significant hot spots, cold spots, spatial outliers and similar
features
 Clustering can be detected at the Global level where clusters at the local
level
 Moran’s I is a global statistic, i.e. a single value for the whole spatial
pattern
 Moran’s I does not provide the location of clusters
 Cluster detection requires a local statistic
30
Interpolation
What is Interpolation?
 Interpolation is the procedure of estimating unknown values at un sampled

sites using known values of existing observations.
 It can be used to predict unknown values for any geographic point data, such
as home delivery, high child mortality, low ANC visit and so on.
 Interpolation predicts values for cells in a raster from a limited number of

sample data points.
31
Interpolation Methods/Types/
INVERSE DISTANCE WEIGHTED (IDW)
• The Inverse Distance Weighting interpolator assumes that each input point has
a local influence that diminishes with distance.
• It weights the points closer to the processing cell greater than those further
away.
• A specified number of points, or all points within a specified radius can be used
to determine the output value of each location.
• Use of this method assumes the variable being mapped decreases in influence
32
Interpolation Methods cont’d…
• IDW interpolation explicitly implements the assumption that things that are
close to one another are more alike than those that are farther apart.
• To predict a value for any unmeasured location, IDW will use the measured
values surrounding the prediction location.
• Those measured values closest to the prediction location will have more
influence on the predicted value than those farther away.
33
Kriging
• Kriging is a geostatistical interpolation technique that considers both the distance and the
degree of variation between known data points when estimating values in unknown areas.
• A kriged estimate is a weighted linear combination of the known sample values around the
point to be estimated.
• Kriging procedure that generates an estimated surface from a scattered set of points with z-
values.
• Kriging assumes that the distance or direction between sample points reflects a spatial
correlation that can be used to explain variation in the surface.
34
• The Kriging tool fits a mathematical function to a specified number of points, or

all points within a specified radius, to determine the output value for each
location.
• Kriging is a multistep process; it includes exploratory statistical analysis of the

data, variogram modeling, creating the surface, and (optionally) exploring a
variance surface.
• Kriging is most appropriate when you know there is a spatially correlated

distance or directional bias in the data.
• It is often used in soil science and geology

35
Sat Scan Analysis
36
What is Sat Scan?
• Sat Scan is a freely available software that uses the scan statistic to detect clusters
(www.satscan.org)
• To test whether a disease is randomly distributed over space, over time or over
space and time
• To perform geographical surveillance of disease, to detect areas of significantly

high or low rates
• The spatial scan statistic can be useful as an addition to disease maps, in order to
determine if the observed patterns are likely due to chance or not
• A complement rather than a replacement for regular disease maps 37

The Spatial Scan Statistic
• For each distinct window, calculate the likelihood, proportional to:
Where: n=number of cases inside the circle

N= total number of cases
µ= expected number of cases inside a circle
• Circles of different sizes (from zero up to maximum 50% of the population size is included)
• LLR is used to test (compare) goodness of two models. (i.e. when the LLR is greater than
the Monte Carol critical value, we reject the null model (hypothesis)
• For each circle, a likelihood ratio statistic is computed based on the number of observed
and expected cases within and outside the circle and compared with the likelihood L0
under the null hypothesis.
• Create a regular or irregular grid of centroids covering the whole study region.
38
The Spatial Scan Statistic cont’d…
For each circle:
 Obtain actual and expected number of cases inside and outside the circle.
 Calculate Likelihood Function.
Compare Circles:
– Pick circle with highest likelihood function as Most Likely Cluster.
Inference:
 Generate random replicas of the data set under the null-hypothesis of no clusters
(Monte Carlo sampling).
 Compare most likely clusters in real and random data sets (Likelihood ratio test).
39
The Spatial Scan Statistic cont’d…
 The scan statistic is the maximum likelihood over all possible circles
• Identifies the most unusual clusters
 To find p-value, use Monte Carlo hypothesis testing
 Redistribute cases randomly and recalculate the scan statistic many times
 Proportion of scan statistics from the Monte Carlo replicates which are greater
than or equal to the scan statistic for the true cluster is the p-value Scan Statistics
40
What SaTScan can/can’t do?
CAN
• Identify spatial, temporal, spatial-temporal clusters
• Provide flexible geographic units
CANNOT
o Display maps of events and clusters locations
o Need GIS or mapping software (such as ArcGIS)
o Create other statistical and regression models
41
Spatial Scan Statistic: Properties
 Adjusts for inhomogeneous population density.
 Simultaneously tests for clusters of any size and any location, by using circular
windows with continuously variable radius.
 Accounts for multiple testing.
 Possibility to include confounding variables, such as age, sex or socio-economic

variables.
 Aggregated or non-aggregated data (states, counties, census tracts, block groups,

households, individuals).
42
Introduction of Statistical models in SaTScan
Bernoulli Model
• There are animals with or without a disease (represented by a 0/1 variable)
 A set of cases and controls
• Purely temporal/spatial or the space-time scan statistics
Discrete Poisson Model

• The number of cases in each location is Poisson- distributed.
• Under the null hypothesis, and when there are no covariates, the expected number
of cases in each area is proportional to its population size
• Purely temporal, purely spatial and space-time
• This model a very good approximation to the Bernoulli model if few cases vs.
controls (less than 10%) 43
Introduction of Statistical models in SaTScan cont’d…
Space-Time Permutation Model
 Requires only cases data with information about the spatial location and time for each
case (No information needed for population at risk)
 If the population increase (or decrease) is the same across the study region, that is
okay, and will not lead to biased results
 The user is advised to be very careful when using this method for data spanning
several years
 Population in some areas grows faster than in others
44
45
46
47
48
49
50
Reading assignment
• Geocoding and data linkage using primary and secondary data
• Network analysis
51
Thank
You!!
52

Chapter Seven Disease Cluster and Cluster - Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Seven Disease Cluster and Cluster - Analysis

Uploaded by

Copyright:

Available Formats

Chapter 7

Disease Cluster and Cluster

 How are the features distributed?

 What is the pattern created by the features?

 Where are the clusters?

 How do patterns clusters of different variables compare on one

 Spatial autocorrelation in GIS helps to understand the degree to which one

 Global Moran's I statistic measures spatial autocorrelation based on feature

 Moran’s I statistic is robust in detecting the presence of a spatial pattern

 Positive Spatial Autocorrelation Example

 This means values cluster together.

 It indicates a spatial pattern where areas with similar characteristics are

Negative Spatial Autocorrelation Example

 Moran’s I is -1 because dissimilar values are next to each other

 A value of 0 for Moran’s I typically indicates no autocorrelation

 It indicates a spatial pattern where areas with contrasting characteristics

• It doesn’t indicate the occurrence of specific pattern

• It is a single value statistic used to summarize pattern

B. Local Tools /statistics/

 Identify variation across the study area, focusing on individual

 It is location specific statistics (i.e. specific areas of clustering)

 General G is a tool used to measure the concentration of high/low values for

 Value of G score indicates statistically significant relationships

 G statistics are useful when negative spatial autocorrelation (outliers) is

 High G score: Statistically significant clustering of high values

 Low G value: Slight clustering of low values

 Measures whether the pattern of feature values is clustered, dispersed, or

 Calculates I values to test for statistically significant clustering

 High and low values are not separated (exist together)

• The null hypothesis states that the attribute being analyzed is

• In another way, the spatial processes promoting the observed pattern of

 Measures the strength of patterns for each specific feature.

 Given a set of weighted features, cluster and outlier analysis identifies

Interpretation of Anselin Local Moran’s I

 A positive value for I

 Feature is part of a cluster

Negative value for I

 Indicates that a feature has neighboring features with dissimilar values;

 Statistically significant outliers can be a feature with a high surrounded by

 It is useful when negative spatial autocorrelation (outliers) is negligible 24

• Another potential application is analyzing clusters and dispersion of ecology and

• Is the disease an isolated case or spreading with dispersion?

• These trends can be better understood using spatial autocorrelation analysis

 Select an appropriate conceptualization of spatial relationships

 Select an appropriate distance band or threshold distance

 All features should have at least one neighbor

 No feature should have all other features as a neighbor

 Cluster detection requires a local statistic

 Interpolation is the procedure of estimating unknown values at un sampled

 Interpolation predicts values for cells in a raster from a limited number of

• The Kriging tool fits a mathematical function to a specified number of points, or

• Kriging is a multistep process; it includes exploratory statistical analysis of the

• Kriging is most appropriate when you know there is a spatially correlated

• It is often used in soil science and geology

• To perform geographical surveillance of disease, to detect areas of significantly

• A complement rather than a replacement for regular disease maps 37

Where: n=number of cases inside the circle

 Calculate Likelihood Function.

• Identifies the most unusual clusters

 To find p-value, use Monte Carlo hypothesis testing

• Identify spatial, temporal, spatial-temporal clusters

• Provide flexible geographic units

o Need GIS or mapping software (such as ArcGIS)