You are on page 1of 52

Chapter 7

Disease Cluster and Cluster


Analysis

1
Session Objectives
At the end of this Session students will enable to:
 Describe the geocoding and data linkage using primary and
secondary data
Define cluster analysis
 Differentiate Local and Global Spatial autocorrelation
 Identify the types of interpolation
 Define network analysis

2
What are spatial statistics?
 They are similar to traditional statistics, but integrate spatial
they
relationships into the calculations.
 Spatial statistics will allow you to answer the following questions about your data:

 How are the features distributed?

 What is the pattern created by the features?

 Where are the clusters?

 How do patterns clusters of different variables compare on one


and
another?
 What are the relationships between sets of features or values?
3
What is spatial autocorrelation?

 Spatial autocorrelation in GIS helps to understand the degree to which one


object is similar to other nearby objects.
 It is a statistical concept that measures the degree of similarity or dissimilarity
between neighboring locations in a spatial dataset.
 It measures how much close objects are in comparison with other close
objects.
 Measure of how likely two neighboring areas are to have similar values for a
specific field of data.
4
What is spatial autocorrelation?
 Moran’s I (Index) measures spatial autocorrelation.

 Global Moran's I statistic measures spatial autocorrelation based on feature


locations and attribute values

 Moran’s I statistic is robust in detecting the presence of a spatial pattern


amongst a variable
 Moran’s I can be classified as positive, negative and no spatial auto-
correlation.
 Positive occurs when observations having similar values are closer
(clustered) to one another
 Negative occurs when observations having dissimilar values occur near one
another 5
Autocorrelation cont’d….

 Positive Spatial Autocorrelation Example


 Positive spatial autocorrelation is when similar values cluster together in a
map and occurs when Moran’s I is close to +1.

 This means values cluster together.


 Similar attribute values tend to cluster together in neighboring locations.

 It indicates a spatial pattern where areas with similar characteristics are


spatially grouped. 6
Autocorrelation cont’d….

Negative Spatial Autocorrelation Example


 Negative spatial autocorrelation is when dissimilar values cluster
together in a map
and occurs when Moran’s I is near -1

 Moran’s I is -1 because dissimilar values are next to each other

 A value of 0 for Moran’s I typically indicates no autocorrelation

 It indicates a spatial pattern where areas with contrasting characteristics


are spatially grouped.
7
Testing of the existence of clusters (Autocorrelation)
A. Global Tools /Statistics/

• Are tools used to test the existence of overall clustering (either high or
low)

• It doesn’t indicate the occurrence of specific pattern

• But, used to identify and measure the pattern of the entire study areas

• It is a single value statistic used to summarize pattern

• Homogeneity
8
Testing of the existence of clusters cont’d…

B. Local Tools /statistics/


 Test the existence of local clusters

 Identify variation across the study area, focusing on individual


features and their relationships to near by features

 It is location specific statistics (i.e. specific areas of clustering)

 Heterogeneity
9
A. Global Statistics
i) Getis-Ord General G (High/Low Clustering)

 General G is a tool used to measure the concentration of high/low values for


a given study area

 The Global G statistic computes a single statistic for the entire study area

 Able to indicate whether there is a clustering of high or low values but not
both

 Value of G score indicates statistically significant relationships


10
Global Statistics cont’d…

 The draw back is that, if there are both high and low clusters they will
counteract each other so it is advisable to first use Moran’s I

 G statistics are useful when negative spatial autocorrelation (outliers) is


negligible

 High G score: Statistically significant clustering of high values

 Low G value: Slight clustering of low values


11
Formula for Getis-Ord General G

12
Interpretation of Getis-Ord General G result
 It is an inferential statistic, which means that the results of the analysis are interpreted
within the context of the null hypothesis

 The null hypothesis states that there is no spatial clustering of feature values

 When the p-value returned by this tool is small and statistically significant, the null
hypothesis can be rejected

 If the null hypothesis is rejected, the sign of the z-score becomes important.

 If the z-score value is positive, the observed General G index is larger than the expected
General G index, indicating that high values for the attribute are clustered in the study
area.

 If the z-score value is negative, the observed General G index is smaller than the
expected index, indicating that low values are clustered in the study area.
13
Global statistics cont’d….
ii) Spatial Auto-Correlation (Global Moran’s I)

 Measures whether the pattern of feature values is clustered, dispersed, or


random.

 Global Statistic

 Calculates I values to test for statistically significant clustering

 High and low values are not separated (exist together)

14
Spatial Autocorrelation Calculation

15
Global statistics cont’d..
Interpretation Global Moran’s I
• Global Moran's I tool is an inferential statistic, which means that the
results of the analysis are always interpreted within the context of its null
hypothesis

• The null hypothesis states that the attribute being analyzed is


randomly distributed among the features in your study area;

• In another way, the spatial processes promoting the observed pattern of


values is random chance.

• When the p-value returned by this tool is statistically significant, you can
reject the null hypothesis. 16
Interpretation Global Moran’s I cont’d….

17
18
Global Moran's I vs. Getis-Ord General G
 Both techniques are used to assess the global clustering (simply tell you whether there is a
cluster or not where the clustering actually exist)

 The assumptions behind both statistics are that your data is continuous and normally
distributed in the study area.

 Moran's I measure only indicates that similar values occur together (It does not indicate
whether any cluster is composed of high or low values)

 General G statistic can be used to indicate whether high or low values are concentrated
over the study area

 Hence, when we wish to find out whether our data is clustered in general (auto correlated)
we can use Moran's I.

 However, if we want to know more specifically whether or not there are clusters of high/low
values we can use G statistics 19
Moran's I vs. Getis-Ord General G

20
B. Local statistics
i) Anselin Local Moran’s I (Cluster and outlier analysis )

 Measures the strength of patterns for each specific feature.

 Given a set of weighted features, cluster and outlier analysis identifies


statistically significant hotspots, cold spots and spatial outliers

 The math of the two are the same as to the global variant but the result are
somewhat different

 Anselin Local Moran’s I can identify HH, LL, HL, LH clusters H=High
L=Low HL is a high value surrounded by low values (outliers) 21
Local statistics cont’d…

Interpretation of Anselin Local Moran’s I

 A positive value for I


 Indicates that a feature has neighboring features with similarly high or low attributes
values;

 Feature is part of a cluster

 Statistically significant clusters can consist of high values (HH) or low values (LL)

22
Local statistics cont’d…

Negative value for I

 Indicates that a feature has neighboring features with dissimilar values;

 Feature is an outlier

 Statistically significant outliers can be a feature with a high surrounded by


features with low values (HL) or a feature with a low value surrounded by
features with high values (LH)

 In either instance, a p-value for the feature must be small enough for the
cluster or outlier to be considered statistically significant 23
Local statistics cont’d…
ii) Hotspot Analysis (Getis-Ord Gi*)

• Local version of the G statistic that indicates hot spot (cluster of high
values) or cold spots (clusters of low value)

 To be statistically significant, the hot or cold spot will have a high/low value
and be surrounded by other features with high/low values exist in the area

 Getis-Ord Gi* can identify Hot (High) or Cold (Low) clusters with
different confidence levels

 It is useful when negative spatial autocorrelation (outliers) is negligible 24


Getis-Ord Gi* (High/Clustering) vs. Anselin Local Moran’s I

 The math of the two are the same as for the global variant but the results
are somewhat different

 Getis-Ord Gi* can identify Hot (High) or Cold (Low) clusters with
different confidence intervals.

 Anselin Local Moran's I can identify HH, LL, LH, HL clusters where
H=High, L=Low and HL is a high value surrounded by low values

25
Why is spatial autocorrelation important?
• One of the main reasons why spatial auto-correlation is important is because
statistics relies on observations being independent from one another

• If autocorrelation exists in a map, then this violates the fact that observations are
independent from one another

• Another potential application is analyzing clusters and dispersion of ecology and


disease

• Is the disease an isolated case or spreading with dispersion?

• These trends can be better understood using spatial autocorrelation analysis

26
Best practice guidelines for using cluster and outlier analysis
(Anselin Local Moran’s I)
 Results are only reliable if the input feature class contains at least 30
features;
 This tool requires an input field such as count, rate, or other numeric
measurements

 If you are analyzing point data, where each point represents a single event or
incident, you might not have a specific numeric attribute to evaluate (a
severity ranking, count or other measurement)

 If you are interested in finding locations with many incidents (hot spots)
and /or locations with very few incidents (cold spot)s), you will need to
27
Best practice guidelines for using cluster cont’d……

 Select an appropriate conceptualization of spatial relationships

 Select an appropriate distance band or threshold distance

 All features should have at least one neighbor

 No feature should have all other features as a neighbor

 Especially if the values for the input field are asked, each feature should
have about eight neighbors
28
Best Practice guidelines for using Cluster cont’d…
 Given a set of weighted features, the Getis-Ord Gi* (pronounced as Gee Eye Star)
statistic identifies statistically significant hot pots and cold spots

 This tool works by looking feature with in the context of neighboring features.

 To be statistically significant hot spot, a feature will have a high value and be
surrounded by other features with high values as well.

 The local sum of features & its neighbors is compared proportionally to the sum of
all features;

 Wen the sum is very different from the expected local sum, and when that
difference is too large to be the result of random chance, a statistically significant
z-score results.

29
Clustering vs Clusters
 The mapping clusters tools perform cluster analysis to identify the locations
of statistically significant hot spots, cold spots, spatial outliers and similar
features
 Clustering can be detected at the Global level where clusters at the local
level
 Moran’s I is a global statistic, i.e. a single value for the whole spatial
pattern
 Moran’s I does not provide the location of clusters

 Cluster detection requires a local statistic

30
Interpolation
What is Interpolation?

 Interpolation is the procedure of estimating unknown values at un sampled


sites using known values of existing observations.

 It can be used to predict unknown values for any geographic point data, such
as home delivery, high child mortality, low ANC visit and so on.

 Interpolation predicts values for cells in a raster from a limited number of


sample data points.

31
Interpolation Methods/Types/
INVERSE DISTANCE WEIGHTED (IDW)

• The Inverse Distance Weighting interpolator assumes that each input point has
a local influence that diminishes with distance.

• It weights the points closer to the processing cell greater than those further
away.

• A specified number of points, or all points within a specified radius can be used
to determine the output value of each location.

• Use of this method assumes the variable being mapped decreases in influence
32
Interpolation Methods cont’d…

• IDW interpolation explicitly implements the assumption that things that are
close to one another are more alike than those that are farther apart.

• To predict a value for any unmeasured location, IDW will use the measured
values surrounding the prediction location.

• Those measured values closest to the prediction location will have more
influence on the predicted value than those farther away.

33
Interpolation Methods cont’d…
Kriging

• Kriging is a geostatistical interpolation technique that considers both the distance and the
degree of variation between known data points when estimating values in unknown areas.

• A kriged estimate is a weighted linear combination of the known sample values around the
point to be estimated.

• Kriging procedure that generates an estimated surface from a scattered set of points with z-
values.

• Kriging assumes that the distance or direction between sample points reflects a spatial
correlation that can be used to explain variation in the surface.
34
Interpolation Methods cont’d…

• The Kriging tool fits a mathematical function to a specified number of points, or


all points within a specified radius, to determine the output value for each
location.

• Kriging is a multistep process; it includes exploratory statistical analysis of the


data, variogram modeling, creating the surface, and (optionally) exploring a
variance surface.

• Kriging is most appropriate when you know there is a spatially correlated


distance or directional bias in the data.

• It is often used in soil science and geology


35
Sat Scan Analysis

36
What is Sat Scan?
• Sat Scan is a freely available software that uses the scan statistic to detect clusters
(www.satscan.org)

• To test whether a disease is randomly distributed over space, over time or over
space and time

• To perform geographical surveillance of disease, to detect areas of significantly


high or low rates

• The spatial scan statistic can be useful as an addition to disease maps, in order to
determine if the observed patterns are likely due to chance or not

• A complement rather than a replacement for regular disease maps 37


The Spatial Scan Statistic
• For each distinct window, calculate the likelihood, proportional to:

Where: n=number of cases inside the circle


N= total number of cases
µ= expected number of cases inside a circle
• Circles of different sizes (from zero up to maximum 50% of the population size is included)
• LLR is used to test (compare) goodness of two models. (i.e. when the LLR is greater than
the Monte Carol critical value, we reject the null model (hypothesis)

• For each circle, a likelihood ratio statistic is computed based on the number of observed
and expected cases within and outside the circle and compared with the likelihood L0
under the null hypothesis.
• Create a regular or irregular grid of centroids covering the whole study region.
38
The Spatial Scan Statistic cont’d…
For each circle:
 Obtain actual and expected number of cases inside and outside the circle.

 Calculate Likelihood Function.

Compare Circles:
– Pick circle with highest likelihood function as Most Likely Cluster.

Inference:

 Generate random replicas of the data set under the null-hypothesis of no clusters
(Monte Carlo sampling).

 Compare most likely clusters in real and random data sets (Likelihood ratio test).

39
The Spatial Scan Statistic cont’d…

 The scan statistic is the maximum likelihood over all possible circles

• Identifies the most unusual clusters

 To find p-value, use Monte Carlo hypothesis testing

 Redistribute cases randomly and recalculate the scan statistic many times

 Proportion of scan statistics from the Monte Carlo replicates which are greater

than or equal to the scan statistic for the true cluster is the p-value Scan Statistics

40
What SaTScan can/can’t do?
CAN

• Identify spatial, temporal, spatial-temporal clusters

• Provide flexible geographic units

CANNOT
o Display maps of events and clusters locations

o Need GIS or mapping software (such as ArcGIS)

o Create other statistical and regression models

41
Spatial Scan Statistic: Properties
 Adjusts for inhomogeneous population density.

 Simultaneously tests for clusters of any size and any location, by using circular
windows with continuously variable radius.

 Accounts for multiple testing.

 Possibility to include confounding variables, such as age, sex or socio-economic


variables.

 Aggregated or non-aggregated data (states, counties, census tracts, block groups,


households, individuals).

42
Introduction of Statistical models in SaTScan
Bernoulli Model
• There are animals with or without a disease (represented by a 0/1 variable)

 A set of cases and controls

• Purely temporal/spatial or the space-time scan statistics

Discrete Poisson Model


• The number of cases in each location is Poisson- distributed.

• Under the null hypothesis, and when there are no covariates, the expected number
of cases in each area is proportional to its population size

• Purely temporal, purely spatial and space-time

• This model a very good approximation to the Bernoulli model if few cases vs.
controls (less than 10%) 43
Introduction of Statistical models in SaTScan cont’d…
Space-Time Permutation Model

 Requires only cases data with information about the spatial location and time for each
case (No information needed for population at risk)

 If the population increase (or decrease) is the same across the study region, that is
okay, and will not lead to biased results

 The user is advised to be very careful when using this method for data spanning
several years

 Population in some areas grows faster than in others

44
45
46
47
48
49
50
Reading assignment
• Geocoding and data linkage using primary and secondary data
• Network analysis

51
Thank
You!!
52

You might also like