Professional Documents
Culture Documents
Clustering
By
Nirdesh Kumar-06004008
Introduction
Regionalization
Clustering Algorithms
K-Means Algorithm
• Each feature vector represents one of the N sites in the study region.
• Rescaling-process necessary to nullify the effects of the differences in their
variance and relative magnitudes.
denotes the rescaled value of
Represents standard deviation of attribute j.
Mean value of attribute j over all N feature vectors.
K-number of clusters.
Nk -number of feature vectors in cluster k.
-rescaled value of attribute j in the feature vector I assigned to cluster k.
-mean value of attribute j for cluster k, computed as
• Minimizing F, distance of each feature vector from the centre of the cluster to
which it belongs, is minimized.
• Steps involved in K-means algorithm to delineate clusters for a given value of K
are:
• Single linkage-Distance between the cluster [yi ,yj ], formed by merging clusters yi
and yj ,and yk ,is the smaller of the distances between yi and yk or yj and yk .
• Complete linkage-distance between the new cluster [yi ,yj ] and any other singleton
cluster yk is the greater of the distances between yi and yk or yj and yk .
• At each step in the analysis, union of every possible pair of clusters is considered
and two clusters whose fusion results in the smallest increase in W are merged.
• The change depends only on the relationship between the two merged clusters
and not on the relationships with other clusters.
Cluster Validity Indices
Heterogeneity of the set of plausible regions obtained from the cluster analysis
is assessed.
Uses the advantages offered by sampling properties of L-moment ratios.
Examines whether the between-site dispersion of the sample LMRs for the
group of sites under consideration is larger than the dispersion expected in a
homogeneous region.
(2) weighted average distance from the site to the group weighted mean in the two
dimensional space of L-CV and L-skewness ();
(3) weighted average distance from the site to the group weighted mean in the
two dimensional space of L-skewness and L-kurtosis ().
• For each simulated realization(homogeneous region) V1 ,V2 and V3 are computed.
•μv ,μv2 ,μv3 are mean deviations and σv ,σv2 ,σv3 are the standard deviations of the
simulated realizations.
• HM<1-Acceptably homogeneous.
• 1≤HM≤2-Possibly homogeneous.
• HM≥2-Definitely heterogeneous.
The regions are adjusted to improve their homogeneity through the following:
(1) Eliminating (or deleting) one or more sites from the data set;
(2) Transferring one or more discordant sites from a region to other regions;
(3) Dividing a region to form two or more new regions;
(4) Allowing a site to be shared by two or more regions;
(5) Dissolving regions by transferring their sites to other regions;
(6) Merging a region with another or others;
(7) Merging two or more regions and redefining groups;
(8) Obtaining more data and redefining regions.
First three options are useful in reducing the values of heterogeneity measures of a region
Options 4–7 help in ensuring that each region is sufficiently large in terms of collective
data length at all the sites in it
Hydrologic Regionalization of India
Description of the Study Region
The study region India (Figure 2) lies between 8◦ 4’ and 37◦ 6’ north latitude and 68◦ 7’
and 97◦ 25’ east longitude, and has an area of 32,87,263 km2 .
Climate- winter (January and February), summer (March to May), summer monsoon
(June to September), and post monsoon (October to December).
Data Used
Daily gridded rainfall data for the period 1951–2004 procured from IMD(India
Meteorological Department).
Records a 2140 stations
Gridded reanalysis data of the monthly mean atmospheric variables is taken from
database of National Centers for Environmental Prediction (NCEP) [1951-2004]
Elevation of terrain in each of the NCEP grid boxes is computed from Shuttle Radar
Topography Mission (SRTM)
Five maps of SMR regions currently in use by the IMD are used.
Results and Discussion
The statistical homogeneity of each of the five IMD SMR regions is tested using SMR
data at grid points in the region as shown in the table below.
Serial Region Number of Region Type
Number Name Grid Points
1 Peninsular 49 23.28 5.93 0.26 Definitely
heterogeneous
2 West 86 10.89 0.64 -1.33 Definitely
Central heterogeneous
3 Northwest 69 20.96 5.87 -1.08 Definitely
heterogeneous
4 Central 59 4.32 -0.73 -1.90 Definitely
Northeast heterogeneous
5 Northeast 36 4.44 -0.91 1.06 Definitely
heterogeneous
Table 1- Characteristics of the IMD SMR Regions Determined Using Heterogeneity Measures
The IMD regions are adjusted to improve their homogeneity and tabulated in table 2.
Figure 2 shows the number of sites removed to make the regions acceptably
homogeneous.
Figure 2- SMR regions that are considered as Figure 3-SMR regions after adjusting
homogeneous by IMD
Serial Number Region Name Number of Grid Heterogeneity Measures Number of Grid
Points Points
Eliminated
To delineate new homogeneous SMR regions in the study region, 52 out of 60 NCEP
grid boxes covering India are considered
Rain gauge density low in himalayan region(8 boxes discarded).
mean monthly values of each of the 15 atmospheric variables are considered at each
NCEP grid point for the summer monsoon months.
960 values (15 variables *16 grid points*4 months) are obtained for each grid point.
The principal components and standardized location attributes (latitude, longitude,
and average elevation of terrain in each of the NCEP grid boxes) are considered as
attributes to form 52 feature vectors for K-means cluster analysis, to reduce
redundancy.
Figure 3- Grid boxes covering India. Figure 4- Identification of optimal partition
provided by K-means clustering algorithm
We observe that the number of sites that had to be eliminated from the regions for
improving their statistical homogeneity is found to be excessive, indicating that the IMD
SMR regions are not useful as precursors to derive homogeneous SMR regions.
New SMR regions are delineated using the proposed methodology.
Conclusion
Existing approaches based on statistics computed from observed hydrology.
Proposed method has the ability to form regions irrespective of the available data(rain
gauges for this study).
However, as seen in this study, there is uncertainty in validating homogeneous regions in
areas having a few rain gauges.
Thank You