Professional Documents
Culture Documents
Unsupervised Classification
© ADVANCED RESOURCES AND RISK TECHNOLOGY This document can only be distributed within UFRGS
What is unsupervised classification?
o 1,500 points
o Our brain is wired to detect
patterns, seems obvious
o We can easily detect cluster
that maximize extra-cluster
distance and minimize intra-
distance clusters (most
compact clusters)
o Number of clusters?
o Centroids? Medoids?
Field model
Correlated data
Well logs
Production
STANDARD ALGORITHM
o Initialization: Start with a guess of k candidate
centroids (quality of answer strongly initialization
dependent!)
o Assignment: Compute the distance of each point to
each centroid, and assign each point to the cluster
with the nearest centroid (=Voronoi diagram
partition) [expectation]
o Update: compute the new centroid as the mean
position of all points contained within a cluster
[maximization]
k points
μ = centroid
o Convergence: stop when centroid positions
S = cluster
changes are below a predetermined tolerance
Here L2 norm
PROS http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html
CONS
o Poor choice of number of clusters can lead to spurious results Fig.1 Fig.2
(Fig.1)
o Sensitive to initial guess and prone to local minima
o “Spherical cluster” assumption: because the centroid is the
mean of Euclidian distances, clusters work best if they are
anisotropic around the centroid (Fig.2)
o Overlapping clusters: when clusters are touching, boundary
assignments can lead to misclassification (Fig. 3)
o Can be CPU-consuming because k*N distances are computed Fig.3 Fig.4
at each iteration, even more so with Kernel transformations.
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
1 2 3 4 5
15
1 2 3 4 5
15
PROS
o No need to know the number of clusters a
priori, can choose later
o Only distance between points matter, not their
position: useful when working in Kernel
spaces, MDS
o CPU consumption predictable, and search for
closest clusters speeds up with more linkage
CONS
o Agglomerative clusters: large clusters will tend
to grow faster with each distance iteration than
small clusters (rich getting richer, singletons
left). This is due to their larger envelope that
automatically offers more choice opportunity
for contact with close points
Easy and tempting to replace expert knowledge by unsupervised learning: instead of interpreting
lithography, obtain measurements and let algorithms identify similarities.
This is dangerous:
o Clusters are pure mathematical objects, they are useful but not explanatory
o They contain no direct physical information, bear no extrapolation quality, offer no judgement with respect
to data (uses all data provided, without distinction)
o Clusters depend on our choice of features, they depend on our subset of data, they depend on our
algorithm parameterization
o More interesting is to understand how these clusters came to be produced by the algorithm: clustering is
only the beginning of our work. We have to extract some sense from them.
o Our assumption: we know that we have an underlying “law” (physics and chemistry of geological
reservoirs) to help us understand why our clusters were formed, we need to use it
o Forming clusters is a way of acknowledging that similar causes (features) will lead to similar behaviors,
although we do not know which behavior (outcome) yet.
o Supervised learning: identify a behavior and a set of causal features, and infer a modeling relationship
Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks, 13(4-5), 2000, pp.
411-430
J.A. Hartigan (1975). Clustering algorithms. John Wiley & Sons, Inc.
Hartigan, J. A.; Wong, M. A. (1979). "Algorithm AS 136: A K-Means Clustering Algorithm". Journal of the Royal Statistical
Society, Series C. 28 (1): 100–108. JSTOR 2346830.
Honarkhah, M; Caers, J (2010). "Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling". Mathematical
Geosciences. 42: 487–517. doi:10.1007/s11004-010-9276-7.
Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US,
2005. 321-352.
Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "14.3.12 Hierarchical clustering". The Elements of Statistical
Learning (PDF) (2nd ed.). New York: Springer. pp. 520–528. ISBN 0-387-84857-6. Retrieved 2009-10-20.