You are on page 1of 21

CHAPTER 4


Unsupervised Classification

Hervé Gross, PhD - Reservoir Engineer


Advanced Resources and Risk Technology
hgross@ar2tech.com

© ADVANCED RESOURCES AND RISK TECHNOLOGY This document can only be distributed within UFRGS
What is unsupervised classification?

o Given a data set with multiple features,


unsupervised classification is the action
of sorting, classifying, categorizing this
multidimensional data into a number of
groups with similar features
o These groups are called clusters
o The goal is to identify distinguishing
features of a large set of data in a
multidimensional space
o If an unsupervised classification is
successful, any member of a given
cluster has stronger similarities with
members of its own cluster than to any
other member of other clusters
o Because no output (no classification
https://pixabay.com/en/human-crowds-collection-people-592738/
label) is provided, unsupervised
Social networks use unsupervised classification to group a large sets of people in classification is not validated against a
categories based on gender, age, taste, recent activity, friends,... (“buckets”). “truth” (contrary to supervised
By finding similarities between users, they can offer customized services to each bucket, classification, where a partial truth is
test new functionalities, analyze similar behaviors, etc. given)

© ADVANCED RESOURCES AND RISK TECHNOLOGY 2


Illustration on 2D data

o 1,500 points
o Our brain is wired to detect
patterns, seems obvious
o We can easily detect cluster
that maximize extra-cluster
distance and minimize intra-
distance clusters (most
compact clusters)
o Number of clusters?
o Centroids? Medoids?

© ADVANCED RESOURCES AND RISK TECHNOLOGY 3


Illustration on 2D data

o K-means clustering done here


o We provide the number of clusters
o Any misclassification?
o Never use clustering without
understanding the reason for clusters
o VISUALIZATION is very important
o Clustering quality metrics do not offer a
true validation, they only offer validation
on the difficulty in performing clustering

ALWAYS ASK YOURSELF:



Does this clustering make sense?
Plots, statistics, spot-checks

Eg.: what is the point of creation 20 clusters for
facies if you cannot find variograms for them?

© ADVANCED RESOURCES AND RISK TECHNOLOGY 4


Generalization to n-dim data
o Geomodeling data sets are multidimensional:
o Well logs (density, gamma, sonic, etc.), interpreted data (rock type…), secondary (seismic)
o Spatial information must always be accounted for: cluster by horizon, by region
o Time-dependent information: production, pressure, history-match quality

Field model

Correlated data

Well logs
Production

© ADVANCED RESOURCES AND RISK TECHNOLOGY 5


Techniques used in 

Unsupervised Classification

Clustering Clustering = automatic classification =


numerical taxonomy
k-means
Sometimes focused on the resulting
mixture models (such as Gaussian mixture models) groups, sometimes focused on the
discriminative power of a set of features
hierarchical clustering
Anomaly detection
Neural Networks
Hebbian Learning
Generative Adversarial Networks
Approaches for learning latent variable models such as
Expectation–maximization algorithm (EM)
Method of moments
Blind signal separation techniques
Principal component analysis
Independent component analysis
Non-negative matrix factorization
Singular value decomposition

© ADVANCED RESOURCES AND RISK TECHNOLOGY 6


K-means algorithms

Many variants of the same base algorithm


(centroid-based algorithms)
Principle: for each k-cluster, minimize the
within-cluster sum of square distance to the
K-centroid (heuristic)

STANDARD ALGORITHM
o Initialization: Start with a guess of k candidate
centroids (quality of answer strongly initialization
dependent!)
o Assignment: Compute the distance of each point to
each centroid, and assign each point to the cluster
with the nearest centroid (=Voronoi diagram
partition) [expectation]
o Update: compute the new centroid as the mean
position of all points contained within a cluster
[maximization]
k points

μ = centroid
 o Convergence: stop when centroid positions
S = cluster
 changes are below a predetermined tolerance
Here L2 norm

© ADVANCED RESOURCES AND RISK TECHNOLOGY 7


Pros and Cons of K-Means

PROS http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html

o Works with large data sets in large dimensions


o Robust to clusters with different densities (Fig. 4)
o Heuristic: always an answer regardless of the required number
of clusters [good for representative model selection problems]


CONS
o Poor choice of number of clusters can lead to spurious results Fig.1 Fig.2
(Fig.1)
o Sensitive to initial guess and prone to local minima
o “Spherical cluster” assumption: because the centroid is the
mean of Euclidian distances, clusters work best if they are
anisotropic around the centroid (Fig.2)
o Overlapping clusters: when clusters are touching, boundary
assignments can lead to misclassification (Fig. 3)
o Can be CPU-consuming because k*N distances are computed Fig.3 Fig.4
at each iteration, even more so with Kernel transformations.

© ADVANCED RESOURCES AND RISK TECHNOLOGY 8


Many K-means variations

o Smart initialization strategies (boundary sampling, DoE concepts, best guesses…)


o Reduce dimensions (pre-process with PCA, k-PCA, MDS,…)
o Transform coordinates (including Kernel transformations)
o Apply weights on the dimensions (favor separation in a metric)
o Work with different distance norms (k-median clustering uses L1, taxicab norm)
o Add stochasticity to avoid local minima (random assignments)
o Use medoids instead of centroids (ie. only existing points can be centroids)
o Add Internal cluster evaluation (measures the density, extent, shape, “silhouette” of
clusters) to update the optimal of number clusters
o Expectation-Maximization algorithms: centroids are maximum likelihood points
determined by the marginal likelihood of data (continuous variables)
o Hierarchical clustering: split clusters to build a hierarchy and find optimal number of
clusters

© ADVANCED RESOURCES AND RISK TECHNOLOGY 9


Hierarchical clustering algorithms

o Hierarchical clustering analysis (HCA) is also


called “connectivity-based clustering”
o Cluster data points (“connect” them) based on
a maximum distance required to travel from
one cluster to another. The inter-cluster
similarity (distance) can be defined in several
ways.
o “Hierarchical” because clusters are either
divided into smaller clusters (divisive
clustering approach) as the maximum
distance to connect decreases, or
agglomerated as the maximum distance to
connect increases (agglomerative clustering
approach)
Iterative construction of 10 clusters with Euclidian
o Dendograms are used to represent the
distance and a “complete” linkage approach
partitioning of data as a function of distance

© ADVANCED RESOURCES AND RISK TECHNOLOGY 10


Most efficient implementation:

Agglomerative algorithm

o Initialization: start with each point in its own N=150 points



cluster (N clusters) 2 dimensions

o Iteration #1: look for the two closest points Euclidian distance
according to the selected linkage and
distance type, and “link” them together (ie.
add their centroid to the pool of points, and
remove the two original points from the pool)
o Iteration #N: look for the next two closest
points or two closest clusters of points
(based on a measure of inter-cluster
distance) and link them (add centroid,
remove parents)
o Stop when there are only 2 clusters left and
report all distances at which linkages
occurred
o Result: a dendogram (classification tree) Agglomerative hierarchical clustering iteratively searches for the two
“closest” (most similar) points or previously-formed clusters and
where all linkages are shown as a function links them
of distance. The user can then pick either a
distance or a desired number of clusters

https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

© ADVANCED RESOURCES AND RISK TECHNOLOGY 11


How to read a dendogram

(Greek dendro: tree + gramma: drawing)

Vertical line = origin of merged clusters

Progression of agglomerative clustering


Value of linkage distance

Horizontal line = merge two clusters

Label of samples (here 150 original points)



Nothing to do with their actual position in space, only their adjacency

© ADVANCED RESOURCES AND RISK TECHNOLOGY 12


How to read a dendogram

1 2 At this distance, we can form 2 clusters


30

Label of samples (here 150 original points)



Nothing to do with their actual position in space

© ADVANCED RESOURCES AND RISK TECHNOLOGY 13


How to read a dendogram

1 2 At this distance, we can form 2 clusters


30

Label of samples (here 150 original points)



Nothing to do with their actual position in space

© ADVANCED RESOURCES AND RISK TECHNOLOGY 14


How to read a dendogram

At this distance, we can form 5 clusters

1 2 3 4 5
15

Label of samples (here 150 original points)



Nothing to do with their actual position in space

© ADVANCED RESOURCES AND RISK TECHNOLOGY 15


How to read a dendogram

At this distance, we can form 5 clusters

1 2 3 4 5
15

Label of samples (here 150 original points)



Nothing to do with their actual position in space

© ADVANCED RESOURCES AND RISK TECHNOLOGY 16


How to read a dendogram

o Dendograms are often “leaf-truncated”


to avoid showing unreadable depths.
(leaves = the end of the branches of the
tree)
o Reports either the singleton index, or
the number of points contained in the
cluster
o Dendograms are sensitive to the
distance between points/clusters (not
their position)
o The choice of metrics and linkage
strategies impacts dendograms

© ADVANCED RESOURCES AND RISK TECHNOLOGY 17


HCA variations

o Two keys components of the method can be customized:


o The metric (Euclidian L2, Manhattan L1, maximum distance Linf, Mahalanobis (covariance), …)
o The linkage :
o Link if closest two points of sets < threshold [minimum linkage]
o Link if farthest two points of sets < threshold [complete or maximum linkage]
o Link if average distance between all pairs of points < threshold [average linkage]
o Link if centroid distance < threshold
o Link if “minimum energy clustering”: distance between the two point sets density distributions < th
o Ward linkage: minimize distances inside all clusters (similar to k-means), compactness

© ADVANCED RESOURCES AND RISK TECHNOLOGY 18


Pros and Cons

PROS
o No need to know the number of clusters a
priori, can choose later
o Only distance between points matter, not their
position: useful when working in Kernel
spaces, MDS
o CPU consumption predictable, and search for
closest clusters speeds up with more linkage

CONS
o Agglomerative clusters: large clusters will tend
to grow faster with each distance iteration than
small clusters (rich getting richer, singletons
left). This is due to their larger envelope that
automatically offers more choice opportunity
for contact with close points

© ADVANCED RESOURCES AND RISK TECHNOLOGY 19


Unsupervised learning in geomodeling

Easy and tempting to replace expert knowledge by unsupervised learning: instead of interpreting
lithography, obtain measurements and let algorithms identify similarities.
This is dangerous:
o Clusters are pure mathematical objects, they are useful but not explanatory
o They contain no direct physical information, bear no extrapolation quality, offer no judgement with respect
to data (uses all data provided, without distinction)
o Clusters depend on our choice of features, they depend on our subset of data, they depend on our
algorithm parameterization
o More interesting is to understand how these clusters came to be produced by the algorithm: clustering is
only the beginning of our work. We have to extract some sense from them.
o Our assumption: we know that we have an underlying “law” (physics and chemistry of geological
reservoirs) to help us understand why our clusters were formed, we need to use it
o Forming clusters is a way of acknowledging that similar causes (features) will lead to similar behaviors,
although we do not know which behavior (outcome) yet.
o Supervised learning: identify a behavior and a set of causal features, and infer a modeling relationship

© ADVANCED RESOURCES AND RISK TECHNOLOGY 20


References

Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks, 13(4-5), 2000, pp.
411-430
J.A. Hartigan (1975). Clustering algorithms. John Wiley & Sons, Inc.
Hartigan, J. A.; Wong, M. A. (1979). "Algorithm AS 136: A K-Means Clustering Algorithm". Journal of the Royal Statistical
Society, Series C. 28 (1): 100–108. JSTOR 2346830.
Honarkhah, M; Caers, J (2010). "Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling". Mathematical
Geosciences. 42: 487–517. doi:10.1007/s11004-010-9276-7.
Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US,
2005. 321-352.
Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "14.3.12 Hierarchical clustering". The Elements of Statistical
Learning (PDF) (2nd ed.). New York: Springer. pp. 520–528. ISBN 0-387-84857-6. Retrieved 2009-10-20.

© ADVANCED RESOURCES AND RISK TECHNOLOGY 21

You might also like