Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Introduction to
Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021
2
Contents
¡ Introduction to Machine Learning & Data Mining
¡ Supervised learning
¡ Unsupervised learning
¨ Clustering
¡ Practical advice
3
1. Basic learning problems
¡ Supervised learning: learn a function y = f(x) from a given
training set {{x1, x2, …, xN}; {y1, y2,…, yN}} so that yi ≅ f(xi) for
every i.
¨ Each training instance has a label/response.
¡ Unsupervised learning: learn a function y = f(x) from a

given training set {x1, x2, …, xN}.
¨ No response is available
¨ Our target is the hidden structure in data.
groups by making quantitative comparisons of multiple character- 1660 entries with the words data clustering that appeared in 2007
4
Unsupervised learning: examples (1)
¡ Clustering data into clusters
Fig. 1. Learning problems: dots correspond to points without any labels. Points with labels are denoted by plus signs, asterisks, and crosses. In (c), the must-link and cannot-
link constraints are denoted by solid and dashed lines, respectively (figure taken from Lange et al. (2005).
¨ Discover the data groups/clusters
Fig. 2. Diversity of clusters. The seven clusters in (a) (denoted by seven different colors in 1(b)) differ in shape, size, and density. Although these clusters are apparent to a data
analyst, none of the available clustering algorithms can detect all these clusters.
¡ Community detection
¨ Detect communities in online social networks
5
Unsupervised learning: examples (2)
¡ Trends detection
¨ Discover the trends, demands, future needs
of online users
¡ Entity-interaction analysis
jects participating in a cannot-link constraint should be different. jects participating
presence of noise in in athe data makes
cannot-link constraint shouldofbethe
the detection different.
clusters presence of noise in the data makes the detection of th
Constraints can be particularly beneficial in data clustering (Lange Constraints
even morecan be particularly
difficult. An ideal beneficial
cluster canin data clustering
be defined as a(Lange
set of even more difficult. An ideal cluster can be defined a
et al., 2005; Basu et al., 2008), where precise definitions of under- etpoints
al., 2005;
that Basu et al., 2008),
is compact where precise
and isolated. definitions
In reality, a clusterofisunder-
a sub- points that is compact and isolated. In reality, a cluster
6
lying clusters are absent. In the search for good models, one would lying clusters
jective entityare
thatabsent. In the
is in the eyesearch
of thefor good models,
beholder one would
and whose signif- jective entity that is in the eye of the beholder and wh
like to include all the available information, no matter whether it is like to include
icance all the available
and interpretation information,
requires domainno matter whether
knowledge. it is
But, while icance and interpretation requires domain knowledge.
unlabeled data, data with constraints, or labeled data. Fig. 1 illus- unlabeled
humans are data, data with
excellent constraints,
cluster seekersorinlabeled
two anddata. Fig. 1 illus-
possibly three humans are excellent cluster seekers in two and poss
2. Clustering
trates this spectrum of different types of learning problems of trates this spectrum
dimensions, we needofautomatic
different algorithms
types of learning problems of
for high-dimensional dimensions, we need automatic algorithms for high-di
interest in pattern recognition and machine learning. interest
data. It in
is pattern recognition
this challenge alongand
withmachine learning.
the unknown number of clus- data. It is this challenge along with the unknown numb
ters for the given data that has resulted in thousands of cluster- ters for the given data that has resulted in thousands
2. Data clustering 2.ing
Dataalgorithms
clustering that have been published and that continue to ing algorithms that have been published and that co
appear. appear.
The goal of data clustering, also known as cluster analysis, is to The goal of data clustering, also known as cluster analysis, is to
discover the natural grouping(s) of a set of patterns, points, or ob- discover
2.1. Whythe natural grouping(s) of a set of patterns, points, or ob-
clustering? 2.1. Why clustering?
¡ Clustering problem:
jects. Webster (Merriam-Webster Online Dictionary, 2008) defines
cluster analysis as ‘‘a statistical classification technique for discov-
jects. Webster (Merriam-Webster Online Dictionary, 2008) defines
cluster analysis
Cluster as ‘‘aisstatistical
analysis prevalentclassification technique
in any discipline for discov-
that involves anal- Cluster analysis is prevalent in any discipline that inv
ering whether the individuals of a population fall into different ering whether
ysis of the individuals
multivariate of avia
data. A search population fall into
Google Scholar different
(2009) found ysis of multivariate data. A search via Google Scholar (20
Input: a training set without any label.

groups by making quantitative comparisons of multiple character- groups by making
1660 entries withquantitative
the words data comparisons
clusteringof multiple
that character-
appeared in 2007 1660 entries with the words data clustering that appear
¨
¨ Output: clusters of the training instances

¡ A cluster:
¨ Consists of similar instances in some senses.
¨ Two clusters should be different from each other.
Fig. 1. Learning problems: dots correspond to points without any labels. Points with labels
Fig.are
link constraints are denoted by solid and dashed lines, respectively (figure taken from link
Lange
1. denoted
Learningby
constraints
problems:
plus signs,
et al. (2005).
dots
asterisks,
correspond
and to
crosses.
pointsInwithout
(c), the any
must-link
labels. and
Points
cannot-
with labels are denoted by plus signs, asterisks, and crosses. In (c), the must-link
are denoted by solid and dashed lines, respectively (figure taken from Lange et al. (2005).
After clustering
7
Clustering
¡ Approaches to clustering
¨ Partition-based clustering
¨ Hierarchical clustering
¨ Mixture models
¨ Deep clustering
¨ …
¡ Evaluation of clustering quality
¨ Distance/difference between any two clusters should be large.
(inter-cluster distance)
¨ Difference between instances inside a cluster should be small.
8
3. K-means for clustering
¡ K-means was first introduced by Lloyd in 1957.
¡ K-means is the most popular method for clustering, which is
partition-based.
¡ Data representation: D = {x1, x2, …, xr}, each xi is a vector
in the n-dimensional Euclidean space.
¡ K-means partitions D into K clusters:
¨ Each cluster has a central point which is called centroid.
¨ K is a pre-specified constant.
9
K-means: main steps
¡ Input: training data D, number K of clusters, and distance
measure d(x,y).
¡ Initialization: select randomly K instances in D as the initial
centroids.
¡ Repeat the following two steps until convergence
¨ Step 1: for each instance, assign it to the cluster with nearest
centroid.
¨ Step 2: for each cluster, recompute its controid from all the
instances assigned to that cluster.
10
K-means: example (1)
¡ NAFOSTED
¨ Title
[Liu, 2006]
11
K-means: example (2)
[Liu, 2006]
12
K-means: convergence
¡ The algorithm converges if:
¨ Very few instances are reassigned to new clusters, or
¨ The centroids do not change significantly, or
¨ The following sum does not change significantly
k
Error = å å d ( x, m i ) 2
i =1 xÎCi
¨ Where Ci is the ith cluster; mi is the centroid of cluster Ci.

13
K-means: centroid, distance
¡ Re-computation of the centroids:
1
mi =
Ci
åx
xÎCi
¨ mi is the centroid of cluster Ci. |Ci| denotes the size of Ci.

¡ Distance measure:
¨ Euclidean
d (x, m i ) = x - m i = (x1 - mi1 ) + (x2 - mi 2 )

2 2
+ ... + (xn - min )
2
¨ Other measures are possible.

14
K-means: about distance
¡ Distance measure
¨ Each measure provides a view on data
¨ There are infinite number of distance measures
¨ Which distance is good?
¡ Similarity measures can be

used
¨ Similarity between two objects
15
K-means: affects of outliers
¡ K-means is sensitive with outliers, i.e., outliers might affect
significantly on clustering results.
¨ Outliers are instances that significantly differ from the normal
instances.
¨ The attribute distributions of outliers are very different from
those of normal points.
¨ Noises or errors in data can result in outliers.
16
K-means: outlier example
[Liu, 2006]
17
K-means: outlier solutions
¡ Outlier removal: we may remove some instances that are
significantly far from the centroids, compared with other
instances.
¨ Removal can be done a priori or when learning clusters.
¡ Random sampling: instead of clustering all data, we take a
random sample S from the whole training data.
¨ S will be used to learn K clusters. Note that S often contains
fewer noises/outliers than the original training data.
¨ After learning, the remaining data will be assigned to the
learned clusters.
18
K-means: initialization
¡ Quality of K-means depends much on the initial centroids.
1st centroid
2nd centroid
[Liu, 2006]
19
K-means: initialization solution (1)
¡ We repeat K-means many times
¨ Each time we initialize a different set of centroids.
¨ After learning, we combine results from those runs to obtain a
unified clustering.
[Liu, 2006]
20
K-means: initialization solution (2)
¡ K-means++: to obtain a good clustering, we can initialize
the centroids from D in sequence as follows
¨ Select randomly the first centroid m1.
¨ Select the second centroid which are farthest to m1.
¨ …
¨ Select ith centroid which are farthest from {m1, …, mi-1}.
¨ …
¡ By using this initialization scheme, K-means can converge
to a near optimal solution [Arthur, D.; Vassilvitskii, 2007]
21
K-means: curved clusters
¡ When using Euclidean distance, K-means cannot detect
non-spherical clusters.
¨ How to deal with those cases?
[Liu, 2006]
22
K-means: summary
¡ Advantages:
¨ Be very simple,
¨ Be efficient in practice,
¨ Converges in expected polynomial time [Manthey & Röglin,
JACM, 2011]
¨ Be flexible in choosing the distance measures.
¡ Limitations:
¨ Choose a good similarity measure for a domain is not easy.
¨ Be sensitive with outliers.
23
4. Online K-means
¡ K-means:
¨ We need all training data for each iteration.
¨ Therefore, it cannot work with big datasets,
¨ And cannot work with stream data where data come in
sequence.
¡ Online K-means helps us to cluster big/stream data.
¨ It is an online version of K-means [Bottou, 1998].
¨ It follows the methodology from online learning and stochastic
gradient.
¨ At each iteration, one instance will be exploited to update the
available clusters.
24
Revisiting K-means
¡ Note that K-means finds K clusters from the training
instances {x1, x2, …, xM} by minimizing the following loss
function:
1 M
Qk−means (w) = ∑|| xi − w(xi ) ||22
2 i=1
¨ Where w(xi) is the nearest centroid to xi.
¡ Using its gradient, we can minimize Q by repeating the
following update until convergence:
M
wt+1 = wt + γ t ∑[xi − wt (xi )]
i=1
¨ Where γt is a small constant, often called learning rate.

¡ This update will converge to a local minimum.
25
Online K-means: idea
¡ Note that each iteration of K-means requires the full
gradient:
&
𝑄!" = $[𝑥# − 𝑤! (𝑥# )]

#$%
¨ Which requires all training data.

¡ Online K-means minimizes Q stochastically:
¨ At each iteration, we just use a little information from the whole
gradient Q’.
¨ Those information comes from the training instances at iteration
t:
𝑥! − 𝑤! (𝑥! )
26
Online K-means: algorithm
¡ Initialize K centroids randomly.
¡ Update the centroids as an instance comes
¨ At iteration t, take an instance xt.
¨ Find the nearest centroid wt to xt, and then update wt as
follows:
wt+1 = wt + γ t (xt − wt )
¡ Note: the learning rates {γ1, γ 2 ,...} are positive constants,

which should satisfy
∞ ∞
∑ t=1
γ t = ∞;∑ γ t2 < ∞
t=1
27
Online K-means: learning rate
¡ A popular choice of learning rate:
−κ
γ t = (t + τ )
n τ, κ are possitive constants.
n κ Î (0.5, 1] is called forgeting rate. Large κ means that the

algorithm remembers the past longer, and that new observations
play less and less important role as t grows.
28
Convergence of Online K-means
KM Cost EM Cost
¡ Objective
2500
2400
Q decreases as t increases. -100
2300
2200
2100
2000 -80
1900
1800
1700
1600
1500 -60
Q 1400
1300
1200
1100
1000 -40
900
800
700
600
500 -20
400
300
200
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
¨ Online K-means (Black circles), K-means (Black squares)

¨ Partial gradient (empty circles), full gradient (empty squares)
29
References
¡ Arthur, D.; Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding.
Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, pp.
1027–1035.
¡ Arthur, D., Manthey, B., & Röglin, H. (2011). Smoothed analysis of the k-means
method. Journal of the ACM (JACM), 58(5), 19.
¡ Bottou, Léon. Online learning and stochastic approximations. On-line learning in
neural networks 17 (1998).
¡ B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data.
Springer, 2006.
¡ Lloyd, S., 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28,
129–137. Originally as an unpublished Bell laboratories Technical Note (1957).
¡ Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition
letters, 31(8), 651-666.
30
Exercises
¡ Solutions to K-means when the data distributions are not
spherical?
¡ How to decide a suitable cluster for a new instance?

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Uploaded by

Copyright:

Available Formats

Introduction to

Machine Learning and Data Mining

¡ Unsupervised learning: learn a function y = f(x) from a

¨ Discover the data groups/clusters

Input: a training set without any label.

¨ Output: clusters of the training instances

¨ Where Ci is the ith cluster; mi is the centroid of cluster Ci.

¨ mi is the centroid of cluster Ci. |Ci| denotes the size of Ci.

d (x, m i ) = x - m i = (x1 - mi1 ) + (x2 - mi 2 )

¨ Other measures are possible.

¡ Similarity measures can be

¨ Where γt is a small constant, often called learning rate.

𝑄!" = $[𝑥# − 𝑤! (𝑥# )]

¨ Which requires all training data.

¡ Note: the learning rates {γ1, γ 2 ,...} are positive constants,

n τ, κ are possitive constants.

n κ Î (0.5, 1] is called forgeting rate. Large κ means that the

¨ Online K-means (Black circles), K-means (Black squares)

You might also like