2.2 - 2.3 Clutering (Hier)

▪ k-means clustering is performing a “hard” clustering to generate circular clusters
centered at the found centroids.

▪ Gaussian mixture models (GMM) clustering, considers each cluster as a
probabilistic model.
▪ As k-means method, GMM also initially assumes
the number of clusters for the input dataset.
▪ GMM tries to fit mixtures of Gaussian
distributions to the dataset, where each
distribution defines one cluster.
▪ Each cluster in GMM clustering follows a Gaussian distribution with two
parameters, which are the cluster’s mean µ and standard deviation
▪ The intuition behind Gaussian mixture clustering is that the algorithm tries to fit a
set of Gaussian distributions to the data points. It does this by iteratively updating
the parameters of the distributions until the parameters converge to a stable
solution.
1.Initialization: Choose a number of clusters K and randomly initialize K Gaussian
distributions.
1.Initialization: Choose a number of clusters K and randomly initialize K Gaussian
distributions.
▪ guess initial parameter values for each cluster µk and k , qk
▪ natural initial assumption for the qk is that they are equal for all clusters.
the probability that any given data point is a member of

that specific cluster, qk
1. Initialization: Choose a number of clusters K and randomly initialize K Gaussian
distributions.
2. Expectation step: Assign each data point to the cluster whose Gaussian
distribution has the highest probability of generating that point.
distributions.
guess initial parameter values for each cluster µk and k
3. Maximization step: Update the parameters of each Gaussian distribution based on
the data points assigned to it.
For each cluster , calculate the total likelihood mk (loosely speaking the fraction of
points allocated to cluster k).
distributions.
For each cluster , calculate the total likelihood mk (loosely speaking the fraction of
points allocated to cluster k).
Update the initially assumed parameters.

distributions.
4. Repeat steps 2 and 3 until convergence: Repeat the expectation and maximization
steps until the parameters converge to a stable solution.
Assign each data record to the cluster with which it has the highest membership
probability.
▪ the clusters are built in a hierarchy. This hierarchy of clusters is represented as a
tree.
▪ The root of this tree is the “universe” cluster that includes all the data records
▪ The leaves form “single-point” clusters, where they include an individual data
record for each leaf.
▪ From the leaves to the root, the most similar
“single-point” clusters are combined to form
larger clusters, and the process is repeated
until reaching the “universe” cluster.
▪ There are two types of hierarchical clustering:
▪ agglomerative
▪ bottom-up approach that starts at the “single-point” clusters and moves up by merging similar
clusters until it reaches the “universe” cluster.
▪ divisive
▪ works the other way around, as a top-down approach.
AGGLOMERATIVE CLUSTERING
ALGORITHM
1. Consider each data record as a cluster (i.e., “single-point” cluster).
2. The number of clusters is equal to n which is the number of data records within the input dataset.
3. Merge the two closest clusters into one bigger cluster. The number of clusters will become (n-1)
4. Repeat step two until a single cluster is formed: the “universe” cluster.
5. Construct a tree (i.e., dendrogram) to visualize the progression of the formed clusters at each step.
AGGLOMERATIVE CLUSTERING
ALGORITHM
1. Consider each data record as a cluster (i.e., “single-point” cluster).
2. The number of clusters is equal to n which is the number of data records within the input dataset.
3. Merge the two closest clusters into one bigger cluster. The number of clusters will become (n-1)
4. Repeat step two until a single cluster is formed: the “universe” cluster.
5. Construct a tree (i.e., dendrogram) to visualize the progression of the formed clusters at each step.
AGGLOMERATIVE
CLUSTERING ALGORITHM
AGGLOMERATIVE
1. Euclidean distance between the students
Example:
2. Create the proximity matrix for the Euclidean distances between all the data
records
AGGLOMERATIVE
The second step is to combine the closest two data records into one cluster (Cluster 1), which
are data record number 1 and data record number 2, because they have the minimum value
(3.61) as seen in the proximity matrix.
AGGLOMERATIVE
▪ the next minimum distance is 9.22, which is between data record number 3 and
data record number 5; therefore, they are combined in one cluster (Cluster 2).
AGGLOMERATIVE
▪ The next step is to continue calculating the next minimum distance between the
data records (or clusters), which is the distance with a value of 9.43, found between
data record number 3 and data record number 4.
AGGLOMERATIVE
▪ The next minimum distance is 14.14 between data record number 1 (element of
Cluster 1) and data record number 4 (element of Cluster 3). A larger cluster
(Cluster 4) is then formed by combining clusters 1 and 3. Since all the data records
are included in this new cluster, this cluster is the “universe” cluster.
AGGLOMERATIVE
▪ Finally, the tree (i.e., dendrogram) is constructed as a two dimensional graph with
x-axis for the data records of the input dataset, and the y-axis for the recorded
distance between the combined data records (or clusters) at each step, as shown in
the following figure.

2.2 - 2.3 Clutering (Hier)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.2 - 2.3 Clutering (Hier)

Uploaded by

Copyright:

Available Formats

▪ k-means clustering is performing a “hard” clustering to generate circular clusters

centered at the found centroids.

the probability that any given data point is a member of

Update the initially assumed parameters.

You might also like