Professional Documents
Culture Documents
Overview of clustering:
In general, clustering is the use of unsupervised techniques for grouping similar objects.
In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data.
(Clustering is a method often used for exploratory analysis of the data.)
Example:
Based on customers’ personal income, it is straightforward to divide the customers into three groups
depending on arbitrarily selected values.
The customers could be divided into three groups as follows:
Earn less than $10,000
Earn between $10,000 and $99,999
Earn $100,000 or more
BIG DATA ANALYTICS (2017 REGULATION)
Partitioning clustering : Used to classify observations, within a data set, into multiple groups based on
their similarity. The algorithms require the analyst to specify the number of clusters to be generated.
Hierarchical clustering : Works by grouping data objects into a hierarchy or tree of cluster. (Top-Down
Fuzzy clustering : Fuzzy clustering is a form of clustering in which each data point can belong to more
Density-based clustering : Which can be used to identify clusters of any shape in a data set containing
Model-based clustering : Which consider the data as coming from a distribution that is mixture of two or
more clusters.
BIG DATA ANALYTICS (2017 REGULATION)
Partitioning clustering :
Used to classify observations, within a data set, into multiple groups based on their similarity.
The algorithms require the analyst to specify the number of clusters to be generated.
Algorithms:
K-means clustering : Each cluster is represented by the center or means of the data points belonging to the
cluster.
BIG DATA ANALYTICS (2017 REGULATION)
Example:
BIG DATA ANALYTICS (2017 REGULATION)
Select randomly k objects from the data set as the initial cluster centers or means.
Assigns each observation to their closest centroid, based on the Euclidean distance between the
For each of the k clusters update the cluster centroid by calculating the new mean values of all
the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the
means of all variables for the observations in the kth cluster; p is the number of variables.
Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the
cluster assignments stop changing or the maximum number of iterations is reached. By default,
the R software uses 10 as the default value for the maximum number of iterations.
BIG DATA ANALYTICS (2017 REGULATION)