You are on page 1of 9

BIG DATA ANALYTICS (2017 REGULATION)

Overview of clustering:
 In general, clustering is the use of unsupervised techniques for grouping similar objects.
 In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data.
(Clustering is a method often used for exploratory analysis of the data.)
Example:
 Based on customers’ personal income, it is straightforward to divide the customers into three groups
depending on arbitrarily selected values.
The customers could be divided into three groups as follows:
 Earn less than $10,000
 Earn between $10,000 and $99,999
 Earn $100,000 or more
BIG DATA ANALYTICS (2017 REGULATION)

They are different types of clustering methods, including:

 Partitioning clustering : Used to classify observations, within a data set, into multiple groups based on

their similarity. The algorithms require the analyst to specify the number of clusters to be generated.

 Hierarchical clustering : Works by grouping data objects into a hierarchy or tree of cluster. (Top-Down

(Divisive), Bottom-Up (Agglomerative))

 Fuzzy clustering : Fuzzy clustering is a form of clustering in which each data point can belong to more

than one cluster. 

 Density-based clustering : Which can be used to identify clusters of any shape in a data set containing

noise and outliers.

 Model-based clustering : Which consider the data as coming from a distribution that is mixture of two or

more clusters.
BIG DATA ANALYTICS (2017 REGULATION)

Partitioning clustering :
 Used to classify observations, within a data set, into multiple groups based on their similarity.
The algorithms require the analyst to specify the number of clusters to be generated.

Algorithms:

K-means clustering : Each cluster is represented by the center or means of the data points belonging to the
cluster.
BIG DATA ANALYTICS (2017 REGULATION)

Example:
BIG DATA ANALYTICS (2017 REGULATION)

K-means algorithm can be summarized as follows:

 Specify the number of clusters (K) to be created. (by the analyst)

 Select randomly k objects from the data set as the initial cluster centers or means.

 Assigns each observation to their closest centroid, based on the Euclidean distance between the

object and the centroid.

 For each of the k clusters update the cluster centroid by calculating the new mean values of all

the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the

means of all variables for the observations in the kth cluster; p is the number of variables.

 Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the

cluster assignments stop changing or the maximum number of iterations is reached. By default,

the R software uses 10 as the default value for the maximum number of iterations.
BIG DATA ANALYTICS (2017 REGULATION)

K-means Algorithm (Overview of the method):


BIG DATA ANALYTICS (2017 REGULATION)

K-means Algorithm (Overview of the method):


BIG DATA ANALYTICS (2017 REGULATION)

K-means Algorithm (Overview of the method):

Initialization Cluster Assignment

Moving Centroid Convergence


BIG DATA ANALYTICS (2017 REGULATION)

Use Cases of K-means :


Here is a list of ten interesting use cases for k-means.
Document Classification
 Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a
very standard classification problem and k-means is a highly suitable algorithm for this purpose.
Delivery Store Optimization
 Optimize the process of good delivery using truck drones by using a combination of k-means to find the
optimal number of launch locations.
Identifying Crime Localities
 With data related to crimes available in specific localities in a city, the category of crime, the area of the
crime, Drug trade, Kidnapping, Robbery, Murder etc.
Customer Segmentation
 Clustering helps marketers improve their customer base, work on target areas, and segment customers based
on purchase history, interests, or activity monitoring.

Fantasy League Stat Analysis


 Analyzing player stats has always been a critical element of the sporting world, and with increasing
competition, machine learning has a critical role to play here.

You might also like