You are on page 1of 22

Unsupervised Machine Learning

clustering Introduction
Unsupervised learning:-
As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data.
It can be compared to learning which takes place in the human brain while learning new
things.
It can be defined as: “Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed to act on that data without
any supervision.”
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups
according to similarities between images.

Why use Unsupervised Learning?


Unsupervised learning is helpful for finding useful insights from the data.
 Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
 Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
 In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
 Working of Unsupervised Learning Working of unsupervised learning can be
understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects
Types of Unsupervised Learning Algorithm: The unsupervised learning algorithm
can be further categorized into two types of problems:

Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes
them as per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people who buy
X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Advantages of Unsupervised Learning
Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
Unsupervised learning is preferable as it is easy to get unlabeled data in comparison
to labeled data.
Disadvantages of Unsupervised Learning
Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance

Clustering is a process of grouping examples basing on their similarity


Clustering is a task of dividing the data into groups such that similar featured data points lie
in one group
Cluster is collection of data points similar to one another with in the cluster, dissimilar
points to another cluster
Good clustering method uses have high intra cluster similarity, low inter cluster similarity

Types of clustering:-
We have the following types of clustering
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work.
As there is no requirement to predetermine the number of clusters as we did in the K-Means
algorithm.
We construct nested partitions layer by layer by grouping the objects into tree of cluster
It uses generalized distance metrics for clustering

The hierarchical clustering technique has two approaches:


1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm
starts with taking all data points as single clusters and merging them until one cluster
is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is
a top-down approach.

Need of hierarchical clustering:-


As we already have other clustering algorithms such as K-Means Clustering, in the K-
means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the predefined number of clusters.
Agglomerative Hierarchical clustering
 The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters,
 it follows the bottom-up approach.
 It means, this algorithm considers single instance (atomic) clusters at the beginning,
and then start combining the closest pair of clusters together this until all the clusters
are merged into a single cluster that contains all the points in the datasets.
 This hierarchy of clusters is represented in the form of the dendrogram.
 Distance between the clusters is considered as design decision
Working of Agglomerative Hierarchical clustering:-
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so
the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster.
So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters:-

As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering.
There are various ways to calculate the distance between two clusters, and these ways decide
the rule for clustering.
These measures are called Linkage methods. Some of the popular linkage methods are given
below:
Single Linkage: It is the Shortest Distance between the closest points of the clusters.

Complete Linkage: It is the farthest distance between the two points of two different
clusters.
It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.

Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage
methods.

Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem

Flowchart of Agglomerative algorithm


Example1:- given a dataset of 5 objects characterized by single features assume that two
clusters s1={a,b} s2={c,d,e}

Data points features


A 1
B 2
C 4
D 5
E 6
Calculate distance matrix and also find the simple linkage, complete linkage, average
linkage distances
Solution:-Distance matrix to compute distance b/w point we are considering Manhattan
distance

a b c d e Data points features


A 0 1 3 4 5 a 1
b 2
c 4
d 5
e 6
B 1 0 2 3 4
C 3 2 0 1 2
D 4 3 1 0 1
E 5 4 2 1 0
s1={a,b} s2={c,d,e}
Single linkage distance between clusters s1,s2=
d(s1,s2)=min{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)}=min{3,4,5,2,3,4}=2
complete linkage distance between clusters s1,s2=
d(s1,s2)=max{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)}=max{3,4,5,2,3,4}=5
Average linkage distance between cluster s1,s2=
d(s1,s2)=average{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)}=average{3,4,5,2,3,4}
=(3+4+5+2+3+4)/6=21/6=7/2=3.5

Problem -2:-
Find the clusters using hierarchical clustering with agglomerative algorithm with following
data matrix
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5

Step1 : construct distance matrix using Euclidean distance

  A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00
Iteration1:-we have create 6 cluster with atomic data points

Step2 : merge two closet clusters basing on minimum distance b/w the points
  A B C D,F E
A 0.00 0.71 5.66 ?3.20 4.24
B 0.71 0.00 4.95 ?2.50 3.54
C 5.66 4.95 0.00 ?2.24 1.41
D,F ?3.20 ?2.50 ?2.24 ?0.00 ?1.00
E 4.24 3.54 1.41 ?1.00 0.00

Consider single linkage between the clusters


d(A,{D,F})=d({D,F},A)=min({d(A,D),d(A,F)}=min{3.61.3.20}=3.20
d(B,{D,F})=d({D,F},B)=min({d(B,D),d(B,F)}=min{2.92,2.50}=2.50
d(C,{D,F})=d({D,F},C)=min({d(C,D),d(C,F)}=min{2.24,2.50}=2.24
d(E,{D,F})=d({D,F},E)=min({d(E,D),(E,F)}=min{1.00,1.20}=1.00

Iteration-2:- now merge 2 closet clusters(A,B)

  A,B C D,F E
  ?
A,B 0.00 4.95 ? 2.50   ?3.54
C ?4.95 0.00 2.24 1.41
?
D ,F 2.50 2.24 0.00 1.00
E ?3.54 1.41 1.00 0.00

d(C,{A,B})=d({A,B},C)=min{d(A,C),d(B,C)}=min{5.66,4.95}=4.95
d({D,F},{A,B})=d({A,B},
{D,F})=min{d(A,D),d(A,F),d(B,D),d(B,F)}=min{3.61,3.20,2.92,2.50}=2.50
d(E,{A,B})=d({A,B},E)=min{d(A,E),d(B,E)}=min{4.24,3.54}=3.54
Iteration3:-{ D,F},E cluster are closer clusters we have to merge them
  A ,B C D,F,E
  ?
A ,B 0.00 4.95 2.50
C 4.95 0.00 ?1.41

D ,F,E ? 2.50 ?1.41 0.00


d({D,F,E},{A,B})=min{d(D,A),d(D.B),d(F,A),d(F,B),d(E,A),d(E,B)}=min{3.61,2.92,3.20,
2.50,4.24,3.54}=2.50
d(C,{D,E,F})=d({D,E,F},C)=min{d(D,C),d(E,C),d(F,C)}=min{2.24,1.41,2.50}=1.41
Iteration 4:- {D,F,E} is most closer to C=> new cluster{D,F,E,C}

D,F,E,
  A,B C
 ?
A,B 0.00 2.5
D,F,E,
C ?2.5 0.00

d({D,F,E,C},{A,B})=d({A,B},
{C,D,F,E})=min{d(A,C),d(A,D),d(A,F),d(A,E),d(B,C),d(B,D),d(B,F),d(B,E)}=min{5.66,3.6
1,3.20,4.24,4.95,2.92,3.54,2.5}=2.5

Iteration5: now we have to merge {A,B},{C,D,F,E}=>{A,B,C,D,F,E} are in single cluster


then stop the process

Then the dendogram tree diagram

d(D,F)=0.5

d(A,B)=0.71

d((D,F),E)=1.00

d(((D,F),E),C)=1.41

d(((D,F),E)&d(A,B))=2.5

Advantages:-
Embedded flexibility regarding the level of granularity
Easy of handling any form of similarity or distance
Applicable of any attribute type

Disadvantages:- haziness of termination criteria


Most hierarchical clustering algorithm don’t resist once constructed cluster with the purpose
of performance improvement

Divisive clustering algorithm:-


It is initially top down approach
Initially all objects are considered in a single cluster
Then cluster is recursively split into smaller one until termination condition occurs

Divisive algorithm : simple approach based on MST


1. Compute minimum spanning tree for a given adjacency matrix
2. Repeat
3. Create a new cluster by breaking the link corresponding to the largest distance
4. Continue this process until singleton cluster remains

Minimum spanning Tree:

Graphs are mathematical structures that represent pair wise relationships between objects.
A graph is a flow structure that represents the relationship between various objects.
Let the Graph is represented as G=(V,E)
Graph G comprising set Vof vertices and and collection of pair of vertices from V
form set edges E of the graph.

Eg:-

Vertices are V={1,2,3,4}


Edges are E={(1, 2),(1,4) ,(2,1),(2,3),(3,2),(3,4),(4,1),(4, 3)}
 Examples:
o Cities with distances between
o Roads with distances between intersection points

Adjacent matrix representation


Minimum spanning Tree
Given a connected and undirected graph, a spanning tree of that graph is a subgraph that is a
tree and connects all the vertices together. A single graph can have many different spanning
trees.

A minimum spanning tree (MST) or minimum weight spanning tree for a weighted,
connected, undirected graph is a spanning tree with a weight less than or equal to the weight
of every other spanning tree.
The weight of a spanning tree is the sum of weights given to each edge of the spanning tree.
A minimum spanning tree has (V – 1) edges where V is the number of vertices in the given
graph

Krushkal’s algorithm
1.Sort all the edges in non-decreasing order of their weight. 
2. Pick the smallest edge. Check if it forms a cycle with the spanning tree formed so
far. If cycle is not formed, include this edge. Else, discard it. 
3. Repeat step#2 until there are (V-1) edges in the spanning tree..

Divisive Problem1
Given adjacency matrix of a graph as follows
Now arrange the edges basing on increasing order of weight and

Now apply the Krushkal’s algorithm and generate minimum spanning tree
1. Join the edges A-B because no cycle
2. Join the edges C-D because no cycle
3. Join the edges A-C because no cycle
4. Don’t Join the edges A-D because of cycle
5. Don’t Join the edges B-C because of cycle
6. Join the edges A-E because no cycle
7. Don’t Join the edges B-E because of cycle
8. Don’t Join the edges D-E because of cycle
9. Don’t Join the edges B-D because of cycle
10. Don’t Join the edges C-E because of cycle
Then the MST will be as follows

Consider the largest edge i.e of weight 3 between A and E


We have to remove this largest edge as a result we got two clusters {E},{A,B,C,D}

Now again consider the largest edge i.e of weight 2 between A and C
We have to remove this largest edge as result we got three clusters
{A,B},{C,D},{E}

Now we have 2 edges of equal weight we can remove either of them let us suppose we
removed the edge between A and B then we have {A}{B}{C,D},{E} clusters

Now we have remove the edge between C and D vertices


Then the clusters are {A},{B},{C},{D},{E}

Our process is completed we stop the process


Problem2
Association Rules
Introduction:-
Association rules are used to show the relationships between data items.
It is unsupervised learning algorithm
Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that it
can be more profitable. It is similar to if /then in logic.

The purchasing of one product when another product is purchased represents an association
rule. Association rules are frequently used by retail stores to assist in marketing, advertising,
floor placement, and inventory control.
Although they have direct applicability to retail businesses, they have been used for other
purposes as well, including predicting faults in telecommunication networks
Consider the following example

 
Tid Beer Bread Jelly Milk Peanut
Butter
T1 0 1 1 0 1
T2 0 1 0 0 1
T3 0 1 0 1 1
T4 1 1 0 0 0
T5 1 0 0 1 0

Association Rule:- Given a set of items I = {I1,I2 . . . , Im} and a database of


transactions
D = {t1 , t2 , ... , tn } where ti = {ti1,ti2,…. tik } and tij belongs to I an association
rule is an implication of the form X=> Y where X, Y subset of I are sets of items
called item sets and X n Y = Φ.

Association rule learning can be divided into three types of algorithms:


1. Apriori
2. Eclat
3. F-P Growth Algorithm

Working of Association Rule Learning

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent.

These types of relationships where we can find out some association or relation between two
items is known as single cardinality.

It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there
are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Support

Support is the frequency of A or how frequently an item appears in the dataset.

It is defined as the fraction of the no of transactions T to transaction t that contains the itemset
X.

If there are X datasets, then for transactions T, it can be written as:

Eg:- support(Beer)=frequency(Beer)/total no of Transactions=2/5=0.4


Support(Bread)=frequency(Bread)/total no of Transactions=4/5=0.8

support(Beer,Bread)=frequency(Beer,Bread)/total no of transactions=1/5=0.2
Confidence:-

Confidence indicates how often the rule has been found to be true. Or

how often the items X and Y occur together in the dataset when the occurrence of X is
already given.

It is the ratio of the transaction that contains X and Y to the number of records that contain X.

Eg:- condence (beer,bread)=frequence(beer,bread)/frequency (beer)=1/2=0.5

Lift:-

It is the strength of any rule, which can be defined as below formula:

Lift(beer,bread)=support(beer,bread)/support(beer)*support(bread)=0.2/0.4*0.8=0.625

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent


of each other.
o Lift>1: It determines the degree to which the two item sets are dependent to each
other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item
has a negative effect on another.
Apriori Algorithm:-
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects.
It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning that
analyzes that people who bought product A also bought product B.
Apriori algorithm is also called frequent pattern mining. Generally, you
operate the Apriori algorithm on a database that consists of a huge number of
transactions
Item set:- two or more items
Frequent or Large itemset:-
The set of items which are occurring more frequently

A large (frequent) itemset is an itemset whose number of occurrences is above a threshold,


s.
We use the notation L to indicate the complete set of large itemsets and l to indicate a
specific large itemset.

Finding large itemsets generally is quite easy but very costly.


The naive approach would be to count all itemsets that appear in any transaction. Given a
set of items of size m, there are 2 m subsets. Since we are not interested in the empty set, the
potential number of large itemset i is then 2 m - 1.
To complete the above problem we need 2 m – 1 scans.
But we can complete this problem with k scans where k is no of items length of the large
item set

Finding Frequent Itemsets Using Candidate Generation:The Apriori Algorithm

 Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994


for mining frequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent itemset properties.
 Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets.
 First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy
minimum support.
The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent
2-itemsets, which is used to find L3, and so on, until no more frequent k-
itemsets can be found.
 The finding of each Lk requires one full scan of the database.
 A two-step process is followed in Apriori consisting of join and prune action.
We have two basic assumptions in this
1. Let x is a item set which most frequent then the subset of the item set are also most
frequently occurring
2. Let x is an item which not most frequent then superset of the item are also not most
frequent
3. If we have n items in the database then we have 2n possible combination of item sets then
we have to scan the database 2n to find which one is most frequently occurring
4. But by using Aprior algorithm we can accomplish this task by computing candidate item
set basing on the 2 basic assumptions ,we remove the item from the candidate item set if
it is not most frequent then we won’t consider the superset of it
5. Basing on the candidate set and threshold value, we find 1 item frequent set and using 1
item frequent set we find 2 item frequent set and then we find 3 item frequent set and so
on
6. We form association rules , we find whether we have to select or reject the rule

Example:

There are five transactions in this database, that is, |D| = 5


Steps
1. In the first iteration of the algorithm, each item is a member of the set of candidate1- itemsets,
C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent
1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support. In our example, all of the candidates in C1 satisfy minimum support.

Now remove the item set {4}

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate a
candidate set of 2-itemsets, C2. No candidates are removed fromC2 during the prune step
because each subset of the candidates is also frequent.

4. Next, the transactions in D are scanned and the support count of each candidate itemset In C2 is
accumulated.
The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2- itemsets
in C2 having minimum support.{1,2} is removed since minimum support is less than 2

5. The generation of the set of candidate 3-itemsets,C3, From the join step, we first get
C3 =L2x L2 = ({1, 2, 3}, {1, 2, 5}, {1, 3, 5}, ,{2, 3, 5}, Based on the Apriori property that all subsets
of a frequent item set must also be frequent, we can determine that the four latter candidates
can’t possibly be frequent
{1,2,3}=>{1,2},{1,3},{2,3}=> so it is not frequent items set
{1,2,5}=>{1,2},{1,5},{2,5}=> so it is not frequent items set
{1,3,5}=>{1,3},{1,5},{3,5}=> so it is may be frequent items set
{2,3,5}=>{2,3},{2,5},{3,5}=> so it is may be frequent items set

6. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support.
7. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4
{1,2,3,5}=>{1,2},{1,3},{1,5},{2,3},{2,5},{3,5},{1,2,3}{1,2,5},{2,3,5},{1,3,5}=>
so it is not frequent itemset
L4=nullset

Now we have to frame association rules


Suppose threshold value is 60%

Rule1:-

{1,3}=>({1,3,5}-{1,3})=>5
{1,3}=>5
Confidence=support(1,3,5)/support(1,3)=2/3=66.66>60
Rule selected

Rule2:-
{1,5}=>({1,3,5}-{1,5})=>3
{1,5}=>3
Confidence=support(1,3,5)/support(1,5)=2/2=100>60
Rule selected

Rule3:-
{3,5}=>({1,3,5}-{3,5})=>1
{3,5}=>1
Confidence=support(1,3,5)/support(3,5)=2/3=66.66>60
Rule selected

Rule4:-
{1}=>({1,3,5}-{1})=>{3,5}
1=>{3,5}
Confidence=support(1,3,5)/support(1)=2/3=66.66>60
Rule selected

Rule5:-
{3}=>({1,3,5}-{3})=>{1,5}
3=>{1,5}
Confidence=support(1,3,5)/support(3)=2/4=50<60
Rule Rejected
Rule6:-
{5}=>({1,3,5}-{5})=>{1,3}
5=>{1,3}
Confidence=support(1,3,5)/support(5)=2/4=50<60
Rule Rejected

Rule7:-
{2,3}=>({2,3,5}-{2,3})=>{5}
{2,3}=>5
Confidence=support(2,3,5)/support({2,3})=2/2=100>60
Rule selected
Rule8:-
{2,5}=>({2,3,5}-{2,5})=>{3}
{2,5}=>3
Confidence=support(2,3,5)/support({2,5})=2/3=100>60
Rule selected

Rule9:-
{2}=>({2,3,5}-{2})=>{3,5}
2=>{3,5}
Confidence=support(2,3,5)/support({2})=2/3=66.66>60
Rule selected

You might also like