Professional Documents
Culture Documents
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Let I= {i1, ..., ik} be a set of items. Let D, be a set of transactions where each transaction T
is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called
TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T.
Transaction TID Items bought What is k-itemset?
T1 10 Beer, Nuts, Diaper When k=1, then k-Itemset is itemset 1.
T2 20 Beer, Coffee, Diaper When k=2, then k-Itemset is itemset 2.
T3 30 Beer, Diaper, Eggs When k=3, then k-Itemset is itemset 3.
T4 40 Nuts, Eggs, Milk
When k=4, then k-Itemset is itemset 4.
When k=5, then k-Itemset is itemset 5.
T5 50 Nuts, Coffee, Diaper, Eggs, Milk
For example, you are in a supermarket to buy milk. Referring to the below example,
there are nine baskets containing varying combinations of milk, cheese, apples, and
bananas.
Question - are you more likely to buy apples or cheese in the same transaction than
somebody who did not buy milk?
The next step is to determine the relationships and the rules. So, association rule
mining is applied in this context. It is a procedure which aims to observe frequently
occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of
repositories.
School of Computer Engineering
Market-Basket Model cont…
8
The association rule has three measures that express the degree of confidence in the
rule, i.e. Support, Confidence, and Lift. Since the market-basket has its origin in retail
application, it is sometimes called transaction.
Support: The number of transactions that include items in the {A} and {B} parts of
the rule as a percentage of the total number of transactions. It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
Example: Referring to the earlier dataset, Support(milk) = 6/9, Support(cheese) =
7/9, Support(milk & cheese) = 6/9. This is often expressed as milk => cheese i.e.
bought milk and cheese together.
Confidence: It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}. Example: Referring to the earlier dataset,
Confidence(milk => cheese) = (milk & cheese)/(milk) = 6/ 6.
Lift: The lift of the rule A=>B is the confidence of the rule divided by the expected
confidence, assuming that the itemsets A and B are independent of each other.
Example: Referring to the earlier dataset, Lift(milk => cheese) = [(milk &
cheese)/(milk) ]/[cheese/Total] = [6/6] / [7/9] = 1/0.777.
Milk ? ?
Cheese ? ?
1 Milk => Cheese ? ? ? ?
Apple, Milk ? ?
2 (Apple, Milk) => Cheese ? ? ? ?
Apple, Cheese ? ?
Support(Grapes) ?
Confidence({Grapes, Apple} => {Mango}) ?
Lift ({Grapes, Apple} => {Mango}) ?
Imagine you’re at the supermarket, and in your mind, you have the items you
wanted to buy. But you end up buying a lot more than you were supposed to.
This is called impulsive buying and brands use the Apriori algorithm to
leverage this phenomenon.
The Apriori algorithm uses frequent itemsets to generate association rules,
and it is designed to work on the databases that contain transactions.
With the help of these association rule, it determines how strongly or how
weakly two objects are connected.
It is the iterative process for finding the frequent itemsets from the large
dataset.
This algorithm uses a breadth-first search and Hash Tree to calculate the
itemset associations efficiently.
It is mainly used for market basket analysis and helps to find those products
that can be bought together. It can also be used in the healthcare field to find
drug reactions for patients.
Hash tree is a data structure used for data verification and synchronization.
It is a tree data structure where each non-leaf node is a hash of it’s child
nodes. All the leaf nodes are at the same depth and are as far left as possible.
It is also known as Merkle Tree.
top hash
hash(hA+B) + hash(hC+D) = hA+B+C+D
A B C D
C1
Itemset Support Count
A 6
Now, we will take out all the itemsets that
have the greater support count that the B 7
minimum support (2). It will give us the C 6
table for the frequent itemset L1. Since all
D 2
the itemsets have greater or equal support
count than the minimum support, except the E 1
E, so E itemset will be removed.
School of Computer Engineering
Apriori Algorithm Example cont…
18 L1
Itemset Support Count Step-2: Candidate Generation C2, and L2:
In this step, we will generate C2 with the help
A 6 of L1. In C2, we will create the pair of the
B 7 itemsets of L1 in the form of subsets. After
C 6 creating the subsets, we will again find the
support count from the main transaction table
D 2 of datasets, i.e., how many times these pairs
C2 have occurred together in the given dataset. So,
we will get the below table for C2:
Itemset Support Count
{A, B} 4
Again, we need to compare the C2 Support
{A,C} 4 count with the minimum support count, and
{A, D} 1 after comparing, the itemset with less support
count will be eliminated from the table C2. It
{B, C} 4
will give us the below table for L2. In this case,
{B, D} 2 {A,D}, {C,D} itemset will be removed.
{C, D} 0
School of Computer Engineering
Apriori Algorithm Example cont…
19
L2
Step-3: Candidate Generation C3, and L3:
Itemset Support Count For C3, we will repeat the same two processes,
{A, B} 4 but now we will form the C3 table with subsets
{A, C} 4 of three itemsets together, and will calculate
the support count from the dataset. It will give
{B, C} 4 the below table:
{B, D} 2
C3
Now we will create the L3 table. As we can Itemset Support Count
see from the above C3 table, there is only
{A, B, C} 2
one combination of itemset that has
support count equal to the minimum {B, C, D} 0
support count. So, the L3 will have only one {A, C, D} 0
combination, i.e., {A, B, C}.
L3
{A, B, D} 0
Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups.
It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
Following is an example of finding clusters of population based on their
income and debt.
The data points clustered together can be classified into one single group. The
clusters can be distinguished, and can identify that there are 3 clusters.
Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:
To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
A dendrogram is a tree-like diagram that records the sequences of merges or splits.
Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
School of Computer Engineering
Dendrogram cont…
43
Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold
is set in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
The number of clusters will be the number of vertical lines which are being intersected
by the line drawn using the threshold. In the above example, since the red line intersects
2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
other will have a sample (3,5).
School of Computer Engineering
Hierarchical Clustering closeness of two clusters
46
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are: Manhattan distance
Euclidean distance Maximum distance
Squared Euclidean distance Mahalanobis distance
In single linkage
hierarchical clustering, the
distance between two
clusters is defined as the
shortest distance between
two points in each cluster.
For example, the distance
between clusters “r” and “s”
to the left is equal to the
length of the arrow
between their two closest
points.
School of Computer Engineering
Complete Linkage
49
In complete linkage
hierarchical clustering, the
distance between two
clusters is defined as the
longest distance between
two points in each cluster.
For example, the distance
between clusters “r” and “s”
to the left is equal to the
length of the arrow
between their two furthest
points. School of Computer Engineering
Average Linkage
50
In average linkage
hierarchical clustering, the
distance between two
clusters is defined as the
average distance between
each point in one cluster to
every point in the other
cluster. For example, the
distance between clusters
“r” and “s” to the left is
equal to the average
length each arrow between
connecting the points of one
School of Computer Engineering
cluster to the other.
K- Means Clustering Algorithm
51
The below diagram explains the working of the K-means Clustering Algorithm:
1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input
dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is
given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
We need to choose some random K points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as K points, which are not the part of dataset.
Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by calculating the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:
From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid
i.e. K2. Let's color them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
School of Computer Engineering
Working of K-Means Algorithm cont…
60
We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
We can see in the previous image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image: