You are on page 1of 178

lOMoARcPSD|35349139

Data Analytics(KIT601) UNIT 4 notes

Data Analytics (Dr. A.P.J. Abdul Kalam Technical University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Apriori Algorithm
The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly
used for market basket analysis and helps to find those products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value or
user-specified minimum support. It means if A & B are the frequent itemsets together,
then individually A and B should also be the frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.

Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum
or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Apriori Algorithm Working

We will understand the apriori algorithm using an example and mathematical


calculation:

Example: Suppose we have the following dataset that has various transactions, and
from this dataset, we need to find the frequent itemsets and generate the association
rules using the Apriori algorithm:

Solution:

Step-1: Calculating C1 and L1:

● In the first step, we will create a table that contains support count (The frequency
of each itemset individually in the dataset) of each itemset in the given dataset.
This table is called the Candidate set or C1.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.

Step-2: Candidate Generation C2, and L2:

● In this step, we will generate C2 with the help of L1. In C2, we will create the pair
of the itemsets of L1 in the form of subsets.

● After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred
together in the given dataset. So, we will get the below table for C2:

● Again, we need to compare the C2 Support count with the minimum support
count, and after comparing, the itemset with less support count will be

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

eliminated from the table C2. It will give us the below table for L2

Step-3: Candidate generation C3, and L3:

● For C3, we will repeat the same two processes, but now we will form the C3 table
with subsets of three itemsets together, and will calculate the support count
from the dataset. It will give the below table:

● Now we will create the L3 table. As we can see from the above C3 table, there is
only one combination of itemset that has support count equal to the minimum
support count. So, the L3 will have only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:

To generate the association rules, first, we will create a new table with the possible
rules from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all
rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).

Consider the below table:

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum confidence is 50%, so the first three rules A ^B →
C, B^C → A, and A^C → B can be considered as the strong association rules for the
given problem.

Advantages of Apriori Algorithm

● This is easy to understand algorithm

● The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

● The apriori algorithm works slow compared to other algorithms.

● The overall performance can be reduced as it scans the database for multiple
times.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● The time complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Handling Larger Dataset in main


memory

Improvements to A-Priori

unit4/Handling large data sets in main


9/28/2019 1
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm(Park-Chen-Yu)
• Hash-based improvement to A-Priori.

• During Pass 1 of A-priori, most memory is idle.


• Use that memory to keep counts of buckets into
which the pairs of items are hashed.
– Just the count, not the pairs themselves.

• Gives extra condition that candidate pairs must


satisfy on Pass 2.

unit4/Handling large data sets in main


9/28/2019 2
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Picture of PCY

Item counts Frequent items

Bitmap

Hash
table Counts of
candidate
pairs

Pass 1 Pass 2

unit4/Handling large data sets in main


9/28/2019 3
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm – Before Pass 1 Organize


Main Memory
• Space to count each item.
– One (typically) 4-byte integer per item.

• Use the rest of the space for as many integers,


representing buckets, as we can.

unit4/Handling large data sets in main


9/28/2019 4
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm – Pass 1

FOR (each basket) {


FOR (each item)
add 1 to item’s count;
FOR (each pair of items) {
hash the pair to a bucket;
add 1 to the count for that bucket
}
}

unit4/Handling large data sets in main


9/28/2019 5
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Observations About Buckets


1. If a bucket contains a frequent pair, then the bucket is surely
frequent.
– We cannot use the hash table to eliminate any member of this
bucket.

2. Even without any frequent pair, a bucket can be frequent.


– Again, nothing in the bucket can be eliminated.

3. But in the best case, the count for a bucket is less than the
support s.
– Now, all pairs that hash to this bucket can be eliminated as
candidates, even if the pair consists of two frequent items.

unit4/Handling large data sets in main


9/28/2019 6
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm – Between Passes


• Replace the buckets by a bit-vector:
– 1 means the bucket count exceeds the support s (a
frequent bucket );
– 0 means it did not.

• 4-byte integers are replaced by bits, so the bit-


vector requires 1/32 of memory.

• Also, decide which items are frequent and list


them for the second pass.

unit4/Handling large data sets in main


9/28/2019 7
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm – Pass 2


• Count all pairs {i, j } that meet the
conditions:
1. Both i and j are frequent items.
2. The pair {i, j }, hashes to a bucket number
whose bit in the bit vector is 1.

• Notice both these conditions are necessary


for the pair to have a chance of being
frequent.

unit4/Handling large data sets in main


9/28/2019 8
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Memory Details
• Hash table requires buckets of 2-4 bytes.
– Number of buckets thus almost 1/4-1/2 of the
number of bytes of main memory.

• On second pass, a table of (item, item, count)


triples is essential.
– Thus, hash table must eliminate 2/3 of the
candidate pairs to beat a-priori with triangular
matrix for counts.

unit4/Handling large data sets in main


9/28/2019 9
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Multistage Algorithm (Improvements


to PCY)
• It might happen that even after hashing there are
still too many surviving pairs and the main
memory isn't sufficient to hold their counts.

• Key idea: After Pass 1 of PCY, rehash only those


pairs that qualify for Pass 2 of PCY.
– Using a different hash function!

• On middle pass, fewer pairs contribute to buckets,


so fewer false positives –frequent buckets with no
frequent pair.
unit4/Handling large data sets in main
9/28/2019 10
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Multistage Picture

Item counts Freq. items Freq. items

Bitmap 1 Bitmap 1

First Second Bitmap 2


hash table hash table
Counts of
candidate
pairs

Pass 1 Pass 2 Pass 3

unit4/Handling large data sets in main


9/28/2019 11
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Multistage – Pass 3
• Count only those pairs {i, j } that satisfy:
1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes to a
bucket whose bit in the first bit-vector is 1.
3. Using the second hash function, the pair hashes
to a bucket whose bit in the second bit-vector is
1.

unit4/Handling large data sets in main


9/28/2019 12
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Multihash
• Key idea: use several independent hash
tables on the first pass.

• Risk: halving the number of buckets


doubles the average count. We have to be
sure most buckets will still not reach count
s.

• If so, we can get a benefit like multistage,


but in only 2 passes.
unit4/Handling large data sets in main
9/28/2019 13
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Multihash Picture

Item counts Freq. items

Bitmap 1
First hash
table Bitmap 2

Counts of
Second
candidate
hash table
pairs

Pass 1 Pass 2

unit4/Handling large data sets in main


9/28/2019 14
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Extensions
• Either multistage or multihash can use more
than two hash functions.

• In multistage, there is a point of diminishing


returns, since the bit-vectors eventually
consume all of main memory.

• For multihash, the bit-vectors occupy exactly


what one PCY bitmap does, but too many hash
functions makes all counts > s.
unit4/Handling large data sets in main
9/28/2019 15
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

All (Or Most) Frequent Itemsets In < 2


Passes
• Simple algorithm.
• SON (Savasere, Omiecinski, and Navathe).
• Toivonen.

unit4/Handling large data sets in main


9/28/2019 16
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Simple Algorithm
• Take a random sample of the market
baskets.
Copy of
sample
• Run a-priori (for sets of all sizes, not just baskets
pairs) in main memory, so you don’t pay
for disk I/O each time you increase the size Space
of itemsets. for
counts
– Be sure you leave enough space for counts.

• Use as support threshold a suitable,


scaled-back number.
– E.g., if your sample is 1/100 of the baskets, use
s /100 as your support threshold instead of s .
unit4/Handling large data sets in main
9/28/2019 17
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Simple Algorithm – Option


• Optionally, verify that your guesses are
truly frequent in the entire data set by a
second pass.

• But you don’t catch sets frequent in the


whole but not in the sample.
– Smaller threshold, e.g., s /125, helps catch
more truly frequent itemsets.
• But requires more space.

unit4/Handling large data sets in main


9/28/2019 18
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON Algorithm – (1)


• Repeatedly read small subsets of the baskets into main memory
and perform the first pass of the simple algorithm on each
subset.

• An itemset becomes a candidate if it is found to be frequent in


any one or more subsets of the baskets.

• On a second pass, count all the candidate itemsets and


determine which are frequent in the entire set.

• Key “monotonicity” idea: an itemset cannot be frequent in the


entire set of baskets unless it is frequent in at least one subset.
unit4/Handling large data sets in main
9/28/2019 19
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON Algorithm – Distributed Version


• This idea lends itself to distributed data
mining.

• If baskets are distributed among many nodes,


compute frequent itemsets at each node, then
distribute the candidates from each node.

• Finally, accumulate the counts of all


candidates.
unit4/Handling large data sets in main
9/28/2019 20
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (1)


• Start as in the simple algorithm, but lower the
threshold slightly for the sample.

• Example: if the sample is 1% of the baskets,


use s /125 as the support threshold rather
than s /100.

• Goal is to avoid missing any itemset that is


frequent in the full set of baskets.
unit4/Handling large data sets in main
9/28/2019 21
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (2)


• Add to the itemsets that are frequent in the
sample the negative border of these itemsets.

• An itemset is in the negative border if it is not


deemed frequent in the sample, but all its
immediate subsets are.

• Example. ABCD is in the negative border if and


only if it is not frequent, but all of ABC, BCD,
ACD, and ABD are.
unit4/Handling large data sets in main
9/28/2019 22
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (3)


• In a second pass, count all candidate frequent
itemsets from the first pass, and also count
their negative border.

• If no itemset from the negative border turns


out to be frequent, then the candidates found
to be frequent in the whole data are exactly
the frequent itemsets.

unit4/Handling large data sets in main


9/28/2019 23
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (4)


• What if we find that something in the negative
border is actually frequent?
• We must start over again!

• Try to choose the support threshold so the


probability of failure is low, while the number
of itemsets checked on the second pass fits in
main-memory.

unit4/Handling large data sets in main


9/28/2019 24
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Theorem:
• If there is an itemset that is frequent in the
whole, but not frequent in the sample,
• then there is a member of the negative border
for the sample that is frequent in the whole.

unit4/Handling large data sets in main


9/28/2019 25
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Proof:
• Suppose not; i.e., there is an itemset S
frequent in the whole but
– Not frequent in the sample, and
– Not present in the sample’s negative border.
• Let T be a smallest subset of S that is not
frequent in the sample.
• T is frequent in the whole (S is frequent,
monotonicity).
• T is in the negative border (else not
“smallest”).
unit4/Handling large data sets in main
9/28/2019 26
memory

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Finding Frequent Itemsets:


Limited Pass Algorithms

Thanks for source slides and material to:


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
http://www.mmds.org

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Limited Pass Algorithms


◆ Algorithms so far: compute exact collection
of frequent itemsets of size k in k passes
◗ A-Priori, PCY, Multistage, Multihash
◆ Many applications where it is not essential to
discover every frequent itemset
◗ Sufficient to discover most of them
◆ Next: algorithms that find all or most
frequent itemsets using at most 2 passes
over data
◗ Sampling
◗ SON
2
◗ Toivonen’s Algorithm
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Random Sampling of Input Data

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Random Sampling
◆ Take a random sample of the market
baskets that fits in main memory
◗ Leave enough space in memory for counts

◆ Run a-priori or one of its improvements


in main memory

Main memory
◗ For sets of all sizes, not just pairs Copy of
sample
◗ Don’t pay for disk I/O each
baskets
time we increase the size of itemsets
◗ Reduce support threshold Space
proportionally to match for
the sample size counts4
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

How to Pick the Sample


◆ Best way: read entire data set
◆ For each basket, select that basket for the
sample with probability p
◗ For input data with m baskets
◗ At end, will have a sample with size close to pm
baskets

◆ If file is part of distributed file system, can


pick chunks at random for the sample

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Support Threshold for Random


Sampling
◆ Adjust support threshold to a suitable,
scaled-back number
◗ To reflect the smaller number of baskets
◆ Example
◗ If sample size is 1% or 1/100 of the baskets
◗ Use s /100 as your support threshold
◗ Itemset is frequent in the sample if it appears
in at least s/100 of the baskets in the sample

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Random Sampling:
Not an exact algorithm
◆ With a single pass, cannot guarantee:
◗ That algorithm will produce all itemsets that are
frequent in the whole dataset
• False negative: itemset that is frequent in the whole but
not in the sample
◗ That it will produce only itemsets that are
frequent in the whole dataset
• False positive: frequent in the sample but not in the
whole

◆ If the sample is large enough, there are


unlikely to be serious errors
7

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Random Sampling: Avoiding


Errors
◆ Eliminate false positives
◗ Make a second pass through the full dataset
◗ Count all itemsets that were identified as frequent in the sample
◗ Verify that the candidate pairs are truly frequent in entire data set
◆ But this doesn’t eliminate false negatives
◗ Itemsets that are frequent in the whole but not in the sample
◗ Remain undiscovered
◆ Reduce false negatives
◗ Before, we used threshold ps where p is the sampling fraction
◗ Reduce this threshold: e.g., 0.9ps
◗ More itemsets of each size have to be counted
◗ If memory allows: requires more space
◗ Smaller threshold helps catch more truly frequent itemsets
8

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Savasere, Omiecinski and


Navathe (SON) Algorithm

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON Algorithm
◆ Avoids false negatives and false positives
◆ Requires two full passes over data

10

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON Algorithm – (1)


◆ Repeatedly read small subsets of the
baskets into main memory
◆ Run an in-memory algorithm (e.g., a priori,
random sampling) to find all frequent
itemsets
◗ Note: we are not sampling, but processing
the entire file in memory-sized chunks

◆ An itemset becomes a candidate if it is found


to be frequent in any one or more subsets of
the baskets
11

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON Algorithm – (2)


◆ On a second pass, count all the candidate
itemsets and determine which are frequent in
the entire set

◆ Key “monotonicity” idea: an itemset cannot be


frequent in the entire set of baskets unless it is
frequent in at least one subset
◗ Subset or chunk contains fraction p of whole file
◗ 1/p chunks in file
◗ If itemset is not frequent in any chunk, then support in
each chunk is less than ps
◗ Support in whole file is less than s: not frequent

12

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON – Distributed Version


◆ SON lends itself to distributed data
mining
◗ MapReduce

◆ Baskets distributed among many nodes


◗ Subsets of the data may correspond to one or
more chunks in distributed file system
◗ Compute frequent itemsets at each node
◗ Distribute candidates to all nodes
◗ Accumulate the counts of all candidates

13

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON: Map/Reduce
◆ Phase 1: Find candidate itemsets
◗ Map?
◗ Reduce?

◆ Phase 2: Find true frequent itemsets


◗ Map?
◗ Reduce?

14

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON: Map/Reduce
Phase 1: Find candidate itemsets
◆ Map
◗ Input is a chunk/subset of all baskets; fraction p of total input file
◗ Find itemsets frequent in that subset (e.g., using random
sampling algorithm)
◗ Use support threshold ps
◗ Output is set of key-value pairs (F, 1) where F is a
frequent itemset from sample

◆ Reduce
◗ Each reduce task is assigned set of keys, which are itemsets
◗ Produces keys that appear one or more time
◗ Frequent in some subset
◗ These are candidate itemsets
15

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SON: Map/Reduce
Phase 2: Find true frequent itemsets
◆ Map
◗ Each Map task takes output from first Reduce task AND a
chunk of the total input data file
◗ All candidate itemsets go to every Map task
◗ Count occurrences of each candidate itemset among the baskets
in the input chunk
◗ Output is set of key-value pairs (C, v), where C is a
candidate frequent itemset and v is the support for that
itemset among the baskets in the input chunk
◆ Reduce
◗ Each reduce tasks is assigned a set of keys (itemsets)
◗ Sums associated values for each key: total support for itemset
◗ If support of itemset >= s, emit itemset and its count
16

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm

17

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm
◆ Given sufficient main memory, uses one pass
over a small sample and one full pass over
data
◆ Gives no false positives or false negatives
◆ BUT, there is a small but finite probability
it will fail to produce an answer
◗ Will not identify frequent itemsets
◆ Then must be repeated with a different
sample until it gives an answer
◆ Need only a small number of iterations
18

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm (1)


First find candidate frequent itemsets from sample
◆ Start as in the random sampling algorithm, but
lower the threshold slightly for the sample
◗ Example: if the sample is 1% of the baskets, use s /125 as the
support threshold rather than s /100
◗ For fraction p of baskets in sample, use 0.8ps or 0.9ps as
support threshold

◆ Goal is to avoid missing any itemset that is


frequent in the full set of baskets

◆ The smaller the threshold:


◗ The more memory is needed to count all candidate itemsets
◗ The less likely the algorithm will not find an answer 19

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (2)

After finding frequent itemsets for the


sample, construct the negative border
◆ Negative border: Collection of itemsets that
are not frequent in the sample but all of
their immediate subsets are frequent
◗ Immediate subset is constructed by deleting
exactly one item

20

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Example: Negative Border

◆ ABCD is in the negative border if and


only if:
1. It is not frequent in the sample, but
2. All of ABC, BCD, ACD, and ABD are frequent
• Immediate subsets: formed by deleting an item

◆ A is in the negative border if and only


if it is not frequent in the sample
◆ Note: The empty set is always frequent

21

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Picture of Negative Border

Negative Border

tripletons

doubletons
Frequent Itemsets
singletons from Sample

22

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm (1)


First pass:
(1) First find candidate frequent itemsets from
sample
◗ Sample on first pass!
◗ Use lower threshold: For fraction p of baskets in sample,
use 0.8ps or 0.9ps as support threshold

◆ Identifies itemsets that are frequent for the


sample

(2) Construct the negative border


◗ Itemsets that are not frequent in the sample but all of
their immediate subsets are frequent
23

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Toivonen’s Algorithm – (3)


◆ In the second pass, process the whole file (no
sampling!!)
◆ Count:
◗ all candidate frequent itemsets from first pass
◗ all itemsets on the negative border

◆ Case 1: No itemset from the negative border


turns out to be frequent in the whole data set
◗ Correct set of frequent itemsets is exactly the itemsets from
the sample that were found frequent in the whole data

◆ Case 2: Some member of negative border is


frequent in the whole data set
◗ Can give no answer at this time
24
◗ Must repeat algorithm with new random sample
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Toivonen’s Algorithm – (4)


◆ Goal: Save time by looking at a sample on first pass
◗ But is the set of frequent itemsets for the sample the correct
set for the whole input file?
◆ If some member of the negative border is frequent in
the whole data set, can’t be sure that there are not
some even larger itemsets that:
◗ Are neither in the negative border nor in the collection of
frequent itemsets for the sample
◗ But are frequent in the whole
◆ So start over with a new sample
◆ Try to choose the support threshold so that probability
of failure is low, while number of itemsets checked on
the second pass fits in main-memory 25

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

A few slides on Hashing


Introduction to Data Mining with Case Studies
Author: G. K. Gupta
Prentice Hall India, 2006.

26

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hashing
In PCY algorithm, when generating L1, the set of
frequent itemsets of size 1, the algorithm also:
• generates all possible pairs for each basket
• hashes them to buckets
• keeps a count for each hash bucket
• Identifies frequent buckets (count >= s)
Main memory
Item counts Frequent items

Bitmap
Recall: Hash
Hash table
Main-Memory table Counts of
for pairs candidate
Picture of PCY
pairs

Pass 1 Pass 2
December 2008 ©GKGupta 27

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Example
Consider a basket database in the first table below
All itemsets of size 1 determined to be frequent on previous pass
The second table below shows all possible 2-itemsets for each basket
Basket ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk

100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)


200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)

December 2008 ©GKGupta 28

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Example Hash Function


• For each pair, a numeric value is obtained by first
representing B by 1, C by 2, E 3, J 4, M 5 and Y 6.
• Now each pair can be represented by a two digit number
• (B, E) by 13 (C, M) by 26
• Use hash function on these numbers: e.g., number modulo 8
• Hashed value is the bucket number
• Keep count of the number of pairs hashed to each bucket
• Buckets that have a count above the support value are
frequent buckets
• Set corresponding bit in bit map to 1; otherwise, bit is 0
• All pairs in rows that have zero bit are removed as candidates

December 2008 ©GKGupta 29

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0
0 1
E 3 0 2
J 4 0 3
0 4
M 5
1 5
Y 6 1 6
1 7

December 2008 ©GKGupta 30

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0
0 1
E 3 0 2
J 4 0 3
0 4 2 (B, C)
M 5
1 5 3 (B, E) (J, M)
Y 6 1 6
1 7

Bucket 5 is frequent. Are any of the pairs that hash to the bucket frequent?
Does Pass 1 of PCY know which pairs contributed to the bucket?
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0 5 (C, J) (B, Y) (M, Y)
0 1 1 (C, M)
E 3 0 2 1 (E, J)
J 4 0 3 0
0 4 2 (B, C)
M 5
1 5 3 (B, E) (J, M)
Y 6 1 6 3 (B, J)
1 7 3 (C, E) (B, M)

At end of Pass 1, know only which buckets are frequent 32


All pairs that hash to those buckets are candidates and will be counted
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Reducing number of candidate pairs


◆ Goal: reduce the size of candidate set C2
◗ Only have to count candidate pairs
◗ Pairs that hash to a frequent bucket
◆ Essential that the hash table is large enough so that
collisions are few
◆ Collisions result in loss of effectiveness of the hash table
◆ In our example, three frequent buckets had collisions
◆ Must count all those pairs to determine which are truly
frequent

December 2008 ©GKGupta 33

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

PCY Algorithm in Big Data Analytics


In this article, I am going to discuss a very important algorithm in big data analytics i.e
PCY algorithm used for the frequent itemset mining.

Submitted by Uma Dasgupta, on September 12, 2018

PCY algorithm was developed by three Chinese scientists Park, Chen, and Yu. This is
an algorithm used in the field of big data analytics for the frequent itemset mining when
the dataset is very large.

Consider we have a huge collection of data, and in this data, we have a number of
transactions. For example, if we buy any product online it’s transaction is being noted.
Let, a person is buying a shirt from any site now, along with the shirt the site advised the
person to buy jeans also, with some discount. So, we can see that how two different
things are made into a single set and associated. The main purpose of this algorithm is
to make frequent item sets say, along with shirt people frequently buy jeans.

For example:

Transaction Items bought


Transaction 1 Shirt + jeans
Transaction 2 Shirt + jeans +Trouser
Transaction 3 Shirt +Tie
Transaction 4 Shirt +jeans +shoes

So, from the above example we can see that shirt is most frequently bought along with
jeans, so, it is considered as a frequent itemset.

An example problem solved using PCY algorithm

Question: Apply PCY algorithm on the following transaction to find the candidate sets
(frequent sets).

Given data

Threshold value or minimization value = 3


Hash function= (i*j) mod 10

T1 = {1, 2, 3}
T2 = {2, 3, 4}

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}

Use buckets and concepts of Mapreduce to solve the above problem.

Solution:

1. To identify the length or we can say repetition of each candidate in the given
dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.

Step 1: Mapping all the elements in order to find their length.

Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4

Step 2: Removing all elements having value less than 1.

But here in this example there is no key having value less than 1. Hence, candidate set
= {1, 2, 3, 4, 5, 6}

Step 3: Map all the candidate set in pairs and calculate their lengths.

T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)


T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

T7: {(1, 4)} = 2


T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______

Note: Pairs should not get repeated avoid the pairs that are already written before.

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5)
(4,5) (4,6)}
ADVERTISEMENT

Step 4: Apply Hash Functions. (It gives us bucket number)

Hash Function = ( i * j) mod 10


(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4

Now, arrange the pairs according to the ascending order of their obtained bucket
number.

Bucket no. Pair


0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)

Step 5: In this final step we will prepare the candidate set.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Bit vector Bucket no. Highest Support Count Pair Candidate Set
s

1 0 3 (4,5) (4,5)

1 2 4 (3,4) (3,4)

1 3 3 (1,3) (1,3)

1 4 3 (4,6) (4,6)

1 5 5 (3,5) (3,5)

1 6 3 (2,3) (2,3)

1 8 3 (2,4) (2,4)

Note: Highest support count is the no. of repetition of that vector.

Check the pairs which have the highest support count less than 3, and write those in the
candidate set, if less than 3 then reject.

(NOTE: There are some exceptional cases where highest count support is less than 3,
i.e. threshold value and for every candidate pair write bit vector as 1 means if HCS is
greater than equal to threshold then bit vector is 1 otherwise 0).

Hence, The frequent itemsets are (4, 5), (3,4)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

process for Clustering, as mentioned above, a distance-based similarity metric plays a pivotal role in
deciding the clustering.
What is Clustering?
In this article, we shall understand the various types of clustering, numerous clustering methods used
Many things around us can be categorized as “this and that” or to be less vague and more specific,
in machine learning and eventually see how they are key to solve various business problems
we have groupings that could be binary or groups that can be more than two, like a type of pizza
base or type of car that you might want to purchase. The choices are always clear – or, how the
technical lingo wants to put it – predefined groups and the process predicting that is an important Types of Clustering Methods
process in the Data Science stack called Classification.
As we made a point earlier that for a successful grouping, we need to attain two major goals: one, a
But what if we bring into play a quest where we don’t have pre-defined choices initially, rather, we similarity between one data point with another and two, a distinction of those similar data points
derive those choices! Choices that are based out of hidden patterns, underlying similarities between with others which most certainly, heuristically differ from those points. The basis of such divisions
the constituent variables, salient features from the data etc. This process is known as Clustering in begins with our ability to scale large datasets and that’s a major beginning point for us. Once we are
Machine Learning or Cluster Analysis, where we group the data together into an unknown number through it, we are presented with a challenge that our data contains different kinds of attributes –
of groups and later use that information for further business processes. categorical, continuous data, etc., and we should be able to deal with them. Now, we know that our
data these days is not limited in terms of dimensions, we have data that is multi-dimensional in
nature. The clustering algorithm that we intend to use should successfully cross this hurdle as well.
You may also like to read: What is Machine Learning?
The clusters that we need, should not only be able to distinguish data points but also they should be
inclusive. Sure, a distance metric helps a lot but the cluster shape is often limited to being a
So, to put it in simple words, in machine learning clustering is the process by which we create
geometric shape and many important data points get excluded. This problem too needs to be taken
groups in a data, like customers, products, employees, text documents, in such a way that objects
care of.
falling into one group exhibit many similar properties with each other and are different from objects
that fall in the other groups that got created during the process.
In our progress, we notice that our data is highly “noisy” in nature. Many unwanted features have
been residing in the data which makes it rather Herculean task to bring about any similarity between
the data points – leading to the creation of improper groups. As we move towards the end of the
line, we are faced with a challenge of business interpretation. The outputs from the clustering
algorithm should be understandable and should fit the business criteria and address the business
problem correctly.

To address the problem points above – scalability, attributes, dimensional, boundary shape, noise,
and interpretation – we have various types of clustering methods that solve one or many of these
problems and of course, many statistical and machine learning clustering algorithms that implement
the methodology.

The various types of clustering are:


1. Connectivity-based Clustering (Hierarchical clustering)
2. Centroids-based Clustering (Partitioning methods)
3. Distribution-based Clustering
4. Density-based Clustering (Model-based methods)
5. Fuzzy Clustering
6. Constraint-based (Supervised Clustering)
Clustering algorithms take the data and using some sort of similarity metrics, they form these groups
– later these groups can be used in various business processes like information retrieval, pattern 1. Connectivity-Based Clustering (Hierarchical Clustering)
recognition, image processing, data compression, bioinformatics etc. In the Machine Learning

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering is a method of unsupervised machine learning clustering where it begins with
a pre-defined top to bottom hierarchy of clusters. It then proceeds to perform a decomposition of
the data objects based on this hierarchy, hence obtaining the clusters. This method follows two
approaches based on the direction of progress, i.e., whether it is the top-down or bottom-up flow of
creating clusters. These are Divisive Approach and the Agglomerative Approach respectively.

1.2 Agglomerative Approach


Agglomerative is quite the contrary to Divisive, where all the “N” data points are considered to be a
single member of “N” clusters that the data is comprised into. We iteratively combine these
numerous “N” clusters to fewer number of clusters, let’s say “k” clusters and hence assign the data
points to each of these clusters accordingly. This approach is a bottom-up one, and also uses a
termination logic in combining the clusters. This logic can be a number based criterion (no more
clusters beyond this point) or a distance criterion (clusters should not be too far apart to be merged)
or variance criterion (increase in the variance of the cluster being merged should not exceed a
threshold, Ward Method)

1.1 Divisive Approach


2. Centroid Based Clustering
This approach of hierarchical clustering follows a top-down approach where we consider that all the
data points belong to one large cluster and try to divide the data into smaller groups based on a Centroid based clustering is considered as one of the most simplest clustering algorithms, yet the
termination logic or, a point beyond which there will be no further division of data points. This most effective way of creating clusters and assigning data points to it. The intuition behind centroid
termination logic can be based on the minimum sum of squares of error inside a cluster or for based clustering is that a cluster is characterized and represented by a central vector and data points
categorical data, the metric can be the GINI coefficient inside a cluster. that are in close proximity to these vectors are assigned to the respective clusters.

Hence, iteratively, we are splitting the data which was once grouped as a single large cluster, to “n” These groups of clustering methods iteratively measure the distance between the clusters and the
number of smaller clusters in which the data points now belong to. characteristic centroids using various distance metrics. These are either of Euclidian distance,
Manhattan Distance or Minkowski Distance.
It must be taken into account that this algorithm is highly “rigid” when splitting the clusters –
meaning, one a clustering is done inside a loop, there is no way that the task can be undone. The major setback here is that we should either intuitively or scientifically (Elbow Method) define
the number of clusters, “k”, to begin the iteration of any clustering machine learning algorithm to
start assigning the data points.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Despite the flaws, Centroid based clustering has proven it’s worth over Hierarchical clustering when
working with large datasets. Also, owing to its simplicity in implementation and also interpretation,
4. Distribution-Based Clustering
these algorithms have wide application areas viz., market segmentation, customer segmentation, text Until now, the clustering techniques as we know are based around either proximity
topic retrieval, image segmentation etc. (similarity/distance) or composition (density). There is a family of clustering algorithms that take a
totally different metric into consideration – probability. Distribution-based clustering creates and
3. Density-based Clustering (Model-based Methods) groups data points based on their likely hood of belonging to the same probability distribution
(Gaussian, Binomial etc.) in the data.
If one looks into the previous two methods that we discussed, one would observe that both
hierarchical and centroid based algorithms are dependent on a distance (similarity/proximity) metric.
The very definition of a cluster is based on this metric. Density-based clustering methods take
density into consideration instead of distances. Clusters are considered as the densest region in a
data space, which is separated by regions of lower object density and it is defined as a maximal-set
of connected points.

When performing most of the clustering, we take two major assumptions, one, the data is devoid of
any noise and two, the shape of the cluster so formed is purely geometrical (circular or elliptical).
The fact is, data always has some extent of inconsistency (noise) which cannot be ignored. Added to
that, we must not limit ourselves to a fixed attribute shape, it is desirable to have arbitrary shapes so
as to not to ignore any data points. These are the areas where density based algorithms have proven
their worth!

Density-based algorithms can get us clusters with arbitrary shapes, clusters without any limitation in
cluster sizes, clusters that contain the maximum level of homogeneity by ensuring the same levels of
density within it, and also these clusters are inclusive of outliers or the noisy data.

The distribution models of clustering are most closely related to statistics as it very closely relates to
the way how datasets are generated and arranged using random sampling principles i.e., to fetch data

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

points from one form of distribution. Clusters can then be easily be defined as objects that are most
likely to belong to the same distribution.

A major drawback of density and boundary-based approaches is in specifying the clusters apriori to
some of the algorithms and mostly the definition of the shape of the clusters for most of the
algorithms. There is at least one tuning or hyper-parameter which needs to be selected and not only
that is trivial but also any inconsistency in that would lead to unwanted results.

Distribution based clustering has a vivid advantage over the proximity and centroid based clustering
methods in terms of flexibility, correctness and shape of the clusters formed. The major problem
however is that these clustering methods work well only with synthetic or simulated data or with
data where most of the data points most certainly belong to a predefined distribution, if not, the
results will overfit.

5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually exclusive
clusters, meaning, a data point always resides uniquely inside a cluster and it cannot belong to more
than one cluster. Fuzzy clustering methods change this paradigm by assigning a data-point to
multiple clusters with a quantified degree of belongingness metric. The data-points that are in
proximity to the center of a cluster, may also belong in the cluster that is at a higher degree than
points in the edge of a cluster. The possibility of which an element belongs to a given cluster is
measured by membership coefficient that vary from 0 to 1. Usually, tree-based, Classification machine learning algorithms like Decision Trees, Random Forest,
and Gradient Boosting, etc. are made use of to attain constraint-based clustering. A tree is
Fuzzy clustering can be used with datasets where the variables have a high level of overlap. It is a constructed by splitting without the interference of the constraints or clustering labels. Then, the leaf
strongly preferred algorithm for Image Segmentation, especially in bioinformatics where identifying nodes of the tree are combined together to form the clusters while incorporating the constraints and
overlapping gene codes makes it difficult for generic clustering algorithms to differentiate between using suitable algorithms.
the image’s pixels and they fail to perform a proper clustering.
Types of Clustering Algorithms with Detailed Description
6. Constraint-based (Supervised Clustering)
The clustering process, in general, is based on the approach that the data can be divided into an 1. k-Means Clustering
optimal number of “unknown” groups. The underlying stages of all the clustering algorithms to find
those hidden patterns and similarities, without any intervention or predefined conditions. However, k-Means is one of the most widely used and perhaps the simplest unsupervised algorithms to solve
in certain business scenarios, we might be required to partition the data based on certain constraints. the clustering problems. Using this algorithm, we classify a given data set through a certain number
Here is where a supervised version of clustering machine learning techniques come into play. of predetermined clusters or “k” clusters. Each cluster is assigned a designated cluster center and
they are placed as much as possible far away from each other. Subsequently, each point belonging
A constraint is defined as the desired properties of the clustering results, or a user’s expectation on gets associated with it to the nearest centroid till no point is left unassigned. Once it is done, the
the clusters so formed – this can be in terms of a fixed number of clusters, or, the cluster size, or, centers are re-calculated and the above steps are repeated. The algorithm converges at a point where
important dimensions (variables) that are required for the clustering process. the centroids cannot move any further. This algorithm targets to minimize an objective function
called the squared error function F(V) :

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

where, complete linkage or single linkage. Ideally, the algorithm continues until each data has its own
||xi – vj|| is the distance between Xi and Vj. cluster.

Ci is the count of data in cluster.C is the number of cluster centroids. Implementation:

Implementation: In R, we make use of the diana() fucntion from cluster package (cluster::diana)

In R, there is a built-in function kmeans() and in Python, we make use of scikit-learn cluster module
which has the KMeans function. (sklearn.cluster.KMeans)

Advantages:

1. Can be applied to any form of data – as long as the data has numerical (continuous) entities.
2. Much faster than other algorithms.
3. Easy to understand and interpret.

Drawbacks:

1. Fails for non-linear data.


2. It requires us to decide on the number of clusters before we start the algorithm – where the user
needs to use additional mathematical methods and also heuristic knowledge to verify the correct
number of centers.
3. This cannot work for Categorical data.
4. Cannot handle outliers.

Application Areas:
2.2 Agglomerative Nesting or AGNES
a. Document clustering – high application area in Segmenting text-matrix related like data like DTM,
TF-IDF etc. AGNES starts by considering the fact that each data point has its own cluster, i.e., if there are n data
b. Banking and Insurance fraud detection where majority of the columns represent a financial figure rows, then the algorithm begins with n clusters initially. Then, iteratively, clusters that are most
– continuous data. similar – again based on the distances as measured in DIANA – are now combined to form a larger
c. Image segmentation. cluster. The iterations are performed until we are left with one huge cluster that contains all the data-
d. Customer Segmentation. points.

Implementation:
2. Hierarchical Clustering Algorithm
As discussed in the earlier section, Hierarchical clustering methods follow two approaches – Divisive In R, we make use of the agnes() function from cluster package (cluster::agnes()) or the built-in
and Agglomerative types. Their implementation family contains two algorithms respectively, the hclust() function from the native stats package. In python, the implementation can be found in
divisive DIANA (Divisive Analysis) and AGNES (Agglomerative Nesting) for each of the scikit-learn package via the AgglomerativeClustering function inside the cluster
approaches. module(sklearn.cluster.AgglomerativeClustering)

Advantages:
2.1 DIANA or Divisive Analysis
As discussed earlier, the divisive approach begins with one single cluster where all the data points 1. No prior knowledge about the number of clusters is needed, although the user needs to define a
belong to. Then it is split into multiple clusters and the data points get reassigned to each of the threshold for divisions.
clusters on the basis of the nearest distance measure of the pairwise distance between the data 2. Easy to implement across various forms of data and known to provide robust results for data
points. These distance measures can be Ward’s Distance, Centroid Distance, average linkage, generated via various sources. Hence it has a wide application area.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Disadvantages: major difference is, as mentioned earlier, that according to this algorithm, a data point can be put
into more than one cluster. This degree of belongingness can be clearly seen in the cost function of
1. The cluster division (DIANA) or combination (AGNES) is really strict and once performed, it this algorithm as shown below:
cannot be undone and re-assigned in subsequesnt iterations or re-runs.
2. It has a high time complexity, in the order of O(n^2 log n) for all the n data-points, hence cannot
be used for larger datasets.
3. Cannot handle outliers and noise

Application areas:
uij is the degree of belongingness of data xi to a cluster cj
μj is the cluster center of the cluster j
1. Widely used in DNA sequencing to analyse the evolutionary history and the relationships among
m is the fuzzifier.
biological entities (Phylogenetics).
So, just like the k-means algorithm, we first specify the number of clusters k and then assign the
degree of belongingness to the cluster. We need to then repeat the algorithm till the max_iterations
are reached, again which can be tuned according to the requirements.

Implementation:

In R, FCM can be implemented using fanny() from the cluster package (cluster::fanny) and in
Python, fuzzy clustering can be performed using the cmeans() function from skfuzzy module.
(skfuzzy.cmeans) and further, it can be adapted to be applied on new data using the predictor
function (skfuzzy.cmeans_predict)

Advantages:

1. FCM works best for highly correlated and overlapped data, where k-means cannot give any
conclusive results.
2. It is an unsupervised algorithm and it has a higher rate of convergence than other partitioning
based algorithms.

Disadvantages:

1. We need to specify the number of clusters “k” prior to the start of the algorithm
2. Although convergence is always guaranteed but the process is very slow and this cannot be used
for larger data.
2. Identifying fake news by clustering the news article corpus, by assigning the tokens or words into
3. Prone to errors if the data has noise and outliers.
these clusters and marking out suspicious and sensationalized words to get possible faux words.
3. Personalization and targeting in marketing and sales.
Application Areas
4. Classifying the incoming network traffic into a website by classifying the http requests into
various clusters and then heuristically identifying the problematic clusters and eventually restricting 1. Used widely in Image Segmentation of medical imagery, especially the images generated by an
them. MRI.
2. Market definition and segmentation.
3. Fuzzy C Means Algorithm – FANNY (Fuzzy Analysis Clustering)
This algorithm follows the fuzzy cluster assignment methodology of clustering. The working of 4. Mean Shift Clustering
FCM Algorithm is almost similar to the k-means – distance-based cluster assignment – however, the

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Mean shift clustering is a form of nonparametric clustering approach which not only eliminates the 1. Image segmentation and computer vision – mostly used for handwritten text identification.
need for apriori specification of the number of clusters but also it removes the spatial and shape 2. Image tracking in video analysis.
constraints of the clusters – two of the major problems from the most widely preferred k-means
algorithm.
5. DBSCAN – Density-based Spatial Clustering
It is a density-based clustering algorithm where it firstly, seeks for stationary points in the density Density-based algorithms, in general, are pivotal in the application areas where we require non-linear
function. Then next, the clusters are eventually shifted to a region with higher density by shifting the cluster structures, purely based out of density. One of the ways how this principle can be made into
center of the cluster to the mean of the points present in the current window. The shift if the reality is by using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
window is repeated until no more points can be accommodated inside of that window. algorithm. There are two major underlying concepts in DBSCAN – one, Density Reachability and
second, Density Connectivity. This helps the algorithm to differentiate and separate regions with
varying degrees of density – hence creating clusters.

For implementing DBSCAN, we first begin with defining two important parameters – a radius
parameter eps (ϵ) and a minimum number of points within the radius (m).

Implementation:

In R, bmsClustering() function from MeanShift package performs the clustering


(MeanShift::bmsClustering()) and MeanShift() function in scikit learn package does the job in
Python. (sklearn.cluster.MeanShift)
a. The algorithm starts with a random data point that has not been accessed before and its
Advantages:
neighborhood is marked according to ϵ.
1. Non-parametric and number of clusters need not be specified apriori.
b. If this contains all the m minimum points, then cluster formation begins – hence marking it as
2. Owing to it’s density dependency, the shape of the cluster is not limited to circular or spherical.
“visited” – if not, then it is labeled as “noise” for that iteration, which can get changed later.
3. More robust and more practical as it works for any form of data and the results are easily
interpretable.

Disadvantages:

1. The selection of the window radius is highly arbitrary and cannot be related to any business logic
and selecting incorrect window size is never desirable.

Applications:

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

1. Used in document Network Analysis of text data for identifying plagiarism and copyrights in
various scientific documents and scholarly articles.
2. Widely used in recommendation systems for various web applications and eCommerce websites.
3. Used in x-ray Crystallography to categorize the protein structure of a certain protein and to
determine its interactions with other proteins in the strands.
4. Clustering in Social Network Analysis is implemented by DBSCAN where objects (points) are
clustered based on the object’s linkage rather than similarity.

6. Gaussian Mixed Models (GMM) with Expectation-Maximization


Clustering
In Gaussian Mixed Models, we assume that the data points follow a Gaussian distribution, which is
never a constraint at all as compared to the restrictions in the previous algorithms. Added to that,
this assumption can lead to important selecting criteria for the shape of the clusters – that is, cluster
shapes can be now quantified. This quantification happens by use to the two most common and
simple metrics – mean and variance.
c. If a next data point belongs to this cluster, then subsequently the ϵ neighborhood now around this To find the mean and variance, Expectation-Maximization is used, which is a form of optimization
point becomes a part of the cluster formed in the previous step. This step is repeated until there are function. This function starts with random Gaussian parameters, say θ, and check if the Hypothesis
no more data points that can follow Density Reachability and Density Connectivity. confirms that a sample actually belongs to a cluster c. Once it does, we perform the maximization
step where the Gaussian parameters are updated to fit the points assigned to the said cluster.
d. Once this loop is exited, it moves to the next “unvisited” data point and creates further clusters or Maximization step aims at increasing the likelihood of the sample belonging to the cluster
noise. distribution.
e. The algorithm converges when there are no more unvisited data points remain.

Implementation:

In Python its implemented via DBSCAN() function from scikit-learn cluster module
(sklearn.cluster.DBSCAN) and in R its implemented through dbscan() from dbscan package
(dbscan::dbscan(x, eps, minpts))

Advantages:

1. Doesn’t require prior specification of clusters.


2. Can easily deal with noise, not affected by outliers.
3. It has no strict shapes, it can correctly accommodate many data points.

Disadvantages:

1. Cannot work with datasets of varying densities.


2. Sensitive to the clustering hyper-parameters – the eps and the min_points.
3. Fails if the data is too sparse.
4. The density measures (Reachability and Connectivity) can be affected by sampling.

Applications:

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Disadvantages:

1. Complex algorithm and cannot be applicable to larger data


2. It is hard to find clusters if the data is not Gaussian, hence a lot of data preparation is required.

Applications:
Implementation:
1. GMM has been more practically used in Topic Mining where we can associate multiple topics to a
particular document (an atomic part of a text – a news article, online review, Twitter tweet, etc.)
In Python, it is implenteded via the GaussianMixture() function from scikit-learn.
2. Spectral clustering, combined with Gaussian Mixed Models-EM is used in image processing.
(sklearn.mixture.GaussianMixture) and in R, it is implemented using GMM() from the clusteR
package. (clusteR.GMM())
Applications of Clustering
Advantages:
We have seen numerous methodologies and approaches for clustering in machine learning and some
1. The associativity of a data point to a cluster is quantified using probability metrics – which can be of the important algorithms that implement those techniques. Let’s have a quick overview of
easily interpreted. business applications of clustering and understand its role in Data Mining.
2. Proven to be accurate for real-time data sets.
3. Some versions of GMM allows for mixed membership of data points, hence it can be a good 1. It is the backbone of search engine algorithms – where objects that are similar to each other must
alternative to Fuzzy C Means to achieve fuzzy clustering. be presented together and dissimilar objects should be ignored. Also, it is required to fetch objects
that are closely related to a search term, if not completely related.
2. A similar application of text clustering like search engine can be seen in academics where
clustering can help in the associative analysis of various documents – which can be in-turn used in
– plagiarism, copyright infringement, patent analysis etc.
3. Used in image segmentation in bioinformatics where clustering algorithms have proven their worth
in detecting cancerous cells from various medical imagery – eliminating the prevalent human errors
and other bias.
4. Netflix has used clustering in implementing movie recommendations for its users.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

5. News summarization can be performed using Cluster analysis where articles can be divided into a
group of related topics.
6. Clustering is used in getting recommendations for sports training for athletes based on their goals
and various body related metrics and assign the training regimen to the players accordingly.
7. Marketing and sales applications use clustering to identify the Demand-Supply gap based on
various past metrics – where a definitive meaning can be given to huge amounts of scattered data.
8. Various job search portals use clustering to divide job posting requirements into organized groups
which becomes easier for a job-seeker to apply and target for a suitable job.
9. Resumes of job-seekers can be segmented into groups based on various factors like skill-sets,
experience, strengths, type of projects, expertise etc., which makes potential employers connect
with correct resources.
10. Clustering effectively detects hidden patterns, rules, constraints, flow etc. based on various metrics
of traffic density from GPS data and can be used for segmenting routes and suggesting users with
best routes, location of essential services, search for objects on a map etc.
11. Satellite imagery can be segmented to find suitable and arable lands for agriculture.
12. Pizza Hut very famously used clustering to perform Customer Segmentation which helped them to
target their campaigns effectively and helped increase their customer engagement across various
channels.
13. Clustering can help in getting customer persona analysis based on various metrics of Recency,
Frequency, and Monetary metrics and build an effective User Profile – in-turn this can be used for
Customer Loyalty methods to curb customer churn.
14. Document clustering is effectively being used in preventing the spread of fake news on Social
Media.
15. Website network traffic can be divided into various segments and heuristically when we can
prioritize the requests and also helps in detecting and preventing malicious activities.
16. Fantasy sports have become a part of popular culture across the globe and clustering algorithms
can be used in identifying team trends, aggregating expert ranking data, player similarities, and
other strategies and recommendations for the users.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K-Means
Means Clustering Algorithm | Examples
Pattern Recognition

K-Means
Means Clustering-
Clustering

● K-Means
Means clustering is an unsupervised iterative clustering technique.
● It partitions the given data set into k predefined distinct clusters.
● A cluster is defined as a collection of data points exhibiting certain similarities.

It partitions the data sett such that


that-

● Each data point belongs to a cluster with the nearest mean.


● Data points belonging to one cluster have high degree of similarity.
● Data points belonging to different clusters have high degree of dissimilarity.

K-Means
Means Clustering Algorithm-
Algorithm

Means Clustering Algorithm involves the following steps-


K-Means steps

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Step-01:

● Choose the number of clusters K.

Step-02:

● Randomly select any K data points as cluster centers.


● Select cluster centers in such a way that they are as farther as possible from each
other.

Step-03:

● Calculate the distance between each data point and each cluster center.
● The distance may be calculated either by using given distance function or by using
euclidean distance formula.

Step-04:

● Assign each data point to some cluster.


● A data point is assigned to that cluster whose center is nearest to that data point.

Step-05:

● Re-compute the center of newly formed clusters.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● The center of a cluster is computed by taking mean of all the data points contained in
that cluster.

Step-06:

Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-

● Center of newly formed clusters do not change


● Data points remain present in the same cluster
● Maximum number of iterations are reached

Advantages-

K-Means Clustering Algorithm offers the following advantages-

Point-01:

It is relatively efficient with time complexity O(nkt) where-

● n = number of instances
● k = number of clusters
● t = number of iterations

Point-02:

● It often terminates at local optimum.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● Techniques such as Simulated Annealing or Genetic Algorithms may be used to find


the global optimum.

Disadvantages-

K-Means Clustering Algorithm has the following disadvantages-

● It requires to specify the number of clusters (k) in advance.


● It can not handle noisy data and outliers.
● It is not suitable to identify clusters with non-convex shapes.

PRACTICE PROBLEMS BASED ON K-MEANS


CLUSTERING ALGORITHM-

Problem-01:

Cluster the following eight points (with (x, y) representing locations) into three clusters:

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-

Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Solution-

We follow the above discussed K-Means Clustering Algorithm-

Iteration-01:

● We calculate the distance of each point from each of the center of the three clusters.
● The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(5, 8)-

Ρ(A1, C2)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

= |x2 – x1| + |y2 – y1|

= |5 – 2| + |8 – 10|

=3+2

=5

Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1 – 2| + |2 – 10|

=1+8

=9

In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.

Next,

● We draw a table showing all the results.


● Using the table, we decide which point belongs to which cluster.
● The given point belongs to that cluster whose center is nearest to it.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (5, 8) of center (1, 2) of to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

From here, New clusters are-

Cluster-01:

First cluster contains points-

● A1(2, 10)

Cluster-02:

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Second cluster contains points-

● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
● A8(4, 9)

Cluster-03:

Third cluster contains points-

● A2(2, 5)
● A7(1, 2)

Now,

● We re-compute the new cluster clusters.


● The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:

● We have only one point A1(2, 10) in Cluster-01.


● So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-01.

Iteration-02:

● We calculate the distance of each point from each of the center of the three clusters.
● The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-

Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|

= |6 – 2| + |6 – 10|

=4+4

=8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1.5 – 2| + |3.5 – 10|

= 0.5 + 6.5

=7

In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.

Next,

● We draw a table showing all the results.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● Using the table, we decide which point belongs to which cluster.


● The given point belongs to that cluster whose center is nearest to it.

Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (6, 6) of center (1.5, 3.5) of to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-

Cluster-01:

First cluster contains points-

● A1(2, 10)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● A8(4, 9)

Cluster-02:

Second cluster contains points-

● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)

Cluster-03:

Third cluster contains points-

● A2(2, 5)
● A7(1, 2)

Now,

● We re-compute the new cluster clusters.


● The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:

Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)

= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-

● C1(3, 9.5)
● C2(6.5, 5.25)
● C3(1.5, 3.5)

Problem-02:

Use K-Means Algorithm to create two clusters-

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Solution-

We follow the above discussed K-Means


K Clustering Algorithm.

Assume A(2, 2) and C(1, 1) are centers of the two clusters.

Iteration-01:

● We calculate the distance of each point from each of the center of the two clusters.
● The distance is calculated by using
u the euclidean distance formula.

The following illustration shows the calculation of distance between point A(2, 2) and each of the
center of the two clusters-

Calculating Distance Between A(2, 2) and C1(2, 2)-


2)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Ρ(A, C1)

= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]

= sqrt [ (2 – 2)2 + (2 – 2)2 ]

= sqrt [ 0 + 0 ]

=0

Calculating Distance Between A(2, 2) and C2(1, 1)-

Ρ(A, C2)

= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]

= sqrt [ (1 – 2)2 + (1 – 2)2 ]

= sqrt [ 1 + 1 ]

= sqrt [ 2 ]

= 1.41

In the similar manner, we calculate the distance of other points from each of the center of the
two clusters.

Next,

● We draw a table showing all the results.


● Using the table, we decide which point belongs to which cluster.
● The given point belongs to that cluster whose center is nearest to it.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Given Points Distance from center Distance from center Point belongs to
(2, 2) of Cluster-01 (1, 1) of Cluster-02 Cluster

A(2, 2) 0 1.41 C1

B(3, 2) 1 2.24 C1

C(1, 1) 1.41 0 C2

D(3, 1) 1.41 2 C1

E(1.5, 0.5) 1.58 0.71 C2

From here, New clusters are-

Cluster-01:

First cluster contains points-

● A(2, 2)
● B(3, 2)
● E(1.5, 0.5)
● D(3, 1)

Cluster-02:

Second cluster contains points-

● C(1, 1)

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

● E(1.5, 0.5)

Now,

● We re-compute the new cluster clusters.


● The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:

Center of Cluster-01

= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)

= (2.67, 1.67)

For Cluster-02:

Center of Cluster-02

= ((1 + 1.5)/2, (1 + 0.5)/2)

= (1.25, 0.75)

This is completion of Iteration-01.

Next, we go to iteration-02, iteration-03 and so on until the centers do not change anymore.

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SNS COLLEGE OF ENGINEERING


Coimbatore-107

Clustering high Dimensional


Data CLIQUE and PROCLUS

9/28/2019 IT6006-DATA ANALYTICS 1

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Clustering High-Dimensional Data


• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering

9/28/2019 IT6006-DATA ANALYTICS 2

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The Curse of Dimensionality


(graphs adapted from Parsons et al. KDD Explorations
2004)
• Data in only one dimension is relatively packed
• Adding a dimension “stretch” the points across
that dimension, making them further apart
• Adding more dimensions will make the points
further apart—high dimensional data is extremely
sparse
• Distance measure becomes meaningless—due to
equi-distance

9/28/2019 IT6006-DATA ANALYTICS 3

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

CLIQUE (Clustering In QUEst)


• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– A cluster is a maximal set of connected dense units within a subspace

9/28/2019 IT6006-DATA ANALYTICS 4

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

CLIQUE: The Major Steps


• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units
for each cluster
– Determination of minimal cover for each cluster

9/28/2019 IT6006-DATA ANALYTICS 5

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Strength and Weakness of CLIQUE


• Strength
– automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some
canonical data distribution
– scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of
simplicity of the method

9/28/2019 IT6006-DATA ANALYTICS 6

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Frequent Pattern-Based Approach

• Clustering high-dimensional space (e.g., clustering text documents,


microarray data)
– Projected subspace-clustering: which dimensions to be projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective?
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive

• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data (pClustering)

9/28/2019 IT6006-DATA ANALYTICS 7

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Clustering by Pattern Similarity (p-Clustering)


• Right: The micro-array “raw” data shows
3 genes and their values in a multi-
dimensional space
– Difficult to find their patterns
• Bottom: Some subsets of dimensions
form nice shift and scaling patterns

9/28/2019 IT6006-DATA ANALYTICS 8

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Why p-Clustering?
• Microarray data analysis may need to
– Clustering on thousands of dimensions (attributes)
– Discovery of both shift and scaling patterns
• Clustering with Euclidean distance measure? — cannot find shift patterns
• Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
• Bi-cluster using transformed mean-squared residue score matrix (I, J)

– Where
1
– A submatrix isda δ-cluster
 if H(I,
d J) ≤ δ for some
1δ>0
ij | J |  d  1
Ij | I |  ij 
ij d d  d
jJ IJ | I || J | i  I , j  J ij
• Problems with bi-cluster iI
– No downward closure property,
– Due to averaging, it may contain outliers but still within δ-threshold

9/28/2019 IT6006-DATA ANALYTICS 9

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

p-Clustering
• Given objects x, y in O and features a, b in T, pCluster is a 2 by 2 matrix

 d xa d xb 
pScore ( 
d d  ) | ( d xa  d xb )  ( d ya  d yb ) |
 ya yb 
• A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some
δ>0
• Properties of δ-pCluster
– Downward closure
– Clusters are more homogeneous than bi-cluster (thus the name: pair-wise
Cluster)
• Pattern-growth algorithm has been developed for efficient mining
• For scaling patterns, one can observe, taking logarithmic on will lead to the
pScore form
d /d
 
xa ya
9/28/2019 IT6006-DATA ANALYTICS
d / d 10
xb yb

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods

9/28/2019 IT6006-DATA ANALYTICS 11

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Model based clustering


• Assume data generated from K probability
distributions
• Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
• Need to find distribution parameters.
• EM Algorithm

9/28/2019 IT6006-DATA ANALYTICS 12

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

EM Algorithm

• Initialize K cluster centers


• Iterate between two steps
– Expectation step: assign points to clusters
P(di  ck )  wk Pr(di | ck )  w Pr(d | c )
j
j i j

 Pr( d  c ) i k
wk  i
N

– Maximation step: estimate


1 m d model parameters
P (d  c )
   i i k

 P (d i  c j)
k
m i1
9/28/2019 IT6006-DATA ANALYTICS k 13

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Frequent Pattern based


Clustering methods

unit4/frequent pattern based clustering


9/28/2019 1
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

What Is Frequent Pattern Analysis?


• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
unit4/frequent pattern based clustering
2
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Why Is Freq. Pattern Mining Important?

• Freq. pattern: An intrinsic and important property of datasets


• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
– Classification: discriminative, frequent pattern analysis
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications
unit4/frequent pattern based clustering
3
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Basic Concepts: Frequent Patterns


Tid I tems bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper • k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
• (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
40 Nuts, Eggs, Milk occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
unit4/frequent pattern based clustering
4
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Basic Concepts: Association Rules


Tid Items bought • Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
– confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Customer Diaper}:3
buys beer  Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
unit4/frequent pattern based clustering
5
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Closed Patterns and Max-Patterns


• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
unit4/frequent pattern based clustering
6
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Closed Patterns and Max-Patterns


• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?
– <a1, …, a100>: 1
• What is the set of all patterns?
– !!
unit4/frequent pattern based clustering
7
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Computational Complexity of Frequent Itemset Mining

• How many itemsets are potentially to be generated in the worst case?


– The number of frequent itemsets to be generated is senstive to the minsup
threshold
– When minsup is low, there exist potentially an exponential number of
frequent itemsets
– The worst case: MN where M: # distinct items, and N: max length of
transactions
• The worst case complexty vs. the expected probability
– Ex. Suppose Walmart has 104 kinds of products
• The chance to pick up one product 10-4
• The chance to pick up a particular set of 10 products: ~10-40
• What is the chance this particular set of 10 products to be frequent 103
times in 109 transactions?
unit4/frequent pattern based clustering
8
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Chapter 5: Mining Frequent Patterns, Association and


Correlations: Basic Concepts and Methods

• Basic Concepts

• Frequent Itemset Mining Methods

• Which Patterns Are Interesting?—Pattern

Evaluation Methods

• Summary
unit4/frequent pattern based clustering 9
9/28/2019
methods
Downloaded by Try (try7851@gmail.com)
lOMoARcPSD|35349139

Scalable Frequent Itemset Mining Methods

• Apriori: A Candidate Generation-and-Test Approach

• Improving the Efficiency of Apriori

• FPGrowth: A Frequent Pattern-Growth Approach

• ECLAT: Frequent Pattern Mining with Vertical Data

Format

unit4/frequent pattern based clustering


10
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The Downward Closure Property and Scalable


Mining Methods
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
– If {beer, diaper, nuts} is frequent, so is {beer, diaper}
– i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
• Scalable mining methods: Three major approaches
– Apriori (Agrawal & Srikant@VLDB’94)
– Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
– Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)

unit4/frequent pattern based clustering


11
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Apriori: A Candidate Generation & Test Approach

• Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
unit4/frequent pattern based clustering
12
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The Apriori Algorithm—An Example


Supmin = 2 I temset sup
Database TDB I temset sup
{A} 2
L1 {A} 2
Tid I tems C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 I temset sup C2 I temset
{A, B} 1
L2 I temset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 I temset L3 I temset sup


3rd scan
{B, C, E} {B, C, E} 2
unit4/frequent pattern based clustering
13
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The Apriori Algorithm (Pseudo-Code)


Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
unit4/frequent pattern based clustering
14
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
unit4/frequent pattern based clustering
15
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

How to Count Supports of Candidates?

• Why counting supports of candidates a problem?


– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
unit4/frequent pattern based clustering
16
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Counting Supports of Candidates Using Hash Tree

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

unit4/frequent pattern based clustering


17
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Candidate Generation: An SQL Implementation


• SQL Implementation of candidate generation
– Suppose the items in Lk-1 are listed in an order
– Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
– Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient
implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association
rule mining with relational database systems: Alternatives and implications.
SIGMOD’98]
unit4/frequent pattern based clustering
18
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Scalable Frequent Itemset Mining Methods

• Apriori: A Candidate Generation-and-Test Approach

• Improving the Efficiency of Apriori

• FPGrowth: A Frequent Pattern-Growth Approach

• ECLAT: Frequent Pattern Mining with Vertical Data Format

• Mining Close Frequent Patterns and Maxpatterns

unit4/frequent pattern based clustering


9/28/2019 19
methods
19

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Further Improvement of the Apriori Method

• Major computational challenges


– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates

unit4/frequent pattern based clustering


20
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Partition: Scan Database Only Twice


• Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
– Scan 1: partition database and find local frequent patterns
– Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

DB1 + DB2 + + DBk = DB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
unit4/frequent pattern based clustering
9/28/2019 21
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

DHP: Reduce the Number of Candidates


• A k-itemset whose corresponding hashing bucket count is below the
threshold cannot be frequent
count itemsets
– Candidates: a, b, c, d, e {ab, ad, ae}
35
– Hash entries 88 {bd, be, de}

• {ab, ad, ae} . .


.
• {bd, be, de}
.
. .
• …
102 {yz, qs, wt}
– Frequent 1-itemset: a, b, d, e
Hash Table
– ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is
below support threshold
• J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining
association rules. SIGMOD’95
unit4/frequent pattern based clustering
22
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Sampling for Frequent Patterns

• Select a sample of original database, mine frequent patterns


within sample using Apriori
• Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
– Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent patterns
• H. Toivonen. Sampling large databases for association rules. In
VLDB’96

unit4/frequent pattern based clustering


23
9/28/2019
methods

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

SNS COLLEGE OF ENGINEERING


Coimbatore-107

Clustering in non-Euclidean
space clustering for streams and
parallelism

IT6006-DATA ANALYTICS 1

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Main Topics
 What is Cluster ing?
 Distance measures and spaces.
 Algor ithmic approaches.
 The cur se of dimensionality.
 Hierarchical cluster ing.
 Point assign cluster ing.
 non-main-memor y data cluster ing.
 Summar y and other topics.

IT6006-DATA ANALYTICS 2

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

What is Cluster ing?


 The process of examining a collection of “points” and
grouping them into cluster s according to some distance
measure.
 The goal is that points in the same cluster have a small
distance from one another, while points in different
cluster s are at large distance from one another.
Main issues
 Data is ver y large.
 High dimensional data space.
 Data space is not Euclidean ( e.g. NLP problems) .

IT6006-DATA ANALYTICS 3

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Cluster ing illustration


Example of data clustered into 3 cluster s
based on Euclidean space.

IT6006-DATA ANALYTICS 4

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Distance measures and spaces


• Distance measure requires 3 conditions ( was given in
previous lectures)
Distances examples:
• Euclidean distance ( for set of points) .
Given by:

IT6006-DATA ANALYTICS 5

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Distance measures and spaces


• Jaccard distance ( for sample sets) .
Given by

• Cosine distance ( for sets of vector s)


Given by

• Edit distance ( compar ing str ings)


Given two str ings a and b. The edit distance is the minimum
number of operations ( inser tion, deletion, substitution) that
transfor m a into b.

And many more…

IT6006-DATA ANALYTICS 6

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Algor ithmic Approaches


There are two main approaches:
• Hierarchical algor ithms:
– Agglomerative ( bottom-up) : Star t w ith each point as a cluster. Cluster s
are combined based on their “closeness”, using some definition of
“close” ( w ill be discussed later ) .
– Divisive ( top-dow n) : Star t w ith one cluster including all points and
recur sively split each cluster based on some cr iter ion.
Will not be discussed in this presentation.
• Point assignment algor ithms:
– Points ar e consider ed in some or der, and each one is assigned
to the cluster into which it best fit.

IT6006-DATA ANALYTICS 7

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Algor ithmic Approaches


Other cluster ing algor ithms distinctions:
• Whether the algor ithm assumes a Euclidean space, or whether
the algor ithm wor ks for ar bitrar y distance measure ( as those
mentioned before) .

• Whether the algor ithm assumes that the data is small enough
to fit in main memor y, or whether data must reside in
secondar y memor y pr imar ily.

IT6006-DATA ANALYTICS 8

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The cur se of dimensionality


• Refer s to high dimensional data proper ties which might
make the cluster ing task much harder or yield bad results.
• Proper ties:
• In high dimensional data all points are equally far away
from one another.
Consider points where and is large, that were
selected unifor mly from a dimensional unit cube.
Each point is a random var iable chosen
unifor mly from the range [0,1].

IT6006-DATA ANALYTICS 9

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

The cur se of dimensionality


The Euclidean distance between two points is:

If the dimension is high, we can expect that for some


. That puts a lower bound of between almost
any two points. The upper bound is given by
Fur ther calculations gives stronger bounds.
Hence, it should be hard finding cluster s among so many
pair s that are all in approximately the same distance.
• Should be handled by dimensionality reduction methods.
IT6006-DATA ANALYTICS 10

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Cluster ing

IT6006-DATA ANALYTICS 11

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Cluster ing


We fir st consider Euclidean space.
The algor ithm:

-While stop condition is false Do


-Pick the best two cluster s to merge.
-Combine them into one cluster.
-End;

IT6006-DATA ANALYTICS 12

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Cluster ing


Three impor tant questions:

1. How do you represent a cluster w ith more than one


point?
2. How w ill you choose which two cluster s to merge?
3. When w ill we stop combining cluster s?

IT6006-DATA ANALYTICS 13

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Cluster ing


• Since we assume Euclidean space, we represent a cluster
by its centroid or average of points in the cluster. Of cour se
that in cluster s w ith one point, that point is the centroids.

• Merging r ule: merge the two cluster s w ith the shor test
Euclidean distance between their centroids.

• Stopping r ules: We may know in advance how many


cluster s there should be and stop where this number
reached. Stop merging when minimum distance between
any two cluster s is greater than some threshold.

IT6006-DATA ANALYTICS 14

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Cluster ing - Cluster ing illustration


.

IT6006-DATA ANALYTICS 15

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering- Tree representation

• The tree representing the way in which all the points


were combined.
• That may help making conclusions about the data
together with how many clusters there should be.
IT6006-DATA ANALYTICS 16

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering – Controlling clustering

Alternative rules for controlling hierarchical


clustering:
• Take the distance between two clusters to be the minimum of
the distances between any two points, one chosen from each
cluster.
For example in phase 2 we would next combine (10,5) with the
two points cluster .
• Take the distance between two clusters to be the average
distance between all pair of points, one from each cluster.
• The Radius of a cluster is the maximum distance between all
the points and the centroid. Combine the two clusters whose
resulting cluster has the lowest radius. May use also average
or sum of squares of distances from the centroid.
IT6006-DATA ANALYTICS 17

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering – Controlling clustering

Continuation.
• The Diameter of a cluster is the maximum
distance between any two points of the cluster.
We merge those clusters whose resulting cluster
has the lowest diameter.
For example, the centroid of the cluster in step 3
is (11,4), so the radius will be
And the diameter will be

IT6006-DATA ANALYTICS 18

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering – Stopping rules


Alternative stopping rules.
• Stop if the diameter of cluster results from the best merger
exceeds some threshold.
• Stop if the density of the cluster that results from the best
merger is lower than some threshold. The density may be
defined as the number of cluster points per unit volume of
the cluster. Volume may be some power of the radius or
diameter.
• Stop when there is evidence that next pair of clusters to be
combined yields bad cluster. For example, if we track the
average diameter of all clusters, we will see a sudden jump in
that value when a bad merge occurred.

IT6006-DATA ANALYTICS 19

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
• Main problem: We use distance measures such
as mentioned at the beginning. So we can’t
base distances on location of points.
The problem arises when we need to represent
a cluster, Because we cannot replace a
collection of points by their centroid.
Euclidean space Strings
space (edit distance)

IT6006-DATA ANALYTICS 20

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
Example:
Suppose we use edit distance, so
But there is no string represents their average.

Solution:
We pick one of the points in the cluster itself to
represent the cluster. This point should be
selected as close to all the points in the cluster, so
it represent some kind of “center”.
We call the representative point Clustroid.

IT6006-DATA ANALYTICS 21

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
Selecting the clustroid.
There are few ways of selecting the clustroid
point:
Select as clustroid the point that minimize:
1. The sum of the distances to the other points in
the cluster.
2. The maximum distance to another point in the
cluster.
3. The sum of the squares of the distances to the
other points in the cluster.
IT6006-DATA ANALYTICS 22

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
Example:
• Using edit distance.
• Cluster points: abcd, aecdb, abecb, ecdab.
Their distances:

Applying the three clustroid cr iter ia to each of


the four points:

IT6006-DATA ANALYTICS 23

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
Results:
For ever y cr iter ia selected, “aecdb” w ill be selected as clustroid.
Measur ing distance between cluster s:
Using clustroid instead of centroid, we can apply all options used
for the Euclidean space measure.
That include:
• The minimum distance between any pair of points.
• Average distance between all pair of points.
• Using radius or diameter ( the same definition) .

IT6006-DATA ANALYTICS 24

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

Hierarchical Clustering in Non-


Euclidean spaces
Stopping cr iter ion:
• Uses cr iter ions not directly using centroids, except the
radius which is valid also to Non-Euclidean spaces.
• So all cr iter ions may be used for Non-Euclidean spaces as
well.

IT6006-DATA ANALYTICS 25

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithms

• Best know n point assignment cluster ing


algor ithms.
• Wor ks well in practice.
• Assume a Euclidean space.
• Assume the number of cluster s K is know n
in advanced.

IT6006-DATA ANALYTICS 26

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
The algorithm:
Initially choose k points which are likely to be in different clusters;
Make this points the centroids of this clusters;
FOR each remaining point p DO
Find the centroids to which p is closest;
Add p to the cluster of that centroid;
Adjust the centroid of that cluster to account for p;
END;
• Optional: Fix the centroids of the clusters and assign each point
to the k clusters (usually does not influence).

IT6006-DATA ANALYTICS 27

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
• Illustration (similar to our algorithm):

IT6006-DATA ANALYTICS 28

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
Initializing clusters.
Few approaches:
• Pick points that are as far away from one other as possible.
• Cluster a sample of the data (perhaps hierarchically) so there
are k clusters. Pick a point from each cluster (perhaps that point
closest to cluster centroid).

Algorithm for the first approach:


Pick the first point at random;
WHILE there are fewer than k points DO
Add the points whose minimum distance from the selected
points is as large as possible;
END;
IT6006-DATA ANALYTICS 29

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
Example for initializing clusters:
We have the following set:

• We first the worst case point, which is (6,8). That’s the first point.
• The furthest point from (6,8) is (12,3), so that’s the next point.
IT6006-DATA ANALYTICS 30

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
• Now, we check the point whose minimum distance to either
(6,8) or (12,3) is the maximum.
d((2,2),(6,8)) = 7.21, d((2,2),(12,3)) = 10.05.
So, the score is min(7.21, 10.05)= 7.21.

IT6006-DATA ANALYTICS 31

Downloaded by Try (try7851@gmail.com)


lOMoARcPSD|35349139

K – Means Algorithm
• Picking the r ight value of k:
• Recall measures of appropr iateness of cluster s , i.e. radius or diameter.
• We r un k-means on ser ies of number s, say 1,…,10, and search for
significant decreasing in the cluster s diameter s average, where
after wards it doesn’t change much.

IT6006-DATA ANALYTICS 32

Downloaded by Try (try7851@gmail.com)

You might also like