DMT Merged

A Pattern Growth Approach
Data Mining Techniques 1

Two issues in Apriori
◼ It may still need to generate a huge number of
candidate sets
◼ If there are 10 pow 4 frequent 1- itemset, Apriori need
to generate more than 10 pow 7 candidate 2- itemset
◼ It may need to repeatedly scan the whole dataset

and check a large set of candidates by pattern
matching
◼ Its costly to go over each transaction in the database to
determine the support of the candidate itemsets
Data Mining Techniques 2

FP-growth: Frequent Pattern-Growth
 Adopts a divide and conquer strategy
 Compress the database representing frequent items into a

frequent –pattern tree or FP-tree
 Retains the itemset association information
 Divid the compressed database into a set of conditional

databases, each associated with one frequent item
 Mine each such databases separately
3 Data Mining Techniques

Example: FP-growth
Transactional Database
 The first scan of data is the
same as Apriori TID List of item IDS
 Derive the set of frequent 1- T100 I1,I2,I5
itemsets T200 I2,I4
Item ID Support T300 I2,I3
count
T400 I1,I2,I4
I1 6
T500 I1,I3
I2 7
T600 I2,I3
I3 6
T700 I1,I3
I4 2
Item ID Support T800 I1,I2,I3,I5
I5 2 count
T900 I1,I2,I3
I2 7
 Let min-sup=2 I1 6
 Generate a set of ordered items I3 6
(apply condition (min-sup=2) & I4 2
write in descending order) I5 2

Construct the FP-Tree
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
- Create a branch for each 1- Order the items T100: {I2,I1,I5}

transaction 2- Construct the first branch:
- Items in each transaction are <I2:1>, <I1:1>,<I5:1>
processed in order
null
Item ID Support
count I2:1
I2 7
I1 6 I1:1
I3 6
I4 2
I5:1
I5 2

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
- Create a branch for each 1- Order the items T200: {I2,I4}

transaction 2- Construct the second branch:
- Items in each transaction are <I2:1>, <I4:1>
processed in order
null
Item ID Support
count I2:2
I2:1
I2 7
I1 6 I1:1 I4:1
I3 6
I4 2
I5:1
I5 2

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

transaction 2- Construct the third branch:
processed in order
null
Item ID Support
count I2:2
I2:3
I2 7
I1 6 I1:1 I4:1
I3 6 I3:1
I4 2
I5:1
I5 2

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
- Create a branch for each 1- Order the items T400: {I2,I1,I4}

transaction 2- Construct the fourth branch:
- Items in each transaction are <I2:3>, <I1:1>,<I4:1>
processed in order
null
Item ID Support
count I2:4
I2:3
I2 7
I1 6 I1:2
I1:1 I4:1
I3 6 I3:1
I4 2
I5:1
I5 2
I4:1
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

transaction 2- Construct the fifth branch:
processed in order
null
Item ID Support
count I2:4 I1:1
I2 7
I1 6 I1:2 I4:1
I3 6 I3:1 I3:1
I4 2
I5:1
I5 2
I4:1
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6
I5:1
I4 2
I4:1
I5 2 When a branch of a
I3:2 transaction is added, the
count for each node
along a common prefix is
I5:1 incremented by 1
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2
I5:1
The problem of mining frequent patterns in databases is

transformed to that of mining the FP-tree

null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2
I5:1
-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>

-Two prefix Paths <I2, I1: 1> and <I2,I1,I3: 1>
-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not
considered because its support count of 1 is less than the
minimum support count.
-Frequent
12 patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
Data Mining Techniques
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2
I5:1
TID Conditional Pattern Base Conditional FP-tree

I5 {{I2,I1:1},{I2,I1,I3:1}} <I2:2,I1:2>
I4 {{I2,I1:1},{I2:1}} <I2:2>
I3 {{I2,I1:2},{I2:2}, {I1:2}} <I2:4,I1:2>,<I1:2>
I1 {I2:4} <I2:4>

null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2
I5:1
TID Conditional FP-tree Frequent Patterns Generated

I5 <I2:2,I1:2> {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
I4 <I2:2> {I2,I4:2}
I3 <I2:4,I1:2>,<I1:2> {I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
I1 <I2:4> {I2,I1:4}

FP-growth properties
 FP-growth transforms the problem of finding long frequent patterns

to searching for shorter once recursively and concatenating the
suffix
 It uses the least frequent suffix offering a good selectivity
 It reduces the search cost
 If the tree does not fit into main memory, partition the database
 Efficient and scalable for mining both long and short frequent
patterns

Generating Association Rules
 Once the frequent itemsets have been found, it is straightforward

to generate strong association rules that satisfy:
 minimum support
 minimum confidence
 Relation between support and confidence:
support_count(AB)
Confidence(AB) = P(B|A)=
support_count(A)
 Support_count(AB) is the number of transactions containing the

itemsets A  B
 Support_count(A) is the number of transactions containing the
itemset A.
Generating Association Rules
 For each frequent itemset L, generate all non empty subsets of L

 For every non empty subset S of L, output the rule:
S  (L-S)
If (support_count(L)/support_count(S)) >= min_conf
(or) Confidence

Example
 Suppose the frequent Itemset Transactional Database
L={I1,I2,I5}
TID List of item IDS
 Subsets of L are: (i.e.)
T100 I1,I2,I5
S = {I1,I2}, {I1,I5},{I2,I5},{I1},{I2},{I5}
T200 I2,I4
S  (L-S)
T300 I2,I3
 Association rules :
T400 I1,I2,I4
I1  I2  I5 confidence = 2/4= 50% T500 I1,I3
I1  I5  I2 confidence=2/2=100% T600 I2,I3
I2  I5  I1 confidence=2/2=100% T700 I1,I3
I1  I2  I5 confidence=2/6=33% T800 I1,I2,I3,I5
I2  I1  I5 confidence=2/7=29% T900 I1,I2,I3
I5  I2  I2 confidence=2/2=100%
If support_count(L)/support_count(S))
>= min_conf
If the minimum confidence =70%
support_count(L) = 2
support_count(S) = 4
2/4 =0.5 (or) 50% this value should be
>=70% (not satisfying)
Classification
Classification
◼ Classification is a form of data analysis that
extracts models describing important data classes.
◼ Such type of models are called classifiers, predict
categorical (discrete, unordered) class labels
◼ Example:
◼ Building a classification model to categorize bank loan
applications as either safe or risky.
◼ A marketing manager needs data analysis to help guess
whether a customer with a given profile will buy a new
computer.
◼ A medical researcher wants to analyse cancer data to
predict which one of three specific treatments a patient
should receive.
SWE413 - Data Warehousing and Data

Mining 2
◼ In all the above examples, The data analysis task
is classification, where a model or classifier is
constructed to predict class label (Categorical)
such as
◼ “Safe” or “Risky”
◼ “Yes” or “ No”
◼ “Treatment A” “Treatment B” ,“Treatment C” Which can
be represented as 1,2,3 or A,B,C

Mining 3
Prediction
◼ Suppose the marketing manager wants to predict
how much a given customer will spend during
offer sales in Amazon.
◼ The above data analysis task is an example of
numeric prediction, where the model constructed
predicts a continuous valued function or ordered
value as opposed to a class label. This model is a
predictor.
◼ Regression analysis is a statistical methodology
that is often used for numeric prediction.

Mining 4
Classification vs. Prediction
◼ Classification
◼ predicts categorical class labels (discrete or nominal)
◼ classifies data (constructs a model) based on the

training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
◼ Prediction
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
◼ Typical applications
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Fraud detection
April 18, 2021 Data Mining: Concepts and Techniques 5

Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees,

or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the
classified result from the model

◼ Accuracy rate is the percentage of test set samples that are
correctly classified by the model

◼ Test set is independent of training set, otherwise over-fitting
will occur
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Supervised vs. Unsupervised Learning
◼ Supervised learning (classification)

◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Supervised Learning
Training dataset:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 1
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 0
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Un-Supervised Learning
Training dataset:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 1
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 0
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Training dataset:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 1
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 0
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Data Set:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
Ramakrishnan and
Gehrke. Database
Management Systems,
3rd Edition.
Bayesian Classification
Bayesian Classification: Why?
◼ A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
◼ Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
April 18, 2021 2
Bayesian Theorem: Basics
◼ Let X be a data sample (“evidence”): class label is unknown

◼ Let H be a hypothesis that X belongs to class C
◼ Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
◼ P(H) (prior probability), the initial probability
◼ E.g., X will buy computer, regardless of age, income, …
◼ P(X): probability that sample data is observed
◼ P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
◼ E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
April 18, 2021 3
Bayesian Theorem
◼ Given training data X, posteriori probability of a

hypothesis H, P(H|X), follows the Bayes theorem
P(H | X) = P(X | H )P(H )

P(X)
◼ Informally, this can be written as
posteriori = likelihood x prior/evidence
◼ Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
◼ Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
April 18, 2021 4
Towards Naïve Bayesian Classifier
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
April 18, 2021 5

Derivation of Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) =  P( x | C i ) = P( x | C i)  P( x | C i)  ...  P( x | C i )
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts
the class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ −
( x− ) 2
1
g ( x,  ,  ) = e 2 2
2 
and P(xk|Ci) is
P(X | Ci) = g ( xk , Ci , Ci )
April 18, 2021 6
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
April 18, 2021 7
Naïve Bayesian Classifier: An Example
◼ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
◼ Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
◼ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

April 18, 2021 8
Avoiding the 0-Probability Problem
◼ Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
P( X | C i ) =  P( x k | C i )
k =1
◼ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

medium (990), and income = high (10),
◼ Use Laplacian correction (or Laplacian estimator)
◼ Adding 1 to each case
Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
◼ The “corrected” prob. estimates are close to their “uncorrected”
counterparts
April 18, 2021 9

Naïve Bayesian Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Good results obtained in most of the cases
◼ Disadvantages
◼ Assumption: class conditional independence, therefore
loss of accuracy
◼ Practically, dependencies exist among variables
◼ E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
◼ Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
◼ How to deal with these dependencies?
◼ Bayesian Belief Networks
April 18, 2021 10
Bayesian Belief Networks
◼ Bayesian belief network allows a subset of the variables

conditionally independent
◼ A graphical model of causal relationships
◼ Represents dependency among the variables
◼ Gives a specification of joint probability distribution
❑ Nodes: random variables
❑ Links: dependency
X Y ❑ X and Y are the parents of Z, and Y is
the parent of P
Z ❑ No dependency between Z and P
P ❑ Has no loops or cycles
April 18, 2021 11
Bayesian Belief Network: An Example
Family The conditional probability table

Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for

each possible combination of its parents
PositiveXRay Dyspnea Derivation of the probability of a

particular combination of values of X,
from CPT:
Bayesian Belief Networks n
P( x1 ,..., xn ) =  P ( xi | Parents(Y i ))
i =1
April 18, 2021 12
Training Bayesian Networks
◼ Several scenarios:
◼ Given both the network structure and all variables
observable: learn only the CPTs

◼ Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,

analogous to neural network learning
◼ Network structure unknown, all variables observable:
search through the model space to reconstruct

network topology
◼ Unknown structure, all hidden variables: No good
algorithms known for this purpose
◼ Ref. D. Heckerman: Bayesian networks for data mining
April 18, 2021 13

Decision Tree Algorithm
Decision Tree Induction
Decision tree is a popular technique in data mining.
Decision tree represent decision and decision
making
Decision tree allows the user to take a problem with
multiple possible solution and display it in a easier
to understandable format.
Decision tree induction is the learning of decision
trees from class- labeled training tuples.
Parts of Decision Tree
A decision tree is a flowchart like tree structure Root

Node
Root Node →top most node in a tree or no
Branches
incoming edges Internal
Internal
Node
Internal Node (Non Leaf Node) Node
Branches Branches
(shape→rectangle)
Leaf Leaf Leaf
Leaf
test on an attribute or anode with outgoing edges is Node Node Node Node
also referred as test node

Branch Node→ outcome of the test
Leaf Node
(shape→Oval shape) , Terminal Node or decision node
holds a class label
Example of a decision Tree
Decision tree also called Classification Tree
Parents visiting
Yes
No
Weather
Cinema
Sunny
Windy Rainy
Money
Play Tennis Stay in
Rich Poor
Shopping
Cinema
 The decision tree algorithm is called with three parameters:
D→ data partition
Attribute List
Attribute selection method
 D→ data partition
 Initially it is the complete set of training tuples and their associated class labels
 Attribute List
 The parameters Attribute List is a list of attributes describing the tuples
 Attribute selection method
 The attribute selection method specifies a heuristic approach for selecting the
attribute that “best” describes the given tuples according to class.
 Attribute selection measures are Information gain and gini index
Decision tree
Algorithm
The tree starts with a single node N→ representing the training tuples
in D (step1)
If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (step2 and step3).
Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes (step 6).
 The splitting criterion also tells us which branches to grow from node
N with respect to the outcomes of the chosen test.
the splitting criterion indicates the
splitting attribute and
A split-point or
a splitting subset.
The splitting criterion is determined so that, ideally, the resulting
partitions at each branch are as “pure” as possible.
 A partition is pure if all of the tuples in it belong to the same class.
Three possibilities of partitioning tuples based on the
splitting criterion
A is discrete-
valued:
A is
continuous-
valued
A is discrete-
valued and
a binary tree
The node N is labeled with the splitting criterion, which serves as a
test at the node (step 7).
 A branch is grown from node N for each of the outcomes of the
splitting criterion.
The tuples in D are partitioned accordingly(steps10 to 11).
There are three possible scenarios.
Let A be the splitting attribute. A has v distinct values,{a1,a2,...,
av}, based on the training data.
A is discrete-valued
A is continuous-valued
A is discrete-valued and a binary tree must be produced
Decision Tree Induction- A is discrete-valued
 The outcomes of the test at node N correspond directly to the known values of A.
 A branch is created for each known value, aj, of A and labeled with that value.
 Partition Dj is the subset of class-labeled tuples in D having value aj of A.
 Because all of the tuples in a given partition have the same value for A, then A need
not be considered in any future partitioning of the tuples.
 Therefore, it is removed from attribute list (steps 8 to 9).
Decision Tree Induction- A is continuous-valued
 In this case, the test at node N has two possible outcomes,
 A ≤ split point and
A > split point, .
 Where split point is the split-point returned by Attribute selection method as
part of the splitting criterion
 (In practice, the split-point, a, is often taken as the mid point of two known
adjacent values of A and there fore may not actually be a pre-existing value of A
from the training data.)
 The tuples are partitioned such thatD1 holds the subset of class-labelled tuples
in D for which A≤ split point, while D2 holds the rest.
Decision Tree Induction- A is discrete-valued and a binary
tree must be produced
 The test at node N is of the form “A∈SA?”.
 SA is the splitting subset for A, returned by Attribute selection method as part of
the splitting criterion.
 It is a subset of the known values of A.
 If a given tuple has value aj of A and if aj ∈SA, then the test at node N is
satisfied. Two branches are grown from N.
 the left branch out of N is labeled yes so that D1 corresponds to the subset of
class-labeled tuples in D that satisfy the test.
 The right branch out of N is labeled no so that D2 corresponds to the subset of
class-labeled tuples from D that do not satisfy the test.
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
Problems in decision Tree
Advantages of Decision Tree
Simple and Transparent

Self Explanatory
Easy Interpret
Little preparation time
Fast
Easy to understand
Handles both Numerical and categorical Data
Fields
ID3 (interactive Dichotomiser 3)

C4.5 (Successor of ID3)
CART(Classification of Regression Tree
CHADDICHD- Squared Automatic Interaction Detective)
MARS( Multivariate Adaptive Regression Splines)
Tree Pruning
 When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
 Tree pruning methods address this problem of overfitting the
data.
 Such methods typically use statistical measures to remove the
least reliable branches.
 An unpruned tree and a pruned version of it are shown in Figure
6.6.
An unpruned decision tree and a pruned version of it
Tree Pruning
 Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend.
 They are usually faster and better at correctly classifying independent test
data (i.e., of previously unseen tuples) than unpruned trees.
Tree Pruning Approaches

Here is the Tree Pruning Approaches listed below −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.
Post pruning
which removes subtrees from a “fully grown” tree.
A subtree at a given node is pruned by removing its branches and
replacing it with a leaf.
The leaf is labeled with the most frequent class among the
subtree being replaced.
 For example, notice the subtree at node “A3?” in the unpruned
tree of Figure 6.6.
Suppose that the most common class within this subtree is “class
B.” I
n the pruned version of the tree, the subtree in question is pruned
by replacing it with the leaf “class B.”
Why decision tree is so popular?
The construction of decision tree classifier does not
require any domain knowledge./parameter settings
Decision tree can handle high dimensional data
The learning and classification steps of decision tree
induction are simple and fast.
Decision tree classifiers have good accuracy.
Attribute Selection Measure:
2 Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple
in D: m
Info( D) = − pi log2 ( pi )
i =1
◼ Information needed (after using A to split D into v

partitions) to classify D: v |D |
InfoA ( D) =   I (D j )
j
j =1 | D |
◼ Information gained by branching on attribute A
Data Mining: Concepts and Techniques

Gain(A) = Info(D) − InfoA(D) April 18, 2021
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes

with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = −  log2 ( )
j =1 | D| | D|
GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
 Ex. SplitInfoA ( D ) = −
14
 log2 ( ) −  log2 ( ) −  log2 ( ) = 0.926
14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected
as the splitting attribute
Rule-Based Classifier
Using IF-THEN Rules for Classification
◼ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
◼ Rule antecedent/precondition vs. rule consequent
◼ Assessment of a rule: coverage and accuracy
◼ ncovers = # of tuples covered by R
◼ ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
◼ If more than one rule is triggered, need conflict resolution
◼ Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
◼ Class-based ordering: decreasing order of prevalence or misclassification
cost per class
◼ Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
April 18, 2021 2
Rule Extraction from a Decision Tree
age?
<=30 31..40 >40

◼ Rules are easier to understand than large trees
student? credit rating?
yes
◼ One rule is created for each path from the root
no yes excellent fair
to a leaf
no yes yes
◼ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
April 18, 2021 3

Rule Extraction from the Training Data
◼ Sequential covering algorithm: Extracts rules directly from training data

◼ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
◼ Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
◼ Steps:
◼ Rules are learned one at a time
◼ Each time a rule is learned, the tuples covered by the rules are
removed
◼ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
◼ Comp. w. decision-tree induction: learning a set of rules simultaneously
April 18, 2021 4

How to Learn-One-Rule?
◼ Star with the most general rule possible: condition = empty
◼ Adding new attributes by adopting a greedy depth-first strategy
◼ Picks the one that most improves the rule quality
◼ Rule-Quality measures: consider both coverage and accuracy
◼ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition pos' pos
FOIL _ Gain = pos'(log2 − log2 )
pos'+ neg' pos + neg
It favors rules that have high accuracy and cover many positive tuples
◼ Rule pruning based on an independent set of test tuples
pos − neg
FOIL _ Prune( R) =
pos + neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
April 18, 2021 5
Classifier’s Accuracy Measures
◼ The confusion matrix is a useful tool for analyzing
how well the classifier can recognize tuples of
different classes.
◼ Given two classes, we can talk in terms of positive
tuples (tuples of the main class of interest versus
negative tuples.
◼ True positives: refer to the positive tuples that
were correctly labeled by the classifier,
◼ True negatives: are the negative tuples that were
correctly labeled by the classifier.
◼ False positives are the negative tuples that were
incorrectly labeled as positive. (e.g., tuples of class
buys computer = no for which the classifier
predicted buys computer = yes).
◼ false negatives are the positive tuples that were
incorrectly labeled (e.g., tuples of class buys
computer = yes for which the classifier predicted 1
SWE2009 - Data Mining
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class P N
P True Positives (TP) False Negatives (FN)
N False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer Total

class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
22
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P P N Class Imbalance Problem:
◼
P TP FN P ◼ One class may be rare, e.g.
N FP TN N fraud, or HIV-positive
P’ N’ All ◼ Significant majority of the
◼ Classifier Accuracy, or negative class and minority of

recognition rate: percentage of the positive class
test set tuples that are
◼ Sensitivity: True Positive
correctly classified
Accuracy = (TP + TN)/All recognition rate
◼ Error rate: 1 – accuracy, or ◼ Sensitivity = TP/P
Error rate = (FP + FN)/All ◼ Specificity: True Negative

recognition rate
◼ Specificity = TN/N
SWE2009 - Data Mining 33

Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the
classifier labeled as positive are actually positive
◼ Recall: completeness – what % of positive tuples did the

classifier label as positive?
◼ Perfect score is 1.0
◼ Inverse relationship between precision & recall
◼ F measure (F1 or F-score): harmonic mean of precision
and recall,
SWE2009 - Data Mining 44

Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
◼ Precision = 90/230 = 39.13% Recall = 90/300 =
30.00%
55
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 18, 2021 SWE2009 - Data Mining Techniques 1
What is Cluster Analysis?
◼ Cluster: a collection of data objects
◼ Similar to one another within the same cluster
◼ Dissimilar to the objects in other clusters
◼ Cluster analysis
◼ Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
◼ Unsupervised learning: no predefined classes
◼ Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
◼ Pattern Recognition
◼ Spatial Data Analysis
◼ Create thematic maps in GIS by clustering feature
spaces
◼ Detect spatial clusters or for other spatial mining tasks
◼ Image Processing
◼ Economic Science (especially market research)
◼ WWW
◼ Document classification
◼ Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

Quality: What Is Good Clustering?
◼ A good clustering method will produce high quality

clusters with
◼ high intra-class similarity
◼ low inter-class similarity
◼ The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
◼ The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

Measure the Quality of Clustering
◼ Dissimilarity/Similarity metric: Similarity is expressed in

terms of a distance function, typically metric: d(i, j)
◼ There is a separate “quality” function that measures the
“goodness” of a cluster.
◼ The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
◼ Weights should be associated with different variables
based on applications and data semantics.
◼ It is hard to define “similar enough” or “good enough”
◼ the answer is typically highly subjective.
Requirements of Clustering in Data Mining
◼ Scalability
◼ Ability to deal with different types of attributes
◼ Ability to handle dynamic data
◼ Discovery of clusters with arbitrary shape
◼ Minimal requirements for domain knowledge to
determine input parameters
◼ Able to deal with noise and outliers
◼ Insensitive to order of input records
◼ High dimensionality
◼ Incorporation of user-specified constraints
◼ Interpretability and usability
1. Scalability
◼ Many clustering algorithms work well on small

data sets containing fewer than several hundred
data objects;
◼ A large database may contain millions of
objects.
◼ Clustering on a sample of a given large data set
may lead to biased results.
◼ Highly scalable clustering algorithms are needed.

2. Ability to deal with different types of
attributes
◼ Many algorithms are designed to cluster interval-

based (numerical) data.
◼ Applications may require clustering other types of

data, such as binary, categorical (nominal), and
ordinal data, or mixtures of these data types.

3. Discovery of clusters with arbitrary
shape
• Many clustering algorithms determine clusters

based on Euclidean or Manhattan distance
measures.
• Algorithms based on such distance measures
tend to find spherical clusters with similar size
and density.
• However, a cluster could be of any shape. It is
important to develop algorithms that can detect
clusters of arbitrary shape.

4. Minimal requirements for domain
knowledge to determine input parameters
• Many clustering algorithms require users to input
certain parameters in cluster analysis (such as
the number of desired clusters).
• The clustering results can be quite sensitive to
input parameters.
• Parameters are often difficult to determine,
especially for data sets containing high-
dimensional objects.
• This not only burdens users, but it also makes
the quality of clustering difficult to control.

5. Ability to deal with noisy data
◼ Most real-world databases contain outliers or

missing, unknown, or erroneous data.
◼ Some clustering algorithms are sensitive to such

data and may lead to clusters of poor quality.

insensitivity to the order of input
records
◼ Some clustering algorithms cannot incorporate newly
inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a
new clustering from scratch.
◼ Some clustering algorithms are sensitive to the order
of input data.
◼ That is, given a set of data objects, such an
algorithm may return dramatically different

clustering's depending on the order of presentation of
the input objects.
◼ It is important to develop incremental clustering
algorithms and algorithms that are insensitive to the
order
April 18, 2021 of input. SWE2009 - Data Mining Techniques 13
7. High dimensionality
• A database or a data warehouse can contain

several dimensions or attributes.
• Many clustering algorithms are good at handling
low-dimensional data, involving only two to three
dimensions.
• Human eyes are good at judging the quality of
clustering for up to three dimensions.
• Finding clusters of data objects in high
dimensional space is challenging, especially
considering that such data can be sparse and
highly skewed.
8. Constraint-based clustering
• Real-world applications may need to perform
clustering under various kinds of constraints.
Suppose that your job is to choose the locations
for a given number of new automatic banking
machines (ATMs) in a city.
• To decide upon this, you may cluster households
while considering constraints such as the city’s
rivers and highway networks, and the type and
number of customers per cluster.
• challenging task is to find groups of data with
good clustering behavior that satisfy
• specified constraints.
9. Interpretability and usability
◼ Users expect clustering results to be

interpretable, comprehensible, and usable.
◼ That is, clustering may need to be tied to specific
semantic interpretations and applications.
◼ It is important to study how an application goal
may influence the selection of clustering features
and methods.

Chapter 7. Cluster Analysis
12. Summary
Data Structures
◼ Data matrix
 x11 ... x1f ... x1p 
◼ (two modes)  
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
◼ Dissimilarity matrix  0 
◼ (one mode)
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

Type of data in clustering analysis
◼ Interval-scaled variables
◼ Binary variables
◼ Nominal, ordinal, and ratio variables
◼ Variables of mixed types

Interval-valued variables
◼ Standardize data
◼ Calculate the mean absolute deviation:
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where mf = 1
n (x1 f + x2 f + ... + xnf )
.
◼ Calculate the standardized measurement (z-score)

xif − m f
zif = sf
◼ Using mean absolute deviation is more robust than using
standard deviation

Similarity and Dissimilarity Between
Objects
◼ Distances are normally used to measure the similarity or

dissimilarity between two data objects
◼ Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
◼ If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp

Similarity and Dissimilarity Between
Objects (Cont.)
◼ If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
◼ Properties
◼ d(i,j)  0
◼ d(i,i) = 0
◼ d(i,j) = d(j,i)
◼ d(i,j)  d(i,k) + d(k,j)
◼ Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

Binary Variables
Object j
1 0 sum
◼ A contingency table for binary 1 a b a +b
data Object i
0 c d c+d
sum a + c b + d p
◼ Distance measure for b+c

d (i, j) =
symmetric binary variables: a +b+c + d
◼ Distance measure for b+c
d (i, j) =
asymmetric binary variables: a +b+c
◼ Jaccard coefficient (similarity
measure for asymmetric simJaccard (i, j) = a
a +b+c
binary variables):
Dissimilarity between Binary Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ gender is a symmetric attribute

◼ the remaining attributes are asymmetric binary
◼ let the values Y and P be set to 1, and the value N be set to 0
0+1
d ( jack, mary ) = = 0.33
2+ 0+1
1+1
d ( jack, jim ) = = 0.67
1+1+1
1+ 2
d ( jim, mary ) = = 0.75
1+1+ 2
Nominal Variables
◼ A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
◼ Method 1: Simple matching
◼ m: # of matches, p: total # of variables
d (i, j) = p −
p
m
◼ Method 2: use a large number of binary variables

◼ creating a new binary variable for each of the M
nominal states

Ordinal Variables
◼ An ordinal variable can be discrete or continuous

◼ Order is important, e.g., rank
◼ Can be treated like interval-scaled
◼ replace xif by their rank rif {1,..., M f }
◼ map the range of each variable onto [0, 1] by replacing

i-th object in the f-th variable by
rif −1
zif =
M f −1
◼ compute the dissimilarity using methods for interval-
scaled variables

Ratio-Scaled Variables
◼ Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
◼ Methods:
◼ treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
◼ apply logarithmic transformation
yif = log(xif)
◼ treat them as continuous ordinal data treat their rank
as interval-scaled
Vector Objects
◼ Vector objects: keywords in documents, gene

features in micro-arrays, etc.
◼ Broad applications: information retrieval, biologic
taxonomy, etc.
◼ Cosine measure
◼ A variant: Tanimoto coefficient

12. Summary
Major Clustering Approaches (I)
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◼ Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSCAN, OPTICS, DenClue

Major Clustering Approaches (II)
◼ Grid-based approach:
◼ based on a multiple-level granularity structure
◼ Typical methods: STING, WaveCluster, CLIQUE
◼ Model-based:
◼ A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
◼ Typical methods: EM, SOM, COBWEB
◼ Frequent pattern-based:
◼ Based on the analysis of frequent patterns
◼ Typical methods: pCluster
◼ User-guided or constraint-based:
◼ Clustering by considering user-specified or application-specific constraints
◼ Typical methods: COD (obstacles), constrained clustering
Typical Alternatives to Calculate the Distance
between Clusters
◼ Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
◼ Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
◼ Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
◼ Centroid: distance between the centroids of two clusters, i.e.,

dis(Ki, Kj) = dis(Ci, Cj)
◼ Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

Kj) = dis(Mi, Mj)
◼ Medoid: one chosen, centrally located object in the cluster
Cluster Distance Measures
◼ Single link: smallest distance single link
between an element in one cluster (min)
and an element in the other, i.e.,
d(Ci, Cj) = min{d(xip, xjq)}
◼ Complete link: largest distance

complete link
between an element in one cluster
and an element in the other, i.e.,
(max)
d(Ci, Cj) = max{d(xip, xjq)}
◼ Average: avg distance between

elements in one cluster and average
elements in the other, i.e.,
d(Ci, Cj) = avg{d(xip, xjq)}

Cluster Distance Measures
Example: Given a data set of five objects characterised by a single feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
Feature 1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and
C2. Single link
a b c d e dist(C1 , C2 ) = min{d(a, c), d(a, d), d(a, e), d(b,c), d(b,d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link
dist(C1 , C2 ) = max{d(a, c), d(a, d), d(a, e), d(b,c), d(b,d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
Average
d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 ) =
6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6

Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
◼ Centroid: the “middle” of a cluster iN= 1(t
Cm = ip )
N
◼ Radius: square root of average distance from any point of the
cluster to its centroid
 N (t − cm ) 2
Rm = i =1 ip
N
◼ Diameter: square root of average mean squared distance between
all pairs of points in the cluster
 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

12. Summary
Types of Clusterings
◼ A clustering is a set of clusters
◼ Important distinction between hierarchical and
partitional sets of clusters
◼ Partitional Clustering
◼ A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
◼ Hierarchical clustering
◼ A set of nested clusters organized as a hierarchical tree

Partitional Clustering
Original Points A Partitional

Clustering
Partitioning Algorithms: Basic Concept
◼ Partitioning method: Construct a partition of a database D of n objects

into a set of k clusters, s.t., min sum of squared distance
km=1tmiKm (Cm − tmi )2

◼ Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
◼ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

The K-Means Clustering Method
◼ Given k, the k-means algorithm is implemented in

four steps:
◼ Partition objects into k nonempty subsets
◼ Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
◼ Assign each object to the cluster with the nearest
seed point
◼ Go back to Step 2, stop when no more new
assignment

The K-Medoids Clustering Method
◼ Find representative objects, called medoids, in clusters
◼ PAM (Partitioning Around Medoids, 1987)
◼ starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
◼ PAM works effectively for small data sets, but does not scale well for
large data sets
◼ CLARA (Kaufmann & Rousseeuw, 1990)

◼ CLARANS (Ng & Han, 1994): Randomized sampling
◼ Focusing + spatial data structure (Ester et al., 1995)

Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Hierarchical Clustering
◼ Use distance matrix as clustering criteria.
◼ This method does not require the number of

clusters k as an input.
◼ It needs a termination condition

Contd….
❖ Two types of Hierarchical Clustering:

Agglomerative (bottom-up) and Divisive (top-
down).
➢ Agglomerative (AGNES): begin with each
element as a separate cluster and merge them into
successively larger clusters
➢ Divisive (DIANA): begin with the whole set and
proceed to divide it into successively smaller
clusters.

Contd….
Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

Agglomerative Algorithm
◼ The Agglomerative algorithm is carried out in
three steps:
1) Convert object attributes to
distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
▪ Merge two closest clusters
▪ Update distance matrix

AGNES (Agglomerative Nesting)
◼ Introduced in Kaufmann and Rousseeuw (1990)
◼ Implemented in statistical analysis packages, e.g., Splus
◼ Use the Single-Link method and the dissimilarity matrix.
◼ Merge nodes that have the least dissimilarity
◼ Go on in a non-descending fashion
◼ Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Agglomerative Clustering Algorithm
◼ More popular hierarchical clustering technique
◼ Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
◼ Key operation is the computation of the proximity of two

clusters
◼ Different approaches to defining the distance between clusters
distinguish the different algorithms

Starting Situation
◼ Start with clusters of individual points and a

proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation
◼ After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation
◼ We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12

After Merging
◼ The question is “How do we update the proximity

matrix?” C2
U
C1 C5 C3 C4
C1 ?
C3 ? ? ? ?
C2 U C5
C4
C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12

How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
Similarity?
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
• Distance Between Centroids .

.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .

.
objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .

.
objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .

.
objective function
p1 p2 p3 p4 p5 ...
  p1
p2
p3
• MIN p4
• MAX p5
• Group Average .

.
objective function
Hierarchical Clustering: Time and
Space requirements
◼ O(N2) space since it uses the proximity matrix.

◼ N is the number of points.
◼ O(N3) time in many cases

◼ There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
◼ Complexity can be reduced to O(N2 log(N) ) time for
some approaches

Hierarchical Clustering: Problems
and Limitations
◼ Once a decision is made to combine two clusters,
it cannot be undone
◼ No objective function is directly minimized
◼ Different schemes have problems with one or
more of the following:
◼ Sensitivity to noise and outliers
◼ Difficulty handling different sized clusters and

convex shapes
◼ Breaking large clusters

A Dendrogram Shows How the
Clusters are Merged Hierarchically
◼ Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram.
◼ A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster.

Example
◼ Problem: clustering analysis with agglomerative
algorithm
data matrix
Euclidean distance
distance matrix
Example
◼ Merge two closest clusters (iteration 1)

Example
◼ Update distance matrix (iteration 1)

Example
◼ Merge two closest clusters (iteration 2)

Example
◼ Update distance matrix (iteration 2)

Example
◼ Merge two closest clusters/update distance matrix
(iteration 3)

Example
◼ Merge two closest clusters/update distance matrix
(iteration 4)

Example
◼ Final result (meeting termination condition)

Example
◼ Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
lifetime
into (A, B) at distance 0.71

4. We merge clusters E and (D, F)
5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2 and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
object thus conclude the computation

DIANA (Divisive Analysis)
◼Initially all items in one cluster

◼Large clusters are successively divided
◼Top Down
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Comments on the Hierarchical Clustering
Weaknesses:
➢ do not scale well: time complexity of at least

O(n2), where n is the number of total objects
➢ can never undo what was done previously

Examples for Hierarchical Clustering
❖ Hierarchical Clustering
❖ ROCK

Density-based clustering
• Clusters are identified by looking at the density

of points.
• Regions with a high density of points depict the
existence of clusters whereas regions with a
low density of points indicate clusters of noise
or clusters of outliers.
• Suited to deal with large datasets, with noise,
and is able to identify clusters with different
sizes and shapes.

DBSCAN algorithm
• It needs three input parameters:

1. K → the neighbour list size;
2. Eps → the radius that delimitate the
neighbourhood area of a point
(Epsneighbourhood);
3. MinPts → the minimum number of points that
must exist in the Eps-neighbourhood.

Contd….
◼ The clustering process is based on the

classification of the points in the dataset as core
points, border points and noise points.
◼ On the use of density relations between points
(directly density-reachable, density-reachable,
density-connected) to form the clusters.

a dataset consisting of 300 points was processed using
the following input parameters: k=7, Eps=0.007,
MinPts=4

Density-Based Clustering definition

definition

definition

K-MEANS
CLUSTERING
SWE2009 - Data Mining Techniques

INTRODUCTION-
What is clustering?
⚫ Clustering is the classification of objects into

different groups, or more precisely, the
partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait - often
according to some defined distance measure.

Types of clustering:
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and
merge them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters.
2. Partitional clustering: Partitional algorithms determine all
clusters at once. They include:
⚫ K-means and derivatives
⚫ Fuzzy c-means clustering
⚫ QT clustering algorithm
Common Distance measures:
⚫ Distance measure will determine how the similarity of two

elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is
given by:
2. The Manhattan distance (also called taxicab norm or 1-

norm) is given by:

K-MEANS CLUSTERING
⚫ The k-means algorithm is an algorithm to cluster
n objects based on attributes into k partitions,
where k < n.
⚫ It is similar to the expectation-maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural clusters
in the data.
⚫ It assumes that the object attributes form a vector
space.
⚫ An algorithm for partitioning (or clustering) N
data points into K disjoint subsets Sj
containing data points so as to minimize the
sum-of-squares criterion
where xn is a vector representing the the nth

data point and uj is the geometric centroid of
the data points in Sj.

⚫ Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
⚫ K is positive integer number.
⚫ The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.

How the K-Mean Clustering
algorithm works?

The K-Means Clustering
Method
⚫ Given k, the k-means algorithm is implemented in four
steps:
⚫ Partition objects into k nonempty subsets
⚫ Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of the
cluster)
⚫ Assign each object to the cluster with the nearest seed point
⚫ Go back to Step 2, stop when no more new assignment

The K-Means Clustering
Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
cluster center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means
Method
⚫ Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
⚫ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
⚫ Comment: Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
⚫ Weakness
⚫ Applicable only when mean is defined, then what about categorical data?
⚫ Need to specify k, the number of clusters, in advance
⚫ Unable to handle noisy data and outliers
⚫ Not suitable to discover clusters with non-convex shapes

Variations of the K-Means
Method
⚫ A few variants of the k-means which differ in
⚫ Selection of the initial k means
⚫ Dissimilarity calculations
⚫ Strategies to calculate cluster means
⚫ Handling categorical data: k-modes (Huang’98)

⚫ Replacing means of clusters with modes
⚫ Using new dissimilarity measures to deal with categorical objects
⚫ Using a frequency-based method to update modes of clusters
⚫ A mixture of categorical and numerical data: k-prototype method

What is the problem of k-
Means Method?
⚫ The k-means algorithm is sensitive to outliers !
⚫ Since an object with an extremely large value may substantially distort
the distribution of the data.
⚫ K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 SWE2009
8 9 10 - Data Mining Techniques
0 1 2 3 4 5 6 7 8 9 10
⚫ Step 1: Begin with a decision on the value of
k = number of clusters .
⚫ Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly or systematically
as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute
the centroid of the gaining cluster.
⚫ Step 3: Take each sample in sequence and
compute its distance from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster
and update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
⚫ Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.

A Simple example showing the
implementation of k-means algorithm
(using K=2)

Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).

Step 2:
⚫ Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
⚫ Their new centroids are:

Step 3:
⚫ Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
⚫ Therefore, the new

clusters are:
{1,2} and {3,4,5,6,7}
⚫ Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)

⚫ Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
⚫ Therefore, there is no
change in the cluster.
⚫ Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.

PLOT

(with K=3)
Step 1 SWE2009 - Data Mining Techniques

Step 2
PLOT

Real-Life Numerical Example
of K-Means Clustering
We have 4 medicines as our training data points object
and each medicine has 2 attributes. Each attribute
represents coordinate of the object. We have to
determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object
weight index
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Step 1:
⚫ Initial value of
centroids : Suppose
we use medicine A and
medicine B as the first
centroids.
⚫ Let and c1 and c2
denote the coordinate
of the centroids, then
c1=(1,1) and c2=(2,1)

⚫ Objects-Centroids distance : we calculate the
distance between cluster centroid to each object.
Let us use Euclidean distance, then we have
distance matrix at iteration 0 is
⚫ Each column in the distance matrix symbolizes the

object.
⚫ The first row of the distance matrix corresponds to the
distance of each object to the first centroid and the
second row is the distance of each object to the second
centroid.
⚫ For example, distance from medicine C = (4, 3) to the
first centroid is , and its distance to the
second centroid is , is etc.
Step 2:
⚫ Objects clustering : We
assign each object based
on the minimum distance.
⚫ Medicine A is assigned to
group 1, medicine B to
group 2, medicine C to
group 2 and medicine D to
group 2.
⚫ The elements of Group
matrix below is 1 if and
only if the object is
assigned to that group.

⚫ Iteration-1, Objects-Centroids distances :
The next step is to compute the distance of
all objects to the new centroids.
⚫ Similar to step 2, we have distance matrix at
iteration 1 is

⚫ Iteration-1, Objects
clustering:Based on the new
distance matrix, we move the
medicine B to Group 1 while
all the other objects remain.
The Group matrix is shown
below
⚫ Iteration 2, determine
centroids: Now we repeat step
4 to calculate the new centroids
coordinate based on the
clustering of previous iteration.
Group1 and group 2 both has
two members, thus the new
centroids are
and
⚫ Iteration-2, Objects-Centroids distances :
Repeat step 2 again, we have new distance
matrix at iteration 2 as

⚫ Iteration-2, Objects clustering: Again, we
assign each object based on the minimum
distance.
⚫ We obtain result that . Comparing the

grouping of last iteration and this iteration reveals
that the objects does not move group anymore.
⚫ Thus, the computation of the k-mean clustering
has reached its stability and no more iteration is
needed.. SWE2009 - Data Mining Techniques
We get the final grouping as the results as:
Object Feature1(X): Feature2 Group

weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

K-Means Clustering Visual Basic Code
Sub kMeanCluster (Data() As Variant, numCluster As Integer)

' main function to cluster data into k number of Clusters
' input:
' + Data matrix (0 to 2, 1 to TotalData);
' Row 0 = cluster, 1 =X, 2= Y; data in columns
' + numCluster: number of cluster user want the data to be clustered
' + private variables: Centroid, TotalData
' ouput:
' o) update centroid
' o) assign cluster number to the Data (= row 0 of Data)
Dim i As Integer
Dim j As Integer
Dim X As Single
Dim Y As Single
Dim min As Single
Dim cluster As Integer
Dim d As Single
Dim sumXY()
Dim isStillMoving As Boolean

isStillMoving = True
if totalData <= numCluster Then
'only the last data is put here because it designed to be interactive
Data(0, totalData) = totalData ' cluster No = total data
Centroid(1, totalData) = Data(1, totalData) ' X
Centroid(2, totalData) = Data(2, totalData) ' Y
Else
'calculate minimum distance to assign the new data
min = 10 ^ 10 'big number
X = Data(1, totalData) SWE2009 - Data Mining Techniques
Y = Data(2, totalData)
For i = 1 To numCluster
Do While isStillMoving
' this loop will surely convergent
'calculate new centroids
' 1 =X, 2=Y, 3=count number of data
ReDim sumXY(1 To 3, 1 To numCluster)
For i = 1 To totalData
sumXY(1, Data(0, i)) = Data(1, i) + sumXY(1, Data(0, i))
sumXY(2, Data(0, i)) = Data(2, i) + sumXY(2, Data(0, i))
Data(0, i))
sumXY(3, Data(0, i)) = 1 + sumXY(3, Data(0, i))
Next i
For i = 1 To numCluster
Centroid(1, i) = sumXY(1, i) / sumXY(3, i)
Centroid(2, i) = sumXY(2, i) / sumXY(3, i)
Next i
'assign all data to the new centroids
isStillMoving = False
For i = 1 To totalData
min = 10 ^ 10 'big number
X = Data(1, i)
Y = Data(2, i)
For j = 1 To numCluster
d = dist(X, Y, Centroid(1, j), Centroid(2, j))
If d < min Then
min = d
cluster = j
End If
Next j
If Data(0, i) <> cluster Then
Data(0, i) = cluster
isStillMoving = True
End If
Next i
Loop SWE2009 - Data Mining Techniques
End If
End Sub
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial
grouping will determine the cluster significantly.
2. The number of cluster, K, must be determined before
hand. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend
on the initial random assignments.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm
may be trapped in the local optimum.
Applications of K-Mean
Clustering
⚫ It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points, k
is number of clusters and t is number of iterations.
⚫ k-means clustering can be applied to machine
learning or data mining
⚫ Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
⚫ Also used for choosing color palettes on old
fashioned graphical display devices and Image
Quantization.
CONCLUSION
⚫ K-means algorithm is useful for undirected
knowledge discovery and is relatively simple.
K-means has found wide spread usage in lot
of fields, ranging from unsupervised learning
of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence,
image processing, machine vision, and many
others.

DMT Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMT Merged

Uploaded by

Copyright:

Available Formats

A Pattern Growth Approach

Data Mining Techniques 1

◼ It may need to repeatedly scan the whole dataset

Data Mining Techniques 2

 Adopts a divide and conquer strategy

 Compress the database representing frequent items into a

 Retains the itemset association information

 Divid the compressed database into a set of conditional

 Mine each such databases separately

3 Data Mining Techniques

4 Data Mining Techniques

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

- Create a branch for each 1- Order the items T100: {I2,I1,I5}

5 Data Mining Techniques

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

- Create a branch for each 1- Order the items T200: {I2,I4}

6 Data Mining Techniques

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

- Create a branch for each 1- Order the items T300: {I2,I3}

7 Data Mining Techniques

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

- Create a branch for each 1- Order the items T400: {I2,I1,I4}

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

- Create a branch for each 1- Order the items T400: {I1,I3}

T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3

The problem of mining frequent patterns in databases is

11 Data Mining Techniques

-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>

TID Conditional Pattern Base Conditional FP-tree

13 Data Mining Techniques

TID Conditional FP-tree Frequent Patterns Generated

14 Data Mining Techniques

 FP-growth transforms the problem of finding long frequent patterns

 It uses the least frequent suffix offering a good selectivity

 It reduces the search cost

15 Data Mining Techniques

 Once the frequent itemsets have been found, it is straightforward

 Relation between support and confidence:

 Support_count(AB) is the number of transactions containing the

 For each frequent itemset L, generate all non empty subsets of L

If (support_count(L)/support_count(S)) >= min_conf

17 Data Mining Techniques

SWE413 - Data Warehousing and Data

SWE413 - Data Warehousing and Data

SWE413 - Data Warehousing and Data

◼ classifies data (constructs a model) based on the

April 18, 2021 Data Mining: Concepts and Techniques 5

◼ The model is represented as classification rules, decision trees,

◼ The known label of test sample is compared with the

classified result from the model

correctly classified by the model

NAME RANK YEARS TENURED Classifier

◼ Supervised learning (classification)

◼ Let X be a data sample (“evidence”): class label is unknown

◼ Given training data X, posteriori probability of a

P(H | X) = P(X | H )P(H )

April 18, 2021 5

◼ Compute P(X|Ci) for each class

◼ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

Therefore, X belongs to class (“buys_computer = yes”)

◼ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

Prob(income = low) = 1/1003

April 18, 2021 9