Professional Documents
Culture Documents
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6
I5:1
I4 2
I4:1
I5 2 When a branch of a
I3:2 transaction is added, the
count for each node
along a common prefix is
I5:1 incremented by 1
10 Data Mining Techniques
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2
I5:1
I5:1
I5:1
I5:1
If the tree does not fit into main memory, partition the database
Efficient and scalable for mining both long and short frequent
patterns
minimum support
minimum confidence
support_count(AB)
Confidence(AB) = P(B|A)=
support_count(A)
S (L-S)
(or) Confidence
◼ Target marketing
◼ Medical diagnosis
◼ Fraud detection
will occur
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
April 18, 2021 Data Mining: Concepts and Techniques 6
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
April 18, 2021 Data Mining: Concepts and Techniques 8
Supervised vs. Unsupervised Learning
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Un-Supervised Learning
Training dataset:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 1
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 0
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Un-Supervised Learning
Training dataset:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 1
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 0
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Un-Supervised Learning
Data Set:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
Ramakrishnan and
Gehrke. Database
Management Systems,
3rd Edition.
Bayesian Classification
Bayesian Classification: Why?
◼ A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
◼ Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
April 18, 2021 2
Bayesian Theorem: Basics
counterparts
◼ Disadvantages
◼ Assumption: class conditional independence, therefore
loss of accuracy
◼ Practically, dependencies exist among variables
Bayesian Classifier
◼ How to deal with these dependencies?
◼ Bayesian Belief Networks
April 18, 2021 10
Bayesian Belief Networks
Parents visiting
Yes
No
Weather
Cinema
Sunny
Windy Rainy
Money
Play Tennis Stay in
Rich Poor
Shopping
Cinema
Decision Tree Induction
The decision tree algorithm is called with three parameters:
D→ data partition
Attribute List
Attribute selection method
D→ data partition
Initially it is the complete set of training tuples and their associated class labels
Attribute List
The parameters Attribute List is a list of attributes describing the tuples
Attribute selection method
The attribute selection method specifies a heuristic approach for selecting the
attribute that “best” describes the given tuples according to class.
Attribute selection measures are Information gain and gini index
Decision tree
Algorithm
Decision Tree Induction
The tree starts with a single node N→ representing the training tuples
in D (step1)
If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (step2 and step3).
Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes (step 6).
The splitting criterion also tells us which branches to grow from node
N with respect to the outcomes of the chosen test.
Decision Tree Induction
the splitting criterion indicates the
splitting attribute and
A split-point or
a splitting subset.
The splitting criterion is determined so that, ideally, the resulting
partitions at each branch are as “pure” as possible.
A partition is pure if all of the tuples in it belong to the same class.
Three possibilities of partitioning tuples based on the
splitting criterion
A is discrete-
valued:
A is
continuous-
valued
A is discrete-
valued and
a binary tree
Decision Tree Induction
The node N is labeled with the splitting criterion, which serves as a
test at the node (step 7).
A branch is grown from node N for each of the outcomes of the
splitting criterion.
The tuples in D are partitioned accordingly(steps10 to 11).
There are three possible scenarios.
Let A be the splitting attribute. A has v distinct values,{a1,a2,...,
av}, based on the training data.
A is discrete-valued
A is continuous-valued
A is discrete-valued and a binary tree must be produced
Decision Tree Induction- A is discrete-valued
The outcomes of the test at node N correspond directly to the known values of A.
A branch is created for each known value, aj, of A and labeled with that value.
Partition Dj is the subset of class-labeled tuples in D having value aj of A.
Because all of the tuples in a given partition have the same value for A, then A need
not be considered in any future partitioning of the tuples.
Therefore, it is removed from attribute list (steps 8 to 9).
Decision Tree Induction- A is continuous-valued
In this case, the test at node N has two possible outcomes,
A ≤ split point and
A > split point, .
Where split point is the split-point returned by Attribute selection method as
part of the splitting criterion
(In practice, the split-point, a, is often taken as the mid point of two known
adjacent values of A and there fore may not actually be a pre-existing value of A
from the training data.)
The tuples are partitioned such thatD1 holds the subset of class-labelled tuples
in D for which A≤ split point, while D2 holds the rest.
Decision Tree Induction- A is discrete-valued and a binary
tree must be produced
The test at node N is of the form “A∈SA?”.
SA is the splitting subset for A, returned by Attribute selection method as part of
the splitting criterion.
It is a subset of the known values of A.
If a given tuple has value aj of A and if aj ∈SA, then the test at node N is
satisfied. Two branches are grown from N.
the left branch out of N is labeled yes so that D1 corresponds to the subset of
class-labeled tuples in D that satisfy the test.
The right branch out of N is labeled no so that D2 corresponds to the subset of
class-labeled tuples from D that do not satisfy the test.
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
Problems in decision Tree
Advantages of Decision Tree
Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend.
They are usually faster and better at correctly classifying independent test
data (i.e., of previously unseen tuples) than unpruned trees.
j =1 | D |
22
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P P N Class Imbalance Problem:
◼
P TP FN P ◼ One class may be rare, e.g.
N FP TN N fraud, or HIV-positive
P’ N’ All ◼ Significant majority of the
55
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 18, 2021 SWE2009 - Data Mining Techniques 1
What is Cluster Analysis?
◼ Cluster: a collection of data objects
◼ Similar to one another within the same cluster
◼ Dissimilar to the objects in other clusters
◼ Cluster analysis
◼ Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
◼ Unsupervised learning: no predefined classes
◼ Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms
April 18, 2021 SWE2009 - Data Mining Techniques 2
Clustering: Rich Applications and
Multidisciplinary Efforts
◼ Pattern Recognition
◼ Spatial Data Analysis
◼ Create thematic maps in GIS by clustering feature
spaces
◼ Detect spatial clusters or for other spatial mining tasks
◼ Image Processing
◼ Economic Science (especially market research)
◼ WWW
◼ Document classification
◼ Cluster Weblog data to discover groups of similar access
patterns
April 18, 2021 SWE2009 - Data Mining Techniques 3
Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
◼ Scalability
◼ Ability to deal with different types of attributes
◼ Ability to handle dynamic data
◼ Discovery of clusters with arbitrary shape
◼ Minimal requirements for domain knowledge to
determine input parameters
◼ Able to deal with noise and outliers
◼ Insensitive to order of input records
◼ High dimensionality
◼ Incorporation of user-specified constraints
◼ Interpretability and usability
April 18, 2021 SWE2009 - Data Mining Techniques 7
1. Scalability
◼ Dissimilarity matrix 0
◼ (one mode)
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
◼ Interval-scaled variables
◼ Binary variables
◼ Nominal, ordinal, and ratio variables
◼ Variables of mixed types
◼ Standardize data
◼ Calculate the mean absolute deviation:
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where mf = 1
n (x1 f + x2 f + ... + xnf )
.
◼ If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
◼ Properties
◼ d(i,j) 0
◼ d(i,i) = 0
◼ d(i,j) = d(j,i)
◼ d(i,j) d(i,k) + d(k,j)
◼ Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◼ Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSCAN, OPTICS, DenClue
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and
C2. Single link
a b c d e dist(C1 , C2 ) = min{d(a, c), d(a, d), d(a, e), d(b,c), d(b,d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link
dist(C1 , C2 ) = max{d(a, c), d(a, d), d(a, e), d(b,c), d(b,d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
Average
d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 ) =
6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
◼ Hierarchical clustering
◼ A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
◼ We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C1 ?
C3 ? ? ? ?
C2 U C5
C4
C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
p1 p2 p3 p4 p5 ...
Similarity?
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
objective function
– Ward’s Method uses squared error
April 18, 2021 SWE2009 - Data Mining Techniques 54
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
objective function
– Ward’s Method uses squared error
April 18, 2021 SWE2009 - Data Mining Techniques 55
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
objective function
– Ward’s Method uses squared error
April 18, 2021 SWE2009 - Data Mining Techniques 56
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
objective function
– Ward’s Method uses squared error
April 18, 2021 SWE2009 - Data Mining Techniques 57
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
objective function
– Ward’s Method uses squared error
April 18, 2021 SWE2009 - Data Mining Techniques 58
Hierarchical Clustering: Time and
Space requirements
data matrix
Euclidean distance
distance matrix
April 18, 2021 SWE2009 - Data Mining Techniques 75
Example
◼ Merge two closest clusters (iteration 1)
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Weaknesses:
❖ Hierarchical Clustering
❖ ROCK
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
⚫ Dissimilarity calculations
⚫ Therefore, there is no
change in the cluster.
⚫ Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
SWE2009 - Data Mining Techniques
Step 1:
⚫ Initial value of
centroids : Suppose
we use medicine A and
medicine B as the first
centroids.
⚫ Let and c1 and c2
denote the coordinate
of the centroids, then
c1=(1,1) and c2=(2,1)
⚫ Iteration 2, determine
centroids: Now we repeat step
4 to calculate the new centroids
coordinate based on the
clustering of previous iteration.
Group1 and group 2 both has
two members, thus the new
centroids are
and
SWE2009 - Data Mining Techniques
⚫ Iteration-2, Objects-Centroids distances :
Repeat step 2 again, we have new distance
matrix at iteration 2 as
Dim i As Integer
Dim j As Integer
Dim X As Single
Dim Y As Single
Dim min As Single
Dim cluster As Integer
Dim d As Single
Dim sumXY()
For i = 1 To totalData
min = 10 ^ 10 'big number
X = Data(1, i)
Y = Data(2, i)
For j = 1 To numCluster
d = dist(X, Y, Centroid(1, j), Centroid(2, j))
If d < min Then
min = d
cluster = j
End If
Next j
If Data(0, i) <> cluster Then
Data(0, i) = cluster
isStillMoving = True
End If
Next i
Loop SWE2009 - Data Mining Techniques
End If
End Sub
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial
grouping will determine the cluster significantly.
2. The number of cluster, K, must be determined before
hand. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend
on the initial random assignments.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm
may be trapped in the local optimum.
SWE2009 - Data Mining Techniques
Applications of K-Mean
Clustering
⚫ It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points, k
is number of clusters and t is number of iterations.
⚫ k-means clustering can be applied to machine
learning or data mining
⚫ Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
⚫ Also used for choosing color palettes on old
fashioned graphical display devices and Image
SWE2009 - Data Mining Techniques
Quantization.
CONCLUSION
⚫ K-means algorithm is useful for undirected
knowledge discovery and is relatively simple.
K-means has found wide spread usage in lot
of fields, ranging from unsupervised learning
of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence,
image processing, machine vision, and many
others.