This action might not be possible to undo. Are you sure you want to continue?
Data Mining
1
يتفرشیپ ثحابم ( یواک ىداد )
1
یل لج دادرهم رتکد
jalali@mshdiua.ac.ir
Jalali.mshdiau.ac.ir
Data Mining
2
مراهچ يسلج – يومجها نيهاوق
Association Rules
2
3
What Is Classification?
• The goal of data classification is to organize and
categorize data in distinct classes
– A model is first created based on the data distribution
– The model is then used to classify new data
– Given the model, a class can be predicted for new data
• Classification = prediction for discrete and nominal
values
– With classification, I can predict in which bucket to put the ball,
but I can’t predict the weight of the ball
4
Prediction, Clustering, Classification
• What is Prediction?
– The goal of prediction is to forecast or deduce the value of an attribute
based on values of other attributes
– models continuousvalued functions, i.e., predicts unknown or missing values
– A model is first created based on the data distribution
– The model is then used to predict future or unknown values
• Supervised vs. Unsupervised Classification
– Supervised Classification = Classification
• We know the class labels and the number of
classes
– Unsupervised Classification = Clustering
• We do not know the class labels and may not know
the number of classes
5
Classification and Prediction
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
6
Classification: 3 Step Process
• 1. Model construction (Learning):
– Each record (instance) is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label
– The set of all records used for construction of the model is called training set
– The model is usually represented in the form of classification rules, (IFTHEN
statements) or decision trees
• 2. Model Evaluation (Accuracy):
– Estimate accuracy rate of the model based on a test set
– The known label of test sample is compared with the classified result from model
– Accuracy rate: percentage of test set samples correctly classified by the model
– Test set is independent of training set otherwise overfitting will occur
• 3. Model Use (Classification):
– The model is used to classify unseen instances (assigning class labels)
– Predict the value of an actual attribute
7
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
8
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
9
Accuracy of Supervised Learning
• Holdout Approach
– A certain amount of data (Usually, onethird) for testing and
remainder (twothird) for training
• KFold cross validation
– Split the data into k subsets of equal size, then each subset is turn is
used for testing and remainder for training
10
Issues: Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Normalize data
11
Issues: Evaluating Classification Methods
• Accuracy
– classifier accuracy: predicting class label
– predictor accuracy: guessing value of predicted attributes
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in diskresident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
12
Other Issues
• Function complexity and amount of training data
– If the true function is simple then an learning algorithm will be able to
learn it from a small amount of data, otherwise the function will only
be learnable from a very large amount of training data.
• Dimensionality of the input space
• Noise in the output values
• Heterogeneity of the data.
– If the feature vectors include features of many different kinds
(discrete, discrete ordered, counts, continuous values), some
algorithms are easier to apply than others.
• Dependency of features
13
Decision Tree  Review of Basics
What exactly is a Decision Tree?
A tree where each branching node represents a choice between two or
more alternatives, with every branching node being part of a path to a
leaf node (bottom of the tree). The leaf node represents a decision,
derived from the tree for the given input.
● How can Decision Trees be used to classify instances of data?
Instead of representing decisions, leaf nodes represent a particular
classification of a data instance, based on the given set of attributes (and
their discrete values) that define the instance of data, which is kind of like
a relational tuple for illustration purposes.
14
Review of Basics (cont’d)
• It is important that data instances have boolean or discrete
data values for their attributes to help with the basic
understanding of ID3, although there are extensions of ID3
that deal with continuous data.
15
How does ID3 relate to Decision Trees, then?
• ID3, or Iterative Dichotomiser 3 Algorithm, is a Decision Tree learning
algorithm. The name is correct in that it creates Decision Trees for
“dichotomizing” data instances, or classifying them discretely through
branching nodes until a classification “bucket” is reached (leaf node).
• By using ID3 and other machinelearning algorithms from Artificial
Intelligence, expert systems can engage in tasks usually done by human
experts, such as doctors diagnosing diseases by examining various
symptoms (the attributes) of patients (the data instances) in a complex
Decision Tree.
• Of course, accurate Decision Trees are fundamental to Data Mining and
Databases.
16
Other Decision Trees
• ID3
• J48
• C4.5
• SLIQ
• SPRINT
• PUBLIC
• RainForest
• BOAT
• Data Cubebased Decision Tree
17
Tree Presentation
18
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
19
age?
overcast
student? credit rating?
<=30
>40
no yes
yes
yes
31..40
fair excellent
yes no
Output: A Decision Tree for “buys_computer”
20
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a topdown recursive divideandconquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuousvalued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
21
Entropy
8
2
=
©
P
8
6
=
O
P
)
8
6
log
8
6
( )
8
2
log
8
2
(
2 2
÷ ÷ = E
• S is the training example
• is the proportion of positive examples in S
• is the proportion of negative examples in S
• Entropy measures the impurity of S
• For NonBoolean classification :
22
Gain (S,A) = Expected reduction in entropy due to sorting on A
Attribute Selection Measure: Information Gain
(ID3/C4.5)
23
• Class P: buys_computer = “yes”
• Class N: buys_computer = “no”
Attribute Selection: Information Gain
) ( ) ( ) ( ) (
40
40
40 .. 31
40 .. 31
30
30
) (
>
>
<=
<=
e
+ + =
¿
S Entropy
S
S
S Entropy
S
S
S Entropy
S
S
S Entropy
S
s
v
A value v
v
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
940 . 0 )
14
5
( log
14
5
)
14
9
( log
14
9
) (
2 2
= ÷ ÷ = S Entropy
694 . 0 )
5
2
5
3
(
14
5
)
4
4
(
14
4
)
5
3
5
2
(
14
5
5
2
2
5
3
2
4
4
2
5
3
2
5
2
2
= ÷ ÷ + ÷ + ÷ ÷ = Log Log Log Log Log
Gain for Age
24
Attribute Selection: Information Gain
048 . 0 ) _ , (
151 . 0 ) , (
029 . 0 ) , (
=
=
=
rating credit S Gain
student S Gain
income S Gain
246 . 0 694 . 0 940 . 0 ) , ( = ÷ = age S Gain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
25
ID3 Algorithm
26
Computing InformationGain for Continuous
Value Attributes
• Let attribute A be a continuousvalued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values is considered as
a possible split point
• (a
i
+a
i+1
)/2 is the midpoint between the values of a
i
and a
i+1
– The point with the minimum expected information requirement for A is
selected as the splitpoint for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ splitpoint, and D2 is the set of tuples
in D satisfying A > splitpoint
27
Gain Ratio for Attribute Selection and Cost of Attributes
What happens if you choose Day as root ??
Attribute with Cost
28
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively
pruned trees
• Use a set of data different from the training data to decide which is the “best pruned tree”
29
Enhancements to Basic Decision Tree Induction
• Allow for continuousvalued attributes
– Dynamically define new discretevalued attributes that partition the continuous
attribute value into a discrete set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are sparsely represented
– This reduces fragmentation, repetition, and replication
30
Instance Based Learning
• Lazy learning (e.g., instancebased learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
• Eager learning (eg. Decision trees, SVM, NN): Given a set of
training set, constructs a classification model before receiving
new (e.g., test) data to classify
31
Instancebased Learning
Its very similar to a
Desktop!!
32
Example
Image Scene Classification
33
Instancebased Learning
When To Consider IBL
• Instances map to points
• Less than 20 attributes per instance
• Lots of training data
Advantages:
• Learn complex target functions(Class)
• Don't lose information
Disadvantages:
• Slow at query time
• Easily fooled by irrelevant attributes
34
KNearest Neighbor Learning (kNN)
• kNN is a type of instancebased learning, or lazy learning
where the function is only approximated locally and all
computation is deferred until classification.
• The knearest neighbor algorithm is amongst the simplest of
all machine learning algorithms: an object is classified by a
majority vote of its neighbors, with the object being assigned
to the class most common amongst its k nearest neighbors (k
is a positive integer, typically small).
– If k = 1, then the object is simply assigned to the class of its nearest
neighbor.
35
KNearest Neighbor Learning (kNN) Cont.
• Usually Euclidean distance is used as the distance metric;
however this is only applicable to continuous variables. In
cases such as text classification, another metric such as the
overlap metric (or Hamming distance: it measures the minimum
number ofsubstitutions required to change one string into the other, or the
number of errors that transformed one string into the other. ) can be used.
36
Distance or Similarity Measures
• Common Distance Measures:
– Manhattan distance:
– Euclidean distance:
– Cosine similarity:
( , ) 1 ( , ) dist X Y sim X Y = ÷
2 2
( )
( , )
i i
i
i i
i i
x y
sim X Y
x y
×
=
×
¿
¿ ¿
37
KNN  Algorithm
• Key Idea: Just store all training examples
• Nearest Neighbour :
– Given query instance X
q
, First locate nearest training example X
n
,
Then estimate
• K Nearest Neighbour :
– Given x
q
, take vote among its K nearest nbrs ( If discretevalue for
class)
– Take mean of f Values of k nearest nbrs (if realvalue)
38
Figure K Nearest Neighbors Example
X
Stored training set patterns
X input pattern for classification
 Euclidean distance measure to
the nearest three patterns
39
two one
four
three
five
six
seven
Eight ?
Which one belongs to Mondrian ?
40
Training data
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4 No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No
Number Lines Line types Rectangles Colours Mondrian?
8 7 2 9 4
Test instance
41
Normalization
• Minmax normalization: to [new_min
A
, new_max
A
]
• Zscore normalization (μ: mean, σ: standard deviation):
• Normalization by decimal scaling
A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' + ÷
÷
÷
=
A
A v
v
o
µ ÷
= '
j
v
v
10
' =
Where j is the smallest integer such that Max(ν’) < 1
42
Normalised training data
Number Lines Line
types
Rectangles Colours Mondrian?
1 0.632 0.632 0.327 1.021 No
2 1.581 1.581 0.588 0.408 No
3 0.474 1.581 1.046 1.021 Yes
4 0.474 0.632 0.588 1.021 Yes
5 0.474 0.632 0.327 0.408 No
6 0.632 0.632 0.588 1.837 Yes
7 1.739 0.632 2.157 0.408 No
Number Lines Line
types
Rectangles Colours Mondrian?
8 1.739 1.581 0.131 1.021
Test instance
43
Distances of test instance from training data
Example Distance
of test
from
example
Mondrian?
1 2.517 No
2 3.644 No
3 2.395 Yes
4 3.164 Yes
5 3.472 No
6 3.808 Yes
7 3.490 No
Classification
1NN Yes
3NN Yes
5NN No
7NN No
44
Example
Classify Theatre ?
45
Nearest neighbors algorithms: illustration
+
+
+
+






e1
1nearest neighbor:
the concept represented by e1
5nearest neighbors:
q1 is classified as negative
q1
46
Voronoi diagram
query
point q
f
nearest neighbor q
i
47
Variant of kNN: DistanceWeighted kNN
• We might want to weight nearer neighbors more heavily
• Then it makes sense to use all training examples instead of just k
2
1
1
) , (
1
where
) (
: ) (
i q
i
k
i
i
k
i
i i
q
d
w
w
f w
f
x x
x
x = =
¿
¿
=
=
48
Difficulties with knearest neighbour
algorithms
• Have to calculate the distance of the test case from all
training cases
• There may be irrelevant attributes amongst the attributes –
curse of dimensionality
• For instance , we have 20 attribute, however 2 of them are
appropriate attributes , nonrelevant attributes effect to
results
• Solution : Assign appropriate Weight to the attributes by
using crossvalidation method.
49
How to choose “k”
• Large k:
– less sensitive to noise (particularly class noise)
– better probability estimates for discrete classes
– larger training sets allow larger values of k
• Small k:
– captures fine structure of problem space better
– may be necessary with small training sets
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows large, kNN
becomes Bayes optimal
50
فيلکت
1. Use Weka to classify and evaluate Letter Recognition dataset through IBL
2. Write down 3 new challenges in KNN