Professional Documents
Culture Documents
Class10 13 PatternClassification 06 13oct2020
Class10 13 PatternClassification 06 13oct2020
Data Modeling
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
13-10-2020
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
3
2
13-10-2020
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
5
3
13-10-2020
Pattern Classification
Classification
• Problem of identifying to which of a set of categories a
new observation belongs
• Predicts categorical labels
• Example:
– Assigning a given email to the "spam" or "non-spam"
class
– Assigning a diagnosis (disease) to a given patient based
on observed characteristics of the patient
• Classification is a two step process
– Step1: Building a classifier (data modeling)
• Learning from data (training phase)
– Step2: Using classification model for classification
• Testing phase
4
13-10-2020
10
5
13-10-2020
2-class Classification
• Example: Classifying a person as child or adult
Weight (x2)
Adult
Height, x1 Class
Adult/Child
Classifier Child
Weight, x2
Adult :Class C1
Child :Class C2 Height (x1)
x = [x1 x2]T
11
Weight
in Kg
Height in cm 12
6
13-10-2020
13
7
13-10-2020
15
Feature
Training extraction
98 28.43 Child Classifier
Phase
Feature
extraction
183 90 Adult
Feature
extraction
163 67.45 Adult
16
8
13-10-2020
17
Feature
extraction Training Examples
90 21.5 Child
Feature
extraction
100 32.45 Child
Class label
Feature
(Adult)
Training extraction
98 28.43 Child Classifier
Phase
Feature
extraction
183 90 Adult
Feature
extraction
163 67.45 Adult
Feature
extraction
Testing
150 50.6
Phase
18
9
13-10-2020
x2 x2 x2
x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes
19
Nearest-Neighbour Method
• Training data with N samples: D {x n , y n }n 1 ,
N
x n R d and y n {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
Euclidean distance x n x
(x n x)T (x n x)
x2 d
T
x [ x1 x2 ]
(x
i 1
ni xi ) 2
x1
20
10
13-10-2020
Nearest-Neighbour Method
• Training data:D {x n , y n }n 1 ,
N
x n R d and yn {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to test example x
Weight
in Kg
Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples
22
11
13-10-2020
Weight
in Kg
Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example
23
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the
minimum distance to the test
example
– Class: Adult
24
12
13-10-2020
Nearest-Neighbour Method
• Training data:D {x n , y n }n 1 ,
N
x n R d and yn {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to x
Weight
in Kg
Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples
26
13
13-10-2020
Weight
in Kg
Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example
27
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the minimum
distance to the test example
– Class: Adult
28
14
13-10-2020
Euclidean distance x n x
(x n x)T (x n x)
x2 d
T
x [ x1 x2 ] (x
i 1
ni xi ) 2
x1
29
15
13-10-2020
Weight
in Kg
Height in cm
• Consider K=5
• Step 3: Choose the first K=5
examples in the sorted list
31
Weight
in Kg
Height in cm
• Consider K=5
• Step 4: Test example is assigned
the most common class among its
K neighbours
– Class: Adult
32
16
13-10-2020
33
Data Normalization
• Since the distance measure is used, K-NN classifier
require normalising the values of each attribute
• Normalising the training data:
– Compute the minimum and maximum values of each of
the attributes in the training data
– Store the minimum and maximum values of each of the
attributes
– Perform the min-max normalization on training data set
• Normalizing the test data:
– Use the stored minimum and maximum values of each
of the attributes from training set to normalise the test
examples
• NOTE: Ensure that test examples are not causing out-
of-bound error
34
17
13-10-2020
36
18
13-10-2020
Image Classification
Tiger
Giraffe
Horse
Bear
Intraclass variability
38
19
13-10-2020
39
40
20
13-10-2020
x2 x2 x2
x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes
41
21
13-10-2020
43
22
13-10-2020
23
13-10-2020
Confusion Matrix
Actual Class
Class1 Class2
(Positive) (Negative)
Predicted
Class1
Class
Class
24
13-10-2020
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total test
samples
in class1
49
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total test
samples
in class2
50
25
13-10-2020
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total test
Class1 True False
Class
samples
(Positive) Positive Positive predicted as
class1
Class2 False True
(Negative) Negative Negative
51
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
52
26
13-10-2020
Accuracy
Number of samples correctly classified (C11 C22)
Accuracy(%) *100
Total number of samples used for testin g
TP TN
Accuracy(%) *100
Total number of samples used for testin g
Actual Class
Class1 Class2
(Positive) (Negative)
Predicted
Class1
Class
Class
53
27
13-10-2020
28
13-10-2020
Actual Class
Class1
C11 C21 C31 predicted as
class1
Total samples
Class2
C12 C22 C32 predicted as
class2
Total samples
Class2
C13 C23 C33 predicted as
class3
Total samples Total samples Total samples in
Total
in class1 in class2 class3
Actual Class
Class1
C11 C21 C31
Class2
C12 C22 C32
58
29
13-10-2020
Predicted
(Positive) (Negative)
Class
Class1 True Positive False Positive
(Positive) (TP) (FP)
Class2 False Negative True Negative
(Negative) (FN) (TN)
• Precision:
– Number of samples correctly classified as positive class,
out of all the examples classified as positive class
– It is also called positive predictive value
TP
Precision
TP FP
(Positive) (Negative)
Class
• Recall:
– Number of samples correctly classified as positive class,
out of all the examples belonging to positive class
– Its also called as sensitivity or true positive rate (TPR)
TP
Recall
TP FN
30
13-10-2020
Predicted
(Positive) (Negative)
Class
Class1 True Positive False Positive
(Positive) (TP) (FP)
Class2 False Negative True Negative
(Negative) (FN) (TN)
2 * Precision * Recall
F - score
Precision Recall
61
31
13-10-2020
32
13-10-2020
Probability Distribution
• Data of a class is represented by a probability
distribution
• For a class whose data is considered to be forming a
single cluster, it can be represented by a normal or
Gaussian distribution
• Multivariate Gaussian distribution:
– Adult-Child class
Weight
in Kg
Height in cm 66
33
13-10-2020
Probability Distribution
• Data of a class is represented by a probability
distribution
• For a class whose data is considered to be forming a
single cluster, it can be represented by a normal or
Gaussian distribution
• Multivariate Gaussian distribution:
– Adult-Child class
– Bivariate
Gaussian p(x)
distribution
– Each example is
sampled from
Gaussian
distribution
Weight
in Kg
Height in cm 67
12 12
Σ
21 2
2
E x1 1 2
Σ
E x1 1 x 2 2
E x 2 2 x1 1
E x 2 2
2
68
34
13-10-2020
Ni
– Prior: Prior information of a class P(Ci )
N
• where, N is total number of training examples M
35
13-10-2020
36
13-10-2020
37
13-10-2020
Illustration of ML Method:
Training Set: Adult-Child
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)
Weight
in Kg
Height in cm 76
38
13-10-2020
109.3778 61.3500
61.3500 43.5415
Weight
in Kg
Height in cm 77
109.3778 61.3500
61.3500 43.5415
39
13-10-2020
p(Di | µi Σi)
µ2
µ1 79
40
13-10-2020
81
82
41
13-10-2020
px μ , Σ P(C )
i 1
i i i
83
Class
label
arg max p x μ i , Σ i
i
θM=[μM ΣM]
p(x|µM,ΣM) 84
42
13-10-2020
109.3778 61.3500
Σ1
61.3500 43.5415
– Prior probability for class 1 (Child):
P(C1)=10/20=0.5
110.6667 160.5278
Σ2
160.5278 255.4911
– Prior probability for class 2 (Adult):
P(C2)=10/20=0.5 85
px μ , Σ P(C )
i 1
i i i
86
43
13-10-2020
3.52x10-08 * 0.5
P(μ1, Σ1 | x)
(3.52x10 * 0.5) (3.72x10-04 * 0.5)
-08
87
88
44
13-10-2020
89
45
13-10-2020
• Statistical model:
– Unimodal Gaussian density
• Univariate
• Multivariate
p(x)
Weight
in Kg
Height in cm 91
– Unimodal Gaussian
density
• Univariate
• Multivariate
[103.6 30.1]
Height in cm
• The real world data need not be unimodal
– The shape of the density can be arbitrary
– Bayes classifier?
• Multimodal density function
92
46
13-10-2020
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
93
47