Professional Documents
Culture Documents
Foundations of Data Science - Unit 5 - Accuracy KNN
Foundations of Data Science - Unit 5 - Accuracy KNN
Data Science
Unit 5
1
3/14/2023
Outline
▪ Accuracy Measures
▪ Eager and Lazy Learners
▪ Distance Computation and Normalization
▪ K-Nearest Neighbor
▪ Cross Validation
Dr.
Spring 2023 Muhammad 3
Usman Arif
Dr.
Spring 2023 Muhammad 4
Usman Arif
2
3/14/2023
Dr.
Spring 2023 Muhammad 5
Usman Arif
▪ Confusion Matrix:
3
3/14/2023
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Dr.
Spring 2023 Muhammad 7
Usman Arif
Limitation of Accuracy
Dr.
Spring 2023 Muhammad 8
Usman Arif
4
3/14/2023
a
Precision (p) =
Class=Yes Class=No
2rp 2a
F - measure (F) = =
r + p 2a + b + c
wa + w d
Weighted Accuracy = 1 4
wa + wb+ wc + w d
1 2 3 4
a
Precision (p) =
a+c
Recall and Precision a
Recall (r) =
Actual Prediction a+b
T T
2rp 2a
T F F - measure (F) = =
r + p 2a + b + c
F T
F F PREDICTED CLASS
F T
Class=Yes Class=No
T T
T T ACTUAL Class=Yes a b
CLASS (TP) (FN)
T F
F T Class=No c d
(FP) (TN)
T T
Dr.
Spring 2023 Muhammad 10
Usman Arif
10
5
3/14/2023
11
T T ▪ Precision = 4 / 7
T F ▪ F-Measure = 8 / 13
F T
F F
F T
T T
T T
T F
F T
T T
Dr.
Spring 2023 Muhammad 12
Usman Arif
12
6
3/14/2023
Dr.
Spring 2023 Muhammad 13
Usman Arif
13
K- Nearest Neighbor
Dr.
Spring 2023 Muhammad
14
Usman Arif
14
7
3/14/2023
Eager Learners
Dr.
Spring 2023 Muhammad 15
Usman Arif
15
Lazy Learners
▪ Imagine a contrasting lazy approach
▪ The learner instead waits until the last minute before doing any model
construction in order to classify a given test tuple.
▪ When given a training tuple, a lazy learner simply stores it (or does only
a little minor processing) and waits until it is given a test tuple.
▪ Only when it sees the test tuple does it perform generalization in order
to classify the tuple based on its similarity to the stored training tuples.
Dr.
Spring 2023 Muhammad 16
Usman Arif
16
8
3/14/2023
Dr.
Spring 2023 Muhammad 17
Usman Arif
17
Dr.
Spring 2023 Muhammad 18
Usman Arif
18
9
3/14/2023
Dr.
Spring 2023 Muhammad 19
Usman Arif
19
20
10
3/14/2023
21
▪ Absolute Difference
Dr.
Spring 2023 Muhammad 22
Usman Arif
22
11
3/14/2023
Normalization
Dr.
Spring 2023 Muhammad 23
Usman Arif
23
▪ Min-Max Normalization
▪ Decimal Scaling
▪ v’(i) = v(i) / 10k
▪ For the smallest k such that max |v’(i)|< 1.
Dr.
Spring 2023 Muhammad 24
Usman Arif
24
12
3/14/2023
25
Distance Function
▪ Although there are other possible choices, most instance-based
learners use Euclidean distance.
▪ Euclidean distance between two examples.
▪ P = [p1, p2, p3,.., pn]
▪ Q = [q1, q2, q3,..., qn]
▪ The Euclidean distance between X and Y is defined as:
Dr.
Spring 2023 Muhammad 26
Usman Arif
26
13
3/14/2023
Dr.
Spring 2023 Muhammad 27
Usman Arif
27
K-Nearest Neighbor
▪ The k-nearest-neighbor method was first described in the early
1950s.
Dr.
Spring 2023 Muhammad 28
Usman Arif
28
14
3/14/2023
K-Nearest Neighbor
Unknown record
29
K-Nearest Neighbor
Unknown record
Dr.
Spring 2023 Muhammad 30
Usman Arif
30
15
3/14/2023
31
Dr.
Spring 2023 Muhammad 32
Usman Arif
32
16
3/14/2023
Majority Voting
▪ Take the majority vote of class labels among the k-nearest
neighbors
▪ If K=5: Response -> No
Dr.
Spring 2023 Muhammad 33
Usman Arif
33
Dr.
Spring 2023 Muhammad 34
Usman Arif
34
17
3/14/2023
Dr.
Spring 2023 Muhammad 35
Usman Arif
35
Right Value of K
▪ A good value for K can be determined experimentally.
▪ Starting with k = 1, we use a test set to estimate the error rate of the
classifier.
▪ This process can be repeated each time by incrementing k to allow
for one more neighbor.
▪ The k value that gives the minimum error rate may be selected.
▪ In general, the larger the number of training tuples is, the larger the
value of k will be (so that classification and prediction decisions can
be based on a larger portion of the stored tuples)
Dr.
Spring 2023 Muhammad 36
Usman Arif
36
18
3/14/2023
Dr.
Spring 2023 Muhammad 37
Usman Arif
37
Dr.
Spring 2023 Muhammad 38
Usman Arif
38
19
3/14/2023
Example: Non-Normalized
Dr.
Spring 2023 Muhammad 39
Usman Arif
39
Example: Normalized
Dr.
Spring 2023 Muhammad 40
Usman Arif
40
20
3/14/2023
Dr.
Spring 2023 Muhammad 41
Usman Arif
41
42
21
3/14/2023
Limitations
▪ It tends to be slow for large training sets, because the entire set must be
searched for each test instance
▪ It performs badly with noisy data, because the class of a test instance is
determined by its single nearest neighbor without any “averaging” to help to
eliminate noise.
Dr.
Spring 2023 Muhammad 43
Usman Arif
43
Dr.
Spring 2023 Muhammad 44
Usman Arif
44
22
3/14/2023
▪ A simple variant is to classify each example with respect to the examples already
seen and to save only ones that are misclassified.
Dr.
Spring 2023 Muhammad 45
Usman Arif
45
▪ This can be done by keeping a record of the number of correct and incorrect
classification decisions that each exemplar makes.
Dr.
Spring 2023 Muhammad 46
Usman Arif
46
23
3/14/2023
▪ In this case, the classifier returns the average value of the real-valued labels
associated with the k nearest neighbors of the unknown tuple.
Dr.
Spring 2023 Muhammad 47
Usman Arif
47
24