Learning

Classification is the process

of classifying the data with the

help of class labels. On the other

hand,Clustering is similar

to classification but there are no

predefined class

labels. Classification is geared with

supervised learning. As

against, clustering is also known as

unsupervised learning

Nearest Neighbor Classifier

Nearest-neighbor classifiers are

based on learning by analogy, that

is, by comparing a given test tuple

with training tuples that are similar

to it.

Nearest Neighbor Classifier

The training tuples are described by

n attributes. Each tuple represents a

point in an n-dimensional space. In

this way, all of the training tuples

are stored in an n-dimensional

pattern space.

Nearest Neighbor Classifier

When given an unknown tuple, a k-

nearest-neighbor classifier searches

the pattern space for the k training

tuples that are closest to the

unknown tuple. These k training

tuples are the k “nearest neighbors”

of the unknown tuple.

Closeness

defined in terms of a distance

metric, such as Euclidean distance.

The Euclidean distance between two

points or tuples, say, X1 = (x11,

x12, … , x1n) and X2 = (x21, x22,

... , x2n) is

Class Label

For k-nearest-neighbor

classification, the unknown tuple is

assigned the most common class

among its k nearest neighbors.

When k = 1, the unknown tuple is

assigned the class of the training

tuple that is closest to it in pattern

space.

1-Nearest Neighbor

3-Nearest Neighbor

K-Nearest Neighbor

An arbitrary instance is represented by

(a1(x), a2(x), a3(x),.., an(x))

ai(x) denotes features

Euclidean distance between two instances

d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) -

ar(xj))2)

Continuous valued target function

mean value of the k nearest training

examples

K-Nearest Neighbor

Here is step by step on how to compute K-

nearest neighbor algorithm:

1. Determine parameter K= number of nearest

neighbors

2. Calculate the distance between the query-

instance and all the training samples

3. Sort the distance and determine nearest

neighbors based on the K-th minimum distance

4. Gather category Y of nearest neighbors

5. Use majority of the category of nearest

neighbors as prediction value of the query

instance

Example

We have data of two attributes to classify

whether a special tissue is ‘good’ or ‘bad’.

X1 = Acid X2 = Strength Y=

Durability (kg/ Square Classification

(Seconds) meter)

7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Example

Now the factory produces a new tissue that pass

laboratory test with X1 = 3 and X2 = 7. Without

another expensive survey, can we guess what

the classification of this new tissue is?

1. Determine parameter K= number of nearest

neighbors. Suppose use K = 3

Example

Now the factory produces a new tissue that pass

laboratory test with X1 = 3 and X2 = 7. Without

another expensive survey, can we guess what

the classification of this new tissue is?

1. Determine parameter K= number of nearest

neighbors. Suppose use K = 3

Example

2. Calculate the distance between the query-

instance and all the training samples

Durability (kg/ Square query instance (3, 7)

(Seconds) meter)

7 7 (7-3)2 + (7-7)2 = 16

7 4 (7-3)2 + (4-7)2 = 25

3 4 (3-3)2 + (4-7)2 = 9

1 4 (1-3)2 + (4-7)2 = 13

Example

3. Sort the distance and determine nearest

neighbors based on the K-th minimum distance

Durability Strength (kg/ query instance (3, 7) minimum included

(Seconds) Square distance in 3-

meter) nearest

neighbors

?

7 7 (7-3)2 + (7-7)2 = 16 3 Yes

7 4 (7-3)2 + (4-7)2 = 25 4 No

3 4 (3-3)2 + (4-7)2 = 9 1 Yes

1 4 (1-3)2 + (4-7)2 = 13 2 Yes

Example

4. Gather category Y of nearest neighbors

X1 = X2 = Rank Is it Y=

Acid Strength minimu include Categor

Durabili (kg/ m d in 3- y of

ty Square distance nearest nearest

(Secon meter) neighbo neighbo

ds) rs? rs

7 7 3 Yes Bad

7 4 4 No -

3 4 1 Yes Good

1 4 2 Yes Good

Example

5. Use majority of the category of nearest

neighbors as prediction value of the query

instance

We have 2 good and 1 bad. So test tuple is

included in good class.

Exercise

Find class label of the values (Age = 26, Income

= 18)

($/hr)

25 15 Yes

20 10 No

30 20 Yes

35 25 No

