You are on page 1of 2

Data Imputation with KNN

 The K Nearest Neighbor is the assigning a value based on how nearly it similar the points in the
training set.
 The data is imputed with the mean of nearest neighbors.
2
 E ( a , b) = √ ∑ (x −x )
i∈ D
ai bi

 E ( a , b ) is the distance between the two cases a and b

 x ai and x bi are the values of attribute i in cases a and b respectively,

 D is the set of attributes with non_missing values in both cases

No TotayDayMinutes TotalDayCalls TotalDayCharge


1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 NaN 56.0 80.0
4 95.0 NaN 98.0

In this example calculation, k is set to 2.

TodayDayMinutes
E( r 3 , r 1)=√(56−30)2 =26
2 2
E( r 3 , r 2)=√ (56−45 ) + ( 80−40 ) =41.48

E( r 3 , r 4)=√(80−98)2=18
Select the first two values of the ascending Euclidean distance.
The first two values are 100 and 95.
The mean value of these is 97.5.

No TotayDayMinutes TotalDayCalls TotalDayCharge


1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 NaN 98.0

TotalDayCalls
2
E( r 4 , r 1)= √( 95−100 ) =5
2 2
E( r 4 , r 2)= √( 95−90 ) + ( 98−40 ) =58.21
2 2
E( r 4 , r 3)=√ ( 95−97.5 ) + ( 98−80 ) =18.17

The selected values are 30 and 56.


The imputed data is 43.
No TotayDayMinutes TotalDayCalls TotalDayCharge
1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 43.0 98.0

TotalDayCharge
2 2
E( r 1 , r 2)= √( 100−9 0 ) −( 30−45 ) =15.81
2 2
E( r 1 , r 3)=√( 100−97.5 ) + ( 30−56 ) =26.1 1
2 2
E( r 1 , r 4)= √ ( 100−95 ) + ( 30−43 ) =13.92
The selected values are 40 and 98.
The imputed data (mean of neighbors) is 69.
No TotayDayMinutes TotalDayCalls TotalDayCharge
1 100.0 30.0 69
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 43.0 98.0

You might also like