You are on page 1of 3

Given the following data,

data = [4, 8, 13, 15, 21, 27, 32, 37, 45, 45, 46, 75, 85, 121, 155, 207, 254, 300,
1200]
Perform the following tasks:
1) clean the data (use 3 Standard deviation method to detect and remove
outliers, if any);
2) apply data binning to the clean data. Use two different methods: equal width
(number of bins=6), and equal depth (height) (4 items in a bin).
1 Solution –
Step – 1:

Calculating the Average =


4+8+13+15+21+27+32+37+45+45+46+75+85+121+155+207+253+300+1200/19 = 141.52

Step – 2:
Calculating the Variance =
Variance of 1st data: (4 – 141.52)2 = (-137.52)2 = 18,911.75
Variance of 2nd data: (8 – 141.52)2 = (-133.52)2 = 17,827.59
Variance of 3rd data: (13 – 141.52)2 = (-128.52)2 = 16,517.39
Variance of 4th data: (15 – 141.52)2 = (-126.52)2 = 16,007.31
Variance of 5th data: (21 – 141.52)2 = (-120-.52)2 = 14,525.07
Variance of 6th data: (27 – 141.52)2 = (-114.52)2 = 13,114.83
Variance of 7th data: (32 – 141.52)2 = (-109.52)2 = 11,994.63
Variance of 8th data: (37 – 141.52)2 = (-104.52)2 = 10,924.43
Variance of 9th data: (45 – 141.52)2 = (-96.52)2 = 9,316.11
Variance of 10th data: (45 – 141.52)2 = (-96.52)2 = 9,316.11
Variance of 11th data: (46 – 141.52)2 = (-95.52)2 = 9,124.07
Variance of 12th data: (75 – 141.52)2 = (-66.52)2 = 4,424.91
Variance of 13th data: (85 – 141.52)2 = (-56.52)2 = 3,194.51
Variance of 14th data: (121 – 141.52)2 = (-20.52)2 = 421.07
Variance of 15th data: (155 – 141.52)2 = (13.48)2 = 181.71
Variance of 16th data: (207 – 141.52)2 = (65.48)2 = 4,287.63
Variance of 17th data: (253 – 141.52)2 = (111.48)2 = 12,427.79
Variance of 18th data: (300 – 141.52)2 = (158.48)2 = 25,115.91
Variance of 19th data: (1200 – 141.52)2 = (1058.48)2 = 1,120,379.91

Step – 3:

Mean of all the squared values = 1318012.79/19 = 69369.09


S= √69369.09
S= 263.38
Step – 4:
Calculate the 3 – Standard Deviations value = Average + (K * Standard Deviation)
= 141.52 + (3 * 263.38) = 931.66
Therefore, the outlier is 1200.

2 Solution –

Equal Width Bins:

Converting the data into equal intervals,


 [4, 8, 13], [15, 21, 27], [32, 37, 45], [45, 46, 75], [85, 121, 155], [207, 254, 300, 1200]
 4 and 1200 are the highest and lowest values in the given data.
 Width of the intervals is = (1200 – 4)/6 = 1196/6 = 199.33

Equal Depth Bins:

Converting the data into equal intervals,


 [4, 8, 13, 15], [21, 27, 32, 37], [45, 45, 46, 75], [85, 121, 155, 207], [254, 300, 1200]

You might also like