You are on page 1of 7

3.3 Exercise 2.

2 gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

a) Use smoothing by bin means to smooth these data, using a bin depth of 3. Illustrate your steps. Comment
on the effect of this technique for the given data.
b) How might you determine outliers in the data?
c) What other methods are there for data smoothing?

Solution:

(a) Smoothing by bin means:

Step 1: Sort the data. (This step is not required here while the data are already sorted.)

Step 2: Partition the data into equidepth bins of depth 3.

Bin# Values
Bin 1 13, 15, 16
Bin 2 16, 19, 20

m
Bin 3 20, 21, 22

er as
Bin 4 22, 25, 25

co
Bin 5 25, 25, 30

eH w
Bin 6 33, 33, 35
Bin 7 35, 35, 35

o.
Bin 8 36, 40, 45
rs e Bin 9 46, 52, 70
ou urc
Step 3: Calculate the arithmetic mean of each bin.
o

Bin# Arithmetic Sum Arithmetic Mean Values


aC s

Bin 1 13+15+16 = 44 44/3, 44/3, 44/3


vi y re

Bin 2 16+19+20 = 55 55/3, 55/3, 55/3


Bin 3 20+21+22 = 63 63/3, 63/3, 63/3
Bin 4 22+25+25 = 72 72/3, 72/3, 72/3
Bin 5 25+25+30 = 80 80/3, 80/3, 80/3
Bin 6 33+33+35 = 101 101/3, 101/3, 101/3
ed d

Bin 7 35+35+35 = 105 105/3, 105/3, 105/3


ar stu

Bin 8 36+40+45 = 121 121/3, 121/3, 121/3


Bin 9 46+52+70 = 168 168/3, 168/3, 168/3

Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
is

Bin# Smoothed Values


Th

Bin 1 14.67, 14.67, 14.67


Bin 2 18.33, 18.33, 18.33
Bin 3 21, 21, 21
Bin 4 24, 24, 24
sh

Bin 5 26.67, 26.67, 26.67


Bin 6 33.67, 33.67, 33.67
Bin 7 35, 35, 35
Bin 8 40.33, 40.33, 40.33
Bin 9 56, 56, 56

This method smooths a sorted data value by consulting it’s “neighborhood”. It performs local smoothing.

Page 1 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
(b) determine outliers:

Outliers in data can be identified in several ways:

 By dividing the data into equi-width histograms and detecting the outlying histograms.
 By clustering where similar values are organized into groups, or ‘clusters’. Values that fall outside of the set
of clusters may be considered outliers.
 A blend of computer and human checking on dataset where a prearranged data distribution is employed to
allow the computer to detect possible outliers and verified by human inspection with much less effort than
the verification of the entire initial dataset.
 In general, fit a model to the data. Any data points that deviate significantly (based on some threshold) from
the model can be considered outliers.

(c) Other methods of data smoothing:

Other methods that can be used for data smoothing include:

m
er as
 Alternate forms of binning such as smoothing by bin medians or smoothing by bin boundaries.

co
 Equiwidth bins can be used to implement any of the forms of binning, where the interval range of values in

eH w
each bin is constant.
 Regression techniques can be used to smooth the data by fitting it to a function such as through linear or

o.
multiple regression.
rs e
 Classification techniques can be used to implement concept hierarchies that can smooth the data by rolling-
ou urc
up lower level perceptions to higher-level perceptions.
o
aC s
vi y re

3.7 Using the data for age given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0,1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age is
12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
ed d

(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.
ar stu

Solution:
is

(a) Max-min Normalization:


Th

The Min-max normalization to [𝑛𝑒𝑤_𝑚𝑎𝑥𝐴 , 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴 ] is computed by


𝑣 − 𝑚𝑖𝑛𝐴
𝑣′ = (𝑛𝑒𝑤_𝑚𝑎𝑥𝐴 − 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴 ) + 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴
𝑚𝑎𝑥𝐴 − 𝑚𝑖𝑛𝐴
sh

According to the equation we have,

𝑚𝑖𝑛𝐴 = 13, 𝑚𝑎𝑥𝐴 = 70, 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴 = 0, 𝑛𝑒𝑤_𝑚𝑎𝑥𝐴 = 1.0, then v = 35 is transformed to


35 − 13 22
𝑣′ = (1 − 0) + 0 = = 0.385965 ≈ 0.39
70 − 13 57

Page 2 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
(b) Z-score Normalization:

The Z-score normalization for mean (μ), standard deviation (σ) is computed by
𝑣 − 𝜇𝐴
𝑣′ =
𝜎𝐴

According to the equation we have,


809
𝜇𝐴 = = 29.96, 𝜎𝐴 = 12.94 then 𝑣 = 35 is transformed to
27

35 − 29.96 5.04
𝑣′ = = = 0.3888889 ≈ 0.39
12.94 12.94

(c) Normalization by decimal scaling:

m
er as
The Normalization by decimal scaling is computed by

co
𝑣
𝑣′ =

eH w
10𝑗

o.
Where j is the smallest integer such that 𝑀𝑎𝑥(|𝑣 ′ |) < 1

rs e
ou urc
According to the equation we have,

𝑗 = 2, then 𝑣 = 35 is transformed to
o

35 35
aC s

𝑣′ = 2
= = 0.35
10 100
vi y re
ed d

(d) Comment on preferred method:


ar stu

Given the data,

 One may prefer decimal scaling for normalization where a transformation maintains the data distribution
while still allowing mining on specific age groups.
is

 Min-max normalization has the undesired effect of not permitting any future values to fall outside the current
minimum and maximum values without encountering an “out of bounds error”. As it is probable that such
Th

values may be present in future data, this method is less appropriate.


 z-score normalization transforms values into measures that represent their distance from the mean, in terms
of standard deviations. It is probable that this type of transformation would not increase the information value
sh

of the attribute in terms of intuitiveness to users or in usefulness of mining results.

Page 3 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:
a) Normalize the two attributes based on z-score normalization.
b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two attributes
positively or negatively correlated? Compute their covariance.

Solution:

(a) z-score normalization:

For Age where n = 18 and x = age,


18
1
𝑚𝑒𝑎𝑛 (𝜇𝑎𝑔𝑒 ) = ∑ 𝑎𝑔𝑒𝑖
18
𝑖=1

1
= (23 + 23 + 27 + 27 + 39 + 41 + 47 + 49 + 50 + 52 + 54 + 54 + 56 + 57 + 58 + 58 + 60 + 61)

m
18

er as
836

co
= = 46.44

eH w
18

o.
rs e 18
ou urc
1
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝜎𝑎𝑔𝑒 ) = √ ∑(𝑎𝑔𝑒𝑖 − 46.44)2
18
𝑖=1
o

(23 − 46.44)2 + (23 − 46.44)2 + (27 − 46.44)2 + (27 − 46.44)2 + (39 − 46.44)2 + (41 − 46.44)2
1
aC s

= √ {+(47 − 46.44)2 + (49 − 46.44)2 + (50 − 46.44)2 + (52 − 46.44)2 + (54 − 46.44)2 + (54 − 46.44)2 }
vi y re

18
+(56 − 46.44)2 + (57 − 46.44)2 + (58 − 46.44)2 + (58 − 46.44)2 + (60 − 46.44)2 + (61 − 46.44)2

1
(549.43 + 549.43 + 377.91 + 377.91 + 55.35 + 29.59 + 0.31 + 6.55 + 12.67
= √ 18
ed d

+30.91 + 57.15 + 57.15 + 91.39 + 111.51 + 133.63 + 133.63 + 183.87 + 211.99)


ar stu

= √2970.38/18 = √165.02 = 12.85


is

For body fat where n = 18 and x = body_fat


Th

18
1
𝑚𝑒𝑎𝑛 (𝜇𝑏𝑜𝑑𝑦_𝑓𝑎𝑡 ) = ∑ 𝑏𝑜𝑑𝑦_𝑓𝑎𝑡𝑖
18
𝑖=1
sh

1
= ( 9.5 + 26.5 + 7.8 + 17.8 + 31.4 + 25.9 + 27.4 + 27.2 + 31.2 + 34.6 + 42.5 + 28.8 + 33.4 + 30.2
18
+ 34.1 + 32.9 + 41.2 + 35.7 )
518.1
= = 28.78
18

Page 4 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
18
1
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝜎𝑏𝑜𝑑𝑦_𝑓𝑎𝑡 ) = √ ∑(𝑏𝑜𝑑𝑦_𝑓𝑎𝑡𝑖 − 28.78)2
18
𝑖=1

(9.5 − 28.78)2 + (26.5 − 28.78)2 + (7.8 − 28.78)2 + (17.8 − 28.78)2 + (31.4 − 28.78)2 + (25.9 − 28.78)2
1
=√ {+(27.4 − 28.78)2 + (27.2 − 28.78)2 + (31.2 − 28.78)2 + (34.6 − 28.78)2 + (42.5 − 28.78)2 + (28.8 − 28.78)2 }
18
+(33.4 − 28.78)2 + (30.2 − 28.78)2 + (34.1 − 28.78)2 + (32.9 − 28.78)2 + (41.2 − 28.78)2 + (35.7 − 28.78)2

1
(371.72 + 5.20 + 440.16 + 120.56 + 6.86 + 8.29 + 1.90 + 2.50 + 5.86 + 33.87
= √18
+188.24 + 0.01 + 21.34 + 2.02 + 28.30 + 16.97 + 154.26 + 47.89)

= √1455.95/18 = √80.89 = 8.99

m
So, the z-score normalization for age

er as
23 − 𝜇𝑎𝑔𝑒 23 − 46.44 −23.44

co
𝑣′ = = = = −1.82412451 ≈ −1.83, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 23
𝜎𝑎𝑔𝑒 12.85 12.85

eH w
27 − 46.44 −19.44

o.
𝑣′ = = = −1.51284047 ≈ −1.51, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 27
12.85 12.85
rs e
ou urc
39 − 46.44 −7.44
𝑣′ = = = −0.57898833 ≈ −0.58, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 39
12.85 12.85
41 − 46.44 −5.44
𝑣′ =
o

= = −0.42334630 ≈ −0.42, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 41


12.85 12.85
aC s

47 − 46.44 0.56
vi y re

𝑣′ = = = 0.04357977 ≈ 0.04, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 47


12.85 12.85
49 − 46.44 2.56
𝑣′ = = = 0.19922179 ≈ 0.20, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 49
12.85 12.85
ed d

50 − 46.44 3.56
𝑣′ = = = 0.27704280 ≈ 0.28, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 50
ar stu

12.85 12.85
52 − 46.44 5.56
𝑣′ = = = 0.43268483 ≈ 0.43, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 52
12.85 12.85
is

54 − 46.44 7.56
𝑣′ = = = 0.58832685 ≈ 0.59, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 54
12.85 12.85
Th

56 − 46.44 9.56
𝑣′ = = = 0.74396887 ≈ 0.74, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 56
12.85 12.85
57 − 46.44 10.56
sh

𝑣′ = = = 0.82178988 ≈ 0.82, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 57


12.85 12.85
58 − 46.44 11.56
𝑣′ = = = 0.89961090 ≈ 0.90, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 58
12.85 12.85
60 − 46.44 13.56
𝑣′ = = = 1.05525292 ≈ 1.06, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 60
12.85 12.85

Page 5 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
61 − 46.44 14.56
𝑣′ = = = 1.13307392 ≈ 1.13, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 61
12.85 12.85

So, the z-score normalization for body_fat


𝑣 − 𝜇𝑏𝑜𝑑𝑦_𝑓𝑎𝑡 9.5 − 28.78 −19.28
𝑣′ = = = = −2.14460512 ≈ −2.14, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 9.5
𝜎𝑏𝑜𝑑𝑦_𝑓𝑎𝑡 8.99 8.99

26.5 − 28.78 −2.28


𝑣′ = = = −0.25361513 ≈ −0.25, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 26.5
8.99 8.99
7.8 − 28.78 −20.98
𝑣′ = = = −2.33370412 ≈ −2.33, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 7.8
8.99 8.99
17.8 − 28.78 −10.98
𝑣′ = = = −1.22135706 ≈ −1.22, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 17.8
8.99 8.99
31.4 − 28.78 2.62
𝑣′ = = = 0.29143493 ≈ 0.29, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 31.4

m
8.99 8.99

er as
25.9 − 28.78 −2.88

co
𝑣′ = = = −0.32035595 ≈ −0.32, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 25.9

eH w
8.99 8.99
27.4 − 28.78 −1.38

o.
𝑣′ = = = −0.15350389 ≈ −0.15, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 27.4
8.99 8.99
rs e
ou urc
27.2 − 28.78 −1.58
𝑣′ = = = −0.17575083 ≈ −0.18, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 27.2
8.99 8.99
31.2 − 28.78 2.42
𝑣′ = = = 0.26918799 ≈ 0.27, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 31.2
o

8.99 8.99
aC s

34.6 − 28.78 5.82


vi y re

𝑣′ = = = 0.64738598 ≈ 0.65, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 34.6


8.99 8.99
42.5 − 28.78 13.72
𝑣′ = = = 1.52614017 ≈ 1.53, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 42.5
8.99 8.99
ed d

28.8 − 28.78 0.02


𝑣′ = = = 0.00222469 ≈ 0.0, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 28.8
ar stu

8.99 8.99
33.4 − 28.78 4.62
𝑣′ = = = 0.51390434 ≈ 0.51, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 33.4
8.99 8.99
is

30.2 − 28.78 1.42


𝑣′ = = = 0.15795328 ≈ 0.16, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 30.2
8.99 8.99
Th

34.1 − 28.78 5.32


𝑣′ = = = 0.59176863 ≈ 0.59, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 34.1
8.99 8.99
32.9 − 28.78 4.12
sh

𝑣′ = = = 0.45828699 ≈ 0.46, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 32.9


8.99 8.99
41.2 − 28.78 12.42
𝑣′ = = = 1.38153504 ≈ 1.38, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 41.2
8.99 8.99
35.7 − 28.78 6.92
𝑣′ = = = 0.76974416 ≈ 0.77, 𝑤ℎ𝑒𝑟𝑒 𝑣 = 35.7
8.99 8.99

Page 6 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
Finally,

age 23 23 27 27 39 41 47 49 50
z-age -1.83 -1.83 -1.51 -1.51 -0.58 -0.42 0.04 0.20 0.28
body_fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
z-body_fat -2.14 -0.25 -2.33 -1.22 0.29 -0.32 -0.15 -0.18 0.27
age 52 54 54 56 57 58 58 60 61
z-age 0.43 0.59 0.58 0.74 0.82 0.90 0.90 1.06 1.13
body_fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
z-body_fat 0.65 1.53 0.0 0.51 0.16 0.59 0.46 1.38 0.77

(b) Correlation coefficient (Pearson’s product moment coefficient):

m
er as
The Correlation coefficient (also called Pearson’s product moment coefficient) is defined by,

co
∑𝑛𝑖=1(𝑎𝑖 − µ𝐴 )(𝑏𝑖 − µ𝐵 ) ∑𝑛𝑖=1(𝑎𝑖 − 𝑏𝑖 ) − 𝑛𝜎𝐴 𝜎𝐵

eH w
𝑟𝐴,𝐵 = =
(𝑛 − 1)𝜎𝐴 𝜎𝐵 (𝑛 − 1)𝜎𝐴 𝜎𝐵

o.
rs e
Where n is the number of tuples, µ𝐴 and µ𝐵 are the respective means of A and B, 𝜎𝐴 and 𝜎𝐵 are the respective standard
deviation of A and B, and ∑ 𝑎𝑖 𝑏𝑖 is the sum of the A and B cross-product.
ou urc
According to the equation we have µ𝐴 = 46.44, µ𝐵 = 28.78, 𝜎𝐴 = 12.85 and 𝜎𝐵 = 8.99. Then,
o
aC s

∑18
𝑖=1(𝑎𝑖 − 46.44)(𝑏𝑖 − 28.78)
𝑟𝐴,𝐵 =
vi y re

18 𝑥 12.85 𝑥 8.99
(23 − 46.44)(9.5 − 28.78) + (23 − 46.44)(26.5 − 28.78) + (27 − 46.44)(7.8 − 28.78)
+(27 − 46.44)(17.8 − 28.78) + (39 − 46.44)(31.4 − 28.78) + (41 − 46.44)(25.9 − 28.78)
+(47 − 46.44)(27.4 − 28.78) + (49 − 46.44)(27.2 − 28.78) + (50 − 46.44)(31.2 − 28.78)
ed d

+(52 − 46.44)(34.6 − 28.78) + (54 − 46.44)(42.5 − 28.78) + (54 − 46.44)(28.8 − 28.78)


ar stu

+(56 − 46.44)(33.4 − 28.78) + (57 − 46.44)(30.2 − 28.78) + (58 − 46.44)(34.1 − 28.78)


[+(58 − 46.44)(32.9 − 28.78) + (60 − 46.44)(41.2 − 28.78) + (61 − 46.44)(35.7 − 28.78)]
=
2079.387
is

451.9232 + 53.4432 + 407.8512 + 213.4512 − 19.4928 + 15.6672 − 0.7728 − 4.0448 + 8.6152 +


[ ]
= 32.3592 + 103.7232 + 0.1512 + 44.1672 + 14.9952 + 61.4992 + 47.6272 + 168.4152 + 100.7552
Th

2079.387
1700.3336
= = 0.81770733 ≈ 0.82
2079.387
sh

So, the correlation coefficient is 0.82 and the variable is positively correlated.

Page 7 of 7
This study source was downloaded by 100000823284023 from CourseHero.com on 06-29-2021 23:06:06 GMT -05:00

https://www.coursehero.com/file/35924318/Assigment2-bodypdf/
Powered by TCPDF (www.tcpdf.org)

You might also like