IS328 Data Mining-Tutorial Lab Session 2 - Solution - Updated

IS328 Data Mining Semester
Week 3: Data Types and Data Pre-processing

Tutorial/Lab Session - 2 - Solution
Part A – MCQs
Q1. Which of the following is not part of the KDD process.

A Data cleaning
B Data Mining
C Data Integration
D Data Encryption
E Data Transformation
Q2 Which of the following is a measurement of data quality?

A Accuracy
B Completeness
C Timeliness
D Reliability
E All of the above
Q3 Which of the following is not a data type?

A Nominal
B Binary
C Discrete
D Random
E Ordinal
Q4 In KDD and data mining, noisy data is referred to as ________.

A repeated data.
B complex data.
C meta data.
D random errors in database.
Q5 __________________refers to the process of deriving high-quality information from text.

A Text Mining.
B Image Mining.
C Database Mining.
D Multimedia Mining.
1|Pag
e
Q6 Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 12, 21, 32, 43, 44, 46, 52, 59, 63
Using equal-width partitioning and five bins, how many values are there in the first bin?
A. 1
B. 2
C. 3
D. 4
E. 5
Workings: W = (max – min)/number of bin = (63 – 3)/5 = 12

Bin1 (3 ----->15) – 3, 4, 5, 10, 12
Bin2 (16----->27) - 21
Bin3 …
Q7 Which of the following are known as qualitative data types?
(i) nominal
(ii) interval
(iii) ordinal
(iv) discrete
(v) continuous
A. (i) and (ii).
B. (ii) and (iii)
C. (i) and (iii).
D. (iii) and (iv)
E (iv) and (v)
Q8 Consider discretizing a continuous attribute whose values are listed below:

3, 4, 5, 10, 20, 32, 43, 44, 46, 52, 59, 61
Which of the following number of bins is not possible for using equi-depth bins?
A. 2
B. 3
C. 4
D. 5
E. 6
2|Pag
e
Q9 Suppose a group of 12 students with the test scores listed as follows:
19, 71, 48, 63, 35, 85, 69, 81, 72, 88, 100, 95
By partitioning them into three bins using equi-width method and smoothing by bins boundaries,
how many data items will be in the second bin?
A 1
B 2
C 3
D 4
E 5
Working: 19, 35, 48 ,63, 69, 71, 72, 81, 85, 88, 95, 100
W = (max – min)/3 = (100 – 19)/3 = 27
Bin 1: (19 to 46) – [19, 35]
Bin 2: (47 to 73) – [48, 63, 69, 71, 72]
Bin 3: (74 to 100) – [81, 85, 88, 95, 100]
Q10 This step of the KDD process model deals with noisy data.
A Data Integration
B Data pre-processing
C Data transformation
D Data mining
E Data Interpretation
Q11 Which of the following is not part of pre-processing:

A Data cleaning
B Data integration
C Data transformation
D Data reduction
E Data visualisation
Q12 Income data for a group of 12 people have the following properties: mean
=$33000 and sd = $11000
Using z-score normalization, an income of $73600 is transformed to:
A 2.24
B 2.79
C 3.69
D 4.25
E 5.36
Working: Z-norm = (value – mean)/SD = (73600 – 33000)/ 11000 = 3.69
Q13 Income data for a group of 12 people have the following properties:
min =$55000 and max = $150000:
Using min-max normalization to map (0 – 1), an income of $73600 is transformed to
3|Pag
e
A 0.543
B 0.434
C 0.365
D 0.196
E 0.098
Working:
v' = (new_maxA – new_minA) + new_minA
v' =
Q14 Normalisation using decimal scaling is done using the formula

𝑣
V’ = 10𝑗
Where j is the smallest integer such that Max (|V|’) < 1

Assume the data range is -722 to 889, what will be the value for J?
A 1
B 2
C 3
D 4
E 5
Part B: Descriptive /Problem Solving

Q1) Give an example of each data type below:
Nominal
Ordinal
Binary
Discrete
Continuous
Nominal: Hair Colour -> {black, brown, grey, red |
Ordinal size: {small, medium, large}
Binary {good, bad|
Discrete {28, 30, 31}
Continuous { 1.24, 32.4, 4.34,5.67 }
Q2) Use 5 bins to partition the following group of data based on Equal-Depth.
33, 35, 16, 21, 36, 73, 45, 22, 22, 20, 25, 25, 30, 13, 33, 16, 19, 25, 20, 35, 15, 35, 46, 52, 40.
4|Pag
e
Sorted:
13 15 16 16 19 20 20 21 22 22 25 25 25 30 33 33 35 35 35 36 40 45 46 52 73
Smooth the data within each bin using bin boundaries.
Bin1: 13, 15, 16, 16, 19

Bin2: 20, 20, 21, 22, 22
Bin3: 25, 25, 25, 30, 33
Bin4: 33, 35, 35, 35, 36
Bin5: 40, 45, 46, 52, 73
Smooth by Bin Boundaries

Bin1: 13, 13, 13, 13, 19
Bin2: 20, 20, 20, 22, 22
Bin3: 25, 25, 25, 33, 33
Bin4: 33, 36, 36, 36, 36 Bin5:
40, 40, 40, 40, 73
Q3) Consider the following group of numbers:
200, 300, 400, 600, 750; 900; 1000; 1200
Normalize the above numbers using the following methods: Show your calculations
(i) Min-Max Normalization (0 – 1)
(ii) Decimal Scaling
Normalise the above numbers using the following methods: Show your calculations
(i)
200, 300, 400, 600, 750; 900; 1000; 1200
min = 200, Max = 1200, new_max = 1, new_min = 0
V= 200
v' =
V= 300
v' =
V= 400
5|Pag
e
v' =
V= 600
v' =
V= 750
v' =
V= 900
v' =
V= 1000
v' =
V= 1200
v' =
(i) Min-Max Normalisation (0 – 1)
{0, 0.1, 0.2, 0.4, 0.55, 0.7, 0.8, 1.0}
Working for (ii)
V’
Steps: Find value for j = 4

V= {200, 300, 400, 600, 750, 900, 1000, 1200}
6|Pag
e
(ii) Decimal Scaling the normalized values are:
{0.02, 0.03, 0.04, 0.06, 0.075, 0.09, 0.1, 0.12}

Q4) Consider the following data vector denoted Y:
Y= {35 36 46 68 70}
(a) Calculate the mean and standard deviation for the above data. Use the following formula.
This is the formula for Standard Deviation:
𝑁
1
𝜎=√ ∑ (𝑥𝑖 − 𝑢)2
𝑁−1
𝑖 =1
Mean = (35 + 36 + 46 + 68 + 70 ) / 5 = 51
(35-51)2 + (36-51)2 + (46-51)2 + (68-51)2 + (70-51)2 = 1156
StdDev = SQRT ( 1156 / 4) = 17
Formula: Z-score normalization =

NY1 = (35-51) / 17 = - 0.941
NY2 = (36-51) / 17 = - 0.882
NY3 = (46-51) / 17 = - 0.294
NY4 = (68-51) / 17 = 1
NY5 = (70-51) / 17 = 1.117
(b) Normalise the data Y using the Z-Score normalization method.

Therefore, the normalized vector is  Normalized Y= {-0.941, -0.882, -0.294, 1, 1.117}
7|Pag
e
Q5) Using the Min-Max method, normalize the following data to scale (1 – 10).
Show your calculations. Use
the formula
new_min = 1
new_max = 10
Name Blood Sugar Body Mass Index Blood Pressure

Reading (BMI) Measurement
Jacki 5.5 - 1 22 – 4.6 125 – 4.86
David 5.8 – 4.6 28 – 10 145 - 10
Jessica 6.0 - 7 18 – 1 110 - 1
Mary 6.25 - 10 20 – 2.8 135 – 7.43
Rahini 5.9 – 5.8 21 – 3.7 120 – 3.57
Blood Sugar Reading
minA = 5.5, maxA = 6.25, new_maxA = 10, new_minA = 1
V = 5.5 (Jacki) [Note the use of brackets] v'

=
V = 5.8 (David)
v' = 9) + 1 = 4.6
V = 6.0 (Jessica)
v' = 9) +1 = 7
V = 6.25 (Mary)
v' = 9) +1 = 10
V = 5.9 (Rahini)
v' =
8|Pag
e
Body Mass Index (BMI
minA = 18, maxA = 28, new_maxA = 10, new_minA = 1
V = 22 (Jacki) [Note the use of brackets]

v' = 9) + 1 = 4.6
V = 28 (David)
v' = 9) + 1 = 10
V = 18 (Jessica)
v' = 9) + 1 = 1
V = 20 (Mary)
v' = 9) + 1 = 2.8
V = 21 (Rahini)
v' = 9) + 1 = 3.7
Blood Pressure Measurement
minA = 110, maxA = 145, new_maxA = 10, new_minA = 1
V = 125 (Jacki)
v' =
V = 145 (David)
v' = 145 9) +1 = 10
V = 110 (Jessica)
v' = 9) + 1 = 1
V = 135 (Mary)
v' =
V = 120 (Rahini)
v' = 9) +1 = 3.57
9|Pag
e
Min-Max Normalisation (1-10)
Blood Sugar Body Mass Index Blood Pressure

Min-Max Reading (BMI) Measurement
Normalisation
(1-10) Name
Jacki 1 4.6 4.86
David 4.6 10 10
Jessica 7 1 1
Mary 10 2.8 7.43
Rahini 5.8 3.7 3.57
Part C: WEKA for Data Mining
Objective: To use simple filters and visualize data
Step 1: Use filter to remove an attribute
1.1 Open Weather.nominal dataset, and click on choose button to choose a filter
There are a lot of different filters. Allfilter and MultiFilter are ways of combining filters. We have
supervised and unsupervised filters. Supervised filters are ones that use a class value for their
operation. They aren't as common as unsupervised filters, which don't use the class value. There are
attribute filters and instance filters. We want to remove an attribute. So we're looking for an attribute
filter. There are so many filters in Weka that you just must learn to look around and find what you
want.
1.2 Expand the attribute filter and choose remove
10 | P a g e
1.3 click on the text box of the choose field and specify the attributeIndices as 3 and select OK
1.4 click on Apply button to see the impact (attribute 3 Humidity is removed from the data set) .
Note : This remove functionality can be achieved by selecting the attribute and choose Remove button
1.5 Select Undo button to undo the operation

1.6. Remove the instances where Humidity attribute value is High (Do it by yourself , Hint: instance
filter)
11 | P a g e
Original File:
Click on choose button ,
12 | P a g e
13 | P a g e
Step 2: Visualize the data
2.1 Open Iris data set iris.arff
2.2 Go to the Visualize panel and visualize this data
2.3 Select sepalwidth on the x-axis and petalwidth on the y-axis
14 | P a g e
2.4 Click on any point in the scatter plot and can see its detail
2.5 Change the X Y axis through the bars on right side and check the different plots
2.6 Jitter slider. Sometimes, points sit right on top of each other, and jitter just adds a little bit of
randomness to the x- and the y-axes.
2.7 Select rectangle and see the plot changes Summary:
1. Used simple Filters

2. Visualized data
15 | P a g e

IS328 Data Mining-Tutorial Lab Session 2 - Solution - Updated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IS328 Data Mining-Tutorial Lab Session 2 - Solution - Updated

Uploaded by

Copyright:

Available Formats

IS328 Data Mining Semester

Week 3: Data Types and Data Pre-processing

Q1. Which of the following is not part of the KDD process.

Q2 Which of the following is a measurement of data quality?

Q3 Which of the following is not a data type?

Q4 In KDD and data mining, noisy data is referred to as ________.

Q5 __________________refers to the process of deriving high-quality information from text.

Workings: W = (max – min)/number of bin = (63 – 3)/5 = 12

Q8 Consider discretizing a continuous attribute whose values are listed below:

Q11 Which of the following is not part of pre-processing:

v' = (new_maxA – new_minA) + new_minA

Q14 Normalisation using decimal scaling is done using the formula

Where j is the smallest integer such that Max (|V|’) < 1

Part B: Descriptive /Problem Solving

Smooth the data within each bin using bin boundaries.

Bin1: 13, 15, 16, 16, 19

Smooth by Bin Boundaries

v' = (new_maxA – new_minA) + new_minA

200, 300, 400, 600, 750; 900; 1000; 1200

min = 200, Max = 1200, new_max = 1, new_min = 0

(i) Min-Max Normalisation (0 – 1)

{0, 0.1, 0.2, 0.4, 0.55, 0.7, 0.8, 1.0}

Working for (ii)

Steps: Find value for j = 4

{0.02, 0.03, 0.04, 0.06, 0.075, 0.09, 0.1, 0.12}

This is the formula for Standard Deviation:

Formula: Z-score normalization =

(b) Normalise the data Y using the Z-Score normalization method.

v' = (new_maxA – new_minA) + new_minA

Name Blood Sugar Body Mass Index Blood Pressure

Jacki 5.5 - 1 22 – 4.6 125 – 4.86

David 5.8 – 4.6 28 – 10 145 - 10

Jessica 6.0 - 7 18 – 1 110 - 1

Mary 6.25 - 10 20 – 2.8 135 – 7.43

Rahini 5.9 – 5.8 21 – 3.7 120 – 3.57

Blood Sugar Reading

v' = (new_maxA – new_minA) + new_minA

minA = 5.5, maxA = 6.25, new_maxA = 10, new_minA = 1

V = 5.5 (Jacki) [Note the use of brackets] v'

v' = (new_maxA – new_minA) + new_minA

minA = 18, maxA = 28, new_maxA = 10, new_minA = 1

V = 22 (Jacki) [Note the use of brackets]

Blood Pressure Measurement

v' = (new_maxA – new_minA) + new_minA

minA = 110, maxA = 145, new_maxA = 10, new_minA = 1

Blood Sugar Body Mass Index Blood Pressure

Part C: WEKA for Data Mining

Objective: To use simple filters and visualize data

Step 1: Use filter to remove an attribute

1.2 Expand the attribute filter and choose remove

1.5 Select Undo button to undo the operation

Click on choose button ,

2.1 Open Iris data set iris.arff

2.2 Go to the Visualize panel and visualize this data

2.3 Select sepalwidth on the x-axis and petalwidth on the y-axis

2.7 Select rectangle and see the plot changes Summary:

1. Used simple Filters

You might also like