You are on page 1of 15

IS328 Data Mining Semester

Week 3: Data Types and Data Pre-processing


Tutorial/Lab Session - 2 - Solution

Part A – MCQs

Q1. Which of the following is not part of the KDD process.


A Data cleaning
B Data Mining
C Data Integration
D Data Encryption
E Data Transformation

Q2 Which of the following is a measurement of data quality?


A Accuracy
B Completeness
C Timeliness
D Reliability
E All of the above

Q3 Which of the following is not a data type?


A Nominal
B Binary
C Discrete
D Random
E Ordinal

Q4 In KDD and data mining, noisy data is referred to as ________.


A repeated data.
B complex data.
C meta data.
D random errors in database.

Q5 __________________refers to the process of deriving high-quality information from text.


A Text Mining.
B Image Mining.
C Database Mining.
D Multimedia Mining.

1|Pag
e
Q6 Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 12, 21, 32, 43, 44, 46, 52, 59, 63
Using equal-width partitioning and five bins, how many values are there in the first bin?
A. 1
B. 2
C. 3
D. 4
E. 5

Workings: W = (max – min)/number of bin = (63 – 3)/5 = 12


Bin1 (3 ----->15) – 3, 4, 5, 10, 12
Bin2 (16----->27) - 21

Bin3 …
Q7 Which of the following are known as qualitative data types?
(i) nominal
(ii) interval
(iii) ordinal
(iv) discrete
(v) continuous
A. (i) and (ii).
B. (ii) and (iii)
C. (i) and (iii).
D. (iii) and (iv)
E (iv) and (v)

Q8 Consider discretizing a continuous attribute whose values are listed below:


3, 4, 5, 10, 20, 32, 43, 44, 46, 52, 59, 61
Which of the following number of bins is not possible for using equi-depth bins?
A. 2
B. 3
C. 4
D. 5
E. 6

2|Pag
e
Q9 Suppose a group of 12 students with the test scores listed as follows:
19, 71, 48, 63, 35, 85, 69, 81, 72, 88, 100, 95
By partitioning them into three bins using equi-width method and smoothing by bins boundaries,
how many data items will be in the second bin?
A 1
B 2
C 3
D 4
E 5
Working: 19, 35, 48 ,63, 69, 71, 72, 81, 85, 88, 95, 100
W = (max – min)/3 = (100 – 19)/3 = 27
Bin 1: (19 to 46) – [19, 35]
Bin 2: (47 to 73) – [48, 63, 69, 71, 72]
Bin 3: (74 to 100) – [81, 85, 88, 95, 100]

Q10 This step of the KDD process model deals with noisy data.
A Data Integration
B Data pre-processing
C Data transformation
D Data mining
E Data Interpretation

Q11 Which of the following is not part of pre-processing:


A Data cleaning
B Data integration
C Data transformation
D Data reduction
E Data visualisation

Q12 Income data for a group of 12 people have the following properties: mean
=$33000 and sd = $11000
Using z-score normalization, an income of $73600 is transformed to:
A 2.24
B 2.79
C 3.69
D 4.25
E 5.36
Working: Z-norm = (value – mean)/SD = (73600 – 33000)/ 11000 = 3.69

Q13 Income data for a group of 12 people have the following properties:
min =$55000 and max = $150000:
Using min-max normalization to map (0 – 1), an income of $73600 is transformed to

3|Pag
e
A 0.543
B 0.434
C 0.365
D 0.196
E 0.098
Working:

v' = (new_maxA – new_minA) + new_minA

v' =

Q14 Normalisation using decimal scaling is done using the formula


𝑣
V’ = 10𝑗

Where j is the smallest integer such that Max (|V|’) < 1


Assume the data range is -722 to 889, what will be the value for J?
A 1
B 2
C 3
D 4
E 5

Part B: Descriptive /Problem Solving


Q1) Give an example of each data type below:
Nominal
Ordinal
Binary
Discrete
Continuous
Nominal: Hair Colour -> {black, brown, grey, red |
Ordinal size: {small, medium, large}
Binary {good, bad|
Discrete {28, 30, 31}
Continuous { 1.24, 32.4, 4.34,5.67 }
Q2) Use 5 bins to partition the following group of data based on Equal-Depth.
33, 35, 16, 21, 36, 73, 45, 22, 22, 20, 25, 25, 30, 13, 33, 16, 19, 25, 20, 35, 15, 35, 46, 52, 40.

4|Pag
e
Sorted:

13 15 16 16 19 20 20 21 22 22 25 25 25 30 33 33 35 35 35 36 40 45 46 52 73

Smooth the data within each bin using bin boundaries.

Bin1: 13, 15, 16, 16, 19


Bin2: 20, 20, 21, 22, 22
Bin3: 25, 25, 25, 30, 33
Bin4: 33, 35, 35, 35, 36
Bin5: 40, 45, 46, 52, 73

Smooth by Bin Boundaries


Bin1: 13, 13, 13, 13, 19
Bin2: 20, 20, 20, 22, 22
Bin3: 25, 25, 25, 33, 33
Bin4: 33, 36, 36, 36, 36 Bin5:
40, 40, 40, 40, 73
Q3) Consider the following group of numbers:
200, 300, 400, 600, 750; 900; 1000; 1200
Normalize the above numbers using the following methods: Show your calculations
(i) Min-Max Normalization (0 – 1)
(ii) Decimal Scaling
Normalise the above numbers using the following methods: Show your calculations

(i)

v' = (new_maxA – new_minA) + new_minA

200, 300, 400, 600, 750; 900; 1000; 1200

min = 200, Max = 1200, new_max = 1, new_min = 0

V= 200

v' =
V= 300

v' =
V= 400

5|Pag
e
v' =

V= 600

v' =
V= 750

v' =
V= 900

v' =
V= 1000

v' =
V= 1200

v' =

(i) Min-Max Normalisation (0 – 1)

{0, 0.1, 0.2, 0.4, 0.55, 0.7, 0.8, 1.0}

Working for (ii)

V’

Steps: Find value for j = 4


V= {200, 300, 400, 600, 750, 900, 1000, 1200}

6|Pag
e
(ii) Decimal Scaling the normalized values are:

{0.02, 0.03, 0.04, 0.06, 0.075, 0.09, 0.1, 0.12}


Q4) Consider the following data vector denoted Y:
Y= {35 36 46 68 70}
(a) Calculate the mean and standard deviation for the above data. Use the following formula.

This is the formula for Standard Deviation:

𝑁
1
𝜎=√ ∑ (𝑥𝑖 − 𝑢)2
𝑁−1
𝑖 =1

Mean = (35 + 36 + 46 + 68 + 70 ) / 5 = 51
(35-51)2 + (36-51)2 + (46-51)2 + (68-51)2 + (70-51)2 = 1156
StdDev = SQRT ( 1156 / 4) = 17

Formula: Z-score normalization =


NY1 = (35-51) / 17 = - 0.941
NY2 = (36-51) / 17 = - 0.882
NY3 = (46-51) / 17 = - 0.294
NY4 = (68-51) / 17 = 1
NY5 = (70-51) / 17 = 1.117

(b) Normalise the data Y using the Z-Score normalization method.


Therefore, the normalized vector is  Normalized Y= {-0.941, -0.882, -0.294, 1, 1.117}

7|Pag
e
Q5) Using the Min-Max method, normalize the following data to scale (1 – 10).
Show your calculations. Use
the formula

v' = (new_maxA – new_minA) + new_minA

new_min = 1
new_max = 10

Name Blood Sugar Body Mass Index Blood Pressure


Reading (BMI) Measurement

Jacki 5.5 - 1 22 – 4.6 125 – 4.86

David 5.8 – 4.6 28 – 10 145 - 10

Jessica 6.0 - 7 18 – 1 110 - 1

Mary 6.25 - 10 20 – 2.8 135 – 7.43

Rahini 5.9 – 5.8 21 – 3.7 120 – 3.57

Blood Sugar Reading

v' = (new_maxA – new_minA) + new_minA

minA = 5.5, maxA = 6.25, new_maxA = 10, new_minA = 1

V = 5.5 (Jacki) [Note the use of brackets] v'


=

V = 5.8 (David)
v' = 9) + 1 = 4.6

V = 6.0 (Jessica)
v' = 9) +1 = 7

V = 6.25 (Mary)
v' = 9) +1 = 10

V = 5.9 (Rahini)
v' =

8|Pag
e
Body Mass Index (BMI

v' = (new_maxA – new_minA) + new_minA

minA = 18, maxA = 28, new_maxA = 10, new_minA = 1

V = 22 (Jacki) [Note the use of brackets]


v' = 9) + 1 = 4.6

V = 28 (David)
v' = 9) + 1 = 10

V = 18 (Jessica)
v' = 9) + 1 = 1

V = 20 (Mary)
v' = 9) + 1 = 2.8

V = 21 (Rahini)
v' = 9) + 1 = 3.7

Blood Pressure Measurement

v' = (new_maxA – new_minA) + new_minA

minA = 110, maxA = 145, new_maxA = 10, new_minA = 1

V = 125 (Jacki)
v' =

V = 145 (David)
v' = 145 9) +1 = 10

V = 110 (Jessica)
v' = 9) + 1 = 1

V = 135 (Mary)
v' =

V = 120 (Rahini)
v' = 9) +1 = 3.57

9|Pag
e
Min-Max Normalisation (1-10)

Blood Sugar Body Mass Index Blood Pressure


Min-Max Reading (BMI) Measurement
Normalisation
(1-10) Name
Jacki 1 4.6 4.86
David 4.6 10 10
Jessica 7 1 1
Mary 10 2.8 7.43
Rahini 5.8 3.7 3.57

Part C: WEKA for Data Mining

Objective: To use simple filters and visualize data

Step 1: Use filter to remove an attribute

1.1 Open Weather.nominal dataset, and click on choose button to choose a filter

There are a lot of different filters. Allfilter and MultiFilter are ways of combining filters. We have
supervised and unsupervised filters. Supervised filters are ones that use a class value for their
operation. They aren't as common as unsupervised filters, which don't use the class value. There are
attribute filters and instance filters. We want to remove an attribute. So we're looking for an attribute
filter. There are so many filters in Weka that you just must learn to look around and find what you
want.

1.2 Expand the attribute filter and choose remove

10 | P a g e
1.3 click on the text box of the choose field and specify the attributeIndices as 3 and select OK

1.4 click on Apply button to see the impact (attribute 3 Humidity is removed from the data set) .

Note : This remove functionality can be achieved by selecting the attribute and choose Remove button

1.5 Select Undo button to undo the operation


1.6. Remove the instances where Humidity attribute value is High (Do it by yourself , Hint: instance
filter)

11 | P a g e
Original File:

Click on choose button ,

12 | P a g e
13 | P a g e
Step 2: Visualize the data

2.1 Open Iris data set iris.arff

2.2 Go to the Visualize panel and visualize this data

2.3 Select sepalwidth on the x-axis and petalwidth on the y-axis

14 | P a g e
2.4 Click on any point in the scatter plot and can see its detail

2.5 Change the X Y axis through the bars on right side and check the different plots
2.6 Jitter slider. Sometimes, points sit right on top of each other, and jitter just adds a little bit of
randomness to the x- and the y-axes.

2.7 Select rectangle and see the plot changes Summary:

1. Used simple Filters


2. Visualized data

15 | P a g e

You might also like