You are on page 1of 5

IS328 Data Mining

Semester 2, 2021

Week 2: Data Mining Concepts and Numerical Analysis


Tutorial/Lab Session – 1 - Solution

PART A: Data Mining Concepts

Q1. Data Mining is a ________.


A. new technology that is used to store data.
B. multidisciplinary field of research.
C. database technology.
D. business information system

Q2. KDD stands for ________.


A. Knowledge Definition and Description.
B. Knowledge Discovery and Description.
C. Knowledge Discovery in Databases.
D. Knowledge Description and Discovery

Q3. Data Mining is commonly carried out using


A Flat Files
B Relational Data Bases
C Data Warehouses
D Transaction Database
E Any of the above
Q4. Data mining is used to refer ______ stage in knowledge discovery in database.
A. selection.
B. retrieving.
C. discovery.
D. presentation.

Q5. _______ is the heart of knowledge discovery in database process.


A. Data Selection.
B. Data Warehouse.
C. Data Transformation:
D. Data Mining

Q6. Knowledge discovery in database refers to _____.


A. whole process of extraction of knowledge from data.
B. selection of data.
C. data mining algorithm.
D. cleaning the data.

1|Page
Q7. ________analysis divides data into groups that are meaningful, useful, or both.

A. Cluster
B. Association
C. Classification
D. Regression

Q8. The _______ data are stored in data warehouse.


A. operational
B. historical
C. transactional
D. optimized
.

Q9. The next stage to data selection in KDD process ______.


A. data mining
B. data visualisation
C. cleaning
D. reporting

Q10. Which of the following is an open source data mining tool.


A. MYSQL
B. JAVA
C. SPSS
D. WEKA

An electronic store sells CD players at the following prices:


$350, $275, $500, $325, $100, $375, and $300.
Answer the questions 12-15.

Q12. What is the mean price?


A $300.25
B $317.86
C $423.89
D $376.34

Q13. What is the mode?


A $225
B $317
C $350
D There is no mode

//working for Q14


100, 275, 300, 325, 350, 375, 500

Odd/Even number of values.


Median = 0.5 * 7 = 3.5 = 4th score = 325

2|Page
Q14. What is the median price?
A $225
B $325
C $350
D $400

Q15. What is the range?


A $225
B $325
C $350
D $400
Range = Max – min = 500 – 100 = 400
Q16. A part of the population selected for data mining is called a:
A Variable
B Data
C Sample
D Parameter

Q17. Monthly rainfall in Suva during the last ten years is an example of a:

A Discrete variable
B Continuous variable
C Qualitative variable
D Random variable

Q18. Number of courses studied by a student in a semester is an example of a:

A Discrete variable
B Continuous variable
C Qualitative variable
D Random variable

PART B: Numerical Data Analysis


Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.

The symbol for Standard Deviation is σ (the Greek letter sigma).

This is the formula for Standard Deviation:

Q1. Find the mean and standard deviation of the following:


a) 19, 13, 15, 25, and 18
Rearrange: 13, 15, 18, 19, 25

3|Page
b) 19, 13, 15, 25, and 78
Rearrange: 13, 15, 18, 19, 78
Mean
(a) (19 + 13 + 15 + 25 +18)/5 = 18
(b) (19 + 13 + 15 + 25 +78)/5 = 30
SD
(a) Sqrt(1/5*[(19-18)2 + (13-18)2 + (15-18)2 + (25-18)2 + (18-18)2]) = 4.1
(b) Sqrt(1/5*[(19-30)2 + (13-30)2 + (15-302 + (25-30)2 + (78-30)2]) = 24.35
Are the results sensitive to outliers?
Yes, outliers can affect the mean and SD. However, it does not affect the median.

Q2. Find the median and mode of the following:


a) 15, 21, 26, 25, 21, 23, 25, 28, 21
b) 12, 15, 18, 26, 15, 9, 12, 27
c) 12, 15, 18, 26, 17, 19, 22, 27

(a) 15, 21, 21, 21, 23, 25, 25, 26, 28 (odd number of scores)
Median = 0.5 * 9 = 4.5 = 5th score = 23
Mode = 21

(b) 9, 12, 12, 15, 15, 18, 26, 27 (even number of scores)
Median = 0.5 * 8 = 4 = average of 4th and 5th score = (15+15)/2 = 15
Mode = 12 and 15 (bimodal

(c) 12, 15, 17, 18, 19, 22, 26, 27 (even number of scores)
Median = 0.5 * 8 = 4 = average of 4th and 5th score = (18+19)/2 = 18.5
Mode = no mode.

Q3. Suppose that a sample of health data for analysis includes the attribute age.

The age values for the data tuples are as follows:

13, 52, 46, 16, 45, 20, 20, 21, 40, 22, 35, 25, 35, 25, 70, 33, 33, 25, 35, 25, 35, 36, 22, 19, 16,
15, 30

Rearranged:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.

(a) What is the mean of the data?


The (arithmetic) mean of the data:
Sum of the ages = 809
Number of ages = 27
The mean of the data = 809 / 27 = 30
(b) What is the median?
Median: = 0.5 * 27 = 13.5 = 14th score = 25

4|Page
(c) What is the mode of the data? Comment on the data's modality (i.e., bimodal,
trimodal, etc.).
This data set has two values that occur with the same highest frequency and is, therefore,
bimodal.
The modes (values occurring with the greatest frequency) of the data are 25 and 35.

(d) What is the range of the data?


The range (range = max - min) of the data is: 70 – 13 = 57

(e) Find the first quartile (Q1) and the third quartile (Q3) of the data?
Rearranged:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45,
46, 52, 70.

LQ: = 0.25 * 27 = 6.75 = 7th score = 20


UQ: = 0.75 * 27 = 20.25 = 21st score = 35

(f) Give the five-number summary of the data.


Median – 25
LQ – 20
UQ – 35
Max – 70
Min - 13

(g) Construct a box and whisker plot

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

(h) Find the IQR (Inter-Quartile Range)


IQR = UQ - LQ = 35 – 20 = 15
(i) Find any outliers. Explain your answers
Note that the five number summary of a distribution consists of the minimum value, first
quartile, median value, third quartile, and maximum value.

5|Page

You might also like