You are on page 1of 8

CANDLEMAS 2022-23 EXAMINATION DIET

SCHOOL OF COMPUTER SCIENCE

MODULE CODE: ID5059

MODULE TITLE: Knowledge Discovery & Data Mining

EXAM DURATION: 3 hours

EXAM INSTRUCTIONS a. Answer all three questions


b. Each question carries 20 marks

This assessment consists of exam-style questions and you should answer as you
would in an exam. You cannot copy or paraphrase text or material from other sources
and present this as your own work. Your exam answers should be entirely your own
work without unacknowledged input from others. If you are in any doubt, you should
clearly acknowledge the origin of any material, text passages or ideas presented (e.g.
through references). You must not co-operate with any other person when completing
the exam, which must be entirely your own work. You must not share any information
about the exam with another person (e.g. another student) or act on any such
information you may receive. Any attempt to do so will be dealt with under the
University's Policy for Good Academic Practice and may result in severe sanctions.
You must submit your completed assessment on MMS within 3 hours of you
downloading the exam. Assuming you have revised the module contents beforehand,
answering the questions should take no more than three hours.

Page 1 of 8
1. Classification
Two binary classification models C1 and C2 are trained on a particular dataset,
and then evaluated on a separate small dataset containing 7 cases. Each row in
the table below shows the actual class for that test case (0 or 1), and the scores
produced by each classifier for that test case.

Actual class C1 score C2 score


0 0.55 0.05
0 0.10 0.17
0 0.71 0.24
1 0.82 0.43
1 0.89 0.47
1 0.70 0.51
1 0.53 0.25

(a) For each classifier, calculate the following for the test data:
• the confusion matrix [2 marks]
• the precision [1 mark]
• the recall [1 mark]
• the F1 measure [1 mark]

Assume a decision threshold of 0.5 (that is, a case will be classified as 1 if


the corresponding score is greater than or equal to 0.5).

(b) For each classifier, sketch the ROC curve, including points on the curve
corresponding to threshold values of 0, 0.2, 0.4, 0.6, 0.8 and 1.

[8 marks]

(c) Explain which classifier you would select for deployment.

[4 marks]

Page 2 of 8
(d) Given the following F1 and recall curves for a new classification model C3,
describe (using words, via a sketch, or both) how the precision varies with
threshold.

[3 marks]

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold

Recall F1

[Total marks 20]

Page 3 of 8
2. Modelling

(a) The plot below shows observations of attribute y for various values of
attribute x. Both axes have a linear scale. A polynomial regression model is
to be fitted to this data. Suggest which model degree is likely to give the
best results, and explain your reasoning.

[3 marks]

(b) During training of a regression model it is observed that error on a


validation data set remains significantly higher than on the training data
set, as training progresses. Explain how the model might be altered to give
better results if it is:
(i) a linear model
(ii) a polynomial model
(iii) a tree model

[4 marks]

Page 4 of 8
(c) A regression model is to be fitted to a data set containing attributes x1..x9.
The scatter plots below show the relationships between each x attribute and
the attribute to be predicted, y. All scales are linear.

For each x attribute, explain whether you would include it in the model,
and if so, whether you would perform any additional processing based on
that attribute before fitting the model.

[6 marks]

Page 5 of 8
(d) The table below shows a sample from a data set on grocery shopping habits.
A model predicting the attribute weekly_spend is to be fitted. For each of the
following model types, explain which attributes you would include in the
model, and any additional processing that would be necessary:

(i) a logistic regression model


(ii) a tree model
(iii) a neural net

[7 marks]

weekly_spend age postcode car_registration income children


low 23 EH9 1DY ST15RXA £32,000 0
medium 30 KY16 8XZ N/A £16,723.24 1
medium 45 NG31 9LP N/A £46,100
low 71 NE69 7BT AH51WXN 3
high 36 PH20 1AL GSZ7809 £67,840 2
medium -18 YO30 6PP NC21PLD £12,000 0
medium 52 DN31 2BT N/A 4
[Total marks 20]

Page 6 of 8
3. Ensemble models

(a) Compare the interpretability of individual decision trees and random


forests.

[2 marks]

(b) Explain how a random forest using bagging operates, and how it is able to
give better predictive performance than an individual decision tree.

[4 marks]

(c) Three separate classifiers C1, C2 and C3 have been trained to predict
whether an image contains a cat or a dog. These three classifiers will then
be aggregated into an ensemble. Each classifier has predicted the following
probabilities for seven test cases:

Actual C1 C2 C3
Cat Dog Cat Dog Cat Dog
Cat 0.6 0.4 0.8 0.2 0.9 0.1
Cat 0.4 0.6 0.6 0.4 0.8 0.2
Cat 0.7 0.3 0.4 0.6 0.7 0.3
Dog 0.6 0.4 0.2 0.8 0.6 0.4
Dog 0.4 0.6 0.3 0.7 0.7 0.3
Dog 0.6 0.4 0.8 0.2 0.6 0.4
Cat 0.9 0.1 0.4 0.6 0.9 0.1

Assuming a decision threshold of 0.5, calculate for the test data set:

(i) the accuracy of each classifier


(ii) the accuracy of the ensemble using hard voting
(iii) the accuracy of the ensemble using soft voting

[4 marks]

(d) A soft voting ensemble classifier is composed of 10 different models, all


trained on the same data. All the individual models are evaluated on the
same test data, yielding F1 values ranging between 0.7 and 0.8. Explain
what range of performance can be expected for the ensemble classifier.

[2 marks]

Page 7 of 8
(e) The diagram below shows a model fitted to a training set as the first stage
of a gradient boost regression model. Sketch the following:

(i) the data used for training the second stage


(ii) a possible model that might be fitted in the second stage (assuming
the same type of model as in the first stage)
(iii) the resulting ensemble model

You can show these on a single diagram or multiple diagrams, as you wish.

[8 marks]

[Total marks 20]

*** END OF PAPER ***

Page 8 of 8

You might also like