Id5059 23 2 1

CANDLEMAS 2022-23 EXAMINATION DIET
SCHOOL OF COMPUTER SCIENCE
MODULE CODE: ID5059
MODULE TITLE: Knowledge Discovery & Data Mining
EXAM DURATION: 3 hours
EXAM INSTRUCTIONS a. Answer all three questions

b. Each question carries 20 marks
This assessment consists of exam-style questions and you should answer as you
would in an exam. You cannot copy or paraphrase text or material from other sources
and present this as your own work. Your exam answers should be entirely your own
work without unacknowledged input from others. If you are in any doubt, you should
clearly acknowledge the origin of any material, text passages or ideas presented (e.g.
through references). You must not co-operate with any other person when completing
the exam, which must be entirely your own work. You must not share any information
about the exam with another person (e.g. another student) or act on any such
information you may receive. Any attempt to do so will be dealt with under the
University's Policy for Good Academic Practice and may result in severe sanctions.
You must submit your completed assessment on MMS within 3 hours of you
downloading the exam. Assuming you have revised the module contents beforehand,
answering the questions should take no more than three hours.
Page 1 of 8
1. Classification
Two binary classification models C1 and C2 are trained on a particular dataset,
and then evaluated on a separate small dataset containing 7 cases. Each row in
the table below shows the actual class for that test case (0 or 1), and the scores
produced by each classifier for that test case.
Actual class C1 score C2 score

0 0.55 0.05
0 0.10 0.17
0 0.71 0.24
1 0.82 0.43
1 0.89 0.47
1 0.70 0.51
1 0.53 0.25
(a) For each classifier, calculate the following for the test data:
• the confusion matrix [2 marks]
• the precision [1 mark]
• the recall [1 mark]
• the F1 measure [1 mark]
Assume a decision threshold of 0.5 (that is, a case will be classified as 1 if

the corresponding score is greater than or equal to 0.5).
(b) For each classifier, sketch the ROC curve, including points on the curve
corresponding to threshold values of 0, 0.2, 0.4, 0.6, 0.8 and 1.
[8 marks]
(c) Explain which classifier you would select for deployment.
[4 marks]
Page 2 of 8
(d) Given the following F1 and recall curves for a new classification model C3,
describe (using words, via a sketch, or both) how the precision varies with
threshold.
[3 marks]
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Recall F1
[Total marks 20]
Page 3 of 8
2. Modelling
(a) The plot below shows observations of attribute y for various values of
attribute x. Both axes have a linear scale. A polynomial regression model is
to be fitted to this data. Suggest which model degree is likely to give the
best results, and explain your reasoning.
[3 marks]
(b) During training of a regression model it is observed that error on a

validation data set remains significantly higher than on the training data
set, as training progresses. Explain how the model might be altered to give
better results if it is:
(i) a linear model
(ii) a polynomial model
(iii) a tree model
[4 marks]
Page 4 of 8
(c) A regression model is to be fitted to a data set containing attributes x1..x9.
The scatter plots below show the relationships between each x attribute and
the attribute to be predicted, y. All scales are linear.
For each x attribute, explain whether you would include it in the model,
and if so, whether you would perform any additional processing based on
that attribute before fitting the model.
[6 marks]
Page 5 of 8
(d) The table below shows a sample from a data set on grocery shopping habits.
A model predicting the attribute weekly_spend is to be fitted. For each of the
following model types, explain which attributes you would include in the
model, and any additional processing that would be necessary:
(i) a logistic regression model

(ii) a tree model
(iii) a neural net
[7 marks]
weekly_spend age postcode car_registration income children

low 23 EH9 1DY ST15RXA £32,000 0
medium 30 KY16 8XZ N/A £16,723.24 1
medium 45 NG31 9LP N/A £46,100
low 71 NE69 7BT AH51WXN 3
high 36 PH20 1AL GSZ7809 £67,840 2
medium -18 YO30 6PP NC21PLD £12,000 0
medium 52 DN31 2BT N/A 4
[Total marks 20]
Page 6 of 8
3. Ensemble models
(a) Compare the interpretability of individual decision trees and random

forests.
[2 marks]
(b) Explain how a random forest using bagging operates, and how it is able to
give better predictive performance than an individual decision tree.
[4 marks]
(c) Three separate classifiers C1, C2 and C3 have been trained to predict
whether an image contains a cat or a dog. These three classifiers will then
be aggregated into an ensemble. Each classifier has predicted the following
probabilities for seven test cases:
Actual C1 C2 C3
Cat Dog Cat Dog Cat Dog
Cat 0.6 0.4 0.8 0.2 0.9 0.1
Cat 0.4 0.6 0.6 0.4 0.8 0.2
Cat 0.7 0.3 0.4 0.6 0.7 0.3
Dog 0.6 0.4 0.2 0.8 0.6 0.4
Dog 0.4 0.6 0.3 0.7 0.7 0.3
Dog 0.6 0.4 0.8 0.2 0.6 0.4
Cat 0.9 0.1 0.4 0.6 0.9 0.1
Assuming a decision threshold of 0.5, calculate for the test data set:
(i) the accuracy of each classifier

(ii) the accuracy of the ensemble using hard voting
(iii) the accuracy of the ensemble using soft voting
[4 marks]
(d) A soft voting ensemble classifier is composed of 10 different models, all

trained on the same data. All the individual models are evaluated on the
same test data, yielding F1 values ranging between 0.7 and 0.8. Explain
what range of performance can be expected for the ensemble classifier.
[2 marks]
Page 7 of 8
(e) The diagram below shows a model fitted to a training set as the first stage
of a gradient boost regression model. Sketch the following:
(i) the data used for training the second stage

(ii) a possible model that might be fitted in the second stage (assuming
the same type of model as in the first stage)
(iii) the resulting ensemble model
You can show these on a single diagram or multiple diagrams, as you wish.
[8 marks]
[Total marks 20]
*** END OF PAPER ***
Page 8 of 8

Id5059 23 2 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Id5059 23 2 1

Uploaded by

Copyright:

Available Formats

CANDLEMAS 2022-23 EXAMINATION DIET

SCHOOL OF COMPUTER SCIENCE

MODULE CODE: ID5059

MODULE TITLE: Knowledge Discovery & Data Mining

EXAM DURATION: 3 hours

EXAM INSTRUCTIONS a. Answer all three questions

Actual class C1 score C2 score

Assume a decision threshold of 0.5 (that is, a case will be classified as 1 if

(c) Explain which classifier you would select for deployment.

[Total marks 20]

(b) During training of a regression model it is observed that error on a

(i) a logistic regression model

weekly_spend age postcode car_registration income children

(a) Compare the interpretability of individual decision trees and random

(i) the accuracy of each classifier

(d) A soft voting ensemble classifier is composed of 10 different models, all

(i) the data used for training the second stage

[Total marks 20]

* END OF PAPER *

You might also like