You are on page 1of 16

Examiners’ commentaries 2020

Examiners’ commentaries 2020


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

General remarks

Learning outcomes

At the end of the half course and having completed the Essential reading and activities you should:

be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.

Planning your time in the examination

You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.

1
ST104a Statistics 1

Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2020, for
example, the first part of Question 2 related to a contingency table while the second part covered
aspects of sampling design. In Question 3, the first part covered correlation and linear regression
while the second part related to statistical inference for the difference between two population
means. Finally, in Question 4, the first part required a boxplot while the second part related to
statistical inference for paired data. This means that it is really important that you make sure you
have a reasonable idea of what topics are covered before you start work on the paper! We suggest
you divide your time as follows during the examination:

Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!

What are the examiners looking for?

The examiners are looking for very simple demonstrations from you. They want to be sure that you:

have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.

You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.

Key steps to improvement

The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2019 examinations!
Remember the following.

If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.

2
Examiners’ commentaries 2020

How should you use the specific comments on each question given in the
Examiners0 commentaries?

We hope that you find these useful. For each question and subquestion, they give:

further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).

Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.

Memorising from the Examiners0 commentaries

It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.

Examination revision strategy

Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.

We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.

The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.

If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.

3
ST104a Statistics 1

Examiners’ commentaries 2020


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1
√ √ √ √
(a) Suppose that x1 = 4, x2 = 5, x3 = 4.5, x4 = −0.8, and y1 = 4, y2 = 5,
y3 = 0, y4 = 2. Calculate the following quantities:
2 4 2
X X X x2i
i. xi yi ii. x3i yi3 iii. x41 + .
i=1 i=3 i=1
yi4

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found on Section 2.9 of the subject
guide and in particular Activity A1.6.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.

4
Examiners’ commentaries 2020

i. We have:
2
X √ √ √ √
xi yi = 4× 4+ 5× 5 = 4 + 5 = 9.
i=1

ii. We have:
4
X
x3i yi3 = (4.5)3 × 03 + (−0.8)3 × 23 = 0 + (−0.512) × 8 = −4.096.
i=3

iii. We have:
2 √ √ !
x2 √ ( 4)2 ( 5)2
 
X
i 4 5
x41 + 4 = ( 4)4 + √ + √ = 16 + + = 16.45.
i=1
yi ( 4)4 ( 5)4 16 25

One thing to note here


√ is√the excessive use of calculators. Some candidates failed to see
that, for example, √4 × 4 = 4, hence there is no need to use a calculator, and used the
calculator for both 4 and the product resulting in some loss of accuracy.

(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Class of cabin on an airliner: ‘first class’, ‘business class’ and ‘economy class’.
ii. The carbon dioxide emissions of an airliner.
iii. The nationalities of airline passengers.
(6 marks)

Reading for this question


This question requires identifying types of variable so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion
of a variable and be able to distinguish between discrete and continuous (measurable) data.
In addition to identifying whether a variable is categorical or measurable, further
distinctions between ordinal and nominal categorical variable should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Categorical, ordinal. Classes of cabin are in a ranked order, in terms of price, comfort etc.
ii. Measurable because carbon dioxide emissions can be quantified in terms of units, such as
tonnes.
iii. Categorical, nominal. Nationalities have no natural ordering.
Weak candidates did not provide justifications for their choices, reported nominal or
categorical to measurable variable and sometimes answered ordinal when their justifications
were pointing to a nominal variable. There were also phrases like ‘It is measurable because
it can be measured’ which were not awarded any marks.

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. Skewness of a distribution cannot be determined from a boxplot.
ii. For any event A, it is possible for P (A) + P (Ac ) < 1, where Ac denotes the
complement set.

5
ST104a Statistics 1

iii. A standardised normal random variable has a standard deviation equal to its
variance.

iv. A lower level of confidence produces a wider confidence interval.

v. A cross-sectional design collects data over time.


(10 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Chapter 4 and in particular Section 4.9.2 of the
subject guide, whereas part ii. requires knowledge of basic probability properties which can
be found in Section 5.9. Part iii. is about probability properties of the normal distribution,
see Chapter 6. Part iv. targets concepts related to confidence intervals, the principles of
which are covered in Section 7.6, but also check Sections 7.7–7.10. Finally, part v. focuses on
material of Chapter 11 and more specifically Sections 11.5–11.7.

Approaching the question


Candidates always find this type of question tricky. It requires a brief explanation of the
reason for why a statement is true/false and not just a choice between the two. Some
candidates lost marks for long rambling explanations without a decision as to whether a
statement was actually true or false.

i. False. Skewness can be determined from the box and whiskers.

ii. False, actually it holds that P (A) + P (Ac ) = 1.

iii. True. A standardised normal random variable has a variance of 1, and hence a standard
deviation of 1.

iv. False. A lower (higher) level of confidence means a lower (higher) confidence coefficient,
hence a narrower (wider) confidence interval.

v. False. A cross-sectional design collects data at one moment in time (or a longitudinal
design collects data over time).

(d) A quota sample of total size n = 500 is to be selected. The researcher has
decided to use age group as a quota control. It is known that the composition of
age groups in the population is:

Age groups 18–29 30–49 50–69 70+


Percentage of population 20% 35% 30% 15%

Determine the quotas to be selected from each age group.


(4 marks)

Reading for this question


This question targets material on non-probability sampling which can be found in Chapter
10 of the subject guide. Particular focus is given to the quota sampling scheme which is
covered in Section 10.7.1, but it is good to familiarise yourself with Section 10.6 and 10.7
(the entire section) as well.

6
Examiners’ commentaries 2020

Approaching the question


The quotas from each group would be:

n18−29 = 0.20 × 500 = 100

n30−49 = 0.35 × 500 = 175

n50−69 = 0.30 × 500 = 150

n70+ = 0.15 × 500 = 75.

(e) The probability distribution of a random variable Y is given below.


Y =y 2 4 6 8
P (Y = y) 0.1 0.3 0.4 0.2

i. Find the expected value of Y , denoted E(Y ).


(2 marks)
ii. Find the standard deviation of Y , denoted σY .
(3 marks)
iii. Find the probability that Y is within one standard deviation of its expected
value.
(3 marks)
iv. Explain why Y does not have a discrete uniform distribution.
(2 marks)

Reading for this question


This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 is suggested with a focus
on the sections on these topics. Try activity A5.1 and the exercises on probability trees. For
part iv., and in particular the discrete uniform distribution, check Section 9.8.
Approaching the question
i. We have:
X
E(Y ) = y p(y) = 2 × 0.1 + 4 × 0.3 + 6 × 0.4 + 8 × 0.2 = 5.4.
y

ii. We have:
X
E(Y 2 ) = y 2 p(y) = 22 × 0.1 + 42 × 0.3 + 62 × 0.4 + 82 × 0.2 = 32.4
y

hence
√ Var(Y ) = E(Y 2 ) − (E(Y ))2 = 32.4 − (5.4)2 = 3.24, so the standard deviation is
3.24 = 1.8.
Several candidates here mistook E(Y 2 ) for the variance or even the standard deviation.
Make sure you can confidently distinguish between these. Note also that an alternative
method to find the variance is through the formula i (xi − µ)2 p(xi ), where µ was found
P
in part ii.
iii. Since:
E(Y ) ± σY ⇒ 5.4 ± 1.8 ⇒ [3.6, 7.2]
this means we require P (Y = 4) + P (Y = 6) = 0.3 + 0.4 = 0.7.
iv. Since the probability masses are not equal for each value of Y , this is not a discrete
uniform distribution.

7
ST104a Statistics 1

(f ) A population is normally distributed with µ = 138 and σ = 21. Given a simple


random sample of size n = 25, determine the probability that the sample mean,
X̄, will be less than 128.
(5 marks)

Reading for this question


This question contains material on sample size determination in relation to the normal
distribution and the distribution of the sample mean. Sample size determination is covered
in Section 7.11 whereas for information on the normal distribution and the sampe mean you
can check Sections 6.8–6.10. Also, check Examples 7.4–7.6 for related exercises.
Approaching the question
We have:
σ2 (21)2
   
X̄ ∼ N µ, =N 138, = N (138, 17.64).
n 25
Hence:
 
128 − 138
P (X̄ < 128) = P Z< √
17.64
= P (Z < −2.38)
= 1 − P (Z ≤ 2.38)
= 1 − 0.9913
= 0.0087.

(g) Two cards are chosen at random from a standard 52-card deck. Consider the
events:

A = first card is the ace of spades

B = second card is the ace of spades.

Suppose the two cards are selected with replacement.

i. Are the events A and B independent? Explain your answer.


(2 marks)
ii. Are the events A and B mutually exclusive? Explain your answer.
(2 marks)
Now suppose the two cards are selected without replacement.

iii. Are the events A and B mutually exclusive? Explain your answer.
(2 marks)
iv. Are the events A and B independent? Explain your answer.
(3 marks)

Reading for this question


This question targets material on basic probability which can be found in Chapter 5 of the
subject guide. Particular focus is given to the total probability and Bayes’ formulae in
Section 5.10. Related exercises to test yourself against these types of questions are Activities
5.2 and 5.3, Learning activity 5.7 and the Sample examination question 5.4.
Approaching the question
i. A and B are independent, since with replacement P (A) = P (B) = 1/52 and
P (A ∩ B) = P (A) P (B) = 1/52 × 1/52 = 1/2,704 (calculation not required).

8
Examiners’ commentaries 2020

ii. Since both events can occur simulataneously, by part i. P (A ∩ B) > 0, hence A and B
are not mutually exclusive.

iii. A and B are mutually exclusive – if A occurs, then without replacement B cannot occur.

iv. A and B are not independent. If A occurs, then B cannot occur, but if A does not occur,
then P (B) = 1/51.

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) A random sample of 3,586 students from various areas is taken to compare their
performance in Physics, based on the final examination mark (low: below 40%,
medium: between 40% and 60%, high: above 60%), in relation to the amount of
private tutoring they receive. The results are summarised in the table below.

Performance in Physics
Tutoring Low Medium High Total
No tutoring 46 (11%) 168 (41%) 196 (48%) 410 (100%)
Some tutoring 100 (5%) 572 (31%) 1,148 (63%) 1,820 (100%)
Frequent tutoring 32 (2%) 248 (18%) 1,076 (79%) 1,356 (100%)
Total 178 (5%) 988 (28%) 2,420 (67%) 3,586 (100%)

i. Based on the data in the table, and without doing a significance test, how
would you describe the relationship between receiving private tutoring and
performance in Physics?

ii. Calculate the χ2 statistic and use it to test for independence between gender
and highest education level, using a 10% significance level. What do you
conclude?

iii. Would you conclude that private tutoring improves the performance in
Physics? Briefly justify your answer.

(13 marks)

Reading for this question


This part targets Chapter 8 on contingency tables and chi-squared tests. Note that part i. of
the question does not require any calculations, just understanding and interpreting
contingency tables. Candidates can attempt Activity A8.4 for practice. Part ii. is a
straightforward chi-squared test and the reading is also given in Chapter 8. Look also at
Activity A8.4. Finally, for part iii. the concepts of correlation and causation covered in
Chapter 11 are relevant.

Approaching the question

i. An example of a ‘good’ answer is as follows.


Looking at the percentages, we see some differences between tutoring frequency and
performance in Physics. More specifically, 79% of students who receive frequent tutoring
got a high mark in Physics versus 48% for students who receive no tutoring. Moreover,
the percentage of students who receive frequent tutoring and got a low mark in Physics
is 2% versus 11% for those receiving no tutoring. Hence there may be an association,
although this needs to be investigated further.

9
ST104a Statistics 1

ii. Set out the null hypothesis that there is no association between gender and highest
education level against the alternative, that there is association. Be careful to get these
the correct way round!
H0 : No association between tutoring frequency and performance in Physics.
vs.
H1 : Association between tutoring frequency and performance in Physics.
Work out the expected values to obtain the table below
20.3514 112.962 276.687
90.3402 501.439 1,228.22
67.3084 373.6 915.092
The test statistic formula is:
3 X 3
X (Oi,j − Ei,j )2
∼ χ2(r−1)(c−1)
i=1 j=1
E i,j

which gives a test statistic value of 187.913. This is a 3 × 3 contingency table so the
degrees of freedom are (3 − 1) × (3 − 1) = 4.
For α = 0.05, the critical value is χ20.05, 4 = 7.778, hence we reject H0 at the 5%
significance level. We conclude that there is evidence of an association between receiving
tutoring and performance in Physics.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
iii. Some reference to correlation and causation is expected here. For example, there is an
association but it could be that those who receive private tutoring may come from
families where there is a strong motivation to do well in school and this is the cause of
better performance in Physics.

(b) i. You have been asked to design a nationwide survey in your country to find
out about the use of Twitter by university students. Provide a probability
sampling scheme and a sampling frame that you would like to use. Identify a
potential source of selection bias that may occur and discuss how this issue
can be addressed.
ii. Describe how you would adapt the sampling design of part i. to create a
longitudinal survey.
(12 marks)

Reading for this question


This was a question on basic material on survey designs. Background reading is given in
Chapters 10 and 11 of the subject guide which, along with the recommended reading, should
be looked at carefully. Candidates were expected to have studied and understood the main
important constituents of design in random sampling. It is also a good idea to try the
learning activities of Chapter 10.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires one or two lines only. If you
are unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be long. Note also that in some
cases there is no unique answer to the question.
i. A couple of ‘good’ suggested answers are as follows.
Example answer 1:
• Sampling frame and defintion of population: Target is university students, so
university admission records can be used.

10
Examiners’ commentaries 2020

• Sampling scheme: State a scheme (any probability random sampling scheme would
do) and provide a justification. For example, if you went for clustering discuss the
area of the country, or if stratified discuss stratification factors: gender, subject of
study etc., and why these schemes would be advantageous.
• Source of selection bias: Selection bias will arise from the omission of those students
in universities that admission records are more difficult to obtain.
• Way to address it: Reset the target population group to match what the sampling
frame is actually providing.
Example answer 2:
• Sampling frame: Note that despite the target group of ‘university students’, it may
be good to also look at people who are not at a university but otherwise similar (say
aged 18–22). If this is the case, you might suggest using an electoral register and
sampling from this list.
• Sampling scheme: State a scheme (any probability random sampling scheme would
do) and provide a justification. For example, if you went for clustering discuss the
area of the country, or if stratified discuss stratification factors: gender, subject of
study etc., and why these schemes would be advantageous.
• Source of selection bias: Selection bias will arise from the omission of those who are
not on electoral registers.
• Way to address it: Reset the target population group to match what the sampling
frame is actually providing.
ii. Again, an indicative answer would contain the following statement.
‘A cohort of people aged 18 will be chosen (both at a university and not) and will be
re-surveyed each year’ and some critical discussion highlighting potential issues such as
participant dropout, or potential advantages such as identifying the year of study where
use of Twitter becomes more prevalent.

Question 3

(a) A farmer would like to investigate the relationship between the obtained yield of
apple trees and the amount of weeds found in their roots. For this reason, nine
apple trees of the same type were randomly selected and the amount of weeds
in their roots (x grams) was recorded together with their yield (y kilograms).
Year #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Weeds in roots (x) 30 28 32 25 25 24 22 24 35 40
Yield (y) 25 30 27 40 42 41 50 45 30 25
The summary statistics for these data are:
Sum of x data: 285 Sum of the squares of x data: 8,419
Sum of y data: 355 Sum of the squares of y data: 13,349
Sum of the products of x and y data: 9,718
i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Using the equation you found in iii., obtain the predicted credit card balance
of someone with annual income of £11,000. Do you think this value is
realistic? Justify your answer.
v. Briefly comment on the suitability of the simple linear regression model for
the data of this question.
(13 marks)

11
ST104a Statistics 1

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12 of the
subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part i.
whereas the remaining parts are on correlation and regression which are covered in Sections
12.8–12.10. Section 12.7 is also relevant. Sample examination question 2 of this chapter is
also recommended for practice on questions of this type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question.

ii. The summary statistics can be substituted into the formula for the correlation (make
sure you know which one it is!) to obtain the value −0.8492. An interpretation of this
value is the following: the data suggest that the greater the amount of weeds in the
roots, the lower the yield of the apple trees. The fact that the value is close to −1
suggests that this is a strong, negative linear relationship.
Many candidates did not mention all three words (strong, negative, linear). Note that all
of these words provide useful information on interpreting the relationship and are hence
required to obtain full marks.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = −1.3474. The formula for a is:

a = ȳ − bx̄

and we get a = 73.9005. Hence the regression line can be written as:

yb = 73.9005 − 1.3474x or y = 73.9005 − 1.3474x + ε.

It should also be plotted on the scatter diagram.


Many candidates reported incorrectly the regression line as y = 73.9005 − 1.3474x. This
expression is false; one of the two above is required. Also, many candidates did not draw

12
Examiners’ commentaries 2020

this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. In this case one can note in the scatter diagram that the points seem to be ‘scattered’
around a straight line. Hence a linear regression model does seem to be a good model.
According to the model 73.9005 − 1.3474 × 37 = 24.047 kilograms.
Many candidates did not provide units here. It is essential to do so in order to obtain full
marks. In order to assess whether this prediction is realistic some criticism on the
prediction validity with statements such as ‘the point is very close to the limits of the
data’ or ’ the assumption does not appear to be linear ’ provided good answers.
v. Some discussion is expected here, on the non-linear association implied by the scatter
diagram.

(b) A company wants to check the quality of its customer service regarding web
chat enquiries. More specifically, the manager wants to compare the waiting
times until each enquiry was answered during the years 2013 and 2012.
Unfortunately, extensive records of the company are not available, and he can
only check a random sample of web chat enquiries within these two years. The
available data, measured in minutes of waiting times, are provided below:

Sample size Sample mean Sample standard deviation


2013 41 5.8 0.5
2012 34 6.1 0.6

i. Use an appropriate hypothesis test to determine whether the mean waiting


times, were different between these two years. Test at two appropriate
significance levels, stating clearly the hypotheses, the test statistic and its
distribution under the null hypothesis. Comment on your findings.
ii. State clearly any other assumptions you make. In your view, which one of
them is most likely to be violated? Justify your answer.
iii. Adjust the procedure above to determine whether the mean waiting time in
2013 was less than that of 2012.
(12 marks)

Reading for this question


The first two parts of the question refer to a two-sided hypothesis test comparing two
population means, whereas the third part of this exercise refers to a one-sided hypothesis
test. While the entire chapter on hypothesis testing is relevant, one can focus on Section
8.15. In terms of exercises one can check Example 8.6 and Learning activity 8.7.
Approaching the question
i. Let µA denote the mean waiting time during 2012 and µB the mean waiting time during
2013. The wording ‘were different between these two years’ implies a two-sided test,
hence the hypotheses can be written as:

H0 : µA = µB vs. H1 : µA 6= µB .

The test statistic formula, depending on the assumption on variances, is:


x̄ − ȳ x̄ − ȳ
p or q .
sA /nA + s2B /nB
2
s2p (1/nA + 1/nB )

The test statistic value is 2.3225 (or 2.3624 if pooled variance used).

13
ST104a Statistics 1

The sample size is quite large, hence the standard normal distribution can be used due to
the central limit theorem. Nevertheless the use of a t60 distribution is also reasonable.
The critical values at the 5% significance level are ±1.96 (2.00 if a t60 distribution is
used), hence we reject the null hypothesis. If we take a (smaller) α such as a 1%
significance level, the critical values are ±2.576 (2.66 if a t60 distribution is used), so we
do not reject H0 . We conclude that there is moderate evidence of a difference in the
mean waiting times between the two years.

ii. The assumptions for i. were:


• assumption about equal variances
• assumption about whether nA + nB is ‘large’ so that the normality assumption is
satisfied
• assumption about independent samples.

Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
Regarding the assumption which is most likely to be violated the following statement is
an indicative answer: ‘Independent samples assumption is most likely to be violated
since the sampling was done in two successive years’.

iii. It is important to identify the correct hypotheses here, i.e. we have:

H0 : µA = µB vs. H1 : µA > µB

Also make sure to get the correct z-values: ≈ 1.645 for a 5% significance level, and ≈ 2.32
for a 1% significance level (1.671 and 2.39, respectively, if a t60 distribution is used).
Based on these, the result is borderline highly significant (moderately significant will also
do), i.e. the mean waiting time in 2013 was less than that of 2012.

Question 4

(a) i. Carefully construct a boxplot to display the following annual before tax
earnings for the employees of a company, measured in £000s:

35, 26, 22, 24, 21, 57, 36, 35, 29, 47, 30 and 36.

ii. Based on the shape of the box plot you have drawn, describe the distribution
of the data.

iii. Name two other types of graphical displays that would be suitable to
represent the data. Briefly explain your choices.

iv. Provide a hypothesis test statistic for the hypothesis that the mean income is
equal to £33,000. Comment on the suitability of this test to the data of this
question.
(13 marks)

Reading for this question


Chapter 4 of the subject guide provides all the relevant material for this question. More
specifically, reading on boxplots can be found in Section 4.9.2, but the entire Sections 4.7,
4.8 and 4.9 are highly relevant. Part iv. requires some knowledge on hypothesis testing
covered in Chapter 8.

14
Examiners’ commentaries 2020

Approaching the question


i. A boxplot compatible with what the examiners were expecting to see is shown below.

Note that in order to draw it you will need to calculate the median and lower/upper
quartiles. These in turn will allow you to determine the outlier limits as well as the
extreme outlier limits as well as the whiskers. These calculations are summarised below.
• Quartiles: 25, 32.5, 36, hence IQR = 36 − 25 = 11 (should be consistent with the Q1
and Q3 values).
• Outlier limits: lower is Q1 − 1.5 × IQR = 8.5, upper is Q3 + 1.5 × IQR = 52.5.
• Outlier: 57.
Marks were also awarded for accurately drawing the figure. Note that the numbers
added in the figure above were for illustration purposes.
ii. The distribution of the data appears to be positively/right-skewed. This is also
supported by the fact that the mean is larger than the median.
iii. The variable income is measurable, hence these graphs are suitable for displaying the
distribution of such variables:
∗ histogram
∗ stem-and-leaf diagram
∗ dot plot.

iv. A suitable test statistic would be (x̄ − 33)/Sx which assumes the data to be normally
distributed. Nevertheless, the assumption of normality here is questionable due to the
presence of skewness. The t distribution is a better choice although still symmetric.
Taking a log-transform may help.

(b) A new fitness programme is devised for people who would like to lose weight.
Each participants’ weight in kg was measured before and after the fitness
programme to see if it is effective in reducing their weights. The following data
were obtained.
Participant #1 #2 #3 #4 #5 #6 #7 #8
Before 86 65 74 55 93 52 66 67
After 81 62 71 49 83 51 61 63

i. Carry out an appropriate hypothesis test to determine whether the fitness


programme is effective in reducing weight. State the test hypotheses, and
specify your test statistic and its distribution under the null hypothesis.
Comment on your findings.

15
ST104a Statistics 1

ii. State any assumptions you made in part i.


iii. State an alternative hypothesis test to determine whether the fitness
programme is effective in reducing weight. Justify why you did not use this
hypothesis test in part i.
iii. Compute an 95% confidence interval for the difference in the means.
iv. On the basis of the data alone, would you consider trying this fitness
programme if you wanted to lose weight? Provide an explanation with your
answer.
(12 marks)

Reading for this question


While the wording of the exercise may appear complicated it does in fact only refers to
comparing two population means in the case of two dependent (paired) samples. Look up
the section 8.16.4 although all the section 8.16 and in fact chapter 8 are relevant. In terms
of exercises check example 8.8 and learning activities 8.8 and 8.9. Part iv. requires using
confidence intervals still for comparing two population means in the case of two dependent
(paired) samples. This is covered on chapter 7 and, in particular, section 7.13.4. In terms of
exercises check example 7.9 and learning activity 7.9.
Approaching the question
i. We test:
H0 : µbefore = µafter vs. H1 : µbefore > µafter .
The differences (‘before − after’) are:

5, 3, 3, 6, 10, 1, 5 and 4.

Note that it is perfectly acceptable to compute ‘after − before’, the signs are reversed in
this case (i.e. all negative in this instance). We have:

x̄d = 4.625 and sd = 2.6693

hence the test statistic value is:


x̄d − 0 4.625 − 0
√ = √ = 4.901.
sd / n 2.6693/ 8
At the 5% significance level the critical value is t0.05, 7 = 2.365 (or −2.365 if
‘after − before’ was computed. Hence we reject H0 at the 5% significance level.
Testing at the 1% significance level, the critical value is t0.01, 7 = 2.998. We again reject
H0 and conclude that there is strong evidence that the fitness programme is effective in
reducing weight.
ii. Two assumptions made are given below:
• Differences are normally distributed.
• Pairs of observations are independent (a weaker condition which suffices is that the
differences are independent, but this is unlikely if observations are not).
iii. An independent samples t test is an alternative. However, this ignores the dependence
between individuals so it is not the most appropriate test.
iv. The workout for a 95% confidence interval provides the endpoints (2.393, 6.857). The
correct t-value to be used here is t0.025, 7 = 1.86.
Note: Make sure to report the above as an interval.
v. The evidence in the data suggests that the fitness programme is effective in reducing
weight. This can be seen, for example, from the 95% confidence interval as well as the
paired samples t test.

16

You might also like