You are on page 1of 11

Examiners’ commentaries 2014

Examiners’ commentaries 2014


ST104a Statistics 1: Preliminary examination

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2013–14. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

A change that took place from 2011–12 onwards is the presence of a formula sheet. The purpose of
this change is to encourage candidates to devote more time in understanding the key concepts of the
syllabus rather than memorising a big number of formulae. Nevertheless, candidates should not rely
on this formula sheet entirely but only use it for verification. The formula sheet is available on the
virtual learning environment (VLE).

Information about the subject guide

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2011).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refers to an earlier edition. If
di↵erent editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions

Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section
A (50 marks) and TWO questions from Section B (25 marks each). Candidates are strongly
advised to divide their time accordingly.

Section A

Answer all parts of Question 1 (50 marks in total).

Question 1

(a) Find the mean, median and mode of the following dataset:
3, 6, 3, 5, 8, 11.

(3 marks)
Reading for this question
Measures of location are covered in Section 3.8 of the subject guide.
Approaching the question
Mean = 6, median = 5.5, mode = 3. Remember to order the observations first when finding
the median.

1
ST104a Statistics 1: Preliminary examination

(b) State whether the following statements are possible or not possible. Give a brief
explanation. (Note that no marks will be awarded for a simple ‘possible’/‘not
possible’ reply.)
i. Random sampling gives unbiased estimates for parameters.
ii. A median can be equal to the upper quartile.
iii. A negative chi-squared value shows that there is little association between
the variables tested.
iv. We use the Student’s t-distribution in sampling if the population size is small.
(8 marks)
Reading for this question
For (i.) see Section 9.7, for (ii.) see Sections 3.8 and 3.9; for (iii.) see Section 8.7, and for
(iv.) see Section 6.7 of the subject guide.
Approaching the question
i. Possible. No selection bias with random sampling (this is not the same thing as sampling
error).
ii. The easiest thing is to try to imagine a situation where the median and the third quartile
would be the same. If all the observations are the same, or even if only the observations
between Q2 and Q3 are the same, then Q2 = Q3 . For example, if the data are 2, 2, 2, 2,
2, then the median and upper quartile are the same, i.e. 2.
iii. Not possible. 2 0. It is always positive. So: ‘It is not possible that a negative
chi-squared would show little association because there is no such thing as a negative
chi-squared!’
2
iv. Possible. Provided is unknown, we use Student’s t, otherwise we use standard normal.

(c) The random variable X has the following probability distribution:


i 1 2 3 4
X=x 1 0 1 2
P (X = x) 0.2 0.3 y 0.1
Calculate:
i. y
ii. E(X)
P
i=3
iii. xi p(xi )
i=2
P
i=4
iv. x2i .
i=1
(7 marks)
Reading for this question
See Sections 5.6 and 5.7 of the subject guide.
Approaching the question
i. We know there are only four possible values listed, so the total probability must sum to
1. Hence
y = 1 (0.2 + 0.3 + 0.1) = 0.4.
ii.
4
X
E(X) = xi p(xi ) = 0.2 + 0 + 0.4 + 0.2 = 0.4.
i=1

iii.
3
X
xi p(xi ) = (0 ⇥ 0.3) + (1 ⇥ 0.4) = 0.4.
i=2

2
Examiners’ commentaries 2014

iv.
4
X
x2i = ( 1)2 + 02 + 12 + 22 = 6.
i=1

(d) i. Researchers at your college tell you that they have been looking at the
connection between student hours spent in the library and computing
facilities and their examination results. Their report shows a correlation
coefficient of +0.9 between the hours spent per week and the examination
marks. What statistical conclusions might you draw from this?
ii. The researchers come back to you and say that they made a mistake with
their calculations, the correlation coefficient is actually +0.4. How would
that alter your conclusions?
iii. You decide to carry out a regression to assess the possible strength of
contributing factors to examination success. Name two other variables apart
from time spent in library and computing facilities which you could include
in your regression instead.
(6 marks)
Reading for this question
See Section 11.8 of the subject guide.
Approaching the question
i. r = 0.9 ) a strong, positive, linear relationship between hours and results. Explanation
also required either in terms of correlation or causation.
ii. r = 0.4 means weak(er)/moderate evidence of a linear relationship.
iii. Any two sensible suggestions (continuous variables) accepted. Possible examples: IQ,
amount of lesiure time etc.

(e) Define each of the following and give two criteria which would make you prefer
one method to the other in particular circumstances:
i. A random sample.
ii. A quota sample.
A tourist agency has a list of clients who have used their services in the last few
years and their telephone numbers, addresses and emails. They want you to
find out which factors have contributed to good holiday experiences.
iii. Which of the two methods above would you use? Explain briefly.
(7 marks)
Reading for this question
See Sections 9.7 and 9.11 of the subject guide.
Approaching the question
i. Random sample: A known (not necessarily equal) non-zero probability of each unit
being selected.
ii. Quota sample: A form of non-probability/non-random sampling, i.e. the probability of
an individual’s selection is not known.
Use quota if (i.) no list (ii.) need quick results (iii.) need cheap data (iv.) accuracy not
important. Use random if (i.) list (ii.) speed not important (iii.) cost not important (iv.)
accuracy is important. Any two criteria.
iii. Random sample due to availability of list/sampling frame (it would be foolish not to
make use of this when provided. This will lead to more accurate/precise inference.
Emails or telephone would be easy if you are in a hurry.

3
ST104a Statistics 1: Preliminary examination

(f ) i. Two fair dice are thrown and you are told that the sum of the upturned dice
is at least nine. What is the probability that neither of the upturned faces is
a four?
ii. In a small village with an adult population of 20, only four adults watch
television sports events on a pay-per-view basis. A researcher asks two adults
selected at random, one after the other, whether they watch sport this way.
What is the probability that the second person asked does?
Rachel is going to a party. There is a 40 per cent chance Bill will go. If Bill
does not go, there is an 80 per cent chance she will enjoy herself. If Bill does go,
there is only a 10 per cent chance she will enjoy herself.
iii. What is the probability that Rachel will enjoy the party?
iv. At the same party, suppose you know Rachel did enjoy herself. What is the
probability that Bill was not present?
(8 marks)
Reading for this question
See Chapter 4 of the subject guide.
Approaching the question
P
i. 10 equally likely pairs such that 9, six have no four, hence probability neither face
is a four is 0.6.
ii. Using obvious notation:
4 3 16 4
P (W1 \ W2 ) + P (W1c \ W2 ) = ⇥ + ⇥ = 0.2.
20 19 20 19
iii. P (E) = P (E | B)P (B) + P (E | B c )P (B c ) = 0.1 · 0.4 + 0.8 · 0.6 = 0.52.
iv.
P (E | B c )P (B c ) 0.8 ⇥ 0.6
P (B c | E) = = = 0.923.
P (E) 0.52
For parts (iii.) and (iv.) it is also acceptable to use a probability tree (tree diagram) such as:

(g) A UK government agency carries out a large-scale random survey of public


attitudes towards the recession. 132 of the 600 workers surveyed indicated they
were worried about losing their job. Newspaper reports claims 25% of workers
fear losing their job. Is such a high percentage claim justified? State and carry
out an appropriate hypothesis test at two significance levels, and explain your
results.

4
Examiners’ commentaries 2014

(8 marks)
Reading for this question
See Section 7.14 of the subject guide.
Approaching the question
We test H0 : ⇡ = 0.25 vs. H1 : ⇡ < 0.25.
The sample proportion is p = 132/600 = 0.22, and we obtain a test statistic value of:

0.22 0.25
p = 1.697.
(0.25 ⇥ 0.75)/600

For ↵ = 0.05, the lower-tail critical value is 1.645. Since 1.697 < 1.645, we reject H0 at
the 5% significance level. Choosing a second (lower) ↵, say 0.01, gives a critical value of
2.326, hence we do not reject H0 at the 1% significance level. Therefore, we conclude that
the test is moderately significant as there is some evidence against the claim.

(h) A variable has a unimodal distribution with a mean of 50 and a median of 30. Is
the distribution skewed to the left, to the right, symmetric, or is there not
enough information to decide? Explain your answer briefly and give a rough
sketch of the distribution.
(3 marks)
Reading for this question
See Section 3.8 of the subject guide.
Approaching the question
This is a positively-skewed (right-skewed) distribution because the mean > median. An
appropriate sketch might be as follows:

Section B

Answer two questions from this section (25 marks each).

Question 2

(a) The monthly losses (x, in $000s) of a random sample of thirty managed funds at
the height of the 2008 ‘credit crunch’ are shown below:

5
ST104a Statistics 1: Preliminary examination

53 23 30 72 55 49
65 36 40 51 48 43
35 39 42 44 56 31
50 54 64 57 53 42
52 63 49 33 47 44
i. Draw and label an appropriate histogram using exactly five classes on the
graph paper provided.
ii. Calculate the mean, median and range.
iii. Give the modal group.
iv. Comment on the data given the information you have calculated.
(12 marks)
Reading for this question
See Sections 3.7, 3.8 and 3.9 of the subject guide.
Approaching the question
i.
Calculating frequency density

Class interval Interval width Frequency Frequency density


(20, 30] 10 2 0.2
(30, 40] 10 6 0.6
(40, 50] 10 10 1.0
(50, 60] 10 8 0.8
(60, 80] 20 4 0.2

Histogram of monthly losses of managed funds

1
Frequency density

0.8

0.6

0.4

0.2

20 30 40 50 60 70 80

Monthly losses in $000s

Note marks are awarded for:


• Title.
• ‘Frequency density’ on y-axis.
• Suitable x-axis label.
• ‘5 classes’ adhered to.
• Axis units
• Accuracy.
ii. Mean = $47,333, median = $48,500, range = $49,000.
iii. Modal group is 40–50 group.
iv. Mean < median, hence (slightly) negatively (left) skewed.

6
Examiners’ commentaries 2014

(b) A pet food supplier is studying the di↵erence between two of its stores. It is
particularly interested in the time it takes before customers receive the
products they have ordered. Using standard notation, the data of delivery times
from the two stores are as follows:
Store A Store B
x̄ 34.3 days 38.6 days
s 2.4 days 3.1 days
n 41 31
i. Use an appropriate hypothesis test to see if there is a di↵erence in the
average delivery times for the two stores. Test at two appropriate
significance levels and comment on your findings.
ii. Give any assumptions you have made.
iii. Give a 98% confidence interval for the mean delivery time for Store A.
(13 marks)

Reading for this question

For (i.) and (ii.) see Section 7.16, and for (iii.) see Section 6.9 of the subject guide.

Approaching the question

i. We test H0 : µA = µB vs. H1 : µA 6= µB . The test statistic is (either accepted (assumption


in (ii.))):
x̄ ȳ x̄ ȳ
p or q .
s21 /n1 + s22 /n2 2
sp (1/n1 + 1n2 )

The test statistic value is 6.41 (or 6.64 if pooled variance used).
For ↵ = 0.05, the critical values are ±1.96 (±2.00 if t60 used). Since 1.96 < 6.41, we reject
H0 . Choosing a smaller ↵, say 0.01, gives critical values of ±2.576, hence we still reject H0 .
Therefore the result is highly significant; there is strong evidence of a di↵erence in delivery
times.
ii. Any two of:
2 2
• Assumption about whether A = B.
• Assumption about whether nA + nB 2 is ‘large’, hence t v. z.
• Assumption about independent samples.
• Assumption about the populations being normally distributed.
iii. The confidence interval structure is:
s
x̄ ± t↵/2,n 1 ·p .
n

We have 40 degrees of freedom, leading to a t value of 2.423. Reporting as a confidence


interval gives (33.38, 35.22).

Question 3

(a) A random sample of 21 female students is chosen from students at higher


education establishments in a particular area of a country, and it is found that
their mean height is 165 centimetres with a sample variance of 81.
i. Assuming that the distribution of the heights of the students may be
regarded as normally distributed, calculate a 98% confidence interval for the
mean height of female students.

7
ST104a Statistics 1: Preliminary examination

ii. You are asked to obtain a 98% confidence interval for the mean height of
width 3 centimetres. What sample size would be needed in order to achieve
that degree of accuracy?
iii. Suppose that a sample of 15 had been obtained from a single student hostel,
in which there were a large number of female students (still with a
distribution of heights which was well-approximated by the normal
distribution). The mean height is found to be 160 centimetres. Calculate a
99% confidence interval for the mean height of female students in the hostel.
How do their heights compare with the group in part ‘i.’ ?
(13 marks)
Reading for this question
See Sections 6.9 and 6.11 of the subject guide.
Approaching the question
i. The confidence interval formula is:
s
x̄ ± t↵/2,n 1 ·p .
n

There are n 1 = 20 degrees of freedom, leading to a t value of t0.01,20 = 2.528. The


computed confidence interval is therefore:
9
165 ± 2.528 ⇥ p = (160.04, 169.96).
21
Note the written form. Always give the lower value first!
p
ii. We seek n such that 2.326 ⇥ 9/ n = 1.5 (for a confidence interval width of 3, i.e.
estimation to within 1.5 units). We solve for n = 194.77 ! 195 (rounded up, as n must
be an integer).
Note here we anticipate a large sample size (given n = 21 gives a confidence interval of
width 169.96 160.04 = 9.92) and so we automatically use a standard normal
approximation for the tn 1 value.
iii. The confidence interval formula is:
s
x̄ ± t↵/2,n 1 ·p .
n

The correct t value is t0.005,14 = 2.977. The computed confidence interval is therefore:

9
160 ± 2.977 ⇥ p = (153.08, 166.92).
15

Comment: The confidence intervals are not directly comparable since (i.) is 98% and
(iii.) is 99%. Other things equal, a 99% confidence interval is wider than one at 98%.
Other things equal, a smaller sample size would lead to a wider confidence interval too.
Therefore the confidence interval in (iii.) is wider for two reasons: the higher level of
confidence and the smaller sample size.
Although there is some overlap of the confidence intervals, suggesting potentially same
mean heights, a formal hypothesis test should be performed.

(b) You have just started work at your country’s education ministry and have been
asked to plan a survey of student welfare and living conditions in your country.
You have been asked to look at post-school, 18–25 year-old, students, and are
provided with details of the 10 institutions you are expected to cover. You have
been told that you are expected to take a random sample of students.
Outline your design giving at least two stratification factors, an indication of
any clustering you might use, and an example of three questions you might ask
and the way in which you might do so. Your boss is expecting you to come up

8
Examiners’ commentaries 2014

with a design which will ensure timeliness and accuracy, but is prepared to
argue for more resources than budgeted, should that become necessary.
(12 marks)
Reading for this question
See Chapter 9 of the subject guide.
Approaching the question
Some points to note:
• You are asked to take a random sample. It is to be accurate and on time. Resources are
not an issue.
• You have information about 10 institutions with 18–25 year old students from which to
sample.
• You are asked for stratification and clustering factors, possible questions.
• Be careful. There are no marks if you give a quota sample design here! You can then
only get the marks allocated for giving three questions set!
• Main points in sampling could be that you stratify the institutions according to the
employment rates in the areas for the institutions (high/medium/low) and the
proportions of students getting jobs after finishing their courses (high/medium/low) or
proportions of students in technical courses.
• If you are aiming to interview students, you might cluster them into class groups.
Examples of questions you might ask could be:
• Background questions: What accommodation do you have? Do you live at home, share
with other students, in a hall or hostel? Are you male or female?
• Resources questions: How much do you spend on food? Do you cook your own food?
When did you last buy clothes for yourself? Do your parents give you money to help?
Do you have a job? Full or part-time?

Question 4

(a) A leading global retailer is studying the relationship between the area available
for sales in its shops (x values in metres squared) and the weekly amount of
sales in each shop (y values in thousands of pounds). The figures for a random
sample of 14 outlets are recorded, and are shown in the following table:
Area (x) in m2 Sales (y) in £000s Area (x) in m2 Sales (y) in £000s
1.9 3.6 1.1 3.3
1.3 4.1 3.7 6.1
2.8 3.7 1.4 4.8
5.8 9.6 5.1 10.1
2.9 3.9 4.5 9.9
1.1 3.6 6.5 10.0
1.2 1.7 3.9 5.1
The summary statistics for these data are:
X X X X X
xi = 43.2 x2i = 178.42 yi = 79.5 yi2 = 563.85 xi yi = 309.39

i. Draw and label the scatter diagram of these data carefully on the graph
paper provided.
ii. What does your diagram show? Make at least two comments.
iii. Calculate and write down the least squares regression line.

9
ST104a Statistics 1: Preliminary examination

iv. An architect has designed a series of retail areas of exactly 6 metres squared.
What would you expect the sales to be in such an area?
(12 marks)
Reading for this question
See Sections 11.6 and 11.9 of the subject guide.
Approaching the question
i. Scatter plot of sales against shop area

x
10
x x
8 x
Sales in £000s

x
6

x
x
x
4

x
x x x
x
2

1 2 3 4 5 6

Area in metres squared

ii. Positive, strong, linear relationship. Sales tend to increase with area.
iii. We have
P
xy nx̄ȳ
b = P 2 = 1.420
x nx̄2
a = ȳ bx̄ = 1.296

The least squares regression line is ŷ = 1.296 + 1.420x.


iv. When x = 6, the predicted sales value is ŷ = 1.296 + 1.420(6) ) £9,816. Note the
correct units.

(b) A researcher into the use of computational aids conducts a survey in which 195
students, as a homework assignment, compute the standard deviation of a set of
data. The students are asked what computational aids, if any, they used. The
researcher is particularly interested as to whether male and female students
di↵er in this respect. The following table shows the results of this survey:
Method of computation
No aids Basic calculator Statistical function Computer
Gender on a calculator
Male 10 45 30 10
Female 12 30 52 6
i. Carry out an overall test for association between gender and method of
calculation at two significance levels. Give the null and alternative
hypotheses and comment on your results.
ii. The researcher is interested in whether there are any gender di↵erences in
the preferred method of computation. Discuss any potential gender
di↵erences which appeared in the test for association.
(13 marks)
Reading for this question
See Section 8.7 of the subject guide.

10
Examiners’ commentaries 2014

Approaching the question


i. We test H0 : No association between gender and method vs. H1 : Association between
gender and method.
Intermediate calculations are best presented in tabular form as follows:

The correct test statistic value is 9.9631, and we have (2 1)(4 1) = 3 degrees of
freedom.
For ↵ = 0.05, the critical value is 7.815, hence we reject H0 at the 5% significance level.
Choosing a second (smaller) ↵, say 0.01, gives a critical value of 11.34, hence we do not
reject H0 at the 1% significance level. The test is moderately significant as there is some
evidence of an association between gender and computation.
ii. The main sources of association are identifiable by the large test statistic contributors
which are gender vs. basic calculator and statistical function on a calculator. Males
prefer basic calculator; females prefer statistical function on calculator.

11

You might also like