Professional Documents
Culture Documents
You should attempt ALL questions. Marks available are shown next to the ques-
tions.
• You may use calculators and computers, but you must show your working
for any calculations you do.
• You may use the Internet as a resource, but not to ask for the solution to an
exam question or to copy any solution you find.
All work should be handwritten and should include your student number.
You have 24 hours to complete and submit this assessment. When you have finished:
• scan your work, convert it to a single PDF file, and submit this file using the
tool below the link to the exam;
• e-mail a copy to maths@qmul.ac.uk with your student number and the module
code in the subject line;
• with your e-mail, include a photograph of the first page of your work together
with either yourself or your student ID card.
You are expected to spend about 3 hours to complete the assessment, plus the time
taken to scan and upload your work. Please try to upload your work well before the
end of the submission window, in case you experience computer problems. Only one
attempt is allowed – once you have submitted your work, it is final.
Some questions use digits from your 9-digit ID number. The digits used are A, B and
C, the third-to-last, second-to-last and last digits of your ID number . . . ABC.
Question 1 [29 marks]. Let A, B and C be the last three digits of your ID number.
Consider the samples x = (9, 10 + A) and y = (7, 20 + B, 30 + C ) from two
populations. We want to know if the mean in the population associated with the first
sample is different from the population mean for the second sample. We are not
prepared, however, to assume that the data are normally distributed.
(b) For sample sizes which are too large to calculate the exact null distribution for
the permutation test of part (a), even by computer, explain how we might
approximate the null distribution. [3]
(c) Calculate the value of Mann-Whitney statistic UX for the two samples x and y. [3]
(d) In R, the function dwilcox calculates the probability mass function of the null
distribution for the Mann-Whitney statistic. Suppose that this function is run
and outputs the following:
> dwilcox(0:3, m=2, n=3)
[1] 0.1 0.1 0.2 0.2
Use this output to calculate a p-value for the same comparison as part (a). [4]
(e) Suppose that the x and y samples were of the same size as each other, m = n. If
we carried out a Mann-Whitney test with a two-sided significance level of 5%,
what is the smallest m that would allow us to reject the null hypothesis? [4]
(f) If the samples were stored in x and y in R, what would be the purpose of the
following code?
length(c(x, y)) == length(unique(c(x, y))) [2]
(g) Briefly explain the main difference between the parametric and nonparametric
approaches to hypothesis testing. [3]
(a) For sample size n = 3, using the assumptions of symmetry about zero and
continuous data, calculate the null distribution for K. In other words, calculate
P(K = k ) for each k for which this probability is non-zero. [8]
(b) Let A, B and C be the last three digits of your ID number. Suppose that the
observed data are
Using the null distribution found in part (a), calculate the one-sided p-value
testing if the first member of each pair tends to be greater than the second
member. [3]
(c) Give an example of a type of study where the data consists of matched pairs. [2]
Question 3 [8 marks].
Consider the simple linear regression model
Yi = α + βxi + ei , i = 1, . . . , n,
where Yi is the random variable representing the response at the value xi of the
explanatory variable and the ei s are uncorrelated random errors.
(b) What does the method of part (a) assume about the data-points
( x1 , y1 ), . . . , ( x n , y n )? [2]
Question 4 [7 marks].
Suppose that we have a certain method for calculating 95% confidence intervals
(θ L , θU ) for a population parameter θ.
(a) Define what is meant by the coverage of the confidence interval. [2]
(b) Suppose that (θ L , θU ) are calculated based on data of the form y1 , . . . , yn , where
each yi is assumed to follow a distribution that depends on θ. Explain how we
could use simulation to assess the coverage of the confidence intervals. [5]
Explain how this graph would be used to select a value of λ. By doing this,
what feature of the fitted model are we selecting for? Why would PRESS
initially decrease as λ increases for small values of λ? [7]
(a) What is the name for the statistical procedure that the code in the loop is
carrying out? [2]
(b) Explain what each of the three lines of code inside the loop is doing. [5]
(c) In statistical terms, what will the command sd(v) output? [2]
(d) In statistical terms, what will the last line of code output? [3]
Suppose now that the first two lines of code inside the loop are replaced by:
xb = rgamma(length(x), shape=ax, rate=bx)
yb = rgamma(length(y), shape=ay, rate=by)
where it is assumed that ax, bx, ay, by have been given values in previous code.
(f) With that change, what is the name of the statistical procedure? [1]
(g) Explain what the line starting with xb = rgamma is doing. [3]
(h) What does this method assume about the observations in the sample x? [2]
Table 1: The standard normal cumulative distribution function (cdf) Φ( x ) for the given
values of x. The cdf for x < 0 can be found using the fact that Φ( x ) = 1 − Φ(− x ). For
x ≥ 3.8, 1 − Φ( x ) < 10−4 .
x Φ( x ) x Φ( x ) x Φ( x ) x Φ( x ) x Φ( x )
0.0 0.500 0.8 0.788 1.6 0.945 2.4 0.992 3.2 0.9993
0.1 0.540 0.9 0.816 1.7 0.955 2.5 0.994 3.3 0.9995
0.2 0.579 1.0 0.841 1.8 0.964 2.6 0.995 3.4 0.9997
0.3 0.618 1.1 0.864 1.9 0.971 2.7 0.997 3.5 0.9998
0.4 0.655 1.2 0.885 2.0 0.977 2.8 0.997 3.6 0.9998
0.5 0.691 1.3 0.903 2.1 0.982 2.9 0.998 3.7 0.9999
0.6 0.726 1.4 0.919 2.2 0.986 3.0 0.999 3.8 0.9999
0.7 0.758 1.5 0.933 2.3 0.989 3.1 0.999
End of Appendix.