You are on page 1of 6

Main Examination period 2021 – May/June – Semester B

Online Alternative Assessments

MTH791U/MTH791P: Computational Statistics with R

You should attempt ALL questions. Marks available are shown next to the ques-
tions.

In completing this assessment:

• You may use books and notes.

• You may use calculators and computers, but you must show your working
for any calculations you do.

• You may use the Internet as a resource, but not to ask for the solution to an
exam question or to copy any solution you find.

• You must not seek or obtain help from anyone else.

All work should be handwritten and should include your student number.

You have 24 hours to complete and submit this assessment. When you have finished:

• scan your work, convert it to a single PDF file, and submit this file using the
tool below the link to the exam;

• e-mail a copy to maths@qmul.ac.uk with your student number and the module
code in the subject line;

• with your e-mail, include a photograph of the first page of your work together
with either yourself or your student ID card.

You are expected to spend about 3 hours to complete the assessment, plus the time
taken to scan and upload your work. Please try to upload your work well before the
end of the submission window, in case you experience computer problems. Only one
attempt is allowed – once you have submitted your work, it is final.

Examiners: J. Griffin, H. Maruri-Aguilar

© Queen Mary University of London (2021) Continue to next page


MTH791U/MTH791P (2021) Page 2

Some questions use digits from your 9-digit ID number. The digits used are A, B and
C, the third-to-last, second-to-last and last digits of your ID number . . . ABC.

Question 1 [29 marks]. Let A, B and C be the last three digits of your ID number.
Consider the samples x = (9, 10 + A) and y = (7, 20 + B, 30 + C ) from two
populations. We want to know if the mean in the population associated with the first
sample is different from the population mean for the second sample. We are not
prepared, however, to assume that the data are normally distributed.

(a) Suppose we want to perform a permutation test. State an appropriate null


hypothesis and a test statistic. Carry out this test at the 10% significance level to
test the hypothesis. In your answer, calculate the full null distribution. [10]

(b) For sample sizes which are too large to calculate the exact null distribution for
the permutation test of part (a), even by computer, explain how we might
approximate the null distribution. [3]

(c) Calculate the value of Mann-Whitney statistic UX for the two samples x and y. [3]

(d) In R, the function dwilcox calculates the probability mass function of the null
distribution for the Mann-Whitney statistic. Suppose that this function is run
and outputs the following:
> dwilcox(0:3, m=2, n=3)
[1] 0.1 0.1 0.2 0.2
Use this output to calculate a p-value for the same comparison as part (a). [4]

(e) Suppose that the x and y samples were of the same size as each other, m = n. If
we carried out a Mann-Whitney test with a two-sided significance level of 5%,
what is the smallest m that would allow us to reject the null hypothesis? [4]

(f) If the samples were stored in x and y in R, what would be the purpose of the
following code?
length(c(x, y)) == length(unique(c(x, y))) [2]

(g) Briefly explain the main difference between the parametric and nonparametric
approaches to hypothesis testing. [3]

© Queen Mary University of London (2021) Continue to next page


MTH791U/MTH791P (2021) Page 3

Question 2 [13 marks].


Let ( x1 , y1 ), . . . , ( xn , yn ) be n pairs of data from continuous distributions. The null
hypothesis is that the differences xi − yi , i = 1, . . . , n, have a distribution which is
symmetric about zero. Let the test statistic K be the number of pairs for which
xi − yi > 0.

(a) For sample size n = 3, using the assumptions of symmetry about zero and
continuous data, calculate the null distribution for K. In other words, calculate
P(K = k ) for each k for which this probability is non-zero. [8]

(b) Let A, B and C be the last three digits of your ID number. Suppose that the
observed data are

( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ) = ( A, 10 + B), (9, 10 + C ), (14, 2)

Using the null distribution found in part (a), calculate the one-sided p-value
testing if the first member of each pair tends to be greater than the second
member. [3]

(c) Give an example of a type of study where the data consists of matched pairs. [2]

Question 3 [8 marks].
Consider the simple linear regression model

Yi = α + βxi + ei , i = 1, . . . , n,

where Yi is the random variable representing the response at the value xi of the
explanatory variable and the ei s are uncorrelated random errors.

(a) Give a step-by-step description of using the method of bootstrapping cases


with a sample ( x1 , y1 ), . . . , ( xn , yn ) in order to estimate the standard error of the
least squares estimators α̂ and β̂ of the intercept and the slope. [6]

(b) What does the method of part (a) assume about the data-points
( x1 , y1 ), . . . , ( x n , y n )? [2]

Question 4 [7 marks].
Suppose that we have a certain method for calculating 95% confidence intervals
(θ L , θU ) for a population parameter θ.

(a) Define what is meant by the coverage of the confidence interval. [2]

(b) Suppose that (θ L , θU ) are calculated based on data of the form y1 , . . . , yn , where
each yi is assumed to follow a distribution that depends on θ. Explain how we
could use simulation to assess the coverage of the confidence intervals. [5]

© Queen Mary University of London (2021) Continue to next page


MTH791U/MTH791P (2021) Page 4

Question 5 [21 marks].


Suppose that we have bivariate data of the form (y1 , x1 ), . . . , (yn , xn ). We wish to fit
models of the form E(Yi ) = f ( xi , β), where f is a known functional form and β is a
vector of parameters to be estimated.
(a) Describe the procedure for using leave-one-out cross-validation to obtain both a
set of predictions ŷ[1] , . . . , ŷ[n] , and the predicted or cross-validated residuals. [5]
(b) Assume now that f ( xi , β) is a linear model, and let H be the hat matrix when the
model is fitted to the whole dataset. You do not need to state the formula for H.
(i) State the formula relating the predicted residuals to the ordinary residuals
found when fitting the model to the original dataset. How does this
formula allow us to save computing time? [4]
(ii) How do the predicted residuals compare in magnitude to the ordinary
residuals? [2]
(c) Suppose that f depends on a set of spline functions, λ > 0 is a smoothing
parameter, and we estimate β by minimizing the penalized sum of squares
n Z∞
∑ (yi − f (xi , β))
2
SP = +λ f 00 ( x, β)2 dx
i =1 −∞
The answers do not need any details about spline functions.
(i) Explain why for sufficiently large values of λ, the fitted model approaches
a linear function. [3]
(ii) Suppose that we have fitted this model for a range of values of λ and
calculated the PRESS statistic each time, with the results as plotted below.
The PRESS statistic is defined as
n
PRESS = ∑ e2[i] , where e[i] is the ith predicted residual.
i =1

Explain how this graph would be used to select a value of λ. By doing this,
what feature of the fitted model are we selecting for? Why would PRESS
initially decrease as λ increases for small values of λ? [7]

© Queen Mary University of London (2021) Continue to next page


MTH791U/MTH791P (2021) Page 5

Question 6 [22 marks].


Suppose that we have two samples x = ( x1 , . . . , xm ) and y = (y1 , . . . , yn ), which in R
are stored in vectors named x and y, respectively. Consider the following R code:
N = 5000
v = vector(length=N)
for(i in 1:N){
xb = sample(x, replace=TRUE)
yb = sample(y, replace=TRUE)
v[i] = mean(xb) - mean(yb)
}
sd(v)
a = 0.025
k = floor(a*(N+1))
sv = sort(v)
c(sv[k], sv[N+1-k])

(a) What is the name for the statistical procedure that the code in the loop is
carrying out? [2]

(b) Explain what each of the three lines of code inside the loop is doing. [5]

(c) In statistical terms, what will the command sd(v) output? [2]

(d) In statistical terms, what will the last line of code output? [3]

(e) What does the statistical method assume about:

(i) the observations in the sample x? [2]


(ii) the relationship between the samples x and y? [2]

Suppose now that the first two lines of code inside the loop are replaced by:
xb = rgamma(length(x), shape=ax, rate=bx)
yb = rgamma(length(y), shape=ay, rate=by)
where it is assumed that ax, bx, ay, by have been given values in previous code.

(f) With that change, what is the name of the statistical procedure? [1]

(g) Explain what the line starting with xb = rgamma is doing. [3]

(h) What does this method assume about the observations in the sample x? [2]

End of Paper – An appendix of 1 page follows.

© Queen Mary University of London (2021) Continue to next page


MTH791U/MTH791P (2021) Page 6

Appendix: Normal distribution function

Table 1: The standard normal cumulative distribution function (cdf) Φ( x ) for the given
values of x. The cdf for x < 0 can be found using the fact that Φ( x ) = 1 − Φ(− x ). For
x ≥ 3.8, 1 − Φ( x ) < 10−4 .

x Φ( x ) x Φ( x ) x Φ( x ) x Φ( x ) x Φ( x )
0.0 0.500 0.8 0.788 1.6 0.945 2.4 0.992 3.2 0.9993
0.1 0.540 0.9 0.816 1.7 0.955 2.5 0.994 3.3 0.9995
0.2 0.579 1.0 0.841 1.8 0.964 2.6 0.995 3.4 0.9997
0.3 0.618 1.1 0.864 1.9 0.971 2.7 0.997 3.5 0.9998
0.4 0.655 1.2 0.885 2.0 0.977 2.8 0.997 3.6 0.9998
0.5 0.691 1.3 0.903 2.1 0.982 2.9 0.998 3.7 0.9999
0.6 0.726 1.4 0.919 2.2 0.986 3.0 0.999 3.8 0.9999
0.7 0.758 1.5 0.933 2.3 0.989 3.1 0.999

End of Appendix.

© Queen Mary University of London (2021)

You might also like