You are on page 1of 4

MH3511 Midterm Test (Total 50 pts)

1. Circle T (True) or F (False) for each of the following statements. (15 pts. 3 pts for a correct
answer, -1 pt for an incorrect answer and 0 for no answer. The minimum total pts= 0.)

(a) The interquartile range can be used to construct a robust estimate for the location of a distribution.
[T / F]

(b) The fact that the mean is less than the median often indicates a left-skewed distribution.
[T / F]

(c) To compare distributions of two data sets using a qqplot, e.g., > qqplot(data1, data2), a straight line
pattern presented in the qqplot implies that both data sets have nearly identical distributions.
[T / F]

(d) If (2.23, 4.57) is a 90% conﬁdence interval for the mean GPA of NTU year-4 students in 2016, then a
randomly selected student from this cohort has a GPA between 2.23 and 4.57 with probability 90%.
[T / F]

(e) A student applies the t-test for testing H0 : µ = 1 vs HA : µ > 1. If the t-test statistic is obtained
as t = 1.28 based on a random sample of size n, then the p-value of the test can be calculated by
1 − P (T < 1.28|T ∼ tn−1 ).
[T / F]

Solution: F, T, F, F, T

2. Write R code to complete the following tasks (35 pts. 5 pts each. For a question requiring
interpretations, 3pts for code and 2 pts for interpretation).

(a) Suppose there is a matrix B in the R console, write R code to subtract each row of B by its row’s
mean value and name the resulting matrix as C (with its row sums equal to zeros).
Solution:
> rm = rowM eans(B)
# or
> rm = apply(B, 1, mean)
# or
> rm = c(mean(B[1, ]), mean(B[2, ]))
> C = B − rm

1
(b) Write R code to i) generate 1000 random values from the standard normal distribution and put
them in variable x; ii) calculate the mean, variance, standard deviation and IQR of x; iii) visualize
the distribution of x using histogram and then plot the density of the standard normal distribution
between - 3 and 3 over it as shown in the ﬁgure below.
Histogram of x

0.4
0.3
Density

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Solution:
> x = rnorm(1000)
> mean(x); var(x); sd(x); IQR(x)
> hist(x, f req = F )
# or
> hist(x, breaks = 14, f req = F )

> lines(seq(−3, 3, 0.01), dnorm(seq(−3, 3, 0.01)))

(c) Two sets of daily maximum temperature data T emp1 and T emp2 were collected from two cities,
respectively, over the past 3 months. Suppose that variables T emp1 and T emp2 exist in the R
console. Explain how to graphically check which of the two samples is more likely to be normally
distributed and write the corresponding R code.
Solution:
> qqnorm(T emp1)
> qqline(T emp1)
> qqnorm(T emp2)
> qqline(T emp2)
For both plots, the one which exhibits a pattern closer to the corresponding reference line will be
more likely to be normally distributed.

(d) A previous health study had found that 5% of the population suﬀered from a blood disease. 23 people
randomly selected from an area near a mobile phone transmitter have a medical exam recently and 3
of them are found to have that disease. The locals believe that the transmitter increases the likelihood
of having the disease. A researcher wants to perform a formal hypothesis test on whether the mobile
phone transmitter increases the incidence of the disease. State H0 and HA in this test, write R code
to test the hypotheses and explain how you can make a decision and draw a conclusion.
Solution:
> prop.test(3, 23, p = 0.05, alternative = “greater”)
H0 : p = 0.05 vs H1 : p > 0.05.
Decision: if the p-value provided by the prop.test is less than α, then we reject H0 at level α.
otherwise do not reject.
Conclusion: When rejecting H0 , we conclude that the mobile phone transmitter increases the inci-
dence of the disease signiﬁcantly. Otherwise, we conclude that there is no signiﬁcant evidence that
the mobile phone transmitter increases the incidence of the disease.

2
(e) Variable “Score” in the R console contains math exam scores of a random sample of 120 primary-
ﬁve students. Write R code to construct a 98% conﬁdence interval for the mean math score of the
underlying population. State necessary assumption(s) required in your construction of the conﬁdence
interval.
Solution: Use a t-conﬁdence interval:
> t.test(Score, conf.level = 0.98)
To use the t-interval, we need an distribution assumption that the variable Score is normally dis-
tributed.
Or we can use a z-interval:
> L = mean(Score) − qnorm(0.99) ∗ sd(Score)/sqrt(120)
> R = mean(Score) + qnorm(0.99) ∗ sd(Score)/sqrt(120)
> cat(“98%CI = ”, “(”, L, “, ”, R, “)”)
We need assumptions for using the CLT in constructing the z-interval only. That is, the sample size
n is large enough and the distribution of variable Score has ﬁnite the ﬁrst and second order moments.
Note that it is not necessary to use the normal distribution assumption in this case.

(f) If you have a set of data collected from a study, which aims to explore whether the hair color and
gender are dependent. The raw data are given in the following table. State the null and alternative
hypotheses in this problem and write R code to perform an appropriate hypothesis test based on the
raw data.

Subject
1 2 3 4 5 6 7 8 9 10
Haircolor BRN BRN BLOND BRN RED RED BLOND BRN RED BLOND
Gender F M M M M F M M F F

Solution:
> gender = c(“F ”, “M ”, “M ”, “M ”, “M ”, “F ”, “M ”, “M ”, “F ”, “F ”)
> hair = c(“BRN ”, “BRN ”, “BLON D”, “BRN ”, “RED”, “RED”, “BLON D”, “BRN ”, “RED”, “BLON D”)
> D = table(gender, hair)
> chisq.test(D)
H0 : Hair color and gender of a subject are independent
against
H1 : Both are dependent.

3
(g) Breast cancer patients receiving treatment-M followed by chemotherapy were matched to each other
on age and cancer stage. By random assignment, one patient in each matched pair received both
chemotherapy and treatment-M, while the other patient in each matched pair received chemotherapy
only. After 5 years followup, data collected from this study are summarized in the table below.
Chemo only
Survived 5 years Died within 5 years
Chemo+M Survived 5 years 510 17
Died within 5 years 5 90
To answer the question “Does survival to 5 years diﬀer by treatment group?”, one applied McNemar’s
test in R and obtained the following output:

> cancer = matrix(c(510, 17, 5, 90), nrow = 2, byrow = T )

> mcnemar.test(cancer, correct = F )
M cN emar′ s Chi − squared test
data : cancer
M cN emar′ s chi − squared = 6.5455, df = 1, p − value = 0.01052

i) What are hypothese H0 and HA in this test?

ii) What conclusion can you draw from the given output at level α = 0.05?

Solution: H0 : The 5-year survival status and the treatment type are independent.
HA : Both variables are dependent.
Decision: Reject H0 by the p-value=0.01 < α = 0.05. Or by the rejection region method: Chisq
statistic= 6.54 > χ21 = 3.84.
Conclusion: The data provide evidence that the 5-year survival status depends on the treatment type
as the treatment of chemo combined with an extra treatment-M results in a diﬀerent survival status
from that of the treatment with chemo alone.