Professional Documents
Culture Documents
Full points may be obtained for correct answers to eight questions. Each numbered question
(which may have several parts) is worth the same number of points. All answers will be
graded, but the score for the examination will be the sum of the scores of your best eight
solutions.
Use separate answer sheets for each question. DO NOT PUT YOUR NAME ON
YOUR ANSWER SHEETS. When you have finished, insert all your answer sheets into
the envelope provided, then seal it.
To earn full credits, you must show all the steps how you got your answer.
1
Problem 1—Stat 401. Let X be a Gaussian random variable with mean 0 and variance
Y , where Y is a random variable such that
Solution to Problem 1.
3. We have
P{X > 0} = P{X > 0|Y = 1}P{Y = 1} + P{X > 0|Y = 2}P{Y = 2}
1 1
= p + (1 − p)
2 2
1
= .
2
Problem 2—Stat 401. Let X, Y, Z have joint probability density function f (x, y, z) =
2(x + y + z)/3, 0 < x, y, z < 1; zero elsewhere.
2. Are X, Y, Z independent?
Solution to Problem 2.
2
1. We have Z 1 Z 1
2
fX (x) = f (x, y, z)dydz = (x + 1); 0 < x < 1.
0 0 3
Similarly, we have
2
fY (y) = (y + 1), 0<y<1
3
and
2
fZ (z) = (z + 1), 0 < z < 1.
3
2. Since fX (x)fY (y)fZ (z) 6= f (x, y, z). They are not independent.
f (x, y, z) 2(x + y + z)
fX|Y,Z (x|y, z) = = ; 0 < x, y, z < 1.
fY,Z (y, z) 2y + 2z + 1
Problem 3—Stat 401. For independent and identically distributed random variables
X1 , X2 , . . . , Xn with P [Xi > 0] = 1, and V ar(log(Xi )) = σ 2 , show that for every > 0
(Hint: Chebyshev inequality)
(a) " #
n
Y σ2
P exp{n[E(log(X1 )) − ]} < Xi < exp{n[E(log(X1 )) + ]} ≥ 1 −
i=1
n2
(b) " #
n
Y
n n σ2
P Xi < (E(X1 )) e ≥1− 2
i=1
n
(You may use the inequality log(E(X1 )) ≥ E(log(X1 )).)
Solution to Problem 3.
3
(a) By the condition that P [Xi > 0] = 1, we have
n
Y
n[E(log(X1 ))−]
P [e < Xi < en[E(log(X1 ))+] ]
i=1
n
X
= P [n[E(log(X1 )) − ] < log(Xi ) < n[E(log(X1 )) + ]]
i=1
n
X
= P [−n < log(Xi ) − n[E(log(X1 ))] < n] (1)
i=1
n
X
= P [| log(Xi ) − n[E(log(X1 ))]| < n]
i=1
n
X
= 1 − P [| log(Xi ) − n[E(log(X1 ))]| ≥ n]
i=1
which is equivalent to
log(E(X1 )) ≥ E(log(X1 )).
On the other hand,
n
Y
n[E(log(X1 ))−]
P [e < Xi < en[E(log(X1 ))+] ]
i=1
n
Y
≤ P[ Xi < en[E(log(X1 ))+] ]
i=1
n (3)
Y
n[logE(X1 ))+]
≤ P[ Xi < e ]
i=1
n
Y
≤ P[ Xi < (E(X1 ))n E n ].
i=1
4
Problem 4—Stat 401. In a lengthy manuscript, it is discovered that 86.5% percent of
the pages contain at least one typing errors. If we assume that the number of errors per
page is Poisson distributed.
(a) Let X be the number of errors per page and X follows a Poisson(λ) distribution. What
is λ?
(b) Let Y be the total number of typos on n pages. Suppose that the numbers of typos on
different page are independent and identically distributed as Poisson(λ). Show that Y
follows a Poisson distribution with parameter nλ.
(c) Suppose that the number of typos on n pages follows a Poisson distribution with
parameter nλ, where λ is computed in part (a). How many pages should be checked
so that at least one typo is found with probability no less than 0.99?
Solution to Problem 4. (a) Because X follows a Poisson distribution, we have
λ0
P (X ≥ 1) = 1 − P (X < 1) = 1 − P (X = 0) = 1 − exp(−λ) = 0.865.
0!
Then, it follows that λ = − log(0.135) = 2.00.
(b) Note that the moment generating function of X is
∞ ∞
X λk X (et λ)k
MX (t) = E(etX ) = etk exp(−λ) = exp(et λ − λ) exp(−et λ)
k=0
k! k=0
k!
t
= exp{(e − 1)λ}.
Then the moment generating function of Y is
MY (t) = {E(etX )}n = exp{(et − 1)nλ}.
Hence, Y follows a Poisson distribution with parameter nλ.
(c) We would like to find n so that
P (Y ≥ 1) ≥ 0.99.
The above inequality is equivalent to
1 − P (Y < 1) ≥ 0.99
Then, it follows that
(nλ)0
P (Y < 1) = P (Y = 0) = exp(−nλ) ≤ 0.01,
0!
which is equivalent to
n ≥ − log(0.01)/λ = − log(0.01)/2 = 2.30.
Thus, at least 3 pages needed to be checked so that at least one typo is found with probability
no less than 0.99.
5
Problem 5—Stat 411. Let X1 , . . . , Xn and Y1 , . . . , Yn be independent random samples
from two normal distributions N (µ1 , σ 2 ) and N (µ2 , σ 2 ), respectively, where σ 2 is the common
but unknown variance.
(a) Find the likelihood ratio ∆ for testing H0 : µ1 = µ2 = 0 against all alternatives.
Thus
∆ = (1 + Z)−n ,
6
where
nX̄ 2 + nȲ 2
Z = Pn 2
Pn 2
.
i=1 (Xi − X̄) + j=1 (Yj − Ȳ )
Problem 6—Stat 411. Let X1 , ..., Xn be iid Poisson random variables with parameter
θ > 0, and q(θ) = 1 − e−θ = Pθ (X1 > 0).
Pn
(a) Show that Sn = i=1 Xi is sufficient and complete for θ.
Solution to Problem 6. (a) Sufficiency be established by noting that the Possion distri-
bution is an exponential family.
To check the sufficient statistic U = Sn = ni=1 Xi is complete, suppose
P
∞
X (nθ)u
Eθ [g(U )] = g(u) e−nθ = 0 ∀ θ > 0.
u=0
u!
Hence ∞
nu
X
g(u) θu = 0 ∀ θ > 0,
u=0
u!
u
which implies g(u) nu! = 0 ∀ u = 0, 1, ... and g(u) = 0 ∀ u = 0, 1, ...
(b) Sn has a Poisson distribution with parameter nθ by using the MGF.
(c) First propose an unbiased estimator T = I{X1 >0} .
Next apply Rao-Blackwellization
Problem 7—Stat 411. Let X1 , ..., Xn be iid uniform(0, θ) random variables, θ > 0.
7
(a) Find a sufficient and complete statistic for θ and verify it.
for all θ > 0. Taking differentiation with respect to θ, we get u(θ)θn−1 = 0 for all θ > 0. That
is, u(y) = 0 for all y > 0, which is the support of Yn . By the definition of the completeness,
Yn is also complete for θ ∈ (0, ∞).
(b) The likelihood function of θ can be written as
n
Y
L(θ) = θ−1 1[0,θ] (Xi ) = θ−n 1[maxi {Xi },∞) (θ)
i=1
which attains its maximum at maxi {Xi }. That is, the MLE of θ is θ̂ = maxi {Xi }.
8
Solution to Problem 8. (a) The probability mass function of b(1, θ) is
f (x; θ) = θx (1 − θ)1−x
Then
respectively. That is, l(θ) increases before θ = X̄ and decreases after that. Given 13 ≤ θ ≤ 1,
l(θ) attains its maximum at X̄ if X̄ ≥ 1/3 or at 1/3 if X̄ < 1/3. That is, the mle is this case
is max{X̄, 1/3}.
Problem 9—Stat 481. Consider the regression model that relates gas mileage and weight
of automobiles. Thirty-eight cars were selected, and their weights x (in units of 1,000 pounds)
and fuel efficiencies MPG (miles per gallon) were measured.
(a) The residual plot Figure 1 was obtained after fitting the simple linear regression model
(Model I) MPG = β0 + β1 x + . Discuss whether Model I is appropriate based on
Figure 1. What would you suggest based on Figure 1?
9
Figure 1: Residual Plot of Model I
6
4
2
Residual
0
−2
−4
15 20 25 30
Fitted Value
(b) Define GPM = 100/MPG, that is, gallons per 100 miles. The residual plot Figure 2
was obtained after fitting the model (Model II) GPM = β0 + β1 x + . Discuss whether
Model II is better than Model I based on Figure 2.
0.0
−0.5
3 4 5 6
Fitted Value
10
Based on the output, estimate the GPM of cars with a weight of 3,500 pounds. Con-
struct a 95% confidence interval for β0 . What is your conclusion based on your confi-
dence interval? (For your reference, some relevant critical values of t-distributions are
t(0.025; df=36) = 2.03, t(0.05; df=36) = 1.69.)
(d) Would you accept Model II as your final model? What else do you want to do?
Solution to Problem 9. (a) Model I is not appropriate based on Figure 1 because there
is clear nonlinear pattern in the residual plot. One may consider adding quadratic term x2
into the model or using some transformation of MPG instead.
(b) Figure 2 is much better than Figure 1 in the sense that there is little pattern in the
residual plot. We conclude that Model II is better than Model I.
(c) Based on the output, the fitted model is
[ = −0.00623 + 1.515x.
GPM
At x = 3.5 (3,500 pounds), the estimate of GPM is −0.00623 + 1.515 × 3.5 = 5.30 (gallons
per 100 miles).
Based on the output, βˆ0 = −0.00623 and se(βˆ0 ) = 0.303. Using the critical value t(0.025;
df=36)=2.03, a 95% confidence interval for β0 is −0.00623 ± 2.03 × 0.303 or (−0.621, 0.609)
which includes 0. The conclusion based on the confidence interval is that β0 is not signifi-
cantly from 0 at 0.05 level.
(d) Based on the output in (c), the intercept (β0 ) can be omitted. One may remove the
intercept item and fit a new model GPM = β1 x+ instead. It can be done by lm(GPM∼-1+x)
using R.
Problem 10—Stat 481. A study conducted by Baty et al. (2006) aims to measure
the influence of beverage on blood gene expression. In the study, they measured the gene
expression of participants who had 4 different beverages (500mL each: grape juice, red wine,
alcohol and water). Consider a linear regression with the gene expression as the response
and beverage types as predictors. The following output is given by R, where BeverFac is
a categorical variable with level 1 (Alcohol), level 2 (Grape Juice), level 3 (Red wine) and
level 4 (water), and resp is the gene expression response.
> BeverFac<-as.factor(Beverages)
> resp<-averageRFC2
> lm2<-lm(resp~BeverFac)
> summary(lm2)
Call:
lm(formula = resp ~ BeverFac)
11
Residuals:
Min 1Q Median 3Q Max
-0.198736 -0.062475 0.000081 0.062119 0.301559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.28993 0.02028 211.525 <2e-16 ***
BeverFac2 -0.04088 0.02840 -1.439 0.153
BeverFac3 -0.03108 0.02790 -1.114 0.268
BeverFac4 0.02851 0.02767 1.030 0.305
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> vcov(lm2)
(Intercept) BeverFac2 BeverFac3 BeverFac4
(Intercept) 0.0004113169 -0.0004113169 -0.0004113169 -0.0004113169
BeverFac2 -0.0004113169 0.0008068140 0.0004113169 0.0004113169
BeverFac3 -0.0004113169 0.0004113169 0.0007785642 0.0004113169
BeverFac4 -0.0004113169 0.0004113169 0.0004113169 0.0007659005
(a) Let Y be the gene expression and Xi (i = 0, 1, 2, 3) be the dummy variables, respec-
tively, representing Alcohol, Grape Juice, Red Wine and Water. Write down the linear
regression model corresponding to the above R code. Provide the least square estimate
of the different effect of alcohol and water on the gene expression. Is this difference
significantly different from 0? Please justify your answer using the parameters defined
in your linear regression model.
(b) Give the least squares estimate of the different effects of Red Wine and Water on the
gene expression, and the standard error of the least squares estimate.
(c) Please complete the following ANOVA table using the R output given above.
Response: resp
Degree of freedom Sum of Squares Mean Squares F value Pr(>F)
BeverFac
Residuals
12
Solution to Problem 10. (a) The linear regression model corresponds to the fitted linear
model is
Y = β0 + β1 X1 + β2 X2 + β3 X3 + ε,
where ε is mean zero with constant variance σ 2 . Because the baseline in the linear model is
Alcohol, the difference of effect of alcohol and water is represented by β3 , the coefficient of
the dummy variable corresponding to water. The least squares estimate of β3 is 0.002851.
To know if the difference is significantly from 0, we can do a hypothesis testing on β3 . The
null hypothesis is H3 : β3 = 0 versus the alternative H1 : β3 6= 0. The corresponding test
statistic value is 1.030 and the p-value is 0.305.
(b) The different effects of Red Wine and Water on the gene expression is given by β2 − β3 .
The least squares estimate is −0.03108 − 0.02851 = −0.05959, and its variance is
0.0007785642 0.0004113169 1
(1, −1) = 0.0007218308.
0.0004113169 0.0007659005 −1
Problem 11—Stat 481. In the computer science department of a large university, many
students change their major after the first year. A detailed study of the 256 students enrolled
as first-year compute science majors in one year was undertaken to help understand this
phenomenon. Students were classified on the basis of their status at the beginning of their
second year, and several variables measured at the time of their entrance to the university
were obtained. Here are summary statistics for the SAT mathematics scores:
Second-year major n x̄ s
Computer Science 103 619 86
Engineering & other sciences 31 629 67
Other 122 575 83
Assume a fixed, completely randomized design is used.
1. Write down the effect model for the CRD, the assumptions, and any model constraints.
2. Given SST R = 139, 372, calculate SSE and SSTO. Construct the ANOVA table.
3. Given that F0.01 (2, 253) = 4.7, test if there are differences in the SAT scores among
the three groups of students at α = 0.01. What is your conclusion?
13
4. Find an estimate for the mean difference between the SAT scores of computer science
and engineering students. Then derive its sampling distribution and find its 95%
confidence interval.
1.
Yij = µ + τi + ij , i = 1, 2, 3; j = 1, 2, . . . , ni .
In this case, n1 = 103, n2 = 31, and n3 = 122.
Assumption:P i.i.d. errors ij ∼ N (0, σ 2 )
Constraint: i ni τi = 0. In this case,
2. We know that
n
i k
1 X 2 X
s2i = Yij − Ȳi ⇒ SSE = s2i (ni − 1)
ni − 1 j=1 i=1
Then
Source SS df MS F
Treatment 139372 2 69686 10.23
Error 1722631 253 6808.82
Total 1862003 255
3. Hypotheses:
H0 : τ1 = τ2 = τ3 = 0 ⇔ H0 : µ1 = µ2 = µ3
versus
H1 : at least one τi 6= 0 ⇔ H1 : at least one µi is not the same
14
σ2 σ2
4. We know that ȲCS ∼ N µCS , nCS
and ȲE ∼ N µE , . Then ȲCS − ȲE follows a
nE
2 1 1
normal distribution with mean µCS − µE and variance σ nCS + nE .
In addition,
ȲCS − ȲE − (µCS − µE ) SSE
r ∼ N (0, 1) and 2
∼ χ2 (N − k).
1
σ
σ 2 nCS + n1E
Pk
Based on Student’s Theorem, ȲCS − ȲE is independent of SSE = i=1 (ni − 1)s2i .
Hence, it has the sampling distribution
ȲCS − ȲE − (µCS − µE )
t= r ∼ t(N − k),
1 1
M SE nCS + nE
Problem 12—Stat 481. A plant manager wants to investigate the productivity of three
groups of workers: those with little, those with average, and those with considerable expe-
rience. Because productivity depends to some degree on the day-to-day variability of the
available raw materials, which affects all groups in a similar fashion, the manager suspects
that the comparison should be blocked with respect to day. The results from five production
days are as follows:
Day (Block)
Experience Level 1 2 3 4 5 Row Mean
A 53 58 49 52 60 54.4
B 55 57 53 57 64 57.2
C 60 62 55 64 69 62.0
You obtain the SAS table:
15
1. What design method was used? Write down the model and necessary assumptions and
constraints.
2. Based on teh SAS output, are there treatment effects at α = 0.05? If so, between
which groups are there differences (use 95% Tukey Simultaneous Confidence Intervals
where q0.05 (3, 8) = 4.04).
3. Are there block effects at α = 0.05? What is the conclusion you can draw about block
effects?
4. If you ran this design as a completely randomized design, what would be the new
ANOVA table? Is this design as sensitive as using a block RCBD at α = 0.05? Given
F0.05 (2, 12) = 3.89, F0.05 (4, 10) = 3.47.
Yij = µ + τi + βj + ij
2. Hypotheses:
16
Decision: Reject H0 .
Conclusion: There is a difference among treatment means.
3. Hypotheses:
17