You are on page 1of 12

KARATINA UNIVERSITY

UNIVERSITY EXAMINATIONS
2015/2016 ACADEMIC YEAR
THIRD YEAR SUPPLIMENTARY EXAMINATION
FOR THE DEGREE OF
BACHELOR OF SCIENCE IN
APPLIED STATISTICS WITH COMPUTING
COURSE CODE: STA 328

COURSE TITLE: APPLIED REGRESSION ANALYSIS


II

DATE: TIME:

INSTRUCTION TO CANDIDATES

 SEE INSIDE

Page 1 of 12
1. The paper comprises 7 questions.
2. Attempt questions one and two (compulsory) and 3 other questions (13 marks each)
3. Electronic, Scientific calculators may be used.

4. Statistical tables are provided.

5. Where level of significance is not stated, use 𝛼 = 0.05.

6. Observe further instructions on the answer booklet.

Question 1 [15 marks]

(a) Define extra sum of square. [1 mark]


(b) An educator examined the relationship between number of hours devoted to
reading each week (Y) and the independent variables social class (X1), number
of years school completed (X2), and reading speed measured by pages read per
hour (X3). The analysis of variance table obtained from a stepwise regression
analysis on data for a sample of 19 women over the age of 60 is shown.

Source DF Sum of squares


Regression :X3 1 1058.628
:X2/X3 1 183.743
:X1/X2,X3 1 37.982
Residual 15 363.300
Total 18 1643.653

(i) Test the significance of each variable as it enters the model. [6 marks]
(ii) Test H0: β1 = β2 = 0 in the model. [2 marks]

Page 2 of 12
(iii) Why can’t we test H0: β1 = β3 = 0 using the ANOVA table given? What
formula would you use for this test? [2 marks]
(iv) What is your overall evaluation concerning the appropriate model to use
given the results in parts (i) and (ii)? [1 mark]
(c) Consider the model

Yi  1 xi   2 xi 2   i where  ~ N (0, I 2 ) and uncorrelated in matrix form.

Find the least squares estimators ̂1 and ˆ2 using matrix method. [3 marks]

Question 2 [16 marks]

(a) Describe properties and uses of logistic regression model. [4 marks]


(b) When modeling the relationship between a numerical predictor variable and
a binary response, why is the logit transformation a good idea? [1 mark]

(c) The usual assumptions placed on the error terms in ordinary least squares
regression are:
• Independently distributed
• Identically distributed (equal variance)
• Normally distributed
Which of these assumptions are violated when dealing with binary response
data? Explain briefly how each is violated. [3 marks]
(d) The results below are the estimates coefficients for a multiple logistic regression
model using the variables AGE, weight at last menstrual period (LWL) and
Number of first trimester physician visits (FTV) from a given data set.

Variable Estimated Std


Coefficient error

AGE -0.024 0.035


LWT -0.014 0.652E-02
FVT -0.049 0.167
Constant 1.295 1.069

Log-likelihood=-111.286
(i) Write down a multiple logistic regression model for the above case and
interpret it. [4 marks]

Page 3 of 12
(ii) State the corresponding logit expression. [1 mark]
(iii) If the log-likelihood of the model after excluding AGE and FTV from it, is
-111.630; test whether or not it is advantageous to include these two covariates
in out model. [3 marks]

Question 3 [13 marks]

(a) In the context of regression model define “polynomial regression”. [1 mark]


(b) Why is a polynomial regression useful? [2 marks]
(c) A physiologist wants to investigate the impact of exercise on the human immune
system.The physiologist theorizes that the amount of immunoglobulin y in blood
(called IgG, an indicator of long-term immunity) is related to the maximal
oxygen uptake x (a measure of aerobic fitness level) of a person by the model
y  0  1 x  2 x2  

To fit the mode, values of y and x were measured for each of 30 human subjects.
Denoting the amount of immunoglobulin with IGG and the maximal oxygen
uptake with MAXOXY,the following are R-program outputs:

Output 1: Scatterplot

Page 4 of 12
Output 2
Call:
lm(formula = IGG ~ MAXOXY)

Residuals:
Min 1Q Median 3Q Max
-228.16 -79.96 -11.78 83.75 211.93

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -100.345 100.450 -0.999 0.326
MAXOXY 32.743 1.932 16.947 2.97e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 124.8 on 28 degrees of freedom


Multiple R-squared: 0.9112, Adjusted R-squared: 0.908
F-statistic: 287.2 on 1 and 28 DF, p-value: 2.973e-16

Output 3
Analysis of Variance Table

Response: IGG

Page 5 of 12
Df Sum Sq Mean Sq F value Pr(>F)
MAXOXY 1 4472047 4472047 287.21 2.973e-16 ***
Residuals 28 435982 15571
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Output 4
Analysis of Variance Table

Response: IGG
Df Sum Sq Mean Sq F value Pr(>F)
MAXOXY 1 4472047 4472047 394.827 2.2e-16 ***
MAXOXY2 1 130164 130164 11.492 0.002165 **
Residuals 27 305818 11327
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Output 5
Call:
lm(formula = IGG ~ MAXOXY + MAXOXY2)

Residuals:
Min 1Q Median 3Q Max
-185.375 -82.129 1.047 66.007 227.377

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1464.4042 411.4012 -3.560 0.00140 **
MAXOXY 88.3071 16.4735 5.361 1.16e-05 ***
MAXOXY2 -0.5362 0.1582 -3.390 0.00217 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 106.4 on 27 degrees of freedom


Multiple R-squared: 0.9377, Adjusted R-squared: 0.9331
F-statistic: 203.2 on 2 and 27 DF, p-value: 2.2e-16

Use the accompanying R-program outputs to answer the following questions.


(i) According to the scatterplot for the data; is there evidence to support the use of a
quadratic model? [1 mark]
(ii) Write the quadratic model. [2 marks]

Page 6 of 12
(iii) Interpret the  estimates. [ 3 marks]
(iv) Is the overall model useful for predicting IgGy? Use   0.01 . [2 marks]
(v) Is there sufficient evidence of concave downward curvature in the immunity
fitness level? Use   0.01 . [2 marks]

Question 4 [13 marks]

Consider the following error sum of square (SSE):


SSE(X2) = 4024.559
SSE(X1 , X2)= 2063.998
SSE (X1 , X2 , X3)=2011.584
And total sum of squares SSTO is equal to 6145.217.The values are obtained from full
regression model
Yi  0  1 X i1  2 X i 2  3 X i3   i , i  1, 2, , 23

(a) Obtain analysis of variance table that decomposes the regression sum of squares
into extra sum of squares associated with X2; with X1 given X2; and with X3 given
X1 and X2. [7 marks]
(b) Test whether X3 can be dropped from the regression model given that X1 and X2
are retained. Use   0.05 [3 marks]
(c) Compute the Coefficients of Partial determination r 23.12 .Comment on the

reduction in error sum of squares. [3 marks]

Question 5 [13 marks]

An experiment is conducted on the toxicity of doses of an insecticide on the tobacco


budworm moth. In the experiment batches of 20male moths were exposed for 3 days to
the insecticide and the number in each batch that were dead or knocked down was
recorded. The data are given below.

Dose Level 1 2 4 8 16 32
log2(Dose Level) 0 1 2 3 4 5
Dead or down 1 4 9 13 18 20

Page 7 of 12
(a) In general, describe the relationship between the dose level and the proportion of
male moths dead or down. [1 mark]
(b) What is the observed proportion of dead or down at dose level 16? [1 mark]
(c) What are the observed odds of dead or down at dose level 16? [1 mark]

A logistic regression of the proportion of dead or down on the log2(Dose Level) is run.
Below is the summary from R-program.

Output 6

Coefficients:
Value Std. Error z value
(Intercept) -2.818555 0.5479524 -5.143796
log2dose 1.258949 0.2120484 5.937086

Null Deviance: 71.13758 on 5 degrees of freedom


Residual Deviance: 1.88097 on 4 degrees of freedom

(d) Give the logistic regression equation. [1 mark]


(e) Use the equation in (d) to predict the log-odds for a dose of 16. [2 marks]
(f) What are the predicted odds? What is the predicted proportion of dead and down
male moths? [2 marks]
(g) If we increase the dose from 16 to 32, by what multiple will the predicted odds
increase? [3 marks]

Page 8 of 12
(h) According to the plot, if you wish to kill or knock down 50% of the males what dose
should you use? [1 mark]

(i) According to the plot, if you wish to kill or knock down 50% of the females what
dose should you use? [1 mark]

Question 6 [13 marks]

Consider N independent binary random variables, Y1, Y2, ..., YN such that

P(Yi=1)= 𝜋𝑖 and P(Yi=0)=1-𝜋𝑖

The probability function of Yi can be written as

Page 9 of 12
 Y (1   )1Y
i i
where Yi=0 or 1

(a) Show that this probability function belongs to the exponential family of
distributions. [2 marks]
(b) Show that E(Yi)= 𝜋𝑖 [3 marks]
(c) If the link function is defined as

   exp( X  )
T
g ( )  log   , where  
 1   1  exp( X T  )

Show that this is equivalent to modeling the probability 𝜋 as

  
  exp( X  ) [3 marks]
T
log 
 1  

(d) In the particular case where X T   1  2 X which gives


exp( 1   2 X )
 ,
1  exp( 1   2 X )
which is the logistic function.
(i) Sketch the graph of x against 𝜋 in this case, taking 1 and 1 as constants.
[2 marks]
(ii) How would you interpret this if x is the dose of an insecticide and 𝜋 the
probability of an insect dying? [3 marks]

Question 7 [13 marks]

Each autumn, individuals (especially older persons or the chronically ill) are
encouraged to get a flu shot. Fifty persons are selected at random from a health clinic
client list and asked if they actually went to get a flu shot. A client who got a flu shot
has a response of Y = 1, if no flu shot, the response is Y = 0. Other data collected were
age (Age) and health awareness (Aware), for which higher values indicate greater
awareness.
Simple logistic regressions were run on Age and Aware separately

Output 7

Page 10 of 12
Coefficients:
Value Std. Error
(Intercept) -6.57492 2.12560
Age 0.13302 0.04439
Null Deviance: 68.03 on 49 degrees of freedom
Residual Deviance: 56.08 on 48 degrees of freedom

Output 8

Coefficients:
Value Std. Error
(Intercept) -7.39019 2.09332
Aware 0.13486 0.03884
Null Deviance: 68.03 on 49 degrees of freedom

Residual Deviance: 49.28 on 48 degrees of freedom

(a) In the model with Age alone, is there a significant lack of fit? Is the variable Age
significant? Use tests based on deviances to support your answers. [4 marks]
(b) If you were to pick only one variable, Age or Aware, to model the binary
response of whether a client received a flu shot, which one would you choose?
Support your choice statistically. [3 marks]
A multiple logistic regression was run with both Age and Aware in the model.

Output 9

Coefficients:
Value Std. Error
(Intercept) -21.58213 6.33966

Page 11 of 12
Age 0.22175 0.07360
Aware 0.20348 0.06206
Null Deviance: 68.03 on 49 degrees of freedom
Residual Deviance: 32.42 on 47 degrees of freedom
(c) What is the z-test statistic for the variable Age in this multiple logistic regression
analysis? Is it statistically significant? [3 marks]
(d) Use the change in deviance to test the significance of adding the variable Aware
to the simple model using Age. [3 marks]

Page 12 of 12

You might also like