Regression Analysis - STAT510

Regression Analysis - STAT510
Dr. Xiyue Liao
Department of Mathematics and Statistics
California State University, Long Beach

Review of the Previous Class
Polynomial regression: another transformation - One way to try to

account for a nonlinear relationship is through a polynomial regression
model. Generally, such a model for a single predictor, x , is:
Yi = β0 + β1 xi + β2 xi2 + ... + βh xih + εi
where h is called the degree of the polynomial.
I For lower degrees, the relationship has a specific name (i.e., h = 2 is

called quadratic, h = 3 is called cubic, h = 4 is called quartic, and so
on)
I Hierarchy principle: if your model includes x h and x h is shown to
be a statistically significant predictor of Y , then your model should
also include each x j for all j < h, whether or not the coefficients for
these lower-order terms are significant.
Multiple Linear Regression Model
A population model for a multiple linear regression model that relates a
response Y to (p − 1) x variables is written as
Yi = β0 + β1 xi1 + β2 xi2 + ... + βp−1 xi(p−1) + εi
I We assume that the εi are independent and have a normal

distribution with mean 0 and constant variance σ 2 .
I The subscript i refers to the i th individual or unit in the population,
and for the x variables, the subscript following i simply denotes
which x variable it is.
I The word “linear” in “multiple linear regression” refers to the fact
that the model is linear in the parameters β0 , β1 , . . . , βp−1 .
I For example, if a model is Yi = β0 + β1 xi + β2 xi2 + εi , it is still a
“multiple linear regression” model, though the highest power for x is
two.
Coefficient of Determination, R 2 , and Adjusted R 2
I As in simple linear regression
SSR SSE
R2 = =1−
SSTO SSTO
and represents the proportion of variation in Y “explained” by the
multiple linear regression model with predictors, x1 , x2 , . . . , xp−1 .
I However, R 2 always increases (or stays the same) as more predictors
are added to a multiple linear regression model, even if the predictors
added are unrelated to the response variable.
I An alternative measure,
n−1
adjusted R 2 = 1 − (1 − R 2 )
n−p
does not necessarily increase as more predictors are added, and can
be used to help us identify which predictors should be included in a
model and which should be excluded.
Significance Testing of Each Variable
Within a multiple regression model, we may want to know whether a

particular x variable is making a useful contribution to the model. For
instance, suppose that we have three x variables in the model.
I
Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi
I As an example, to determine whether variable x1 is a useful predictor
variable in this model, we could test
H0 : β1 = 0 vs H1 : β1 6= 0
I When we cannot reject the null hypothesis, we should say that we do

not need variable x1 in the model given that variables x2 and x3 will
remain in the model.
I To carry out each test, the t-statistic is still calculated as
sample estimate - hypothesized value

t? =
standard error of sample estimate
I For the previous example, the t-statistic is
b1 − 0 b1
t? = =
se(b1 ) se(b1 )
I Multiple linear regression, in contrast to simple linear

regression, involves multiple predictors and so testing each
variable can quickly become complicated.
Matrix Formulation of the Multiple Regression Model
I In the multiple regression setting, because of the potentially large

number of predictors, it is more efficient to use matrices to define the
regression model and the subsequent analyses.
I Consider the following simple linear regression function
Yi = β0 + β1 xi + εi , i = 1, . . . , n
I If we actually let i = 1, ..., n, we see that we obtain n equations
Y1 = β0 + β1 x1 + ε1
Y2 = β0 + β1 x2 + ε2
..
.
Yn = β0 + β1 xn + εn
I We can formulate the above simple linear regression function in
matrix notation:
     
Y1 1 x1 ε
!  1
Y2  1 x2  β0  ε2 
   
 .  = . .  + (1)
 .. 

 .  . .  β
 .  . .  1 .
Yn 1 xn εn
I That is, instead of writing out the n equations, using matrix

notation, our simple linear regression function reduces to a
short and simple statement:
Y = Xβ + ε
Least Squares Estimates in Matrix Notation
I X is a n × 2 matrix.
I Y is a n × 1 column vector, β is a 2 × 1 column vector, and ε is an
n × 1 column vector.
I The matrix X and vector β are multiplied together using the
techniques of matrix multiplication.
I The vector Xβ is added to the vector ε using the techniques of
matrix addition.
We can get the least squares estimates b0 and b1 using matrix notation.

b0
b= = (X0 X)−1 X0 Y (2)
b1
where
I X is a n × 2 matrix.
I X0 is the transpose of the X matrix.
I (X0 X)−1 is the inverse of the X0 X matrix.
I Y is a n × 1 column vector
In simple linear regression, X0 X is a 2 × 2 matrix
 n
P

 n xi 
X0 X =  n
i=1
n

xi2
P P 
xi
i=1 i=1
and X0 Y is a 2 × 1 column vector

 n 
P
yi
X0 Y =  i=1
 
n

P 
xi yi
i=1
It can be shown that the two elements in the 2 × 1 column vector

(X0 X)−1 X0 Y are the least squares estimates b0 and b1 we’ve shown,
i.e,
b1 = Sxy /Sxx
and
b0 = Ȳ − b1 x̄ .
Generally, for a multiple linear regression model:
Yi = β0 + β1 xi1 + β2 xi2 + ... + βp−1 xi(p−1) + εi .
The p × 1 vector containing the estimates of the p parameters of

the regression function can be shown to equal
 
b0

b1 
 = (X0 X)−1 X0 Y
 
b= ..
.
 
 
bp−1
and
I X is a n × p matrix.
I X0 is the transpose of the X matrix.
I (X0 X)−1 is the inverse of the X0 X matrix.
I Y is a n × 1 column vector.
Topics in Today’s Class
I Some research questions answered by multiple linear

regression
I Introduction to general linear F -test
I Sequential sum of squares
I R examples
What Types of Questions can We Answer by Multiple
Linear Regression?
An Example: Heart attacks in rabbits.

I When heart muscle is deprived of oxygen, the tissue dies and
leads to a heart attack.
I Some researchers hypothesized that cooling the heart would be
effective in reducing the size of the heart attack even if it takes
place after the blood flow becomes restricted.
I To investigate their hypothesis, the researchers conducted an
experiment on 32 completely sedated rabbits that were
subjected to a heart attack.
The researchers established three experimental groups:
I Early cooling
I Late cooling
I No cooling
At the end of the experiment, the researchers measured the size of
the infarcted (i.e., damaged) area (in grams) in each of the 32
rabbits. But, as you can imagine, there is great variability in the size
of hearts.
Therefore, in order to adjust for differences in heart sizes, the
researchers also measured the size of the region at risk for infarction
(in grams) in each of the 32 rabbits.
The researchers’ primary research question was:
I Does the mean size of the infarcted area differ among the three
treatment groups – no cooling, early cooling, and late cooling – when
controlling for the size of the region at risk for infarction?
I If we translate this question into a model, then it is:
Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi
where:
I Yi is the size of the infarcted area (in grams) of rabbit i
I xi1 is the size of the region at risk (in grams) of rabbit i
I xi2 = 1 if early cooling of rabbit i, 0 if not
I xi3 = 1 if late cooling of rabbit i, 0 if not
and the independent error terms εi follow a normal distribution with mean
0 and equal variance σ 2 .
Categorical Variable
I The predictors x2 and x3 are “indicator variables” that translate

the categorical information on the experimental group to which
a rabbit belongs into a usable form.
I For “early cooling” rabbits x2 = 1 and x3 = 0
I For “late cooling” rabbits x2 = 0 and x3 = 1
I For “no cooling” rabbits x2 = 0 and x3 = 0
Simple Examples about Categorical Variable
I For a binary variable, e.g., gender, we can use an “indicator”

and code it as x = 1 for female and x = 0 for male.
I So for a variable with two possible categories, we only need one
“indicator” x . Its value is 1 or 0.
I How about a variable with three possible categories? For
example, degree of pain: mild, moderate, severe, then we need
two “indicator”s: for the mild condition: x1 = 1 and x2 = 0; for
the moderate condition: x1 = 0 and x2 = 1; for the severe
condition: x1 = 0 and x2 = 0.
The model can therefore be simplified for each of the three experimental
groups:
I For “early cooling” rabbits
Yi = β0 + β1 xi1 + β2 + εi
I For “late cooling” rabbits
Yi = β0 + β1 xi1 + β3 + εi
I For “no cooling” rabbits
Yi = β0 + β1 xi1 + εi
I Thus, β2 represents the difference in mean size of the infarcted
area – controlling for the size of the region at risk – between
“early cooling” and “no cooling” rabbits.
I β3 represents the difference in mean size of the infarcted area –
controlling for the size of the region at risk – between “late
cooling” and “no cooling” rabbits.
Fitting the model to the rabbits’ data, the summary table in R is
summary(fit)
##
## Call:
## lm(formula = Infarcted ~ Area + X2 + X3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29410 -0.06511 -0.01329 0.07855 0.35949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13454 0.10402 -1.293 0.206459
## Area 0.61265 0.10705 5.723 3.87e-06 ***
## X2 -0.24348 0.06229 -3.909 0.000536 ***
## X3 -0.06566 0.06507 -1.009 0.321602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1395 on 28 degrees of freedom
## Multiple R-squared: 0.6377, Adjusted R-squared: 0.5989
## F-statistic: 16.43 on 3 and 28 DF, p-value: 2.363e-06
The regression equation is:
Infarcted = −0.135 + 0.613 Area − 0.244x2 − 0.066x3

A plot of the data adorned with the estimated regression equation
looks like:
I The plot suggests that, as we’d expect, as the size of the area at risk
increases, the size of the infarcted area also tends to increase.
I The plot also suggests that for this sample of 32 rabbits with a given
size of area at risk, 1.0 gram say, the average size of the infarcted
area differs for the three experimental groups.
As always, the researchers aren’t just interested in this sample. They want
to be able to answer their research question for the whole population of
rabbits.
Recall that the research question is: Does the mean size of the infarcted
area differ among the three treatment groups – no cooling, early cooling,
and late cooling – when controlling for the size of the region at risk for
infarction?
How Could the Researchers Use the Above Regression
Model to Answer Their Research Question?
I Note that the estimated slope coefficients b2 = −0.2435 and
b3 = −0.0657. If the estimated coefficients b2 and b3 were instead
both 0, then the average size of the infarcted area would be the same
for the three groups of rabbits in this sample.
I If the two slopes β2 and β3 simultaneously equal 0, then the mean
size of the infarcted area would be the same for the whole population
of rabbits – controlling for the size of the region at risk.
I That is, the researchers’s question reduces to testing the hypothesis
H0 : β2 = β3 = 0 vs H1 : At least one βk 6= 0 (k = 2, 3)
In this case, the researchers are interested in testing whether a subset

(more than one, but not all) of the slope parameters are simultaneously
zero.
We will learn a general linear F -test for testing such a hypothesis.
Another Research Question
Consider the research question: “Is a regression model containing at

least one predictor useful in predicting the size of the infarct?”
H0 : β1 = β2 = β3 = 0 and H1 : At least one βk 6= 0 (k = 1, 2, 3)
In this case, the researchers are interested in testing that all three
slope parameters are zero.
We’ll soon see that the null hypothesis is tested using the analysis of
variance F -test.
A Final Research Question
Consider the research question: “Is the size of the infarct

significantly (linearly) related to the area of the region at risk?”
H0 : β1 = 0 and H1 : β1 6= 0
In this case, the researchers are interested in testing that just one of
the three slope parameters is zero. Wouldn’t this just involve
performing a t-test for β1 ?
We’ll soon learn how to think about the t-test for a single slope
parameter in the multiple regression framework.
The General Linear F -Test
The “general linear F -test” involves three basic steps

I Define a larger full model. (Model with more parameters.)
I Define a smaller reduced model. (Model with fewer
parameters.)
I Use an F -statistic to decide whether or not to reject the
smaller reduced model in favor of the larger full model.
As you can see by the wording of the third step, the null hypothesis
always pertains to the reduced model, while the alternative
hypothesis always pertains to the full model.
The Full Model
I First, let’s go back to what we know, namely the simple linear

regression model, to learn about the general linear F -test.
I The “full model,” which is also sometimes referred to as the
“unrestricted model,” is the model thought to be most
appropriate for the data. For simple linear regression, the full
model is:
Yi = β0 + β1 xi1 + εi
The Reduced Model
I The “reduced model,” which is sometimes also referred to as

the “restricted model,” is the model described by the null
hypothesis H0 .
I For simple linear regression, a common null hypothesis is
H0 : β1 = 0, and the reduced model is
Yi = β0 + εi
This reduced model suggests that each response Yi is a function

only of some overall mean, β0 , and some error εi .
Review of SSTO Decomposition
I Regression sum of squares (a component that is due to the
change in x )
n
X
SSR = (Ŷi − Ȳ )2
i=1
I Error sum of squares (a component that is just due to random

error)
n
X
SSE = (Yi − Ŷi )2
i=1
I Total sum of squares

n
X
SSTO = (Yi − Ȳ )2
i=1
If the regression sum of squares is a “large” component of the total

sum of squares, it suggests that there is a linear association between
the predictor x and the response Y .
The General Linear F -Test
I How do we decide if the reduced model or the full model does a

better job of describing the trend in the data when it can’t be
determined by simply looking at a plot?
I What we need to do is to quantify how much error remains after
fitting each of the two models to our data.
That is, we take the general linear F -test approach:
I “Fit the full model” to the data: Obtain the least squares estimates
of β0 and β1 . Determine the error sum of squares, which we denote
“SSE (F ).”
I “Fit the reduced model” to the data. Obtain the least squares
estimate of β0 . Determine the error sum of squares, which we denote
“SSE (R).”
Where are we going with this general linear F -test approach? In short:
I The general linear F -test involves a comparison between SSE (R) and
SSE (F ).
I SSE (R) can never be smaller than SSE (F ). It is always larger than
(or possibly the same as) SSE (F ).
1. If SSE (F ) is close to SSE (R), then the variation around the

estimated full model regression function is almost as large as the
variation around the estimated reduced model regression function.
2. If that’s the case, it makes sense to use the simpler reduced model.
3. On the other hand, if SSE (F ) and SSE (R) differ greatly, then the
additional parameter(s) in the full model substantially reduce the
variation around the estimated regression function.
4. In this case, it makes sense to go with the larger full model.
I How different does SSE (R) have to be from SSE (F ) in order
to justify using the larger full model? The general linear
F -statistic:
SSE (R) − SSE (F ) SSE (F )
F? = ÷
dfR − dfF dfF
helps answer this question.
I The F -statistic intuitively makes sense because it is a function
of SSE (R) − SSE (F ), the difference in the error between the
two models.
I The degrees of freedom – denoted dfR and dfF – are those
associated with the reduced and full model error sum of
squares, respectively.
I In general, we reject H0 if F ? is large – or equivalently if its
associated p-value is small.
I We use the general linear F -statistic to decide whether or not
to reject the null hypothesis H0 : the reduced model in favor of
the alternative hypothesis H1 : the full model.
As noted earlier for the simple linear regression case, the full model
is:
Yi = β0 + β1 xi1 + εi
and the reduced model is:
Yi = β0 + εi
Therefore, the appropriate null and alternative hypotheses are

specified either as:
H0 : Yi = β0 + εi and H1 : Yi = β0 + β1 xi1 + εi
or as
H0 : β1 = 0 and H1 : β1 6= 0
For simple linear regression, it turns out that the general linear
F -test is just the same ANOVA F -test that we learned before.
The formula for each entry is summarized for you in the following
analysis of variance table:
F -test for the Slope Parameter β1
I Hypothesis Test
H0 : β1 = 0 and H1 : β1 6= 0
I Test statistic
SSR/1 MSR
F? = =
SSE /(n − 2) MSE
I Null distribution: F ? ∼ F (1, n − 2)

I We reject H0 if F ? > F (1 − α; 1, n − 2)
The degrees of freedom associated with the error sum of squares for the
reduced model is n − 1, and:
n
X
SSE (R) = (Yi − Ȳ )2 = SSTO
i=1
The degrees of freedom associated with the error sum of squares for the
full model is n − 2, and:
n
X
SSE (F ) = (Yi − Ŷi )2 = SSE
i=1
That is, the general linear F -statistic reduces to the ANOVA F -statistic:
SSE (R) − SSE (F ) SSE (F ) SSR SSE MSR
F? = ÷ = ÷ =
dfR − dfF dfF (n − 1) − (n − 2) n − 2 MSE
Sequential (or Extra) Sums of Squares
What is a “sequential sum of squares?” It can be viewed in either of two

ways:
I It is the reduction in the error sum of squares (SSE ) when one or

more predictor variables are added to the model.
I Or, it is the increase in the regression sum of squares (SSR) when
one or more predictor variables are added to the model.
I In essence, when we add a predictor to a model, we hope to explain
some of the variability in the response, and thereby reduce some of
the error.
I A sequential sum of squares quantifies how much variability we
explain (increase in regression sum of squares) or alternatively
how much error we reduce (reduction in the error sum of
squares).
We’ll just note what predictors are in the model by listing them in
parentheses after any SSE or SSR. For example:
I SSE (x1 ) denotes the error sum of squares when x1 is the only
predictor in the model.
I SSR(x1 , x2 ) denotes the regression sum of squares when x1 and x2
are both in the model.
I SSR(x2 | x1 ) denotes the sequential sum of squares obtained by
adding x2 to a model already containing only the predictor x1 .
I The vertical bar “|” is read as “given” – that is, “x2 | x1 ” is read as
“x2 given x1 .”
Here are a few more examples of the notation:
I The sequential sum of squares obtained by adding x1 to the model

already containing only the predictor x2 is denoted as SSR(x1 | x2 ).
I The sequential sum of squares obtained by adding x1 to the model in
which x2 and x3 are predictors is denoted as SSR(x1 | x2 , x3 ).
I The sequential sum of squares obtained by adding x1 and x2 to the
model in which x3 is the only predictor is denoted as SSR(x1 , x2 | x3 ).
Example
Let’s try out the notation and the two alternative definitions of a
sequential sum of squares on an example.
An “Allen Cognitive Level” (ACL) Study investigated the relationship
between ACL test scores and level of psychopathology. Researchers
collected the following data on each of 69 patients in a hospital psychiatry
unit:
I Potential predictor x1 = vocabulary (“Vocab”) score on Shipley

Institute of Living Scale
I Potential predictor x2 = abstraction (“Abstract”) score on Shipley
Institute of Living Scale
I Potential predictor x3 = score on Symbol-Digit Modalities Test
(“SDMT”)
I Response Y = ACL score
As we’ve mentioned in the previous class, we an use a scatterplot matrix
to check pairwise relationship for the four variables in this data set.
ACL Vocab Abstract SDMT
0.4
Corr: Corr: Corr:
ACL
0.250* 0.354** 0.521***
0.2
0.0
40
30
Vocab
Corr: Corr:
0.698*** 0.556***
20
10
40
30
Abstract
Corr:
20
0.577***
10
60
SDMT
40
20
0
4 5 6 10 20 30 40 10 20 30 40 0 20 40 60
If we estimate the regression function with Y = ACL score as the response
and x1 = Vocab as the predictor, that is, if we “regress Y = ACL on x1 =
Vocab,” we obtain:
anova(fit1)
## Analysis of Variance Table

##
## Response: ACL
## Df Sum Sq Mean Sq F value Pr(>F)
## Vocab 1 2.691 2.69060 4.4667 0.03829 *
## Residuals 67 40.359 0.60237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Noting that x1 is the only predictor in the model, the output tells us that:
SSR(x1 ) = 2.691, SSE (x1 ) = 40.359, and SSTO = 43.050.
If we regress Y = ACL on x1 = Vocab and x3 = SDMT, we obtain:
##
## Call:
## lm(formula = ACL ~ Vocab + SDMT, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55758 -0.44619 -0.01027 0.34114 1.55955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.845292 0.324380 11.854 < 2e-16 ***
## Vocab -0.006840 0.015045 -0.455 0.651
## SDMT 0.029795 0.006803 4.379 4.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6883 on 66 degrees of freedom
## Multiple R-squared: 0.2736, Adjusted R-squared: 0.2516
## F-statistic: 12.43 on 2 and 66 DF, p-value: 2.624e-05
Noting that x1 and x3 are the predictors in the model. The
regression equation is
ACL = 3.845 − 0.0068 Vocab + 0.0298 SDMT
By some computation, we can get SSR(x1 , x3 ) = 11.7778,

SSE (x1 , x3 ) = 31.2717, and SSTO = 43.050.
Comparing the sums of squares for this model containing x1 and x3 to the
previous model containing only x1 , we note that:
I the error sum of squares has been reduced,

I the regression sum of squares has increased,
I and the total sum of squares stays the same.
For a given data set, the total sum of squares will always be the same
regardless of the predictors in the model, because the total sum of squares
quantifies how much the observed response Yi vary around Ȳ , and it has
nothing to do with which predictors are in the model.
Now, how much have the error sum of squares (SSE ) decreased and the
regression sum of squares (SSR) increased?
I The sequential sum of squares SSR(x3 | x1 ) tells us how much.

I Recall that SSR(x3 | x1 ) is the reduction in the error sum of squares
when x3 is added to the model in which x1 is the only predictor.
Therefore,
SSR(x3 | x1 ) = SSE (x1 ) − SSE (x1 , x3 )
SSR(x3 | x1 ) = 40.359 − 31.2717 = 9.087
I Alternatively, SSR(x3 | x1 ) is the increase in the regression sum of
squares when x3 is added to the model in which x1 is the only
predictor:
SSR(x3 | x1 ) = SSR(x1 , x3 ) − SSR(x1 )
SSR(x3 | x1 ) = 11.7778 − 2.691 = 9.087
The anova() function in R can get the sequential sum of squares
according to the order of each predictor entered in the model.
fit2$call
## lm(formula = ACL ~ Vocab + SDMT, data = dat)

anova(fit2)

##
## Response: ACL
## Vocab 1 2.6906 2.6906 5.6786 0.02006 *
## SDMT 1 9.0872 9.0872 19.1789 4.35e-05 ***
## Residuals 66 31.2717 0.4738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
Order Matters for Sequential Sum of Squares
I Perhaps, you noticed from the previous illustration that the

order in which we add predictors to the model determines the
sequential sums of squares (“Seq SS”) we get. That is, order is
important!
I Therefore, we’ll have to pay attention to it – we’ll soon see
that the desired order depends on the hypothesis test we want
to conduct.
Let’s revisit the Allen Cognitive Level Study data to see what happens
when we reverse the order in which we enter the predictors in the model.
Let’s start by regressing Y = ACL on x3 = SDMT.
anova(fit3)

##
## Response: ACL
## SDMT 1 11.68 11.6799 24.946 4.468e-06 ***
## Residuals 67 31.37 0.4682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
Noting that x3 is the only predictor in the model, the resulting output tells
us that: SSR(x3 ) = 11.68 and SSE (x3 ) = 31.37.
Now, regressing Y = ACL on x3 = SDMT and x1 = Vocab – in that order, that is,
specifying x3 first and x1 second, we obtain:
anova(fit4)

##
## Response: ACL
## SDMT 1 11.6799 11.6799 24.6508 5.12e-06 ***
## Vocab 1 0.0979 0.0979 0.2067 0.6508
## Residuals 66 31.2717 0.4738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Noting that x1 and x3 are the two predictors in the model, the output tells us that:
SSR(x1 , x3 ) = 11.7778 and SSE (x1 , x3 ) = 31.2717.
How much did the error sum of squares decrease – or alternatively, the
regression sum of squares increase?
I The sequential sum of squares SSR(x1 |x3 ) tells us how much.

SSR(x1 |x3 ) is the reduction in the error sum of squares when x1 is
added to the model in which x3 is the only predictor:
SSR(x1 |x3 ) = SSE (x3 ) − SSE (x1 , x3 )
SSR(x1 |x3 ) = 31.37 − 31.2717 = 0.098

I Alternatively, SSR(x1 |x3 ) is the increase in the regression sum of
squares when x1 is added to the model in which x3 is the only
predictor:
SSR(x1 |x3 ) = SSR(x1 , x3 ) − SSR(x3 )
SSR(x1 |x3 ) = 11.7778 − 11.68 = 0.098
Again, we obtain the same answers. Regardless of how we perform
the calculation, it appears that taking into account x1 = Vocab
doesn’t help much in explaining the variability in Y = ACL after
x3 = SDMT has already been considered.
Study Guide
To prepare for the quiz and exam, you will want to

I know and be able to defiine three types of research
questions and use the general linear F -test to answer the
questions
I know and be able to compute sequential sum of squares

Regression Analysis - STAT510

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis - STAT510

Uploaded by

Copyright:

Available Formats

Regression Analysis - STAT510

Dr. Xiyue Liao

Department of Mathematics and Statistics

California State University, Long Beach

Polynomial regression: another transformation - One way to try to

Yi = β0 + β1 xi + β2 xi2 + ... + βh xih + εi

where h is called the degree of the polynomial.

I For lower degrees, the relationship has a specific name (i.e., h = 2 is

Yi = β0 + β1 xi1 + β2 xi2 + ... + βp−1 xi(p−1) + εi

I We assume that the εi are independent and have a normal

Within a multiple regression model, we may want to know whether a

I When we cannot reject the null hypothesis, we should say that we do

sample estimate - hypothesized value

I For the previous example, the t-statistic is

I Multiple linear regression, in contrast to simple linear

I In the multiple regression setting, because of the potentially large

I If we actually let i = 1, ..., n, we see that we obtain n equations

I That is, instead of writing out the n equations, using matrix

and X0 Y is a 2 × 1 column vector

It can be shown that the two elements in the 2 × 1 column vector

Yi = β0 + β1 xi1 + β2 xi2 + ... + βp−1 xi(p−1) + εi .

The p × 1 vector containing the estimates of the p parameters of

I Some research questions answered by multiple linear

An Example: Heart attacks in rabbits.

Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi

I The predictors x2 and x3 are “indicator variables” that translate

I For a binary variable, e.g., gender, we can use an “indicator”

I For “early cooling” rabbits

I For “late cooling” rabbits

I For “no cooling” rabbits

The regression equation is:

Infarcted = −0.135 + 0.613 Area − 0.244x2 − 0.066x3

In this case, the researchers are interested in testing whether a subset

Consider the research question: “Is a regression model containing at

H0 : β1 = β2 = β3 = 0 and H1 : At least one βk 6= 0 (k = 1, 2, 3)

Consider the research question: “Is the size of the infarct

The “general linear F -test” involves three basic steps

I First, let’s go back to what we know, namely the simple linear

I The “reduced model,” which is sometimes also referred to as

This reduced model suggests that each response Yi is a function

I Error sum of squares (a component that is just due to random

I Total sum of squares

If the regression sum of squares is a “large” component of the total

I How do we decide if the reduced model or the full model does a

That is, we take the general linear F -test approach:

1. If SSE (F ) is close to SSE (R), then the variation around the

Therefore, the appropriate null and alternative hypotheses are

I Null distribution: F ? ∼ F (1, n − 2)

What is a “sequential sum of squares?” It can be viewed in either of two

I It is the reduction in the error sum of squares (SSE ) when one or

I The sequential sum of squares obtained by adding x1 to the model

I Potential predictor x1 = vocabulary (“Vocab”) score on Shipley

## Analysis of Variance Table

ACL = 3.845 − 0.0068 Vocab + 0.0298 SDMT

By some computation, we can get SSR(x1 , x3 ) = 11.7778,

I the error sum of squares has been reduced,

I The sequential sum of squares SSR(x3 | x1 ) tells us how much.

## lm(formula = ACL ~ Vocab + SDMT, data = dat)

## Analysis of Variance Table

I Perhaps, you noticed from the previous illustration that the

## Analysis of Variance Table

## Analysis of Variance Table

I The sequential sum of squares SSR(x1 |x3 ) tells us how much.