You are on page 1of 28

Multiple Regression Models

Exploring an example: Chapter 4:


Multiple Regression II data
Online stock trading through the Internet has
increased dramatically during the past several
years. An article discussing this new method of
investing provided data on the major Internet
stock brokerages who provide this service. Here
we have some data for the top 10 Internet
brokerages. The variables are Mshare, the market
share of the firm; Accts, the number of Internet
accounts in thousands; and Assets, the total
assets in billions of dollars.
Describe the data:
How many variables does the data set contain?
How would you describe them in terms of levels of
measurement?
Explaining Assets with each predictor
variable

Find the correlation between Assets, and


the explanatory variables Mshare and
Accts.
Use a Simple Linear Regression to predict
Assets content using the number of
accounts.
What is the regression equation?
What are the results of the significance test for
the regression coefficient?
Do the same using Mshare.
What is Multiple Regression?

Predicting an outcome (dependent


variable) based upon several
independent variables
simultaneously.
Why is this important?
Behavior is rarely a function of just one
variable, but is instead influenced by
many variables. So the idea is that we
should be able to obtain a more accurate
predicted score if using multiple
variables to predict our outcome.
Strategy for Multiple Regression
Start

Hypothesize form of the model (choose which independent variables to include)

Conduct exploratory data analysis

Develop one or more tentative models

Identify most suitable model

Make inferences based on model

Stop
The Multiple Linear Regression Model

Regression applications in which there are several


independent variables, x1, x2, , xk . A multiple
linear regression model with p independent
variables has the equation

y o 1x1 p x p
i is the intercept and i determines the
contribution of the independent variable xi

The is a random variable with mean 0 and


variance 2.
The Prediction Equation
The equation for this model fitted to data is
y bo b1x1 bp x p
Where y denotes the predicted value
computed from the equation, and bi denotes
an estimate of i.

As with Simple Linear Regression, theyre


obtained by the method of least squares
Among the set of all possible values for the
parameter estimates, I find the ones which
minimize the sum of squared residuals.
Basic Idea
With multiple regression, we form a 'linear
combination' of multiple variables to
best predict an outcome, and then we
assess the contribution that each
predictor variable makes to the
equation.
My research question might be:
How much does an independent variable
contribute to explaining dependent variable
after the effect of another independent
variable is taken into account?
Doing the Calculations

Computation of the estimates by hand


is tedious.
They are ordinarily obtained using a
regression computer program.
Standard errors also are usually part
of output from a regression program.
Lets Return to the Example

Construct a 3-D plot.


Come up with a prediction equation for
the multiple regression model.
Assessing the Utility of the Model:
Hypothesis tests (see MLR handout)

Test if all of the slope parameters


are zero: F test.
Test if a particular slope parameter
is zero given that all other x's
remain in the model: t test.
ANOVA: ANalysis Of VAriance
This is a test of the null hypothesis that
Multiple R in the population = 0.0. If this is .05
or less, reject the null hypothesis.
For a multiple linear regression model with p
independent variables fitted to a data set with
n observations is, the ANOVA is:
Source of
Variation DF SS MS
Model p SSM MSM
Error n-p-1 SSE MSE
Total n-1 SST
Sums of squares

The sums of squares SSM, SSE, and


SST have the same definitions in
relation to the model as in simple
linear regression:

y y
2
M
SSR
SSE y y
2

SST y y
2
SST = SSM + SSE

The value of SST does not change with the


model.
It depends only on the values of the dependent
variable y.

SSE decreases as variables are added to a


model, and SSM increases by the same
amount.
This amount of increase in SSM is the amount of
variation due to variables in the larger model that
was not accounted for by variables in the smaller
model.
MSM
F statistic F
MSE

F is the statistic to test if ALL the slope


parameters are zero.
ANOVA gives F statistic and p-value (be
sure to set the level)
Under the null hypothesis
H o : 1 2 ... p 0
the F statistic has an F(p, n-p-1) distribution
and the p-value is ___. According to this
distribution, the chance of obtaining an F
statistic of __ or larger is _(p-value). We
conclude that the model is useful/not useful
for predicting
Proceed only if F and corresponding
p-value indicate sufficient evidence
that the overall model is useful

If so, look to the individual variables


to determine their contribution
We do this with t-tests
p = .05 or less than each variable
indicates a significant contribution
Interpreting coefficients

Constant = slope
Other coefficients are the regression
coefficients, interpreted as the
change in the mean dependent
variable for each unit change in the
corresponding independent variable,
all other variables held constant.
Confidence Intervals


Use b j t * SEb j

bj is the least-squares estimate of j


t* is the (1-C)/2 critical value from the
t(n-p-1) distribution.
Returning to our example

How good is the model?


Which variables contribute to the
model?
What if the Relationship is Curvilinear?

Example: Application journal for chapter 4


(data- Chapter 4: Curvilinear Relationship)
Explore the relationship between IgG (y) as a
function of maximal oxygen uptake (x).
Does a linear or curvilinear model better explain
the variation in IgG? How do you determine
this?
Basic Quadratic Model

E(y) = 0 + 1x + 2x2
0 is the y-intercept of the curve; value
of E(y) when x = 0
1 is the shift parameter; changing the
value of 1 shifts the parabola to the
right (if increased) or left (with
decrease)
2 is the rate of curvature
Interpreting the Coefficient () Estimates

Estimate of 0 can only be meaningfully interpreted


if the sampled range of the independent variable
includes zero.
The estimated coefficient of the first-order terms no
longer represent the slope and cannot typically be
meaningfully interpreted.
The sign of the coefficient associated with the
quadratic term (x2) indicates if curve is
concave downward (mound-shaped): -
concave upward (bowl-shaped): +

What is the prediction equation, and how would


you interpret the s for the example?
Assessing Model Utility
Again, refer to the F test statistic and associate p-
value.
If these indicate that the model is useful, proceed to
the t-test of the associated with the quadratic term
(x2)- 2 here
H0: 2 = 0 (no curvature in response curve)
Ha: 2 < 0 (downward concavity exists)
Or
Ha: 2 > 0 (upward concavity exists)

This is a one-tailed test, so we divide the associated


p-value by 2.
We do not need to consider the test statistics for the
coefficients associated with the y-intercept and first-
order term(s)
What if I have a Qualitative
Independent Variable?

Create a dummy variable (indicator


variable.)
Instructions included on Minitab
worksheet.
Example: Application journal # 3 (data-
Chapter 4: Dummy Variable)
Create a dummy variable for repellent type
Is repellent type useful for predicting cost per
use? Number of hours of protection?
What if the relationship between E(y)
and any one IV depends on the value
of another IV?
In this case, the two independent
variables interact, and we model
this a cross-product of the IVs.
Example: Graph and interpret the following findings

Lets say we want to study how hard students work on tests.


We have some achievement-oriented students and some
achievement-avoiders. We create two random halves in
each sample, and give half of each sample a challenging
test, the other an easy test. We measure how hard the
students work on the test. The means of this study are:

Achievement-oriented Achievement
(n=100) avoiders (n=100)

Challenging test 10 5

Easy test 5 10
Caution!

Once an interaction has been


deemed important in a model, all
associated first-order terms should
be kept in the model, regardless of
the magnitude of their p-values.
Conclusions

E(y)= 0 + 1x1 + 2x2+ 3x1x2


The effect of test difficulty (x1) on effort (y)
depends on a students achievement
orientation (x2).
Thus, the type of achievement orientation
and test difficulty interact in their effect on
effort.
This is an example of a two-way interaction
between achievement orientation and test
difficulty.