regression

Multiple Regression II data

Online stock trading through the Internet has

increased dramatically during the past several

years. An article discussing this new method of

investing provided data on the major Internet

stock brokerages who provide this service. Here

we have some data for the top 10 Internet

brokerages. The variables are Mshare, the market

share of the firm; Accts, the number of Internet

accounts in thousands; and Assets, the total

assets in billions of dollars.

Describe the data:

How many variables does the data set contain?

How would you describe them in terms of levels of

measurement?

Explaining Assets with each predictor

variable

the explanatory variables Mshare and

Accts.

Use a Simple Linear Regression to predict

Assets content using the number of

accounts.

What is the regression equation?

What are the results of the significance test for

the regression coefficient?

Do the same using Mshare.

What is Multiple Regression?

variable) based upon several

independent variables

simultaneously.

Why is this important?

Behavior is rarely a function of just one

variable, but is instead influenced by

many variables. So the idea is that we

should be able to obtain a more accurate

predicted score if using multiple

variables to predict our outcome.

Strategy for Multiple Regression

Start

Stop

The Multiple Linear Regression Model

independent variables, x1, x2, , xk . A multiple

linear regression model with p independent

variables has the equation

y o 1x1 p x p

i is the intercept and i determines the

contribution of the independent variable xi

variance 2.

The Prediction Equation

The equation for this model fitted to data is

y bo b1x1 bp x p

Where y denotes the predicted value

computed from the equation, and bi denotes

an estimate of i.

obtained by the method of least squares

Among the set of all possible values for the

parameter estimates, I find the ones which

minimize the sum of squared residuals.

Basic Idea

With multiple regression, we form a 'linear

combination' of multiple variables to

best predict an outcome, and then we

assess the contribution that each

predictor variable makes to the

equation.

My research question might be:

How much does an independent variable

contribute to explaining dependent variable

after the effect of another independent

variable is taken into account?

Doing the Calculations

is tedious.

They are ordinarily obtained using a

regression computer program.

Standard errors also are usually part

of output from a regression program.

Lets Return to the Example

Come up with a prediction equation for

the multiple regression model.

Assessing the Utility of the Model:

Hypothesis tests (see MLR handout)

are zero: F test.

Test if a particular slope parameter

is zero given that all other x's

remain in the model: t test.

ANOVA: ANalysis Of VAriance

This is a test of the null hypothesis that

Multiple R in the population = 0.0. If this is .05

or less, reject the null hypothesis.

For a multiple linear regression model with p

independent variables fitted to a data set with

n observations is, the ANOVA is:

Source of

Variation DF SS MS

Model p SSM MSM

Error n-p-1 SSE MSE

Total n-1 SST

Sums of squares

SST have the same definitions in

relation to the model as in simple

linear regression:

y y

2

M

SSR

SSE y y

2

SST y y

2

SST = SSM + SSE

model.

It depends only on the values of the dependent

variable y.

model, and SSM increases by the same

amount.

This amount of increase in SSM is the amount of

variation due to variables in the larger model that

was not accounted for by variables in the smaller

model.

MSM

F statistic F

MSE

parameters are zero.

ANOVA gives F statistic and p-value (be

sure to set the level)

Under the null hypothesis

H o : 1 2 ... p 0

the F statistic has an F(p, n-p-1) distribution

and the p-value is ___. According to this

distribution, the chance of obtaining an F

statistic of __ or larger is _(p-value). We

conclude that the model is useful/not useful

for predicting

Proceed only if F and corresponding

p-value indicate sufficient evidence

that the overall model is useful

to determine their contribution

We do this with t-tests

p = .05 or less than each variable

indicates a significant contribution

Interpreting coefficients

Constant = slope

Other coefficients are the regression

coefficients, interpreted as the

change in the mean dependent

variable for each unit change in the

corresponding independent variable,

all other variables held constant.

Confidence Intervals

Use b j t * SEb j

t* is the (1-C)/2 critical value from the

t(n-p-1) distribution.

Returning to our example

Which variables contribute to the

model?

What if the Relationship is Curvilinear?

(data- Chapter 4: Curvilinear Relationship)

Explore the relationship between IgG (y) as a

function of maximal oxygen uptake (x).

Does a linear or curvilinear model better explain

the variation in IgG? How do you determine

this?

Basic Quadratic Model

E(y) = 0 + 1x + 2x2

0 is the y-intercept of the curve; value

of E(y) when x = 0

1 is the shift parameter; changing the

value of 1 shifts the parabola to the

right (if increased) or left (with

decrease)

2 is the rate of curvature

Interpreting the Coefficient () Estimates

if the sampled range of the independent variable

includes zero.

The estimated coefficient of the first-order terms no

longer represent the slope and cannot typically be

meaningfully interpreted.

The sign of the coefficient associated with the

quadratic term (x2) indicates if curve is

concave downward (mound-shaped): -

concave upward (bowl-shaped): +

you interpret the s for the example?

Assessing Model Utility

Again, refer to the F test statistic and associate p-

value.

If these indicate that the model is useful, proceed to

the t-test of the associated with the quadratic term

(x2)- 2 here

H0: 2 = 0 (no curvature in response curve)

Ha: 2 < 0 (downward concavity exists)

Or

Ha: 2 > 0 (upward concavity exists)

p-value by 2.

We do not need to consider the test statistics for the

coefficients associated with the y-intercept and first-

order term(s)

What if I have a Qualitative

Independent Variable?

variable.)

Instructions included on Minitab

worksheet.

Example: Application journal # 3 (data-

Chapter 4: Dummy Variable)

Create a dummy variable for repellent type

Is repellent type useful for predicting cost per

use? Number of hours of protection?

What if the relationship between E(y)

and any one IV depends on the value

of another IV?

In this case, the two independent

variables interact, and we model

this a cross-product of the IVs.

Example: Graph and interpret the following findings

We have some achievement-oriented students and some

achievement-avoiders. We create two random halves in

each sample, and give half of each sample a challenging

test, the other an easy test. We measure how hard the

students work on the test. The means of this study are:

Achievement-oriented Achievement

(n=100) avoiders (n=100)

Challenging test 10 5

Easy test 5 10

Caution!

deemed important in a model, all

associated first-order terms should

be kept in the model, regardless of

the magnitude of their p-values.

Conclusions

The effect of test difficulty (x1) on effort (y)

depends on a students achievement

orientation (x2).

Thus, the type of achievement orientation

and test difficulty interact in their effect on

effort.

This is an example of a two-way interaction

between achievement orientation and test

difficulty.

