You are on page 1of 61

Multiple Regression

• Multiple regression extends simple regression to include several independent variables


(called predictors).
• Multiple regression is required when a single-predictor model is inadequate to describe
the true relationship between the dependent variable Y (the response variable) and its
potential predictors (X1, X2, X3, . . .).
• The interpretation of multiple regression is similar to simple regression because simple
regression is a special case of multiple regression.
Limitations of Simple Regression
• Multiple relationships usually exist.
• Biased estimates if relevant predictors are
omitted.
• Lack of fit does not show that X is unrelated to Y
if the true model is multivariate.
Correlation coefficient
Regression Terminology
• The response variable (Y) is assumed to be related
to the k predictors (X1, X2, . . . , Xk ) by a linear
equation called the population regression model:
y = β 0 + β 1 x1 + β 2 x2 + . . . + β k x k + ε
• A random error ε represents everything that is not
part of the model. The unknown regression
coefficients β0, β1, β2, . . . , βk are parameters and
are denoted by Greek letters
Regression Terminology
• The sample estimates of the regression
coefficients are denoted by Roman letters b0, b1,
b2, . . . , bk . The predicted value of the response
variable is denoted and is calculated by inserting
the values of the predictors into the estimated
regression equation:
(predicted value of Y)
Data Format
• To obtain a fitted regression, we need n observed
values of the response variable Y and its proposed
predictors X1, X2, . . . , Xk . A multivariate data set
is a single column of Y-values and k columns of
X-values.
Sample Data
• Illustration: Home Prices
• Table shows sales of 30 new homes in an upscale development. Although the
selling price of a home (the response variable) may depend on many factors, we
will examine three potential explanatory variables.
• Definition of Variable Short Name
• Y = selling price of a home (thousands of dollars) Price
• X1 = home size (square feet) SqFt
• X2 = lot size (thousand square feet) LotSize
• X3 = number of bathrooms Baths
Sample Data
Home
1
X1
Sqft
2192
X2
LotSize
16.4
X3
Baths Sample Data
2.5
Y
Price
505.5
2 3429 24.7 3.5 784.1
3 2842 17.7 3.5 649
4 2987 20.3 3.5 689.8
5 3029 22.2 3 709.8
6 2616 20.8 2.5 590.2
7 2978 17.3 3 643.3
8 3595 22.4 3.5 789.7
9 2838 27.4 3 683
10 2591 19.2 2 544.3
11 3633 26.9 4 822.8
12 2822 23.1 3 637.7
13 2994 20.4 3 618.7
14 2696 22.7 3.5 619.3
15 2134 13.4 2.5 490.5
16 3076 19.8 3 675.1
17 3259 20.8 3.5 710.4
18 3162 19.4 4 674.7
19 2885 23.2 3 663.6
20 2550 20.2 3 606.6
21 3380 19.6 4.5 758.9
22 3131 22.5 3.5 723.3
23 2754 19.2 2.5 621.8
24 2710 21.6 3 622.4
25 2616 20.8 2.5 631.3
26 2608 17.3 3.5 574
27 3572 29 4 863.8
28 2924 21.8 2.5 652.7
29 3614 25.5 3.5 844.2
30 2600 24.1 3.5 629.9
Estimation of Regression
• A regression equation can be estimated by using Excel, MegaStat,
MINITAB, or any other statistical package. Using the sample of n=
5 30 home sales, we obtain the fitted regression and its statistics of
fit (R2 is the coefficient of determination, SE is the standard error):

• Intercept = -28.85
• Slope Sqft = 0.171
• Slope LotSize = 6.78
• Slope Baths = 15.53
Estimation of Regression
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.977722197
R Square 0.955940694
Adjusted R Square 0.950856928
Standard Error 20.29773478
Observations 30

ANOVA
Significance
df SS MS F
F
Regression 3 232413.739 77471.25 188.0379 9.66E-18
Residual 26 10711.94896 411.998
Total 29 243125.688

Lower Upper
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
95.0% 95.0%
Intercept -28.80558256 29.69997693 -0.96989 0.34104 -89.8548 32.24359 -89.8548 32.24359
Sqft 0.170903982 0.01544169 11.0677 2.46E-11 0.139163 0.202645 0.139163 0.202645
LotSize 6.777241714 1.420744663 4.770204 6.16E-05 3.856859 9.697624 3.856859 9.697624
Baths 15.52386756 9.204704708 1.686514 0.103661 -3.39667 34.44441 -3.39667 34.44441
Regression Modeling
Four Criteria for Regression Assessment
• Logic Is there an a priori reason to expect a causal
relationship between the predictors and the response
variable?
• Fit Does the overall regression show a significant
relationship between the predictors and the response
variable?
• Parsimony Does each predictor contribute significantly to the
explanation? Are some predictors not worth the trouble?
• Stability Are the predictors related to one another so
strongly that regression estimates become erratic?
ANOVA Table Format
Coefficient of Determination
Adjusted R2
ANOVA vs t-test

t-test ANOVA

checks whether two populations are checks whether three or more populations
statistically different are statistically different

T-value F- value

difference in means and the spread of the distributions (i.e., variance) across groups

lower p-value reflects a value that is more significantly different across populations
• ANOVA – ANalysis of Variance
• ANOVA is a hypothesis testing procedure that is used to evaluate
differences between 2 or more samples
Compare THREE sample
means to see if a difference Is one mean so far away
exists somewhere among from the other two that it is
them likely not from the same
populations?

Or are all THREE sample


means so far apart that they
Do all THREE sample ALL come from unique
means come from a populations?
common population
Put all the data points in all
of the THREE samples into where is each mean
a common larger distribution relative to the overall
data set sorted in the
background?

Shows how far the mean it is away from the mean of the
larger sort of combined population
Oddball distribution, sort of
the one that doesn’t belong
in the same population as
the other two
Means are in very different
locations relative to the
overall mean
Variability
AMONG/BETWEEN
the sample means

We are not asking if they are


EXACTLY equal. We are
asking if each mean likely
came from the larger overall
population
What changed?

The variance or Spread of


each distribution

Variability AROUND/
WITHIN the distributions
Example for ANOVA
ANOVA Table
• The total degrees of freedom (DF) are the amount of
information in your data.
• DF Num : degrees of freedom for the numerator to
calculate the probability of obtaining an F value that is at
least as extreme as the observed F value.
• DF DENOM : degrees of freedom for the denominator to
calculate the probability of obtaining an F value that is at
least as extreme as the observed F value.
Adj SS

Adj SS Term
The adjusted sum of squares for a term is the increase in the regression
sum of squares compared to a model with only the other terms. It quantifies
the amount of variation in the response data that is explained by each term
in the model.
Adj SS Error
The error sum of squares is the sum of the squared residuals. It quantifies
the variation in the data that the predictors do not explain.
Adj SS Total
The total sum of squares is the sum of the term sum of squares and the
error sum of squares. It quantifies the total variation in the data.
Adj MS
• The adjusted mean square of the error (also called MSE
or s2) is the variance around the fitted values.
F- Value

F-value for the model or the terms


The F-value is the test statistic used to determine whether
the term is associated with the response.
F-value for the lack-of-fit test
The F-value is the test statistic used to determine whether
the model is missing higher-order terms that include the
predictors in the current model.
p-value
• The p-value is a probability that measures the evidence against the null hypothesis.
Lower probabilities provide stronger evidence against the null hypothesis.

• P-value ≤ α: The differences between some of the means are statistically significant
If the p-value is less than or equal to the significance level, you reject the null
hypothesis and conclude that not all population means are equal. Use your
specialized knowledge to determine whether the differences are practically significant.
For more information, go to Statistical and practical significance.
• P-value > α: The differences between the means are not statistically significant
If the p-value is greater than the significance level, you do not have enough evidence
to reject the null hypothesis that the population means are all equal. Verify that your
test has enough power to detect a difference that is practically significant. For more
information, go to Increase the power of a hypothesis test.
T-value

• The t-value is a test statistic that measures the ratio


between the difference in means and the standard error of
the difference.
Example for hypothesis testing
a) Estimated regression equation
b) Interpreting coefficients
d) Age Coefficient
c)Sample Prediction
Hypothesis testing
e) Significance on House price parameter
e) Significance of Rooms
e) Significance of Age
e) Significance on Distance
e) Significance on Distance
References
• Doane, D. P., & Seward, L. E. (2016). Applied
statistics in business and economics, 5th. Mcgraw-
Hill.
• https://www.youtube.com/watch?v=xj_e7M7YS6w
• https://www.youtube.com/watch?v=IQo_T7BmO90
Thank You!!!

You might also like