Professional Documents
Culture Documents
Linear Regression
Scatter Plots and Correlation
y y
x x
y y
x x
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Correlation Coefficient
• The population correlation coefficient (ρ)
measures the strength of the linear association
between the variables
cov( x, y )
rxy =
sx s y
1
cov( x, y ) = ( xi − x )( yi − y )
n
1 1
sx =
n
( xi − x ) 2
s y =
n
( y i − y ) 2
Features of correlation coefficient
– Unit free
– Range between -1.00 and 1.00
– -1≤ρ<0 implies that as X ↑ (↓), Y ↓ (↑ )
– 0< ρ≤1 implies that as X ↑ (↓), Y ↑ (↓)
– The closer to -1.00, the stronger the negative
linear relationship
– The closer to 1.00, the stronger the positive linear
relationship
– The closer to 0.00, the weaker the linear
relationship
– ρ=0 implies that X and Y are not linearly
associated
Examples of Approximate r Values
y y y
x x x
r = -1.00 r = -.60 r = 0.00
y y
x x
r = 0.20 r = 1.00
Introduction to Regression Analysis
• Regression analysis is used to:
y = β 0 + β1 x + ε
Variable
y y = β 0 + β1 x + ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value Random Error for
of y for xi
this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Independent
ŷ i = b 0 + b1x variable
E(εi) = 0
Least Square Regression
• b0 and b1 are obtained by finding the values of b0 and
b1 that minimize the sum of the squared residuals
SSE = e = (y −ŷ)
2 2
= (y − (b0 + b1x)) 2
Interpretation of the Slope and the Intercept
y = y + b1 ( x − x )
b1 : regression coefficient of y on x
Simple Linear Regression
Example: Demand for “Fresh”-antiseptic liquid
Regression Statistics
Multiple R 0.469220295
R Square 0.220167685
Adjusted R Square 0.192316531
Standard Error 0.612239472
Observations 30
ANOVA
Significanc
df SS MS F eF
Total 29 13.45859
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
𝑏0 = 21.62429
𝑦ො = 21.62429−3.54528𝑥1
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
i
( y − y ) 2
= i
( ˆ
y − y ) 2
+ i i
( y − ˆ
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
The coefficient of determination is:
R2 is a measure of relative fit based on a comparison
of SSR and SST
R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
𝟎 ≤ 𝑹𝟐 ≤ 𝟏 a value of R2 closer to one indicates the better fit and
value of R2 closer to zero indicates the poor fit.
Coefficient of Determination
t Test
H1 : 1 10
b1 − 10 s
tobs = where sb1 =
sb1 i
(
i
x − x ) 2
H1 : 0 00
b0 − 00 1 x2
= where sb 0 = s +
tobs
sb 0 n ( xi − x ) 2
where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example
b1
3. Select the test statistic. t=
sb1
t(obs)= -2.81161
4. State the rejection rule.
P value= 0.008902<0.05
Reject null
Slope parameter is significant
Price has significant impact on
demand
Confidence Interval for 1
• The form of a confidence interval for 1 is: t /2 sb1
is the
b1 t /2 sb1 margin
of error
b1 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for 0
• The form of a confidence interval for 0 is: t / 2 sb 0
is the
b0 t / 2 sb 0 margin
of error
b0 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Point Estimation
y^ = 10 + 5(3) = 25 cars
Multiple Regression
The simple linear regression model was used to analyze
how one interval variable (the dependent variable y) is
related to one other interval variable (the independent
variable x).
error variable
coefficients
Estimating the Regression Coefficients
• The parameters are estimated using the same least
squares approach that we saw in the context of simple
linear regression.
• We choose 𝛽0 , 𝛽1 , … , 𝛽𝑝 to minimize the sum of
squared residuals
𝑛
𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦ො𝑖 2
𝑖=1
𝑛
2
መ መ መ መ
= 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 .
𝑖=1
Estimating the Coefficients…
The sample regression equation is expressed as:
Some Important Questions
1. Is at least one of the predictors 𝑋1 , … , 𝑋𝑝
useful in predicting the response?
2. How well does the model fit the data?
3. Given a set of predictor values, what response
value should we predict, and how accurate is
our prediction?
Is at least one of the predictors
𝑋1 , … , 𝑋𝑝 useful in predicting the
response?
• We test the null hypothesis
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
vs the alternative
𝐻𝑎 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑗 𝑖𝑠 𝑛𝑜𝑛 − 𝑧𝑒𝑟𝑜.
• We use the following F-statistic to test the above hypothesis:
(𝑇𝑆𝑆 − 𝑅𝑆𝑆)ൗ
𝑝
𝐹= ,
𝑅𝑆𝑆ൗ
(𝑛 − 𝑝 − 1)
which, under 𝐻0 , follows a F-distribution with d.f. 𝑝 and 𝑛 − 𝑝 − 1.
• The observed 𝑝-value is 𝑃 𝐹𝑝. 𝑛−𝑝−1 > 𝐹𝑜𝑏𝑠 . Reject 𝐻0 if 𝑝-value is small.
• For the Advertising data set, the observed 𝑝-value is very low. Thus we reject 𝐻0.
Model Fit