Introduction To Correlation and Linear Regression

Introduction to Correlation and
Linear Regression
Scatter Plots and Correlation
– A scatter plot (or scatter diagram) is used to show

the relationship between two variables
– Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Examples
• Salary and no of years of experience

• Household income and expenditure;
• Price and supply of commodities;
• Amount of rainfall and yield of crops.
• Price and demand of goods.
• Weight and blood pressure
• Sales and GDP
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Correlation Coefficient
• The population correlation coefficient (ρ)
measures the strength of the linear association
between the variables
• The sample correlation coefficient (r) is an

estimate of ρ and is used to measure the strength
of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient
cov( x, y )
rxy =
sx s y
1
cov( x, y ) =  ( xi − x )( yi − y )
n
1 1
sx =
n
 ( xi − x ) 2
s y =
n
 ( y i − y ) 2
Features of correlation coefficient
– Unit free
– Range between -1.00 and 1.00
– -1≤ρ<0 implies that as X ↑ (↓), Y ↓ (↑ )
– 0< ρ≤1 implies that as X ↑ (↓), Y ↑ (↓)
– The closer to -1.00, the stronger the negative
linear relationship
– The closer to 1.00, the stronger the positive linear
relationship
– The closer to 0.00, the weaker the linear
relationship
– ρ=0 implies that X and Y are not linearly
associated
Examples of Approximate r Values
y y y
x x x
r = -1.00 r = -.60 r = 0.00
y y
x x
r = 0.20 r = 1.00
Introduction to Regression Analysis
• Regression analysis is used to:
– Predict the value of a dependent variable based on the

value of at least one independent variable
– Explain the impact of changes in an independent variable

on the dependent variable
**Dependent variable: the variable we wish to explain (predict)
**Independent variable: the variable used to explain (predict) the

dependent variable
• Only one independent variable, x
• Relationship between x and y is described by a linear function

Model Building
The inexact nature of the Data In ANOVA, the systematic

relationship between years component is the variation
of experience and annual of means between samples
sales suggests that a or treatments (SSTR) and
statistical model might be
Statistical the random component is
useful in analyzing the model the unexplained variation
relationship. (SSE).
A statistical model Systematic In regression, the

separates the systematic systematic component is
component
component of a the overall linear
relationship from the + relationship, and the
random component. Random random component is the
errors variation around the line.
Population Linear Regression
The population regression model:

Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
y = β 0 + β1 x + ε
Variable
Linear component Random Error

(systematic) component
Population Linear Regression
y y = β 0 + β1 x + ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value Random Error for
of y for xi
this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Estimated (or Estimate of the Estimate of the

predicted) y regression regression slope
value intercept
Independent
ŷ i = b 0 + b1x variable
E(εi) = 0
Least Square Regression
• b0 and b1 are obtained by finding the values of b0 and
b1 that minimize the sum of the squared residuals
SSE =  e =  (y −ŷ)
2 2
=  (y − (b0 + b1x)) 2
Interpretation of the Slope and the Intercept
• b0 is the estimated average value of y when the value of x is

zero
• b1 is the estimated change in the average value of y as a result
of a one-unit change in x
Regression equation of y on x is
y = y + b1 ( x − x )
b1 : regression coefficient of y on x
Simple Linear Regression
Example: Demand for “Fresh”-antiseptic liquid
A company produces a brand of antiseptic liquid(Fresh). In

order to manage its inventory more effectively and make
revenue projections, the company would like to better
predict demand for “Fresh”. To develop a prediction model
the company has gathered data concerning demand for Fresh
over the last 30 sales periods (1 period=four week). The
factory manager assumes that price of Fresh has influence on
the demand.
Y: demand for 500 ml bottle of Fresh (in hundreds of thousands
of bottles) in the sales period
X1: price (in $) of fresh in the sales period
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.469220295
R Square 0.220167685
Adjusted R Square 0.192316531
Standard Error 0.612239472
Observations 30
ANOVA
Significanc
df SS MS F eF
Regression 1 2.963146 2.963146 7.905155 0.008902
Residual 28 10.49544 0.374837
Total 29 13.45859
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 21.62429127 4.710949 4.59022 8.5E-05 11.97435 31.27423
X Variable 1 -3.545281018 1.260943 -2.81161 0.008902 -6.12821 -0.96236

Estimated Regression Equation
Slope for the Estimated Regression Equation
𝑏1 = −3.54528
y-Intercept for the Estimated Regression Equation
𝑏0 = 21.62429
Estimated Regression Equation
𝑦ො = 21.62429−3.54528𝑥1
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
 i
( y − y ) 2
=  i
( ˆ
y − y ) 2
+  i i
( y − ˆ
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
The coefficient of determination is:
R2 is a measure of relative fit based on a comparison
of SSR and SST
R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
𝟎 ≤ 𝑹𝟐 ≤ 𝟏 a value of R2 closer to one indicates the better fit and
value of R2 closer to zero indicates the poor fit.
R2 = SSR/SST = 2.963146 / 13.45859 = 0.220168
The regression relationship is weak; 22.02% of the

variability in the demand for Fresh can be explained
by the linear model between the demand and the
price.
Test for Significance
To test for a significant regression relationship, we

test for intercept parameter, b0;
slope parameter b1 and predicted y
test commonly used is:
t Test
t test requires an estimate of σe2,

the variance of error (residuals) in the regression model.
Testing for slope parameter
• Hypotheses
H 0 : 1 = 10
H1 : 1  10
• Test Statistic, under H0
b1 − 10 s
tobs = where sb1 =
sb1  i
(
i
x − x ) 2
For the example, Sb1 = 1.260943

Testing for intercept parameter
• Hypotheses
H 0 :  0 =  00
H1 :  0   00
• Test Statistic, under H0
b0 −  00 1 x2
= where sb 0 = s +
tobs
sb 0 n  ( xi − x ) 2
For the example, Sb0 = 4.710949

Testing for Significance: t Test
Critical Region
Reject H0 if p-value < 

or tobs < -t /2;n-2 or t > t/2;n-2
where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example
1. Determine the hypotheses. H0 : 1 = 0

H a : 1  0
2. Specify the level of significance. a = .05
b1
3. Select the test statistic. t=
sb1
t(obs)= -2.81161
4. State the rejection rule.
P value= 0.008902<0.05
Reject null
Slope parameter is significant
Price has significant impact on
demand
Confidence Interval for 1
• The form of a confidence interval for 1 is: t /2 sb1
is the
b1  t /2 sb1 margin
of error
b1 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
Confidence Interval for 0
• The form of a confidence interval for 0 is: t / 2 sb 0
is the
b0  t / 2 sb 0 margin
of error
b0 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
Point Estimation
For a price fixation at 3.12, mean demand for Fresh

are estimated as
y^ = 10 + 5(3) = 25 cars
Multiple Regression
The simple linear regression model was used to analyze
how one interval variable (the dependent variable y) is
related to one other interval variable (the independent
variable x).
Multiple regression allows for any number of

independent variables.
We expect to develop models that fit the data better

than would a simple linear regression model.
We now assume we have k independent
variables potentially related to the one
dependent variable. This relationship is
represented
dependent
in this first order linear equation:
independent variables
variable
error variable
coefficients
Estimating the Regression Coefficients
• The parameters are estimated using the same least
squares approach that we saw in the context of simple
linear regression.
• We choose 𝛽0 , 𝛽1 , … , 𝛽𝑝 to minimize the sum of
squared residuals
𝑛
𝑅𝑆𝑆 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2
𝑖=1
𝑛
2
መ መ መ መ
= ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 .
𝑖=1
Estimating the Coefficients…
The sample regression equation is expressed as:
Some Important Questions
1. Is at least one of the predictors 𝑋1 , … , 𝑋𝑝
useful in predicting the response?
2. How well does the model ﬁt the data?
3. Given a set of predictor values, what response
value should we predict, and how accurate is
our prediction?
Is at least one of the predictors
𝑋1 , … , 𝑋𝑝 useful in predicting the
response?
• We test the null hypothesis
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
vs the alternative
𝐻𝑎 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑗 𝑖𝑠 𝑛𝑜𝑛 − 𝑧𝑒𝑟𝑜.
• We use the following F-statistic to test the above hypothesis:
(𝑇𝑆𝑆 − 𝑅𝑆𝑆)ൗ
𝑝
𝐹= ,
𝑅𝑆𝑆ൗ
(𝑛 − 𝑝 − 1)
which, under 𝐻0 , follows a F-distribution with d.f. 𝑝 and 𝑛 − 𝑝 − 1.
• The observed 𝑝-value is 𝑃 𝐹𝑝. 𝑛−𝑝−1 > 𝐹𝑜𝑏𝑠 . Reject 𝐻0 if 𝑝-value is small.
• For the Advertising data set, the observed 𝑝-value is very low. Thus we reject 𝐻0.
Model Fit
• The quality of a linear regression fit is

assessed using following quantities:
➢ 𝑅 2 statistic
➢ Adjusted 𝑅 2 statistic
➢ Residual Standard Error (𝑅𝑆𝐸)
𝑅 2 and Adjusted-𝑅 2 Statistic
• In MLR, 𝑅2 equals 𝐶𝑜𝑟(𝑌, 𝑌) ෠ 2 , i.e., the square of the

correlation between the response and the fitted linear
model.
• 𝑅2 close to 1 indicates that the model explains the large
portion of the variance in the response variable.
• However, 𝑅2 always increases with the addition of every
new variable.
• This is remedied using
𝑅𝑆𝑆ൗ
2 𝑛−𝑝−1
Adjusted − 𝑅 = 1 − .
𝑇𝑆𝑆ൗ
𝑛−1
• A model with more variables can have lower Adjusted-𝑅2 .
Residual Standard Error (𝑅𝑆𝐸)
• The RSE is defined as

1
𝑅𝑆𝐸 = 𝑅𝑆𝑆.
𝑛−𝑝−1
• A model with more variables can have higher
𝑅𝑆𝐸 if the decrease in 𝑅𝑆𝑆 is small relative to
the increase in the number of variables (𝑝).

Introduction To Correlation and Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Correlation and Linear Regression

Uploaded by

Copyright:

Available Formats

Introduction to Correlation and

– A scatter plot (or scatter diagram) is used to show

• Salary and no of years of experience

• The sample correlation coefficient (r) is an

– Predict the value of a dependent variable based on the

– Explain the impact of changes in an independent variable

**Dependent variable: the variable we wish to explain (predict)

**Independent variable: the variable used to explain (predict) the

• Only one independent variable, x

• Relationship between x and y is described by a linear function

The inexact nature of the Data In ANOVA, the systematic

A statistical model Systematic In regression, the

The population regression model:

Linear component Random Error

Estimated (or Estimate of the Estimate of the

• b0 is the estimated average value of y when the value of x is

A company produces a brand of antiseptic liquid(Fresh). In

Regression 1 2.963146 2.963146 7.905155 0.008902

Residual 28 10.49544 0.374837

Intercept 21.62429127 4.710949 4.59022 8.5E-05 11.97435 31.27423

X Variable 1 -3.545281018 1.260943 -2.81161 0.008902 -6.12821 -0.96236

y-Intercept for the Estimated Regression Equation

Estimated Regression Equation

R2 = SSR/SST = 2.963146 / 13.45859 = 0.220168

The regression relationship is weak; 22.02% of the

To test for a significant regression relationship, we

test commonly used is:

t test requires an estimate of σe2,

• Test Statistic, under H0

For the example, Sb1 = 1.260943

• Test Statistic, under H0

For the example, Sb0 = 4.710949

Reject H0 if p-value < 

1. Determine the hypotheses. H0 : 1 = 0

For a price fixation at 3.12, mean demand for Fresh

Multiple regression allows for any number of

We expect to develop models that fit the data better

• The quality of a linear regression fit is

• In MLR, 𝑅2 equals 𝐶𝑜𝑟(𝑌, 𝑌) ෠ 2 , i.e., the square of the

• The RSE is defined as

You might also like