You are on page 1of 40

Introduction to Correlation and

Linear Regression
Scatter Plots and Correlation

– A scatter plot (or scatter diagram) is used to show


the relationship between two variables
– Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Examples

• Salary and no of years of experience


• Household income and expenditure;
• Price and supply of commodities;
• Amount of rainfall and yield of crops.
• Price and demand of goods.
• Weight and blood pressure
• Sales and GDP
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient
• The population correlation coefficient (ρ)
measures the strength of the linear association
between the variables

• The sample correlation coefficient (r) is an


estimate of ρ and is used to measure the strength
of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy =
sx s y
1
cov( x, y ) =  ( xi − x )( yi − y )
n
1 1
sx =
n
 ( xi − x ) 2
s y =
n
 ( y i − y ) 2
Features of correlation coefficient
– Unit free
– Range between -1.00 and 1.00
– -1≤ρ<0 implies that as X ↑ (↓), Y ↓ (↑ )
– 0< ρ≤1 implies that as X ↑ (↓), Y ↑ (↓)
– The closer to -1.00, the stronger the negative
linear relationship
– The closer to 1.00, the stronger the positive linear
relationship
– The closer to 0.00, the weaker the linear
relationship
– ρ=0 implies that X and Y are not linearly
associated
Examples of Approximate r Values

y y y

x x x
r = -1.00 r = -.60 r = 0.00
y y

x x
r = 0.20 r = 1.00
Introduction to Regression Analysis
• Regression analysis is used to:

– Predict the value of a dependent variable based on the


value of at least one independent variable

– Explain the impact of changes in an independent variable


on the dependent variable

**Dependent variable: the variable we wish to explain (predict)

**Independent variable: the variable used to explain (predict) the


dependent variable

• Only one independent variable, x

• Relationship between x and y is described by a linear function


Model Building

The inexact nature of the Data In ANOVA, the systematic


relationship between years component is the variation
of experience and annual of means between samples
sales suggests that a or treatments (SSTR) and
statistical model might be
Statistical the random component is
useful in analyzing the model the unexplained variation
relationship. (SSE).

A statistical model Systematic In regression, the


separates the systematic systematic component is
component
component of a the overall linear
relationship from the + relationship, and the
random component. Random random component is the
errors variation around the line.
Population Linear Regression

The population regression model:


Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual

y = β 0 + β1 x + ε
Variable

Linear component Random Error


(systematic) component
Population Linear Regression

y y = β 0 + β1 x + ε
Observed Value
of y for xi

εi Slope = β1
Predicted Value Random Error for
of y for xi
this x value

Intercept = β0

xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value intercept

Independent

ŷ i = b 0 + b1x variable

E(εi) = 0
Least Square Regression
• b0 and b1 are obtained by finding the values of b0 and
b1 that minimize the sum of the squared residuals

SSE =  e =  (y −ŷ)
2 2

=  (y − (b0 + b1x)) 2
Interpretation of the Slope and the Intercept

• b0 is the estimated average value of y when the value of x is


zero
• b1 is the estimated change in the average value of y as a result
of a one-unit change in x
Regression equation of y on x is

y = y + b1 ( x − x )
b1 : regression coefficient of y on x
Simple Linear Regression
Example: Demand for “Fresh”-antiseptic liquid

A company produces a brand of antiseptic liquid(Fresh). In


order to manage its inventory more effectively and make
revenue projections, the company would like to better
predict demand for “Fresh”. To develop a prediction model
the company has gathered data concerning demand for Fresh
over the last 30 sales periods (1 period=four week). The
factory manager assumes that price of Fresh has influence on
the demand.
Y: demand for 500 ml bottle of Fresh (in hundreds of thousands
of bottles) in the sales period
X1: price (in $) of fresh in the sales period
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.469220295
R Square 0.220167685
Adjusted R Square 0.192316531
Standard Error 0.612239472
Observations 30
ANOVA
Significanc
df SS MS F eF

Regression 1 2.963146 2.963146 7.905155 0.008902

Residual 28 10.49544 0.374837

Total 29 13.45859

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%

Intercept 21.62429127 4.710949 4.59022 8.5E-05 11.97435 31.27423

X Variable 1 -3.545281018 1.260943 -2.81161 0.008902 -6.12821 -0.96236


Estimated Regression Equation
Slope for the Estimated Regression Equation
𝑏1 = −3.54528

y-Intercept for the Estimated Regression Equation

𝑏0 = 21.62429

Estimated Regression Equation

𝑦ො = 21.62429−3.54528𝑥1
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

 i
( y − y ) 2
=  i
( ˆ
y − y ) 2
+  i i
( y − ˆ
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
The coefficient of determination is:
R2 is a measure of relative fit based on a comparison
of SSR and SST

R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
𝟎 ≤ 𝑹𝟐 ≤ 𝟏 a value of R2 closer to one indicates the better fit and
value of R2 closer to zero indicates the poor fit.
Coefficient of Determination

R2 = SSR/SST = 2.963146 / 13.45859 = 0.220168

The regression relationship is weak; 22.02% of the


variability in the demand for Fresh can be explained
by the linear model between the demand and the
price.
Test for Significance

To test for a significant regression relationship, we


test for intercept parameter, b0;
slope parameter b1 and predicted y

test commonly used is:

t Test

t test requires an estimate of σe2,


the variance of error (residuals) in the regression model.
Testing for slope parameter
• Hypotheses
H 0 : 1 = 10

H1 : 1  10

• Test Statistic, under H0

b1 − 10 s
tobs = where sb1 =
sb1  i
(
i
x − x ) 2

For the example, Sb1 = 1.260943


Testing for intercept parameter
• Hypotheses
H 0 :  0 =  00

H1 :  0   00

• Test Statistic, under H0

b0 −  00 1 x2
= where sb 0 = s +
tobs
sb 0 n  ( xi − x ) 2

For the example, Sb0 = 4.710949


Testing for Significance: t Test
Critical Region

Reject H0 if p-value < 


or tobs < -t /2;n-2 or t > t/2;n-2

where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example

1. Determine the hypotheses. H0 : 1 = 0


H a : 1  0
2. Specify the level of significance. a = .05

b1
3. Select the test statistic. t=
sb1
t(obs)= -2.81161
4. State the rejection rule.
P value= 0.008902<0.05
Reject null
Slope parameter is significant
Price has significant impact on
demand
Confidence Interval for 1
• The form of a confidence interval for 1 is: t /2 sb1
is the
b1  t /2 sb1 margin
of error
b1 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for 0
• The form of a confidence interval for 0 is: t / 2 sb 0
is the
b0  t / 2 sb 0 margin
of error
b0 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Point Estimation

For a price fixation at 3.12, mean demand for Fresh


are estimated as

y^ = 10 + 5(3) = 25 cars
Multiple Regression
The simple linear regression model was used to analyze
how one interval variable (the dependent variable y) is
related to one other interval variable (the independent
variable x).

Multiple regression allows for any number of


independent variables.

We expect to develop models that fit the data better


than would a simple linear regression model.
We now assume we have k independent
variables potentially related to the one
dependent variable. This relationship is
represented
dependent
in this first order linear equation:
independent variables
variable

error variable

coefficients
Estimating the Regression Coefficients
• The parameters are estimated using the same least
squares approach that we saw in the context of simple
linear regression.
• We choose 𝛽0 , 𝛽1 , … , 𝛽𝑝 to minimize the sum of
squared residuals
𝑛

𝑅𝑆𝑆 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

𝑖=1
𝑛
2
መ መ መ መ
= ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 .
𝑖=1
Estimating the Coefficients…
The sample regression equation is expressed as:
Some Important Questions
1. Is at least one of the predictors 𝑋1 , … , 𝑋𝑝
useful in predicting the response?
2. How well does the model fit the data?
3. Given a set of predictor values, what response
value should we predict, and how accurate is
our prediction?
Is at least one of the predictors
𝑋1 , … , 𝑋𝑝 useful in predicting the
response?
• We test the null hypothesis
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
vs the alternative
𝐻𝑎 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑗 𝑖𝑠 𝑛𝑜𝑛 − 𝑧𝑒𝑟𝑜.
• We use the following F-statistic to test the above hypothesis:
(𝑇𝑆𝑆 − 𝑅𝑆𝑆)ൗ
𝑝
𝐹= ,
𝑅𝑆𝑆ൗ
(𝑛 − 𝑝 − 1)
which, under 𝐻0 , follows a F-distribution with d.f. 𝑝 and 𝑛 − 𝑝 − 1.
• The observed 𝑝-value is 𝑃 𝐹𝑝. 𝑛−𝑝−1 > 𝐹𝑜𝑏𝑠 . Reject 𝐻0 if 𝑝-value is small.
• For the Advertising data set, the observed 𝑝-value is very low. Thus we reject 𝐻0.
Model Fit

• The quality of a linear regression fit is


assessed using following quantities:
➢ 𝑅 2 statistic
➢ Adjusted 𝑅 2 statistic
➢ Residual Standard Error (𝑅𝑆𝐸)
𝑅 2 and Adjusted-𝑅 2 Statistic

• In MLR, 𝑅2 equals 𝐶𝑜𝑟(𝑌, 𝑌) ෠ 2 , i.e., the square of the


correlation between the response and the fitted linear
model.
• 𝑅2 close to 1 indicates that the model explains the large
portion of the variance in the response variable.
• However, 𝑅2 always increases with the addition of every
new variable.
• This is remedied using
𝑅𝑆𝑆ൗ
2 𝑛−𝑝−1
Adjusted − 𝑅 = 1 − .
𝑇𝑆𝑆ൗ
𝑛−1
• A model with more variables can have lower Adjusted-𝑅2 .
Residual Standard Error (𝑅𝑆𝐸)

• The RSE is defined as


1
𝑅𝑆𝐸 = 𝑅𝑆𝑆.
𝑛−𝑝−1
• A model with more variables can have higher
𝑅𝑆𝐸 if the decrease in 𝑅𝑆𝑆 is small relative to
the increase in the number of variables (𝑝).

You might also like