Correlation and Regression Analysis

Chapter 5
Correlation and Linear

Regression Analyses
Correlation Analysis
• Statistical technique used to determine the
strength or degree of association
(relationship) between two or more variables
• Considers the joint variation of two
measurements, neither of which is controlled
by the researcher or experimenter
• Does not imply causal-and-effect relationship
Correlation Analysis
• No variable is labeled as independent or
dependent
• Determine existence of linear association

among variables
• Measure the strength of their linear

relationship
Regression Analysis
• A statistical technique used to study the
functional relationship between two or more
variables
• Describes the effect of one or more variables

(independent variables) on a single variable
(dependent variable) by expressing the latter
as a function of the former
Regression Analysis
• Establish the possible causal relationship
• Estimate the value of a variable given the

value of other variables
• Explain the variation of one variable by other

variables
Scatter diagram
• For two variables X and Y, it gives a rough idea
of the relationship between X and Y
Supply Current without
Voltage (X) Electronics (Y)
0.66 7.32
1.32 12.22
1.98 16.34
2.64 23.66
3.30 28.06
3.96 33.39
4.62 34.12
3.28 39.21
5.94 44.21
6.60 47.48
Scatter diagram
Pearson’s correlation coefficient
• For two variables X and Y, Pearson’s
correlation coefficient in the population is
given by
 XY cov  X , Y 
  , 1    1
 XY var  X  var Y 
The Pearson’s correlation coefficient computed based on
a sample is
r
 ( x  x )( y  y )
[ ( x  x ) ][  ( y  y ) ]
2 2
or the algebraic equivalent:

n xy   x  y
r
[n x 2  ( x) 2 ][ n y 2  ( y ) 2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
• The coefficient of correlation is used to measure the
strength of association between two variables.
• The coefficient values range between -1 and 1.

– If ρ = -1 (negative association, X&Y)
– If ρ = +1 (positive association, X&Y or X &Y)
– If ρ = 0 there is no linear pattern.
• The coefficient can be used to test for linear

relationship between two variables.
Y Y Y
X X r=0 X
r = -1 r = -.6
Y
Y Y
X X X
r = +1 r = +.3
11
r=0
Guide in interpreting ρ (r)
ρ Interpretation
0 No linear association
0<ρ<0.2 Very weak linear association
0.2≤ρ<0.4 Weak linear association
0.4≤ρ<0.6 Moderate linear association
0.6≤ρ<0.8 Strong linear association
0.8≤ρ<1.0 Very strong linear association
1.0 Perfect linear association
Test of hypothesis about ρ
Ho: ρ=0 (There is no linear association between X and Y)
Ha: ρ≠0 (There is linear association between X and Y)

Ha: ρ>0 (There is positive linear association between X and Y)
Ha: ρ<0 (There is negative linear association between X and Y)
Test statistic: T test
r
tc 
1 r2
n2
Decision rule: Reject Ho if _____; otherwise, fail to reject Ho.
i. |tc|≥tα/2, n-2
ii. tc≥tα, n-2
iii. tc≤-tα, n-2
Calculation Example
Supply Current without
Voltage (X) Electronics (Y)
X2 Y2 XY
0.66 7.32 0.4356 53.5824 4.8312
1.32 12.22 1.7424 149.328 16.1304
1.98 16.34 3.9204 266.996 32.3532
2.64 23.66 6.9696 559.796 62.4624
3.3 28.06 10.89 787.364 92.598
3.96 33.39 15.6816 1114.89 132.224
4.62 34.12 21.3444 1164.17 157.634
3.28 39.21 10.7584 1537.42 128.609
5.94 44.21 35.2836 1954.52 262.607
6.6 47.48 43.56 2254.35 313.368
X=34.3 Y=286.01 X2=150.6 Y2=9842 XY=1203
Calculation Example
(continued)
n  xy   x  y
r
[n( x 2 )  ( x) 2 ][n( y 2 )  ( y)2 ]
10(1203)  (34.3)(286.01)

[10(150.6)  (34.3) 2 ][10(9842)  (286.01) 2 ]
 0.9479
r = 0.9479 → very strong positive linear association

between Voltage and Current
Ho: ρ=0
Ha: ρ≠0
α=0.05
Test stat.: T test
r 0.9479
t   8.42
1 r 2
1  0.94792
n2 10  2
Decision rule: Reject Ho if |tc|≥t0.025,8=2.306; otherwise, do not
reject Ho.
Decision: Reject Ho since 8.42>2.306.
Conclusion: At α=0.01, there is sufficient evidence to conclude
that Voltage and Current are positively correlated.
Introduction to Regression Analysis
Regression analysis is used to:
– Predict the value of a dependent variable based
on the value of at least one independent variable
– Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model
• Only one independent variable, x

• Relationship between x and y is
described by a linear function
• Changes in y are assumed to be
caused by changes in x
Population Linear Regression Model
Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
y  β0  β1x  ε
Variable
Linear component Random Error

component
Linear Regression Assumptions
• Error values (ε) are statistically independent
• Error values are normally distributed for any
given value of x
• The probability distribution of the errors is
normal
• The probability distribution of the errors has
constant variance
• The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
(continued)
y y  β0  β1x  ε
Observed Value of y
for xi
εi Slope = β1
Predicted Value of Random Error for
y for xi
this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Estimated (or Estimate of the Estimate of the

predicted) y regression regression slope
value intercept
Independent
ŷ i  b0  b1x variable
The individual random error terms ei have a mean of zero

Least Squares Criterion
• b0 and b1 are obtained by finding the

values of b0 and b1 that minimize the
sum of the squared residuals
e 2
  (y ŷ) 2
  (y  (b 0  b1x)) 2
The Least Squares Equation
• The formulas for b1 and b0 are:
b1 
 ( x  x )( y  y )
 (x  x) 2
algebraic equivalent:
and
 xy   x y
b1  n b0  y  b1 x
(  x )2
 
x 2
n
Interpretation of the
Slope and the Intercept
• b0 is the estimated average value of y

when the value of x is zero
• b1 is the estimated change in the

average value of y as a result of a one-
unit change in x
Finding the Least Squares Equation
• The coefficients b0 and b1 will usually

be found using computer software, such
as Excel or Minitab
• Other regression measures will also be

computed as part of computer-based
regression analysis
Example
Supply Current
Voltage without
(X) Electronics (Y) X2 Y2 XY
0.66 7.32 0.4356 53.5824 4.8312
1.32 12.22 1.7424 149.328 16.1304
1.98 16.34 3.9204 266.996 32.3532
2.64 23.66 6.9696 559.796 62.4624
3.3 28.06 10.89 787.364 92.598
3.96 33.39 15.6816 1114.89 132.224
4.62 34.12 21.3444 1164.17 157.634
3.28 39.21 10.7584 1537.42 128.609
5.94 44.21 35.2836 1954.52 262.607
6.6 47.48 43.56 2254.35 313.368
X=34.3 Y=286.01 X2=150.6 Y2=9842 XY=1203
Example
n xy   x  y 101203  34.3286.01
b1    6.734
n x   x  10150.6   34.3
2 2 2
b0  y  b1 x  28.601  6.7343.43  5.503
ˆy i  5.503  6.734 x i
Example
ˆy i  5.503  6.734 x i
Interpretation of the
Intercept, b0
ˆ  5.503  6.734 X
Y
• b0 is the estimated average value of Y when
the value of X is zero (if x = 0 is in the range of
observed x values)
• b1 is the estimated change in the value of Y
per unit change in X
Least Squares Regression Properties
• The sum of the residuals from the least squares
regression line is 0 ( )
 ( y yˆ )  0
• The sum of the squared residuals is a minimum
(minimized )
 ( y yˆ ) 2
• The simple regression line always passes through

the mean of the y variable and the mean of the x
variable
• The least squares coefficients are unbiased
estimates of β0 and β1
Explained and Unexplained Variation
• Total variation is made up of two parts:
TSS  SSR  SSE

Total sum of Sum of Squares Sum of Squares
Squares Regression Error
SST   ( y  y )2 SSR   ( ŷ  y )2 SSE   ( y  ŷ )2

where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
(continued)
• TSS = total sum of squares

– Measures the variation of the yi values around their
mean y
• SSE = error sum of squares
– Variation attributed to factors other than the
relationship between x and y
• SSR = regression sum of squares
– Explained variation attributed to the linear
relationship between x and y
(continued)
y
yi 
 2
SSE = (yi - yi ) y
_
TSS = (yi - y)2

y  _2
_ SSR = (yi - y) _
y y
Xi x
Coefficient of Determination, R2
• The coefficient of determination is the
proportion of the total variation in the
dependent variable that is explained by
variation in the independent variable
• The coefficient of determination is also called

R-squared and is denoted as R2
SSR
R 2
where 0 R 1
2
TSS
Coefficient of Determination, R2
(continued)
Coefficient of determination
SSR sum of squares explained by regression
R 2

TSS total sum of squares
Note: In the single independent variable case, the coefficient of

determination is
R r 2 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Assessing the overall goodness-of-fit
of the model
• ANOVA F test—if the F test of the ANOVA is
significant at α then the estimated
equation fits well to the data
• R2—the higher the value of R2, the better the
fit of the model
• Adjusted R2—more reliable than R2…use this
if there are more than 1 predictor in the
model
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is
estimated by
SSE
s 
n  k 1
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
The Standard Deviation of the
Regression Slope
• The standard error of the regression slope
coefficient (b1) is estimated by
sε sε
sb1  
 (x  x) 2
x 2

(  x) 2
n
where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε 
n  2 = Sample standard error of the estimate
Inference about the Slope:
t Test
• t test for a population slope
– Is there a linear relationship between x and y?
• Null and alternative hypotheses
– H0: β1 = 0 (no linear relationship)
– H1: β1  0 (linear relationship does exist)
• Test statistic
b1 where:
– t b1 = Sample regression slope
s b1 coefficient
sb1 = Estimator of the standard
– d.f.  n  2 error of the slope
SLR using Data Analysis Toolpak (MS Excel)
Residual Analysis
• Purposes
–Examine for linearity assumption
– Examine for constant variance for all levels
of x
– Evaluate normal distribution assumption
• Graphical Analysis of Residuals
– Can plot residuals vs. x
–Can create histogram of residuals to check
for normality
Residual Analysis for Linearity
y y
x x
residuals
residuals
x x
Not Linear
✓ Linear
Residual Analysis for
Constant Variance
y y
x x
residuals
x residuals x
Non-constant variance ✓Constant variance

Using the model for prediction or
estimation
• If the model fits well to the data
• If the model residuals do not violate anyone of
the assumptions
• Example: What is the expected current if
voltage is 5?
Ans:
ˆ  5.503  6.734 X
Y
 5.503  6.7345
 39.173

Correlation and Regression Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression Analysis

Uploaded by

Copyright:

Available Formats

Chapter 5

Correlation and Linear

• Determine existence of linear association

• Measure the strength of their linear

• Describes the effect of one or more variables

• Estimate the value of a variable given the

• Explain the variation of one variable by other

or the algebraic equivalent:

• The coefficient values range between -1 and 1.

• The coefficient can be used to test for linear

Ha: ρ≠0 (There is linear association between X and Y)

r = 0.9479 → very strong positive linear association

• Only one independent variable, x

Linear component Random Error

Estimated (or Estimate of the Estimate of the

The individual random error terms ei have a mean of zero

• b0 and b1 are obtained by finding the

• b0 is the estimated average value of y

• b1 is the estimated change in the

• The coefficients b0 and b1 will usually

• Other regression measures will also be

b0  y  b1 x  28.601  6.7343.43  5.503

• The simple regression line always passes through

• Total variation is made up of two parts:

TSS  SSR  SSE

SST   ( y  y )2 SSR   ( ŷ  y )2 SSE   ( y  ŷ )2

• TSS = total sum of squares

• The coefficient of determination is also called

Note: In the single independent variable case, the coefficient of

Non-constant variance ✓Constant variance

You might also like