You are on page 1of 46

Chapter 5

Correlation and Linear


Regression Analyses
Correlation Analysis
• Statistical technique used to determine the
strength or degree of association
(relationship) between two or more variables
• Considers the joint variation of two
measurements, neither of which is controlled
by the researcher or experimenter
• Does not imply causal-and-effect relationship
Correlation Analysis
• No variable is labeled as independent or
dependent

• Determine existence of linear association


among variables

• Measure the strength of their linear


relationship
Regression Analysis
• A statistical technique used to study the
functional relationship between two or more
variables

• Describes the effect of one or more variables


(independent variables) on a single variable
(dependent variable) by expressing the latter
as a function of the former
Regression Analysis
• Establish the possible causal relationship

• Estimate the value of a variable given the


value of other variables

• Explain the variation of one variable by other


variables
Scatter diagram
• For two variables X and Y, it gives a rough idea
of the relationship between X and Y
Supply Current without
Voltage (X) Electronics (Y)
0.66 7.32
1.32 12.22
1.98 16.34
2.64 23.66
3.30 28.06
3.96 33.39
4.62 34.12
3.28 39.21
5.94 44.21
6.60 47.48
Scatter diagram
Pearson’s correlation coefficient
• For two variables X and Y, Pearson’s
correlation coefficient in the population is
given by
 XY cov  X , Y 
  , 1    1
 XY var  X  var Y 
Pearson’s correlation coefficient
The Pearson’s correlation coefficient computed based on
a sample is

r
 ( x  x )( y  y )
[ ( x  x ) ][  ( y  y ) ]
2 2

or the algebraic equivalent:


n xy   x  y
r
[n x 2  ( x) 2 ][ n y 2  ( y ) 2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Pearson’s correlation coefficient
• The coefficient of correlation is used to measure the
strength of association between two variables.

• The coefficient values range between -1 and 1.


– If ρ = -1 (negative association, X&Y)
– If ρ = +1 (positive association, X&Y or X &Y)
– If ρ = 0 there is no linear pattern.

• The coefficient can be used to test for linear


relationship between two variables.
Pearson’s correlation coefficient
Y Y Y

X X r=0 X
r = -1 r = -.6
Y
Y Y

X X X
r = +1 r = +.3
11
r=0
Guide in interpreting ρ (r)

ρ Interpretation
0 No linear association
0<ρ<0.2 Very weak linear association
0.2≤ρ<0.4 Weak linear association
0.4≤ρ<0.6 Moderate linear association
0.6≤ρ<0.8 Strong linear association
0.8≤ρ<1.0 Very strong linear association
1.0 Perfect linear association
Test of hypothesis about ρ
Ho: ρ=0 (There is no linear association between X and Y)

Ha: ρ≠0 (There is linear association between X and Y)


Ha: ρ>0 (There is positive linear association between X and Y)
Ha: ρ<0 (There is negative linear association between X and Y)
Test of hypothesis about ρ
Test statistic: T test
r
tc 
1 r2
n2
Decision rule: Reject Ho if _____; otherwise, fail to reject Ho.

i. |tc|≥tα/2, n-2
ii. tc≥tα, n-2
iii. tc≤-tα, n-2
Calculation Example
Supply Current without
Voltage (X) Electronics (Y)
X2 Y2 XY
0.66 7.32 0.4356 53.5824 4.8312
1.32 12.22 1.7424 149.328 16.1304
1.98 16.34 3.9204 266.996 32.3532
2.64 23.66 6.9696 559.796 62.4624
3.3 28.06 10.89 787.364 92.598
3.96 33.39 15.6816 1114.89 132.224
4.62 34.12 21.3444 1164.17 157.634
3.28 39.21 10.7584 1537.42 128.609
5.94 44.21 35.2836 1954.52 262.607
6.6 47.48 43.56 2254.35 313.368
X=34.3 Y=286.01 X2=150.6 Y2=9842 XY=1203
Calculation Example
(continued)

n  xy   x  y
r
[n( x 2 )  ( x) 2 ][n( y 2 )  ( y)2 ]
10(1203)  (34.3)(286.01)

[10(150.6)  (34.3) 2 ][10(9842)  (286.01) 2 ]
 0.9479

r = 0.9479 → very strong positive linear association


between Voltage and Current
Test of hypothesis about ρ
Ho: ρ=0
Ha: ρ≠0
α=0.05
Test stat.: T test
r 0.9479
t   8.42
1 r 2
1  0.94792

n2 10  2
Decision rule: Reject Ho if |tc|≥t0.025,8=2.306; otherwise, do not
reject Ho.
Decision: Reject Ho since 8.42>2.306.
Conclusion: At α=0.01, there is sufficient evidence to conclude
that Voltage and Current are positively correlated.
Introduction to Regression Analysis
Regression analysis is used to:
– Predict the value of a dependent variable based
on the value of at least one independent variable
– Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model

• Only one independent variable, x


• Relationship between x and y is
described by a linear function
• Changes in y are assumed to be
caused by changes in x
Population Linear Regression Model

Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual

y  β0  β1x  ε
Variable

Linear component Random Error


component
Linear Regression Assumptions
• Error values (ε) are statistically independent
• Error values are normally distributed for any
given value of x
• The probability distribution of the errors is
normal
• The probability distribution of the errors has
constant variance
• The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
(continued)

y y  β0  β1x  ε
Observed Value of y
for xi

εi Slope = β1
Predicted Value of Random Error for
y for xi
this x value

Intercept = β0

xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value intercept

Independent

ŷ i  b0  b1x variable

The individual random error terms ei have a mean of zero


Least Squares Criterion

• b0 and b1 are obtained by finding the


values of b0 and b1 that minimize the
sum of the squared residuals

e 2
  (y ŷ) 2

  (y  (b 0  b1x)) 2
The Least Squares Equation
• The formulas for b1 and b0 are:

b1 
 ( x  x )( y  y )
 (x  x) 2

algebraic equivalent:
and
 xy   x y
b1  n b0  y  b1 x
(  x )2
 
x 2

n
Interpretation of the
Slope and the Intercept

• b0 is the estimated average value of y


when the value of x is zero

• b1 is the estimated change in the


average value of y as a result of a one-
unit change in x
Finding the Least Squares Equation

• The coefficients b0 and b1 will usually


be found using computer software, such
as Excel or Minitab

• Other regression measures will also be


computed as part of computer-based
regression analysis
Example
Supply Current
Voltage without
(X) Electronics (Y) X2 Y2 XY
0.66 7.32 0.4356 53.5824 4.8312
1.32 12.22 1.7424 149.328 16.1304
1.98 16.34 3.9204 266.996 32.3532
2.64 23.66 6.9696 559.796 62.4624
3.3 28.06 10.89 787.364 92.598
3.96 33.39 15.6816 1114.89 132.224
4.62 34.12 21.3444 1164.17 157.634
3.28 39.21 10.7584 1537.42 128.609
5.94 44.21 35.2836 1954.52 262.607
6.6 47.48 43.56 2254.35 313.368
X=34.3 Y=286.01 X2=150.6 Y2=9842 XY=1203
Example
n xy   x  y 101203  34.3286.01
b1    6.734
n x   x  10150.6   34.3
2 2 2

b0  y  b1 x  28.601  6.7343.43  5.503

ˆy i  5.503  6.734 x i
Example

ˆy i  5.503  6.734 x i
Interpretation of the
Intercept, b0

ˆ  5.503  6.734 X
Y
• b0 is the estimated average value of Y when
the value of X is zero (if x = 0 is in the range of
observed x values)
• b1 is the estimated change in the value of Y
per unit change in X
Least Squares Regression Properties
• The sum of the residuals from the least squares
regression line is 0 ( )
 ( y yˆ )  0
• The sum of the squared residuals is a minimum
(minimized )
 ( y yˆ ) 2

• The simple regression line always passes through


the mean of the y variable and the mean of the x
variable
• The least squares coefficients are unbiased
estimates of β0 and β1
Explained and Unexplained Variation

• Total variation is made up of two parts:

TSS  SSR  SSE


Total sum of Sum of Squares Sum of Squares
Squares Regression Error

SST   ( y  y )2 SSR   ( ŷ  y )2 SSE   ( y  ŷ )2


where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Explained and Unexplained Variation
(continued)

• TSS = total sum of squares


– Measures the variation of the yi values around their
mean y
• SSE = error sum of squares
– Variation attributed to factors other than the
relationship between x and y
• SSR = regression sum of squares
– Explained variation attributed to the linear
relationship between x and y
Explained and Unexplained Variation
(continued)
y
yi 
 2
SSE = (yi - yi ) y
_
TSS = (yi - y)2

y  _2
_ SSR = (yi - y) _
y y

Xi x
Coefficient of Determination, R2
• The coefficient of determination is the
proportion of the total variation in the
dependent variable that is explained by
variation in the independent variable

• The coefficient of determination is also called


R-squared and is denoted as R2
SSR
R 2
where 0 R 1
2

TSS
Coefficient of Determination, R2
(continued)
Coefficient of determination
SSR sum of squares explained by regression
R 2

TSS total sum of squares

Note: In the single independent variable case, the coefficient of


determination is

R r 2 2

where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Assessing the overall goodness-of-fit
of the model
• ANOVA F test—if the F test of the ANOVA is
significant at α then the estimated
equation fits well to the data
• R2—the higher the value of R2, the better the
fit of the model
• Adjusted R2—more reliable than R2…use this
if there are more than 1 predictor in the
model
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is
estimated by

SSE
s 
n  k 1
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
The Standard Deviation of the
Regression Slope
• The standard error of the regression slope
coefficient (b1) is estimated by
sε sε
sb1  
 (x  x) 2

x 2

(  x) 2

n
where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε 
n  2 = Sample standard error of the estimate
Inference about the Slope:
t Test
• t test for a population slope
– Is there a linear relationship between x and y?
• Null and alternative hypotheses
– H0: β1 = 0 (no linear relationship)
– H1: β1  0 (linear relationship does exist)
• Test statistic
b1 where:
– t b1 = Sample regression slope
s b1 coefficient
sb1 = Estimator of the standard
– d.f.  n  2 error of the slope
SLR using Data Analysis Toolpak (MS Excel)
Residual Analysis
• Purposes
–Examine for linearity assumption
– Examine for constant variance for all levels
of x
– Evaluate normal distribution assumption
• Graphical Analysis of Residuals
– Can plot residuals vs. x
–Can create histogram of residuals to check
for normality
Residual Analysis for Linearity

y y

x x
residuals

residuals
x x

Not Linear
✓ Linear
Residual Analysis for
Constant Variance

y y

x x
residuals

x residuals x

Non-constant variance ✓Constant variance


Using the model for prediction or
estimation
• If the model fits well to the data
• If the model residuals do not violate anyone of
the assumptions
• Example: What is the expected current if
voltage is 5?
Ans:
ˆ  5.503  6.734 X
Y
 5.503  6.7345
 39.173

You might also like