You are on page 1of 37

Linear Correlation and regression

Instructor:
Dr. Md. Sohel Rana
Associate Professor of Statistics,
Department of Mathematical & Physical Sciences, EWU
Email: srana@ewubd.edu

Dr. Md. Sohel Rana, Stat, EWU

Correlation Analysis

• Correlation analysis is used to


measure strength of the association
(linear
relationship) between two variables
– Correlation is only concerned with strength of
the relationship
– No causal effect is implied with
correlation – Correlation was first
presented in Chapter 3

Dr. Md. Sohel Rana, Stat, EWU

Positive (weak and Strong) Correlation


Patterns
Dr. Md. Sohel Rana, Stat, EWU

Negative (weak and Strong) Correlation Patterns


Dr. Md. Sohel
Rana, Stat, EWU

Sample Pearson’s correlation coefficient

Sample Pearson’s correlation coefficient


S
Cov x y
(,)
xy
r==
Var x Var y
( ). ( )
SS
xx yy

= −
= 2 ∑ ∑
2 ∑ ∑ −() ()
2
y
S yi
2
x S xi
yy i
n
xx i

= −
∑∑ ∑
( )( )
xy
S x yi i
xy i i

Dr. Md. Sohel Rana, Stat, EWU

Pearson’s Correlation Coefficient


• “r” indicates…
– strength of relationship (strong, weak, or none)
– direction of relationship
• positive (direct) – variables move in same direction
• negative (inverse) – variables move in opposite

directions • r ranges in value from –1.0 to


+1.0
-1.0 0.0 +1.0 Strong Negative No Relationship Strong Positive

Dr. Md. Sohel Rana, Stat, EWU


Rule of thumb

Value of r Strength of relationship ± 0.8 to ± 1 Very


Strong
± 0.7 to ± 0.8 strong
± 0.5 to ± 0.7 Moderate
± 0.3 to ± 0.5 Weak
± 0 to ± 0.3 Very weak / No relationship Dr. Md. Sohel Rana, Stat,
EWU

Example
• A major airline wants to estimate the relationship
between the number of reservations and the
actual number of passengers who show up for
flight ABC.
• Information gathered over 12 randomly selected
days for flight ABC is given in the table below:
Dr. Md. Sohel Rana, Stat, EWU

Airline Data
Day No. of No. of
Reservations Passengers
1 250 210
2 548 405
3 156 120
4 121 89
5 416 304
6 450 320
7 462 319
8 508 410
9 307 275
10 311 289
11 265 236
12 189 170

Dr. Md. Sohel Rana, Stat, EWU


.

Example
s

r
450
e

g 400
350
n

s
300
a
250
200
P

o
150 Reservations
100
50
0
0 100 200 300 400 500 600 No. of

Sxy Sxx Syy = = = 154483.2 ; 223736.9;

113564.2

r 154483.2
==
0.97
(223736.9)(113564.2)
Comment: ?

Dr. Md. Sohel Rana, Stat, EWU

Linear Regression: Introduction


• We are often interested in trying to determine
the relationship between a pair of variables.
For instance,
o How does the amount of money spent in
advertising a new product relate to the first
month’s sales figures for that product? Or
o How does the house price relate to the
house space? Or
o How does the expenditure of a family relate
to the family income ?

Dr. Md. Sohel Rana, Stat, EWU


Introduction
• The variable whose value is determined first is
called the input or independent variable and
the other is called the response or
dependent variable.
Dr. Md. Sohel Rana, Stat, EWU

Simple Linear Regression Model


• We consider a basic regression model where
there is one input (or independent) variable X
and one response (or dependent) variable Y
and the relationship is linear.
• The regression model can be stated as
follows: Y = β0 +β1X + ε
• The quantities β0and β1are parameters
(unknown population characteristic). The
variable ε, called the random error, is assumed
to be a random variable having normal
distribution with mean 0 and variance σ2.
Dr. Md. Sohel Rana, Stat, EWU

Simple Linear Regression Model


Definition:
• The relationship between the response variable
Y and the input variable X specified by the
equation
Y = β0 +β1X + ε
is called a simple linear regression.

Dr. Md. Sohel Rana, Stat, EWU


Estimating Regression Parameters
• Suppose that the responses yicorresponding to
the input values xi, i =1, 2, …, n, are to be
observed and used to estimate the parameters
β0and β1in a simple linear regression model
yi = β0 +β1xi + εi, i =1, 2, …, n
ˆˆ
• If are the respective estimators of β0and
β β and
01

β1, then the estimator of the response


corresponding
to the input value xi would be
ˆ ˆi
β β + x 01
• Since the actual response is yi, it follows that the
difference between the actual response and its
estimated value is ˆ ˆ
given by ( ) iiiε β β = − + y x 01
Dr. Md. Sohel Rana, Stat, EWU

Estimating Regression Parameters


• Now, it is reasonable to choose our estimates of
β0 ˆ ˆ
and β1to be the values of that make these β β

and
01
errors as small
as possible.

ˆˆ
• To do this, we choose to minimize the
β β and
01
value of the sum of the squares of the errors (SSE),
nn
ˆˆ
[ ( )]
=−+
∑∑
22
εββyx
iii
01
ii
==
11

• The resulting estimators of β0and β1are called


least-square estimators.

Dr. Md. Sohel Rana, Stat, EWU

Estimating Regression Parameters


Definition:
For given data pairs (xi, yi, i = 1, 2, ..., n), the least

square estimators of β0and β1 are the values


ˆˆ
β β and
that make
nn
01

ˆˆ
[ ( )]
=−+
∑∑
22
εββyx
iii
01
ii
==
11

as small as possible.
Dr. Md. Sohel Rana, Stat, EWU

Estimating Regression
Parameters • The Least Square Method:
Minimize SSE
ˆˆ
ˆi i y x = + β β
01
Dr. Md. Sohel Rana, Stat, EWU

Estimating Regression

Parameters • It can be shown that the


least-squares estimators of

ˆˆ
β0and β1, which we call , are given by
β β and
01

S
yx
ˆ
ˆ ˆ and xy
βββ==−
101
xx
S
Dr. Md. Sohel Rana, Stat, EWU

Example
Refer to the Airline Example
a) Which is the dependent variable (Y) and which
is the explanatory variable (X) in this problem?
b) Draw a scatter diagram with X and Y. Is the
relationship looks linear?
c) Fit the regression line of Y on X. Or, fit a linear
regression model to these data with no. of
Passengers being the response variable and
no. of Reservations the explanatory variable.
d) From the output, identify and interpret the slope
and the intercept.
Dr. Md. Sohel Rana, Stat, EWU

Example
Solution:
a) Since the no. of Passengers depends on the
no. of Reservations, the dependent variable
(Y) in this example is the no. of Passengers
and the explanatory (or independent) variable
(X) is the no. of Reservations .

Dr. Md. Sohel Rana, Stat, EWU

Example
b) The Scatter Plot of Reservations (X) and no. of
Passengers (Y) is shown below.
It is clear from the diagram that there is a positive
relationship between the variables. It is also
reasonably clear that there is a linear trend in the
data.
350
300
250
200
s
150
r

e 100
50
g

e
0
s

s
0 100 200 300 400 500 600 No. of
a

. Reservations
o

450
400

Dr. Md. Sohel Rana, Stat, EWU


x y y2 x2 xy
250 210 62500 44100 52500
548 405 300304 164025 221940
156 120 24336 14400 18720
121 89 14641 7921 10769
416 304 173056 92416 126464
450 320 202500 102400 144000
462 319 213444 101761 147378
508 410 258064 168100 208280
307 275 94249 75625 84425
311 289 96721 83521 89879
265 236 70225 55696 62540
189 170 35721 28900 32130
��=3983 �� =3147 ��2=
��2=15457 938865 ����=119
61 9025

Dr. Md. Sohel Rana, Stat, EWU

Example - 1
• For this data:
�� =3147 ��2=
��=398 ����=11
��2=15457 938865
3 61 99025

Sxy=154483.2 ; Sxx= 223736.9 ;

Syy=113564.2�� 1 =������

154483.2
������=

223736.9= 0.69

and �� 0 = �� − �� 1�� = 33.0721


The fitted line is �� = 33.0721 + 0.69��

Dr. Md. Sohel Rana, Stat, EWU

Example
d) Interpretation of intercept (β0) and slope
(β1): (i) Slope β1:
• In general this tells us how we expect Y to
change, on average, if X is increased by 1 unit
• In this example, β1 = 0.69. Thus, for every
additional Reservations, the no. of
Passengers will increase by an average of
0.69.
• Since the slope is positive, we expect Y to
increase as X increases.
• If the slope were negative, we would expect
Y to decrease as X increases.
Dr. Md. Sohel Rana, Stat, EWU
Example

(ii) Intercept β0:


• This is the value of Y predicted for X = 0.
• In this example, β0 = 33.0721 which means
that at zero Reservations, the no. of
Passengers estimated to be 33.0721 (??).
• In most applications, the intercept has no
useful practical interpretation. It just serves
to fix the line.
Dr. Md. Sohel Rana, Stat, EWU

Coefficient of determination

• The coefficient of determination, R2, for a


simple regression is equal to the simple
correlation squared

2
R=r
2
xy
• greater than 0.80 usually indication of a good
2
R
fitted model

Dr. Md. Sohel Rana, Stat, EWU


The End Dr. Md. Sohel Rana, Stat, EWU

You might also like