Correlation and Regression

Correlation and Regression: Week 4
Dr Pooja Soni
University Busienss School
Panjab University, Chandigarh
Dr Pooja SoniUniversity Busienss School PanjabCorrelation

University, and
Chandigarh
Regression: Week 4 1 / 35
Random Variable
Random variable is a variable which has preassigned

probability associated with it. For example if we consider
outcome of tossing a coin as random variable X where
P(X = H) = 12 and P(X = T ) = 12 .
Also X can be defined as no. appear when we throw die is
tossed, then P(X = i) = 16 for i = 1, 2, 3, 4, 5, 6.
This will be in done in details in probability.

University, and
Chandigarh
Bivariate Random Variable
(X , Y ) is called Bivariate random variable, where we

observe (Height, Weight) on one individual. Here we have
joint probability P(X = x, Y = y ), which is predefined.
Similarly we have multivariate random variable as
(X1 , X2 , ..., Xn ).

University, and
Chandigarh
Correlation
A statistical technique that is used to analyse the

strength (magnitude) and direction of the relationship
between two quantitative variables is called correlation
analysis. For example relation between
Family Income and expenditure on luxury items;
Sales revenue and expenses incurred on advertising;
frequency of smoking and lung damage;
Height and weight etc.

University, and
Chandigarh
Types of correlation
Positive and negative

Linear and non Linear
Simple, partial and multiple.

University, and
Chandigarh
Positive and negative correlation
If in (X , Y ), X and Y ↑(increasing) or ↓ (decreasing )

simultaneously, then it is called positive correlation;
If in (X , Y ), X ↓ (decreasing) and Y ↑(increasing) or
vice versa, then it is called negative correlation;

University, and
Chandigarh
Linear and Non linear
A linear relationship refers to when variation in their

values are proportional, Y = aX + b type.
If relationship is Y = X 2 or Y = log (X ) then it is
called non linear.
University, and
Chandigarh
Simple, Partial and Multiple
Partial Correlation: The relationship between two

variables by keeping other variables are constant or
fixed.
Multiple Correlation: The correlation when more than
two variables are chosen.

University, and
Chandigarh
Methods of measuring correlation
Scatter diagram;
Karl pearson’s coefficient of correlation;
Spearman Rank’s correlation.

University, and
Chandigarh
Scatter diagram
A scatter diagram is a graphical representation of

bi-variate data, which helps to understand the relationship
between two variables. It is X-Y graph in two dimensions.

University, and
Chandigarh
University, and
Chandigarh
Karl Pearson Correlation Coefficient
Consider bi-variate data (X1 , Y1 ), (X2 , Y2 ), ...., (Xn , Yn ),

where both X and Y are quantitative in nature.
Karl Pearson Correlation measure is quantitative method
of calculating the strength and direction of relationship
between two variables. It is denoted by symbol r and is
given by :
P
(X − X̄ )(Y − Ȳ )
r = pP pP
(X − X̄ )2 (Y − Ȳ )2

University, and
Chandigarh
P
(X −X̄ )(Y −Ȳ )
nq Cov (X , Y )
r= qP =
(X −X̄ )2
P
(Y −Ȳ )2 SD(X )SD(Y )
n n

University, and
Chandigarh
Properties of Correlation Coefficient
r has no unit. It is a pure number;

Karl Pearson’s Coefficient −1 ≤ r ≤ 1;
If r = 1, perfect positive linear correlation;
If r = −1, perfect negative linear correlation;
If r = 0, no linear correlation;
If 0 ≤ r ≤ 0.5, then weak positive correlation;
If 0.5 < r ≤ 0.7, then moderate positive correlation
If 0.7 < r ≤ 0.99, then strong positive correlation.
It is independent of scale and origin.

r 2 is called coefficient of determination.

University, and
Chandigarh
Probable error: The probable error of Pearson’s
correlation coefficient, r, indicate the extent to which
its value depends on the condition of random
sampling.
1 − r2
PEr = 0.6745 √ ;
n
If r < PEr then the value of r is not significant, i.e., there is no
relationship between two variables.
If r > PEr then the value of r is significant, i.e., there is significant
relationship between two variables.

University, and
Chandigarh
Coefficient of determination
The value of coefficient of determination represents the

proportion of the total variability in the dependent
variable, y , that is explained by the independent variable
x. Mathematically, the coefficient of determination is
determined by
Explained variability in y
r2 = 1 −
Total variability in y

University, and
Chandigarh
Spearman’s Rank Correlation
This measure of correlation is applied in a situation where

quantitative measure of qualitative factors such as
judgement, brands personalities, beauty, intelligence,
honesty, efficiency etc. cannot be fixed but individual
observations can be arranged in a definite order (or rank).
The rank correlation is given by
6 di2
P
R =1− ,
n(n2 − 1)
where di = RXi − RYi , RXi and RYi are ranks of X 0 s and
Y 0 s.
University, and
Chandigarh
Assumptions
Spearman’s correlation coefficient is to be calculated:

when data is given in the form of ranks;
when the distribution of data is not normal (skewed or
not bell shaped).
It has the same properties as that of Karl Pearson
correlation coefficient.

University, and
Chandigarh
In case of ties
If more than one observations are equal then ranking of

those can be done in the following ways:
assign equal ranks to all tied observations;
assign ranks as an average of the ranks that these
individual observation deserved.
The formula in tied case be:
1 1
6{ di2 + 12 (m13 − m1 ) + 12 (m23 − m2 )...}
P
ρ=1−
n(n2 − 1)
m1 is no. of tied observations first time; m2 is no. of tied
observations second time.
University, and
Chandigarh
Regression
The statistical technique that expresses functional

relationship between two or more variables in the form of
an equation to estimate the value of variable, based on
the given value of another variable is called regression
analysis.

University, and
Chandigarh
The variable to be predicted is called dependent
variables or explained or response variable;
Independent or predictor or explanatory variables

University, and
Chandigarh
Types of regression
Simple Regression: when in a regression model, we

have only two variables (one independent (X) and one
dependent (Y) ) then it is called simple regression
y = f (x) + error ;
Multiple Regression: if in a model we have n
independent variables (X1 , X2 , ..., Xn ) and one
dependent (Y) then it is called multiple regression
model, Y = f (X1 , X2 , ..., Xn ) + error

University, and
Chandigarh
Linear Regression: If the change in the values of
dependent variable in a model is directly proportional
to a unit change in the value of independent variable,
then model is called linear, y = ax + b + error where
a is called slope and b is intercept;
Non linear Regression: when the functional form of
model is not linear, e.g.
Y = β0 + β1 X 3 + β2 Sin(X ) + error is called non
linear model.

University, and
Chandigarh
University, and
Chandigarh
University, and
Chandigarh
Simple Linear Regression
If we consider two variables with linear relationship, i.e.

Y = aX + b + error or Y = β0 + β1 X + error is called
simple linear regression. Consider bi-variate data as
(x1 , y1 ), (x2 , y2 ), ...., (xn , yn ) the linear relationship
between X and Y is
yi = axi + b + ei , i = 1, 2..., n
where a and b are called regression coefficients and ei is

called residual or error.

University, and
Chandigarh
The least square method
The aim in model is to estimate regression coefficients a

and b so that model gives the best fit to the data. To find
the best fit we need to minimise the error and for that we
have technique called the least square principle.
The error term is, ei = yi − axi − b, the sum of square of
error term is
n
X n
X
ei2 = (yi − axi − b)2 .
i=1 i=1
The minimisation of sum of square of error term is called

least square principle.
University, and
Chandigarh
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1

University, and
Chandigarh
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1
n
∂E X
= −2 (yi − axi − b) = 0 (2)
∂b
i=1

University, and
Chandigarh
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1
n
∂E X
= −2 (yi − axi − b) = 0 (2)
∂b
i=1
n
∂E X
= −2 (yi − axi − b)(xi ) = 0 (3)
∂a
i=1
From these two equations we can solve for a and b and
obtain our best fit model.

University, and
Chandigarh
After simplifying the above equations (2) and (3) we get
n
X n
X
yi = nb + a xi (4)
i=1 i=1
n
X n
X n
X
yi xi = b xi + a xi2 . (5)
i=1 i=1 i=1
These equations are called normal equations.

University, and
Chandigarh
From the fitted model one can predict the values
ŷi = b + axi with error as ei = yi − ŷi

University, and
Chandigarh
Assumptions of Simple Linear Regression
The model is linear.

The error terms have constant variances
Homoscedasticity.

University, and
Chandigarh
The error terms are independent.
The error terms are normally distributed.

University, and
Chandigarh
Regression Lines
There are two regression lines, Y on X and X on Y . The

line Y on X is given by:
y − ȳ = byx (x − x̄),
SD(y )
where byx = r SD(x) regression coefficient of y on x and
the line X on Y is given by :
x − x̄ = bxy (y − ȳ ),
SD(x)
where bxy = r SD(y ) regression coefficient of x on y.
Clearly byx bxy = r 2
University, and
Chandigarh
Difference between correlation and regression
Regression helps in prediction of variable with the help

of other variables and correlation measures strength
and direction only;
Correlation does not establish cause and effect
relationship;
In correlation there is no concept of dependent and
independent variables

University, and
Chandigarh
yi = a + bxi + czi + ei
P P P
yi = na + b xi + c z
P P P 2 i P
y x = a xi + b xi + c xz
P i i P P 2 P i i
yi zi = a zi + c zi + b xi zi
bonus = 2.1 ∗ YearEx + 0.9 + error

y = ax + b + error

University, and
Chandigarh

Correlation and Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression

Uploaded by

Copyright:

Available Formats

Correlation and Regression: Week 4

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Random variable is a variable which has preassigned

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

(X , Y ) is called Bivariate random variable, where we

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

A statistical technique that is used to analyse the

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Positive and negative

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

If in (X , Y ), X and Y ↑(increasing) or ↓ (decreasing )

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

A linear relationship refers to when variation in their

Partial Correlation: The relationship between two

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

A scatter diagram is a graphical representation of

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Consider bi-variate data (X1 , Y1 ), (X2 , Y2 ), ...., (Xn , Yn ),

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

r has no unit. It is a pure number;

It is independent of scale and origin.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

The value of coefficient of determination represents the

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

This measure of correlation is applied in a situation where

Spearman’s correlation coefficient is to be calculated:

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

If more than one observations are equal then ranking of

The statistical technique that expresses functional

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Simple Regression: when in a regression model, we

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

If we consider two variables with linear relationship, i.e.

where a and b are called regression coefficients and ei is

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

The aim in model is to estimate regression coefficients a

The minimisation of sum of square of error term is called

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

The model is linear.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

There are two regression lines, Y on X and X on Y . The

Regression helps in prediction of variable with the help

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

bonus = 2.1 ∗ YearEx + 0.9 + error

Dr Pooja SoniUniversity Busienss School PanjabCorrelation

You might also like