You are on page 1of 37

Correlation and Regression: Week 4

Dr Pooja Soni
University Busienss School
Panjab University, Chandigarh

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 1 / 35
Random Variable

Random variable is a variable which has preassigned


probability associated with it. For example if we consider
outcome of tossing a coin as random variable X where
P(X = H) = 12 and P(X = T ) = 12 .
Also X can be defined as no. appear when we throw die is
tossed, then P(X = i) = 16 for i = 1, 2, 3, 4, 5, 6.
This will be in done in details in probability.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 2 / 35
Bivariate Random Variable

(X , Y ) is called Bivariate random variable, where we


observe (Height, Weight) on one individual. Here we have
joint probability P(X = x, Y = y ), which is predefined.
Similarly we have multivariate random variable as
(X1 , X2 , ..., Xn ).

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 3 / 35
Correlation

A statistical technique that is used to analyse the


strength (magnitude) and direction of the relationship
between two quantitative variables is called correlation
analysis. For example relation between
Family Income and expenditure on luxury items;
Sales revenue and expenses incurred on advertising;
frequency of smoking and lung damage;
Height and weight etc.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 4 / 35
Types of correlation

Positive and negative


Linear and non Linear
Simple, partial and multiple.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 5 / 35
Positive and negative correlation

If in (X , Y ), X and Y ↑(increasing) or ↓ (decreasing )


simultaneously, then it is called positive correlation;
If in (X , Y ), X ↓ (decreasing) and Y ↑(increasing) or
vice versa, then it is called negative correlation;

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 6 / 35
Linear and Non linear

A linear relationship refers to when variation in their


values are proportional, Y = aX + b type.
If relationship is Y = X 2 or Y = log (X ) then it is
called non linear.
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 7 / 35
Simple, Partial and Multiple

Partial Correlation: The relationship between two


variables by keeping other variables are constant or
fixed.
Multiple Correlation: The correlation when more than
two variables are chosen.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 8 / 35
Methods of measuring correlation

Scatter diagram;
Karl pearson’s coefficient of correlation;
Spearman Rank’s correlation.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 9 / 35
Scatter diagram

A scatter diagram is a graphical representation of


bi-variate data, which helps to understand the relationship
between two variables. It is X-Y graph in two dimensions.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 10 / 35
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 11 / 35
Karl Pearson Correlation Coefficient

Consider bi-variate data (X1 , Y1 ), (X2 , Y2 ), ...., (Xn , Yn ),


where both X and Y are quantitative in nature.
Karl Pearson Correlation measure is quantitative method
of calculating the strength and direction of relationship
between two variables. It is denoted by symbol r and is
given by :
P
(X − X̄ )(Y − Ȳ )
r = pP pP
(X − X̄ )2 (Y − Ȳ )2

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 12 / 35
P
(X −X̄ )(Y −Ȳ )
nq Cov (X , Y )
r= qP =
(X −X̄ )2
P
(Y −Ȳ )2 SD(X )SD(Y )
n n

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 13 / 35
Properties of Correlation Coefficient

r has no unit. It is a pure number;


Karl Pearson’s Coefficient −1 ≤ r ≤ 1;
If r = 1, perfect positive linear correlation;
If r = −1, perfect negative linear correlation;
If r = 0, no linear correlation;
If 0 ≤ r ≤ 0.5, then weak positive correlation;
If 0.5 < r ≤ 0.7, then moderate positive correlation
If 0.7 < r ≤ 0.99, then strong positive correlation.

It is independent of scale and origin.


r 2 is called coefficient of determination.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 14 / 35
Probable error: The probable error of Pearson’s
correlation coefficient, r, indicate the extent to which
its value depends on the condition of random
sampling.
1 − r2
PEr = 0.6745 √ ;
n
If r < PEr then the value of r is not significant, i.e., there is no
relationship between two variables.
If r > PEr then the value of r is significant, i.e., there is significant
relationship between two variables.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 15 / 35
Coefficient of determination

The value of coefficient of determination represents the


proportion of the total variability in the dependent
variable, y , that is explained by the independent variable
x. Mathematically, the coefficient of determination is
determined by
Explained variability in y
r2 = 1 −
Total variability in y

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 16 / 35
Spearman’s Rank Correlation

This measure of correlation is applied in a situation where


quantitative measure of qualitative factors such as
judgement, brands personalities, beauty, intelligence,
honesty, efficiency etc. cannot be fixed but individual
observations can be arranged in a definite order (or rank).
The rank correlation is given by
6 di2
P
R =1− ,
n(n2 − 1)
where di = RXi − RYi , RXi and RYi are ranks of X 0 s and
Y 0 s.
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 17 / 35
Assumptions

Spearman’s correlation coefficient is to be calculated:


when data is given in the form of ranks;
when the distribution of data is not normal (skewed or
not bell shaped).
It has the same properties as that of Karl Pearson
correlation coefficient.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 18 / 35
In case of ties

If more than one observations are equal then ranking of


those can be done in the following ways:
assign equal ranks to all tied observations;
assign ranks as an average of the ranks that these
individual observation deserved.
The formula in tied case be:
1 1
6{ di2 + 12 (m13 − m1 ) + 12 (m23 − m2 )...}
P
ρ=1−
n(n2 − 1)
m1 is no. of tied observations first time; m2 is no. of tied
observations second time.
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 19 / 35
Regression

The statistical technique that expresses functional


relationship between two or more variables in the form of
an equation to estimate the value of variable, based on
the given value of another variable is called regression
analysis.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 20 / 35
The variable to be predicted is called dependent
variables or explained or response variable;
Independent or predictor or explanatory variables

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 21 / 35
Types of regression

Simple Regression: when in a regression model, we


have only two variables (one independent (X) and one
dependent (Y) ) then it is called simple regression
y = f (x) + error ;
Multiple Regression: if in a model we have n
independent variables (X1 , X2 , ..., Xn ) and one
dependent (Y) then it is called multiple regression
model, Y = f (X1 , X2 , ..., Xn ) + error

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 22 / 35
Linear Regression: If the change in the values of
dependent variable in a model is directly proportional
to a unit change in the value of independent variable,
then model is called linear, y = ax + b + error where
a is called slope and b is intercept;
Non linear Regression: when the functional form of
model is not linear, e.g.
Y = β0 + β1 X 3 + β2 Sin(X ) + error is called non
linear model.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 23 / 35
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 24 / 35
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 25 / 35
Simple Linear Regression

If we consider two variables with linear relationship, i.e.


Y = aX + b + error or Y = β0 + β1 X + error is called
simple linear regression. Consider bi-variate data as
(x1 , y1 ), (x2 , y2 ), ...., (xn , yn ) the linear relationship
between X and Y is

yi = axi + b + ei , i = 1, 2..., n

where a and b are called regression coefficients and ei is


called residual or error.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 26 / 35
The least square method

The aim in model is to estimate regression coefficients a


and b so that model gives the best fit to the data. To find
the best fit we need to minimise the error and for that we
have technique called the least square principle.
The error term is, ei = yi − axi − b, the sum of square of
error term is
n
X n
X
ei2 = (yi − axi − b)2 .
i=1 i=1

The minimisation of sum of square of error term is called


least square principle.
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 27 / 35
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 28 / 35
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1
n
∂E X
= −2 (yi − axi − b) = 0 (2)
∂b
i=1

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 28 / 35
Let
n
X n
X
E= ei2 = (yi − axi − b)2 (1)
i=1 i=1
n
∂E X
= −2 (yi − axi − b) = 0 (2)
∂b
i=1
n
∂E X
= −2 (yi − axi − b)(xi ) = 0 (3)
∂a
i=1
From these two equations we can solve for a and b and
obtain our best fit model.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 28 / 35
After simplifying the above equations (2) and (3) we get
n
X n
X
yi = nb + a xi (4)
i=1 i=1
n
X n
X n
X
yi xi = b xi + a xi2 . (5)
i=1 i=1 i=1
These equations are called normal equations.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 29 / 35
From the fitted model one can predict the values
ŷi = b + axi with error as ei = yi − ŷi

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 30 / 35
Assumptions of Simple Linear Regression

The model is linear.


The error terms have constant variances
Homoscedasticity.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 31 / 35
The error terms are independent.
The error terms are normally distributed.

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 32 / 35
Regression Lines

There are two regression lines, Y on X and X on Y . The


line Y on X is given by:

y − ȳ = byx (x − x̄),
SD(y )
where byx = r SD(x) regression coefficient of y on x and
the line X on Y is given by :

x − x̄ = bxy (y − ȳ ),
SD(x)
where bxy = r SD(y ) regression coefficient of x on y.
Clearly byx bxy = r 2
Dr Pooja SoniUniversity Busienss School PanjabCorrelation
University, and
Chandigarh
Regression: Week 4 33 / 35
Difference between correlation and regression

Regression helps in prediction of variable with the help


of other variables and correlation measures strength
and direction only;
Correlation does not establish cause and effect
relationship;
In correlation there is no concept of dependent and
independent variables

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 34 / 35
yi = a + bxi + czi + ei
P P P
yi = na + b xi + c z
P P P 2 i P
y x = a xi + b xi + c xz
P i i P P 2 P i i
yi zi = a zi + c zi + b xi zi

bonus = 2.1 ∗ YearEx + 0.9 + error


y = ax + b + error

Dr Pooja SoniUniversity Busienss School PanjabCorrelation


University, and
Chandigarh
Regression: Week 4 35 / 35

You might also like