You are on page 1of 81

Correlation and Regression

drvijay.niftem@gmail.com
Correlation

Finding the relationship between two quantitative


variables without being able to infer causal
relationships

Correlation is a statistical technique used to determine


the degree to which two variables are related
Correlation
• Correlation is “a statistical technique used to determine the
relationship between two or more variables”
• We use two different techniques to determine score
relationships:
1. graphical technique
2. mathematical technique called correlation coefficient

3
4
Scatter Plots of Data with Various Correlation
Coefficients

Y Y Y

X X X
r = -1 r = -.6 r=0

Y
Y Y

X X X
r = +1 r = +.3 r=0
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship

X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Types of Relationships

• The scatter diagram can indicate a positive


relationship, a negative relationship, or a zero
relationship
• What are the characteristics of positive, negative,
and zero relationships?
Correlation

• Measures the relative strength of the linear relationship


between two variables
• Unit-less
• Ranges between –1 and 1

• The closer to –1, the stronger the negative linear relationship


• The closer to 1, the stronger the positive linear relationship

• The closer to 0, the weaker any positive linear relationship


Example

Wt. (kg) 67 69 85 83 74 81 97 92 114 85


SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)

220

200

180

160

140

120

100

80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


SBP(mmHg)
220

200

180

160

140

120

100

80
60 70 80 90 100 110 120Wt (kg)

Scatter diagram of weight and systolic blood pressure


 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.

strong intermediate weak weak intermediate strong

-1 -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
Simpler calculation formula…

Numerator of
n covariance
 ( x  x )( y
i 1
i i  y)

rˆ  n 1 
n n SS xy
 (x  x)  ( y
i
2
i  y)2 rˆ 
i 1 i 1 SS x SS y
n 1 n 1
n

 ( x  x )( y
i i  y)
SS xy
Numerators of
variance
i 1

n n
SS x SS y
 (x  x)  ( y
i 1
i
2

i 1
i  y) 2
FORMULA FOR CALCULATING CORRELATION COEFFICIENT

1
Cov.(X, Y) N
 XY  X Y
r 
σXσY 1
 2  1
 
  X  X   Y  Y 
2 2 2

N  N 
Example:

A sample of 6 children was selected, data about their


age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

Weight Age serial


(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test
)X( score (Y)
10 2
8 3
2 9
1 7
5 6
6 5
X = 32∑ Y = 32∑
Solution:

Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient

r = - 0.94

Indirect strong correlation


Rank Correlation
(Spearman Rank correlation)
Spearman Rank Correlation Coefficient (rs)
• It is a non-parametric measure of correlation.
• This procedure makes use of the two sets of ranks that may
be assigned to the sample values of x and Y.
• Spearman Rank correlation coefficient could be computed in
the following cases:
• Both variables are quantitative.
• Both variables are qualitative ordinal.
• One variable is quantitative and the other is qualitative
ordinal.
Procedure:

1. Rank the values of X from 1 to n where n is the numbers of


pairs of values of X and Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of observation by
subtracting the rank of Yi from the rank of Xi
4. Square each di and compute ∑di2 which is the sum of the
squared values.
5. Apply the following formula 6 (di) 2

rs  1 
n(n  1)
2
The value of rs denotes the magnitude and nature of
association giving the same interpretation as simple r.
Example
In a study of the relationship between level education and income the following
data was obtained. Find the relationship between them and comment.

Income level education sample


(Y) )X( numbers

25 Preparatory. A
10 Primary. B
8 University. C
10 secondary D
15 secondary E
50 illiterate F
60 University. G
Answer:
di2 di Rank Rank
Y X (Y) (X)
4 2 3 5 25 Preparatory A

0.25 0.5 5.5 6 10 Primary. B


30.25 -5.5 7 1.5 8 University. C
4 -2 5.5 3.5 10 secondary D
0.25 -0.5 4 3.5 15 secondary E
25 5 2 7 50 illiterate F
0.25 0.5 1 1.5 60 university. G

∑ di2=64
6  64
rs  1   0.1
7(48)

Comment:
There is an indirect weak correlation
between level of education and income.
EXAMPLE B.P (y) Age (x) B.P (y) Age (x)
128 46 120 20
136 53 128 43
The following are the age (in 146 60 141 63
124 20 126 26
years) and systolic blood 143
130
63
43
134
128
53
31
124 26 136 58
pressure of 20 apparently 121 19 132 46
126 31 140 58
123 23 144 70
healthy adults.

• Find the correlation between age and blood pressure using simple and
Spearman's correlation coefficients, and comment.

• Find the regression equation?

• What is the predicted blood pressure for a man aging 25 years?


• Example: The left side of Figure 1 displays the association
between the IQ of each adolescent in a sample with the
number of hours they listen to rap music per month.
Determine the strength of the correlation between IQ and
rap music using both the Pearson’s correlation coefficient
and Spearman’s rank correlation. Compare the results.
Regression
Demand
LinearRegression
Linear Equation

Y= a+bX
Is it Correct?
Yc= a+bX
Linear
Linear Equation
Regression.....

Yc= a+bX+Error term


Is it Correct?
Y= a+bX+Error term
Y= Yc +Error term
Linear
Linear Equation
Regression.....

Y = a+bX + Error term

Demand = a + b (Price) + Error term

Y = Demand (Dependent Variable)

X = Price (Independent Variable)

a = Intercept

b = Slope
Linear Regression.....
Linear Regression.....

a=Intercept
The intercept (often labeled the constant) is the expected mean value of Y when all

X=0. Start with a regression equation with one predictor, X. If X sometimes

equals 0, the intercept is simply the expected mean value of Y at that value.

b=Slope

If the beta coefficient is positive, the interpretation is that for every 1-unit

increase in the predictor variable, the outcome variable will increase by

the beta coefficient value.
Dependent and Independent Variables

The dependent variable is also referred to as


the outcome, target or criterion variable.

The independent variable is also referred to as


the predictor, explanatory or regressor variable.
Different Names of Variables
Y= a+bX
Left Side Right Side
Dependent Variable Independent Variable

Explained Variable Explanatory Variable

Endogenous Variable Exogenous Variable

Forecast Variable Predictor Variable

Regressand Variable Regressor Variable

Response Variable Stimulus Variable

Controlled Variable Control Variable


Open file:Demand.xls
Three variables:
Demand, Price and Income
LINEAR EQUATION
Yc= a+bX
Yc = 140-10*X
Demand= 140-10*Price
LINEAR EQUATION
Yc= a + bX+ error term
https://365datascience.com/nor
mal-distribution/
3450

3000 450
3000

3450

0.869
What’s the Best Value for an R-squared?

The immediate question you may be asking:


“What is a good R-squared? When do I know,
for sure, that my regression is good enough?”
An R-squared of 41% is neither good nor bad.
But since it is far away from 90%, we may
conclude we are missing some important
information.
Other determinants must be considered. Variables
such as gender, income, and marital status could
help us understand the full picture a little better.
What it Adjusts for
Let’s consider the following two statements:
1) The R-squared measures how much of the total variability is
explained by our model.
2) Multiple regressions are always better than simple ones. This is
because with each additional variable that you add, the explanatory
power may only increase or stay the same.
The adjusted R-squared measures how much of the total variability our
model explains, considering the number of variables.
The adjusted R-squared is always smaller than the R-
squared, as it penalizes excessive use of variables.
Comparing Regression Models
Finally, the adjusted R-squared is the basis for comparing regression
models.
It only makes sense to compare two models considering the same
dependent variable and using the same dataset. If we compare two
models that are about two different dependent variables, we will be
making an apples-to-oranges comparison. If we use different datasets, it
is an apples-to-dinosaurs problem.
The “Game” of Maximising R Square
Assumptions of
Regression
Assumptions of Regression
In order to generalize the results about the broader population
from which it is drawn, it is required that the basic assumptions
have been checked and met (Field. A., 2005).
In order to apply multiple regression, the assumptions are:

1. Model Fit: F is significant


2. Significance of Independent variables.
3. No Multicollinearity between the independent variables
4. Normality of residuals
5. Homogeneity of Variance (Homoscedasticity)
Model Fit
Model Fit: F is significant
It means model is fit. The variable entered in the model is
sufficient enough to explain the dependent variable.
H0: There is no significant difference between Explained
variance and Unexplained variance) or Model in not Good fit.

If P value: <0.05 means null hypothesis rejected and Model is


Good Fit.
Significance of Independent Variables

H0: There is no significant Impact of Independent Variable on


the dependent variable.
If p<0.05 means null H0 rejected which means there is
significant Impact of Independent Variable on the dependent
variable.
Here t test is performed to check the significance
t=Difference of beta from 0/Standard Error
No Multicollinearity

In multiple regression there may exist strong correlation and linear


relationships among independent predictor variables in the model
which is called as problem of Multicollinearity. Presence of multi-
collinearity makes it difficult to predict the individual importance or
predictability of each independent variable as they themselves are
highly correlated.
No Multicollinearity…

For identifying multi-collinearity, we need to check the collinearity


statistics. VIF (Variance Inflation Factor) indicates whether an
independent variable has a strong linear relationship with other
independent variables used in the model. This VIF value should be
less than 10 for all the independent variables indicating absence of
collinearity among them (Myers, 1990).
Normality of Residuals

In Linear Regression, residuals should be normal.

Residuals captures the impact of left over variables in the


regression. If their behavior is not normal and following a
specific pattern which means there is any important variable
left out and showing the impact and without taking that
omitted/left out variable>prediction shall not be good.
Steps: Save the residuals as variable and apply the Normality
test.
Homogeneity of Variance (Homoscedasticity)
Dummy Variables
Dummy Variables
Dummy Variables
Alternative
Regression
• Linear regression is the most basic and commonly
used predictive analysis. Regression estimates are
used to describe data and to explain the relationship
between one dependent variable and one or more
independent variables.
• At the center of the regression analysis is the task of
fitting a single line through a scatter plot. The simplest
form with one dependent and one independent
variable is defined by the
• formula Y = a + b*X.
Assumptions of Regression
• Two or more continuous variables (i.e., interval or ratio level)
• Cases that have values on both variables
• Linear relationship between the variables.
• Independent cases (i.e., independence of observations)
– There is no relationship between the values of variables between cases. This means that:
• the values for all variables across cases are unrelated
• for any case, the value for any variable cannot influence the value of any variable for
other cases
• no case can influence another case on any variable
– The bivariate Pearson correlation coefficient and corresponding significance test are not
robust when independence is violated.
• Bivariate normality
– Each pair of variables is bivariately normally distributed.
– Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
– This assumption ensures that the variables are linearly related; violations of this
assumption may indicate that non-linear relationships among variables exist. Linearity can
be assessed visually using a scatterplot of the data.
• Random sample of data from the population
• No outliers
Difference between Correlation and Regression

BASIS FOR COMPARISON CORRELATION REGRESSION


Meaning Correlation is a statistical Regression describes how an
measure which determines co- independent variable is
relationship or association of two numerically related to the
variables. dependent variable.

Usage To represent linear relationship To fit a best line and estimate one
between two variables. variable on the basis of another
variable.

Dependent and Independent No difference Both variables are different.


variables

Indicates Correlation coefficient indicates Regression indicates the impact


the extent to which two variables of a unit change in the known
move together. variable (x) on the estimated
variable (y).

Objective To find a numerical value To estimate values of random


expressing the relationship variable on the basis of the
between variables. values of fixed variable.
Comparison: Independent variable and Dependent
variable
BASIS FOR COMPARISON INDEPENDENT VARIABLE DEPENDENT VARIABLE

Meaning Independent Variable is one Dependent Variable refers to a


whose values are deliberately variable which changes its values
changed by the researcher in in order to reciprocate change in
order to obtain a desired the values of independent
outcome. variable.

What is it? Antecedent Consequent

Relationship Presumed cause Observed effect

Values Manipulated by the researcher. Measured by the researcher.

Usually denoted by x y
How to check linearity
References
• Relationships Among Variables ( Correlation and Regression ) KNES 510 Methods in Kinesiology

You might also like