Correlation and Regression

Correlation and Regression
drvijay.niftem@gmail.com
Correlation
Finding the relationship between two quantitative

variables without being able to infer causal
relationships
Correlation is a statistical technique used to determine

the degree to which two variables are related
Correlation
• Correlation is “a statistical technique used to determine the
relationship between two or more variables”
• We use two different techniques to determine score
relationships:
1. graphical technique
2. mathematical technique called correlation coefficient
3
4
Scatter Plots of Data with Various Correlation
Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Linear Correlation
No relationship
X
Types of Relationships
• The scatter diagram can indicate a positive

relationship, a negative relationship, or a zero
relationship
• What are the characteristics of positive, negative,
and zero relationships?
Correlation
• Measures the relative strength of the linear relationship

between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship

• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship

Example
Wt. (kg) 67 69 85 83 74 81 97 92 114 85

SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120
Scatter diagram of weight and systolic blood pressure

SBP(mmHg)
220
200
180
160
140
120
100
80
60 70 80 90 100 110 120Wt (kg)
Scatter diagram of weight and systolic blood pressure

 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.
strong intermediate weak weak intermediate strong
-1 -0.75 -0.25 0 0.25 0.75 1

indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.
If 0 < r < 0.25 = weak correlation.
If 0.25 ≤ r < 0.75 = intermediate correlation.
If 0.75 ≤ r < 1 = strong correlation.
If r = l = perfect correlation.
Simpler calculation formula…
Numerator of
n covariance
 ( x  x )( y
i 1
i i  y)
rˆ  n 1 
n n SS xy
 (x  x)  ( y
i
2
i  y)2 rˆ 
i 1 i 1 SS x SS y
n 1 n 1
n
 ( x  x )( y
i i  y)
SS xy
Numerators of
variance
i 1

n n
SS x SS y
 (x  x)  ( y
i 1
i
2
i 1
i  y) 2
FORMULA FOR CALCULATING CORRELATION COEFFICIENT
1
Cov.(X, Y) N
 XY  X Y
r 
σXσY 1
 2  1
 
  X  X   Y  Y 
2 2 2
N  N 
Example:
A sample of 6 children was selected, data about their

age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.
Weight Age serial

(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test
)X( score (Y)
10 2
8 3
2 9
1 7
5 6
6 5
X = 32∑ Y = 32∑
Solution:
Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient
r = - 0.94
Indirect strong correlation

Rank Correlation
(Spearman Rank correlation)
Spearman Rank Correlation Coefficient (rs)
• It is a non-parametric measure of correlation.
• This procedure makes use of the two sets of ranks that may
be assigned to the sample values of x and Y.
• Spearman Rank correlation coefficient could be computed in
the following cases:
• Both variables are quantitative.
• Both variables are qualitative ordinal.
• One variable is quantitative and the other is qualitative
ordinal.
Procedure:
1. Rank the values of X from 1 to n where n is the numbers of

pairs of values of X and Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of observation by
subtracting the rank of Yi from the rank of Xi
4. Square each di and compute ∑di2 which is the sum of the
squared values.
5. Apply the following formula 6 (di) 2
rs  1 
n(n  1)
2
The value of rs denotes the magnitude and nature of
association giving the same interpretation as simple r.
Example
In a study of the relationship between level education and income the following
data was obtained. Find the relationship between them and comment.
Income level education sample

(Y) )X( numbers
25 Preparatory. A
10 Primary. B
8 University. C
10 secondary D
15 secondary E
50 illiterate F
60 University. G
Answer:
di2 di Rank Rank
Y X (Y) (X)
4 2 3 5 25 Preparatory A
0.25 0.5 5.5 6 10 Primary. B

30.25 -5.5 7 1.5 8 University. C
4 -2 5.5 3.5 10 secondary D
0.25 -0.5 4 3.5 15 secondary E
25 5 2 7 50 illiterate F
0.25 0.5 1 1.5 60 university. G
∑ di2=64
6  64
rs  1   0.1
7(48)
Comment:
There is an indirect weak correlation
between level of education and income.
EXAMPLE B.P (y) Age (x) B.P (y) Age (x)
128 46 120 20
136 53 128 43
The following are the age (in 146 60 141 63
124 20 126 26
years) and systolic blood 143
130
63
43
134
128
53
31
124 26 136 58
pressure of 20 apparently 121 19 132 46
126 31 140 58
123 23 144 70
healthy adults.
• Find the correlation between age and blood pressure using simple and
Spearman's correlation coefficients, and comment.
• Find the regression equation?
• What is the predicted blood pressure for a man aging 25 years?

• Example: The left side of Figure 1 displays the association
between the IQ of each adolescent in a sample with the
number of hours they listen to rap music per month.
Determine the strength of the correlation between IQ and
rap music using both the Pearson’s correlation coefficient
and Spearman’s rank correlation. Compare the results.
Regression
Demand
LinearRegression
Linear Equation
Y= a+bX
Is it Correct?
Yc= a+bX
Linear
Linear Equation
Regression.....
Yc= a+bX+Error term

Is it Correct?
Y= a+bX+Error term
Y= Yc +Error term
Linear
Linear Equation
Regression.....
Y = a+bX + Error term
Demand = a + b (Price) + Error term
Y = Demand (Dependent Variable)
X = Price (Independent Variable)
a = Intercept
b = Slope
Linear Regression.....
Linear Regression.....
a=Intercept
The intercept (often labeled the constant) is the expected mean value of Y when all
X=0. Start with a regression equation with one predictor, X. If X sometimes
equals 0, the intercept is simply the expected mean value of Y at that value.
b=Slope
If the beta coefficient is positive, the interpretation is that for every 1-unit
increase in the predictor variable, the outcome variable will increase by
the beta coefficient value.
Dependent and Independent Variables
The dependent variable is also referred to as

the outcome, target or criterion variable.
The independent variable is also referred to as

the predictor, explanatory or regressor variable.
Different Names of Variables
Y= a+bX
Left Side Right Side
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Endogenous Variable Exogenous Variable
Forecast Variable Predictor Variable
Regressand Variable Regressor Variable
Response Variable Stimulus Variable
Controlled Variable Control Variable

Open file:Demand.xls
Three variables:
Demand, Price and Income
LINEAR EQUATION
Yc= a+bX
Yc = 140-10*X
Demand= 140-10*Price
LINEAR EQUATION
Yc= a + bX+ error term
https://365datascience.com/nor
mal-distribution/
3450
3000 450
3000
3450
0.869
What’s the Best Value for an R-squared?
The immediate question you may be asking:

“What is a good R-squared? When do I know,
for sure, that my regression is good enough?”
An R-squared of 41% is neither good nor bad.
But since it is far away from 90%, we may
conclude we are missing some important
information.
Other determinants must be considered. Variables
such as gender, income, and marital status could
help us understand the full picture a little better.
What it Adjusts for
Let’s consider the following two statements:
1) The R-squared measures how much of the total variability is
explained by our model.
2) Multiple regressions are always better than simple ones. This is
because with each additional variable that you add, the explanatory
power may only increase or stay the same.
The adjusted R-squared measures how much of the total variability our
model explains, considering the number of variables.
The adjusted R-squared is always smaller than the R-
squared, as it penalizes excessive use of variables.
Comparing Regression Models
Finally, the adjusted R-squared is the basis for comparing regression
models.
It only makes sense to compare two models considering the same
dependent variable and using the same dataset. If we compare two
models that are about two different dependent variables, we will be
making an apples-to-oranges comparison. If we use different datasets, it
is an apples-to-dinosaurs problem.
The “Game” of Maximising R Square
Assumptions of
Regression
Assumptions of Regression
In order to generalize the results about the broader population
from which it is drawn, it is required that the basic assumptions
have been checked and met (Field. A., 2005).
In order to apply multiple regression, the assumptions are:
1. Model Fit: F is significant

2. Significance of Independent variables.
3. No Multicollinearity between the independent variables
4. Normality of residuals
5. Homogeneity of Variance (Homoscedasticity)
Model Fit
Model Fit: F is significant
It means model is fit. The variable entered in the model is
sufficient enough to explain the dependent variable.
H0: There is no significant difference between Explained
variance and Unexplained variance) or Model in not Good fit.
If P value: <0.05 means null hypothesis rejected and Model is

Good Fit.
Significance of Independent Variables
H0: There is no significant Impact of Independent Variable on

the dependent variable.
If p<0.05 means null H0 rejected which means there is
significant Impact of Independent Variable on the dependent
variable.
Here t test is performed to check the significance
t=Difference of beta from 0/Standard Error
No Multicollinearity
In multiple regression there may exist strong correlation and linear

relationships among independent predictor variables in the model
which is called as problem of Multicollinearity. Presence of multi-
collinearity makes it difficult to predict the individual importance or
predictability of each independent variable as they themselves are
highly correlated.
No Multicollinearity…
For identifying multi-collinearity, we need to check the collinearity

statistics. VIF (Variance Inflation Factor) indicates whether an
independent variable has a strong linear relationship with other
independent variables used in the model. This VIF value should be
less than 10 for all the independent variables indicating absence of
collinearity among them (Myers, 1990).
Normality of Residuals
In Linear Regression, residuals should be normal.
Residuals captures the impact of left over variables in the

regression. If their behavior is not normal and following a
specific pattern which means there is any important variable
left out and showing the impact and without taking that
omitted/left out variable>prediction shall not be good.
Steps: Save the residuals as variable and apply the Normality
test.
Homogeneity of Variance (Homoscedasticity)
Dummy Variables
Dummy Variables
Dummy Variables
Alternative
Regression
• Linear regression is the most basic and commonly
used predictive analysis. Regression estimates are
used to describe data and to explain the relationship
between one dependent variable and one or more
independent variables.
• At the center of the regression analysis is the task of
fitting a single line through a scatter plot. The simplest
form with one dependent and one independent
variable is defined by the
• formula Y = a + b*X.
Assumptions of Regression
• Two or more continuous variables (i.e., interval or ratio level)
• Cases that have values on both variables
• Linear relationship between the variables.
• Independent cases (i.e., independence of observations)
– There is no relationship between the values of variables between cases. This means that:
• the values for all variables across cases are unrelated
• for any case, the value for any variable cannot influence the value of any variable for
other cases
• no case can influence another case on any variable
– The bivariate Pearson correlation coefficient and corresponding significance test are not
robust when independence is violated.
• Bivariate normality
– Each pair of variables is bivariately normally distributed.
– Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
– This assumption ensures that the variables are linearly related; violations of this
assumption may indicate that non-linear relationships among variables exist. Linearity can
be assessed visually using a scatterplot of the data.
• Random sample of data from the population
• No outliers
Difference between Correlation and Regression
BASIS FOR COMPARISON CORRELATION REGRESSION

Meaning Correlation is a statistical Regression describes how an
measure which determines co- independent variable is
relationship or association of two numerically related to the
variables. dependent variable.
Usage To represent linear relationship To fit a best line and estimate one
between two variables. variable on the basis of another
variable.
Dependent and Independent No difference Both variables are different.

variables
Indicates Correlation coefficient indicates Regression indicates the impact

the extent to which two variables of a unit change in the known
move together. variable (x) on the estimated
variable (y).
Objective To find a numerical value To estimate values of random

expressing the relationship variable on the basis of the
between variables. values of fixed variable.
Comparison: Independent variable and Dependent
variable
BASIS FOR COMPARISON INDEPENDENT VARIABLE DEPENDENT VARIABLE
Meaning Independent Variable is one Dependent Variable refers to a

whose values are deliberately variable which changes its values
changed by the researcher in in order to reciprocate change in
order to obtain a desired the values of independent
outcome. variable.
What is it? Antecedent Consequent
Relationship Presumed cause Observed effect
Values Manipulated by the researcher. Measured by the researcher.
Usually denoted by x y
How to check linearity
References
• Relationships Among Variables ( Correlation and Regression ) KNES 510 Methods in Kinesiology

Correlation and Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression

Uploaded by

Copyright:

Available Formats

Correlation and Regression

Finding the relationship between two quantitative

Correlation is a statistical technique used to determine

• The scatter diagram can indicate a positive

• Measures the relative strength of the linear relationship

• The closer to –1, the stronger the negative linear relationship

• The closer to 0, the weaker any positive linear relationship

Wt. (kg) 67 69 85 83 74 81 97 92 114 85

Scatter diagram of weight and systolic blood pressure

Scatter diagram of weight and systolic blood pressure

strong intermediate weak weak intermediate strong

-1 -0.75 -0.25 0 0.25 0.75 1

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

A sample of 6 children was selected, data about their

Weight Age serial

Indirect strong correlation

1. Rank the values of X from 1 to n where n is the numbers of

Income level education sample

0.25 0.5 5.5 6 10 Primary. B

• Find the regression equation?

• What is the predicted blood pressure for a man aging 25 years?

Yc= a+bX+Error term

Y = a+bX + Error term

Demand = a + b (Price) + Error term

Y = Demand (Dependent Variable)

X = Price (Independent Variable)

X=0. Start with a regression equation with one predictor, X. If X sometimes

equals 0, the intercept is simply the expected mean value of Y at that value.

If the beta coefficient is positive, the interpretation is that for every 1-unit

increase in the predictor variable, the outcome variable will increase by

The dependent variable is also referred to as

The independent variable is also referred to as

Explained Variable Explanatory Variable

Endogenous Variable Exogenous Variable

Forecast Variable Predictor Variable

Regressand Variable Regressor Variable

Response Variable Stimulus Variable

Controlled Variable Control Variable

The immediate question you may be asking:

1. Model Fit: F is significant

If P value: <0.05 means null hypothesis rejected and Model is

H0: There is no significant Impact of Independent Variable on

In multiple regression there may exist strong correlation and linear

For identifying multi-collinearity, we need to check the collinearity

In Linear Regression, residuals should be normal.

Residuals captures the impact of left over variables in the

BASIS FOR COMPARISON CORRELATION REGRESSION

Dependent and Independent No difference Both variables are different.

Indicates Correlation coefficient indicates Regression indicates the impact

Objective To find a numerical value To estimate values of random

Meaning Independent Variable is one Dependent Variable refers to a

What is it? Antecedent Consequent

Relationship Presumed cause Observed effect

Values Manipulated by the researcher. Measured by the researcher.

You might also like