Correlation Coefficient Formulas and Assumptions

PassHoJao.
com - Get Notes, Question Papers, Practical Files, eBooks
QT/U2 Topic 1 Correlation Coefficient, Assumptions of

Correlation Coefficient
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the
relative movements of the two variables. The range of values for the correlation coefficient bounded by 1.0 on an
absolute value basis or between -1.0 to 1.0. If the correlation coefficient is greater than 1.0 or less than -1.0, the
correlation measurement is incorrect. A correlation of -1.0 shows a perfect negative correlation, while a
correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows zero or no relationship between
the movements of the two variables.
While the correlation coefficient measures a degree of relation between two variables, it only measures the linear
relationship between the variables. The correlation coefficient cannot capture nonlinear relationships between two
variables.
A value of exactly 1.0 means there is a perfect positive relationship between the two variables. For a positive
increase in one variable, there is also a positive increase in the second variable. A value of -1.0 means there is a
perfect negative relationship between the two variables. This shows the variables move in opposite directions —
for a positive increase in one variable, there is a decrease in the second variable. If the correlation is 0, there is no
relationship between the two variables.
The strength of the relationship varies in degree based on the value of the correlation coefficient. For example, a
value of 0.2 shows there is a positive relationship between the two variables, but it is weak and likely
insignificant. Experts do not consider correlations significant until the value surpasses at least 0.8. However, a
correlation coefficient with an absolute value of 0.9 or greater would represent a very strong relationship.
This statistic is useful in finance. For example, it can be helpful in determining how well a mutual fund performs
relative to its benchmark index, or another fund or asset class. By adding a low or negatively correlated mutual
fund to an existing portfolio, the investor gains diversification benefits.
Correlation Coefficient Formulas
One of the most commonly used formulas in stats is Pearson’s correlation coefficient formula. If you’re taking a
basic stats class, this is the one you’ll probably use:
Where,
r = Pearson correlation coefficient
x = Values in first set of data
y = Values in second set of data
n = Total number of values.
The assumptions of Correlation Coefficient are-
1. Normality means that the data sets to be correlated should approximate the normal distribution. In such
normally distributed data, most data points tend to hover close to the mean.
2. Homoscedascity comes from the Greek prefix hom, along with the Greek word skedastikos, which means
PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)
‘able to disperse’. Homoscedascity means ‘equal variances’. It means that the size of the error term is the
same for all values of the independent variable. If the error term, or the variance, is smaller for a particular
range of values of independent variable and larger for another range of values, then there is a violation of
homoscedascity. It is quite easy to check for homoscedascity visually, by looking at a scatter plot. If the
points lie equally on both sides of the line of best fit, then the data is homoscedastic.
3. Linearity simply means that the data follows a linear relationship. Again, this can be examined by looking at
a scatter plot. If the data points have a straight line (and not a curve) relationship, then the data satisfies the
linearity assumption.
4. Continuous variables are those that can take any value within an interval. Ratio variables are also
continuous variables. To compute Karl Pearson’s Coefficient of Correlation, both data sets must contain
continuous variables. If even one of the data sets is ordinal, then Spearman’s Coefficient of Rank Correlation
would be a more appropriate measure.
5. Paired observations mean that every data point must be in pairs. That is, for every observation of the
independent variable, there must be a corresponding observation of the dependent variable. We cannot
compute correlation coefficient if one data set has 12 observations and the other has 10 observations.
6. No outliers must be present in the data. While statistically there’s no harm if the data contains outliers, they
can significantly skew the correlation coefficient and make it inaccurate. When does a data point become an
outlier? In general, a data point thats beyond +3.29 or -3.29 standard deviations away, it is considered to be
an outlier. Outliers are easy to spot visually from the scatter plot.

PassHoJao.com - Get Notes, Question Papers, Practical Files, eBooks
QT/U2 Topic 2 Coefficient of Determination and Correlation

COEFFICIENT OF DETERMINATION
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the
proportion of the variance in the dependent variable that is predictable from the independent variable.
The coefficient of determination is the square of the correlation (r) between predicted y scores and actual y
scores; thus, it ranges from 0 to 1.
With linear regression, the coefficient of determination is also equal to the square of the correlation between
x and y scores.
An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10
means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is
predictable; and so on.
The formula for computing the coefficient of determination for a linear regression model with one independent
variable is given below.
Coefficient of determination. The coefficient of determination (R2) for a linear regression model with one
independent variable is:
R2 = { ( 1 / N ) * Σ [ (xi – x) * (yi – y) ] / (σx * σy ) }2
where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for
observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard
deviation of x, and σy is the standard deviation of y.
Coefficient of Correlation
The coefficient of determination, r 2, is useful because it gives the proportion of the variance (fluctuation) of
one variable that is predictable from the other variable.
It is a measure that allows us to determine how certain one can be in making predictions from a certain
model/graph.
The coefficient of determination is the ratio of the explained variation to the total variation.
The coefficient of determination is such that 0 < r 2 < 1, and denotes the strength of the linear association
between x and y.
The coefficient of determination represents the percent of the data that is the closest to the line of best fit. For
example, if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be explained by
the linear relationship between x and y (as described by the regression equation). The other 15% of the total
variation in y remains unexplained.
The coefficient of determination is a measure of how well the regression line represents the data. If the
regression line passes exactly through every point on the scatter plot, it would be able to explain all of the
variation. The further the line is away from the points, the less it is able to explain.

QT/U2 Topic 3 Measurement of Correlation – Karl Pearson’s

Method, Spearman Rank Correlation
Karl Pearson’s Coefficient of Correlation is widely used mathematical method wherein the numerical
expression is used to calculate the degree and direction of the relationship between linear related variables.
Pearson’s method, popularly known as a Pearsonian Coefficient of Correlation, is the most extensively used
quantitative methods in practice. The coefficient of correlation is denoted by “r”.
If the relationship between two variables X and Y is to be ascertained, then the following formula is used:
Properties of Coefficient of Correlation
The value of the coefficient of correlation (r) always lies between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
The coefficient of correlation is independent of the origin and scale.By origin, it means subtracting any non-
zero constant from the given value of X and Y the vale of “r” remains unchanged. By scale it means, there is
no effect on the value of “r” if the value of X and Y is divided or multiplied by any constant.
The coefficient of correlation is a geometric mean of two regression coefficient. Symbolically it is
represented as:
The coefficient of correlation is “ zero” when the variables X and Y are independent. But, however, the
converse is not true.
Assumptions of Karl Pearson’s Coefficient of Correlation

1. The relationship between the variables is “Linear”, which means when the two variables are plotted, a straight
line is formed by the points plotted.
2. There are a large number of independent causes that affect the variables under study so as to form a Normal
Distribution. Such as, variables like price, demand, supply, etc. are affected by such factors that the normal
distribution is formed.
3. The variables are independent of each other.
Note: The coefficient of correlation measures not only the magnitude of correlation but also tells the direction.
Such as, r = -0.67, which shows correlation is negative because the sign is “-“ and the magnitude is 0.67.
SPEARMAN RANK CORRELATION
Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two
variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data
and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.
The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of
those two variables; while Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses
monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman
correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
Intuitively, the Spearman correlation between two variables will be high when observations have a similar (or
identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd,
3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully opposed for a
correlation of −1) rank between the two variables.
The following formula is used to calculate the Spearman rank correlation:
ρ= Spearman rank correlation
di= the difference between the ranks of corresponding variables
n= number of observations
Assumptions
The assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable
must be monotonically related to the other variable.

QT/U2 Topic 4 Regression: Meaning, Assumption,

Regression Line
REGRESSION
Regression is a statistical measurement used in finance, investing and other disciplines that attempts to determine
the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other
changing variables (known as independent variables).
Regression helps investment and financial managers to value assets and understand the relationships between
variables, such as commodity prices and the stocks of businesses dealing in those commodities.
Regression Explained
The two basic types of regression are linear regression and multiple linear regressions, although there are non-
linear regression methods for more complicated data and analysis. Linear regression uses one independent
variable to explain or predict the outcome of the dependent variable Y, while multiple regressions use two or more
independent variables to predict the outcome.
Regression can help finance and investment professionals as well as professionals in other businesses. Regression
can also help predict sales for a company based on weather, previous sales, GDP growth or other types of
conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing
assets and discovering costs of capital.
The general form of each type of regression is:
Linear regression: Y = a + bX + u
Multiple regression: Y = a + b1X1 + b2X2 + b3X3 + … + btXt + u
Where:
Y = the variable that you are trying to predict (dependent variable).
X = the variable that you are using to predict Y (independent variable).
a = the intercept.
b = the slope.
u = the regression residual.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical
relationship between them. This relationship is typically in the form of a straight line (linear regression) that best
approximates all the individual data points. In multiple regression, the separate variables are differentiated by
using numbers with subscripts.
ASSUMPTIONS IN REGRESSION
Independence: The residuals are serially independent (no autocorrelation).

The residuals are not correlated with any of the independent (predictor) variables.
Linearity: The relationship between the dependent variable and each of the independent variables is linear.
Mean of Residuals: The mean of the residuals is zero.
Homogeneity of Variance: The variance of the residuals at all levels of the independent variables is
constant.
Errors in Variables: The independent (predictor) variables are measured without error.
Model Specification: All relevant variables are included in the model. No irrelevant variables are included
in the model.
Normality: The residuals are normally distributed. This assumption is needed for valid tests of significance
but not for estimation of the regression coefficients.
REGRESSION LINE
Definition: The Regression Line is the line that best fits the data, such that the overall distance from the line to the
points (variable values) plotted on a graph is the smallest. In other words, a line used to minimize the squared
deviations of predictions is called as the regression line.
There are as many numbers of regression lines as variables. Suppose we take two variables, say X and Y, then
there will be two regression lines:
Regression line of Y on X: This gives the most probable values of Y from the given values of X.
Regression line of X on Y: This gives the most probable values of X from the given values of Y.
The algebraic expression of these regression lines is called as Regression Equations. There will be two regression
equations for the two regression lines.
The correlation between the variables depend on the distance between these two regression lines, such as the
nearer the regression lines to each other the higher is the degree of correlation, and the farther the regression lines
to each other the lesser is the degree of correlation.
The correlation is said to be either perfect positive or perfect negative when the two regression lines coincide, i.e.
only one line exists. In case, the variables are independent; then the correlation will be zero, and the lines of
regression will be at right angles, i.e. parallel to the X axis and Y axis.
Note: The regression lines cut each other at the point of average of X and Y. This means, from the point where the
lines intersect each other the perpendicular is drawn on the X axis we will get the mean value of X. Similarly, if
the horizontal line is drawn on the Y axis we will get the mean value of Y.

QT/U1 Topic 5 Ordinary Least Square Method of Regression

Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship
between one or more independent variables and a dependent variable; the method estimates the relationship by
minimizing the sum of the squares in the difference between the observed and predicted values of the dependent
variable configured as a straight line. In this entry, OLS regression will be discussed in the context of a bivariate
model, that is, a model in which there is only one independent variable ( X ) predicting a dependent variable ( Y ).
However, the logic of OLS regression is easily extended to the multivariate model in which there are two or more
independent variables.
ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a
linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the
principle of least squares: minimizing the sum of the squares of the differences between the observed dependent
variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable,
between each data point in the set and the corresponding point on the regression surface – the smaller the
differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula,
especially in the case of a simple linear regression, in which there is a single regressor on the right side of the
regression equation.
The OLS estimator is consistent when the regressors are exogenous, and optimal in the class of linear unbiased
estimators when the errors are homoscedastic and serially uncorrelated [citation needed]. Under these conditions,
the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances.
Under the additional assumption that the errors are normally distributed, OLS is the maximum likelihood
estimator.
OLS is used in fields as diverse as economics (econometrics), data science, political science, psychology and
engineering (control theory and signal processing).

QT/U2 Topic 6 Pitfalls and Limitations Associated With

Regression and Correlation Analysis
Pitfalls Associated With Regression and Correlation Analysis
The regression analysis as a statistical tool has a number of uses, or utilities for which it is widely used in various
fields relating to almost all the natural, physical and social sciences. the specific uses, or utilities of such a
technique may be outlined as under:
It provides a functional relationship between two or more related variables with the help of which we can
easily estimate or predict the unknown values of one variable from the known values of another variable.
It provides a measure of errors of estimates made through the regression line. A little scatter of the observed
(actual) values around the relevant regression line indicates good estimates of the values of a variable, and
less degree of errors involved therein. On the other hand, a great deal of scatter of the observed values
around the relevant regression line indicates inaccurate estimates of the values of a variable and high degree
of errors involved therein.
It provides a measure of coefficient of correlation between the two variables which can be calculated by
taking the square root of the product of the two regression coefficients e. r = √(b×y. byx)
It provides a measure of coefficient of the determination which speaks of the effect of the independent
variable (explanatory, or regressing variable) on the dependent variable (explained or regressed variable)
which in its turn give us an idea about the predictive values of the regression analysis. This coefficient of
determination is computed by taking the product of the two regression coefficients e. r2 = bxy. Byx The
greater the value of the Coefficient of Determination (r2), the better is the fit, and more useful are the
regression equations as the estimating devices.
It provides a formidable tool of statistical analysis in the field of business and commerce where people are
interested in predicting the future events viz. : consumption, production, investment, prices, sales, profits,
etc. and success of businessmen depends very much on the degree of accuracy in their various estimates.
It provides a valuable tool for measuring and estimating the cause and effect relationship among the
economic variables that constitute the essence of economic theory and economic life. It is highly used in the
estimation of Demand curves, Supply curves, Production functions, Cost functions, Consumption functions
etc. In fact, economists have propounded many types of production function by fitting regression lines to the
input and output data.
This technique is highly used in our day-to-day life and sociological studies as well to estimate the various
factors viz. birth rate, death rate, tax rate, yield rate, etc.
Last but not the least, the regression analysis technique gives us an idea about the relative variation of a
series.
Limitations Associated With Regression and Correlation Analysis
Despite the above utilities and usefulness, the technique of regression analysis suffers form the following serious
limitations:
It is assumed that the cause and effect relationship between the variables remains unchanged. This
assumption may not always hold good and hence estimation of the values of a variable made on the basis of
the regression equation may lead to erroneous and misleading results.
The functional relationship that is established between any two or more variables on the basis of some
limited data may not hold good if more and more data are taken into consideration. For example, in case of
the Law of Return, the law of diminishing return may come to play, if too much of inputs are used with ca
view to increasing the volume of output.
It involves very lengthy and complicated procedure of calculations and analysis.
It cannot be used in case of qualitative phenomenon viz. honesty, crime etc.

Correlation Coefficient Formulas and Assumptions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation Coefficient Formulas and Assumptions

Uploaded by

Copyright:

Available Formats

PassHoJao.

com - Get Notes, Question Papers, Practical Files, eBooks

QT/U2 Topic 1 Correlation Coefficient, Assumptions of

Correlation Coefficient Formulas

n = Total number of values.

The assumptions of Correlation Coefficient are-

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

QT/U2 Topic 2 Coefficient of Determination and Correlation

R2 = { ( 1 / N ) * Σ [ (xi – x) * (yi – y) ] / (σx * σy ) }2

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

QT/U2 Topic 3 Measurement of Correlation – Karl Pearson’s

Properties of Coefficient of Correlation

Assumptions of Karl Pearson’s Coefficient of Correlation

3. The variables are independent of each other.

SPEARMAN RANK CORRELATION

The following formula is used to calculate the Spearman rank correlation:

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

QT/U2 Topic 4 Regression: Meaning, Assumption,

The general form of each type of regression is:

Y = the variable that you are trying to predict (dependent variable).

X = the variable that you are using to predict Y (independent variable).

u = the regression residual.

Independence: The residuals are serially independent (no autocorrelation).

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

QT/U1 Topic 5 Ordinary Least Square Method of Regression

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

QT/U2 Topic 6 Pitfalls and Limitations Associated With

Limitations Associated With Regression and Correlation Analysis

PassHoJao - Get Notes, Question Papers, Practical Files, eBooks (https://passhojao.com)

You might also like