Professional Documents
Culture Documents
Lecture 10:
Correlation and Regression
Faculty of Economics and Business
Universitas Gadjah Mada
2020/21
Learning Objectives
Correlation
Regression
Correlation and Regression
Basic Concept
Correlation
Regression
Background
Graphical
Representation
Relationship between
two variables
Variable 1
Scatter
Random
Diagram
Variables
Variable 2
Scatter Diagram
Graphical
Representation Statistical Measures
Relationship between
two variables
Correlation
We develop numerical measures to express the
relationship between two variables:
Variable 1 ▪ Is the relationship strong or weak?
▪ Is it direct or inverse?
Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression
We develop an equation to express the relationship
between variables.
This will allow us to estimate one variable on the
basis of another.
Relationship Between Variables:
Correlation
Variable Y
Variable X
Using the sample statistics of the two variable (X and Y), we can calculate
correlation coefficient using the following formula:
𝑪𝒐𝒗(𝒀, 𝑿)
𝒓 = 𝑪𝒐𝒓𝒓(𝒀, 𝑿) =
𝒔𝒀 𝒔𝑿
Relationship Between Variables:
Outcome of Correlation
The correlation
coefficient indicates the
direction and closeness
(the strength or degree)
of the linear association
between two variables.
Relationship Between Variables:
Regression
Using the sample statistics of the two variable (X and Y), we make extensive use
of relationships between variables, which can be expressed mathematically as
𝒀 = 𝒇(𝑿)
Relationship Between Variables:
Regression
Using the sample statistics of the two variable (X and Y), we make extensive use of
relationships between variables, which can also be expressed as a linear model
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 + 𝝐
Relationship Between Two Variables
Graphical
Representation Statistical Measures
Relationship between
two variables
Correlation
Correlation coefficient
Variable 1
Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression
Regression equation
Correlation and Regression
Basic Concept
Correlation
Regression
Correlation Analysis
CORRELATION ANALYSIS
A group of techniques to measure the
direction and strength of the
relationship between two variables.
Correlation Analysis
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ
For population data 𝛒=
𝝈𝒙 𝝈𝒚
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ
For sample data 𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
Correlation Analysis
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.
𝑪𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
Correlation Analysis
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.
▪ If there is absolutely no relationship between the two sets of variables, the r is zero.
▪ if the r is below 0.5, it is considered weak relationship.
▪ if the r is above 0.5, it is considered strong relationship.
Characteristics of Coefficient Correlation
Now we find the deviations from the mean number of sales calls
and the mean number of copiers sold; then multiply the them.
The sum of their product is 6,672 and will be used in the
coefficient correlation formula to find r.
We also need the standard deviations.
6672
r= = 0.865
(15−1)(42.76)(12.89)
▪ Recall that the sales manager from North American Copier Sales found
an r of 0.865
▪ Could the result be due to sampling error? Remember only 15 sales
people were sampled
▪ We ask the question, could there be zero correlation in the population
from which the sample was selected?
▪ We’ll let "ρ" represent the correlation in the population and conduct a
hypothesis test to find out.
Testing the Significance of r
Correlation
Regression
Definition of Regression Analysis
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
𝛽0 and 𝛽1 are referred to as the parameters of the model, and 𝜖 is a random variable
referred to as the error term.
The error term accounts for the variability in 𝑌 that cannot be explained by the linear
relationship between 𝑌 and 𝑋.
Definition of Regression Analysis
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable
(Y) (X)
The variable being estimated The variable used to make the
is the dependent variable estimate or predict the value is the
independent variable
Definition of Regression Analysis
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable
(Y) (X)
𝑋
𝑥1 𝑥2 𝑥3 𝑥4
Least Squares Principle
𝒀 𝒀 𝒀
Chart 1 Chart 2 Chart 3
𝑿 𝑿 𝑿
Compared to the other charts, chart 1 Chart 2 and 3 were drawn differently and their sum of
represent the best fitting line because the squares is 44 and 132 respectively; They are not
its sum of the squares is 24. the best fitting lines.
Least Squares Principle
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Estimators
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Regression Example 1
The result:
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝒚𝒊 − 𝒚ഥ 𝟐𝟖𝟒𝟎
መ
𝛽𝟏 = = =𝟓
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝟐 𝟓𝟔𝟖
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙
ഥ = 𝟏𝟑𝟎 − 𝟓 𝟏𝟒 = 𝟔𝟎
The regression line:
= 𝟔𝟎 + 𝟓𝑿
𝒀
Least Squares Regression Example 1
= 𝟔𝟎 + 𝟓𝑿
Y ▪ The graph of this equation on
the scatter diagram.
▪ The slope of the estimated
𝛽መ𝟏 = 𝟓
regression equation is positive,
implying that as student
population increases, sales
𝛽መ𝟎 = 𝟔𝟎
increase
Least Squares Regression Example 2
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
1. The first step is to find the slope of the least squares regression line, 𝛽መ𝟏
𝒔𝒚 𝟏𝟐. 𝟖𝟗
𝛽መ𝟏 = 𝒓 = 𝟎. 𝟖𝟔𝟓 = 𝟎. 𝟐𝟔𝟎𝟖
𝒔𝒙 𝟒𝟐. 𝟕𝟔
▪ The 𝛽መ𝟏 value of .2608 indicates that for each additional sales call, the sales representative can
expect to increase the number of copiers sold by about .2608.
▪ So 20 additional sales calls in a month will result in about five more copiers being sold.
Least Squares Regression Example 2
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
2. The second step is to find 𝛽መ𝟎
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙
ഥ
𝛽መ𝟎 = 𝟒𝟓 − 𝟎. 𝟐𝟔𝟎𝟖 𝟗𝟔 = 𝟏𝟗. 𝟗𝟔𝟑
Least Squares Regression Example 2
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
3. Then determine the regression line
= 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖𝑿
𝒀
So if a salesperson makes 100 calls, he or she can expect to sell
46.0432 copiers
▪ The next step is to conduct a test of hypothesis to see if the slope of the
regression line is different from zero.
▪ If we find that the slope is different from zero, then we can conclude that
using the regression equation adds to our ability to predict the dependent
variable based on the independent variable.
▪ The test is equivalent to the test for the correlation coefficient. We use 𝛽ሶ
to represent the population slope.
Regression Equation Slope Test
Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 1: State the null and alternate hypothesis
𝐻0 : 𝛽 = 0 The slope of the population is zero
𝐻1 : 𝛽 ≠ 0 The slope of the population is different from zero
▪ Step 2: Select the test statistic, t
Regression Equation Slope Test
Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 3: Formulate the decision rule, reject 𝐻0 if t > 1.771 (it is obtained
from the Students' t Table with degree of freedom n - 2)
▪ Step 4: Make decision, reject H0, t = 6.205
𝛽መ1 − 𝛽1 0.2606 − 0
𝑡= = = 6.205
𝑠𝑒 𝛽መ1 0.042
𝑠
𝑠𝑒 is the standard error of the slope estimate and is obtained by 𝑠𝑒 =
σ 𝑋𝑖 −𝑋ത 2
Regression Equation Slope Test
Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 5: Interpret, the number of sales calls is useful in estimating copier
sales
Evaluating a Regression Equation’s
Ability to Predict
▪ We need a measure that will tell how inaccurate the estimate might be.
▪ The measure we’ll use is
1. The standard error of the estimate, 𝑠𝑦,𝑥 .
2. The coefficient of determination
The Standard Error of Estimate
2
σ 𝑌 − 𝑌 587.1108
𝑠𝑦,𝑥 = = = 6.720
𝑛−2 15 − 2
The Coefficient of Determination
▪ Recall the standard error of estimate (𝑠𝑦,𝑥 ) measures how close the actual values are to
the regression line
When it is small, the two variables are closely related
▪ The correlation coefficient (𝑟) measures the strength of the linear association between
two variables
When points on the scatter diagram are close to the line, the correlation coefficient tends to be
large
▪ The coefficient of determination is the correlation coefficient squared
▪ Therefore, the correlation coefficient (𝑟) and the coefficient of determination (𝑟 2 )have an
inverse relationship with the standard error of estimate (𝑠𝑦,𝑥 )
THANK YOU