Correlation and Regression: Statistics For Economics 1

Statistics for Economics 1
Lecture 10:
Correlation and Regression
Faculty of Economics and Business
Universitas Gadjah Mada
2020/21
Learning Objectives
▪ LO10-1: Explain the purpose of correlation analysis

▪ LO10-2: Calculate a correlation coefficient to test and interpret the
relationship between two variables
▪ LO10-3: Apply regression analysis to estimate the linear relationship
between two variables
▪ LO10-4: Evaluate the significance of the slope of the regression
equation
▪ LO10-5: Evaluate a regression equation’s ability to predict using the
standard estimate of the error and the coefficient of determination
Basic Concept
Correlation
Regression
Basic Concept
Correlation
Regression
Background
▪ Our study to this point has focused on analysis and inference

related to a single variable.
▪ In this topic we extend our analysis to relationships between
variables.
Background
▪ Our study to this point has focused on analysis and inference

related to a single variable.
▪ In this topic we extend our analysis to relationships between
variables.
In all business fields, identifying and

studying relationships between variables
can provide information on ways to increase
profits, methods to decrease costs, or
variables to predict demand.
Background
Examples of relationships between two variables are:

▪ In a study of fuel efficiency, is there a relationship between
miles per gallon an the weight of a car?
▪ Does the number of hours that students study for an exam
influence the exam score?
▪ Does the amount company spends per month on training its
sales force affect its monthly sales?
Relationship Between Two Variables
Graphical
Representation
Relationship between
two variables
Variable 1
Scatter
Random
Diagram
Variables
Variable 2
Scatter Diagram
▪ A scatter diagram is a graphic tool used to portray the

relationship between two variables.
▪ Example:
A sales manager wants to know if there is a relationship between the
number of sales calls made in a month and the number of copiers sold
that month and begins the analysis with a random sample of 15 sales
representatives.
With this data, the number of sales calls is the independent variable and
number of copiers sold is the dependent variable.
Scatter Diagram: Example
Graphing the data in a

scatter diagram will make
the relationship between
sales calls and copiers
sales easier to see.
▪ The independent variable is

scaled on the X-axis and is
the variable used as the
predictor
▪ The dependent variable is
scaled on the Y-axis and is
the variable being estimated
▪ It is perfectly reasonable for

the manager to tell the sales
people that the more sales
calls they make, the more
copiers they can expect to sell.
▪ Note, that while there does
seem to be a positive
relationship between the two
variables, all the points do not
fall on a line.
Graphical
Representation Statistical Measures
two variables
Correlation
We develop numerical measures to express the
relationship between two variables:
Variable 1 ▪ Is the relationship strong or weak?
▪ Is it direct or inverse?
Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression
We develop an equation to express the relationship
between variables.
This will allow us to estimate one variable on the
basis of another.
Relationship Between Variables:
Correlation
Population of Population parameter

(𝜽𝟏 = 𝝈𝟐 ) Population parameter
Random Population of
(𝜽𝟏 = 𝝈𝟐 )
Variable Y Random Variable X
Sample statistic Sample statistic

Sample of Random ෡ 𝟏 = 𝒔𝟐 )
(𝜽 ෡
Sample of Random (𝜽𝟏 = 𝒔 )
𝟐
Variable Y
Variable X
Using the sample statistics of the two variable (X and Y), we can calculate
correlation coefficient using the following formula:
𝑪𝒐𝒗(𝒀, 𝑿)
𝒓 = 𝑪𝒐𝒓𝒓(𝒀, 𝑿) =
𝒔𝒀 𝒔𝑿
Outcome of Correlation
The correlation
coefficient indicates the
direction and closeness
(the strength or degree)
of the linear association
between two variables.
Regression

(𝜽𝟏 = 𝝈𝟐 ) Population parameter
Random Population of
(𝜽𝟏 = 𝝈𝟐 )

(𝜽 ෡
𝟐
Variable Y
Variable X
Using the sample statistics of the two variable (X and Y), we make extensive use
of relationships between variables, which can be expressed mathematically as
𝒀 = 𝒇(𝑿)
Regression

(𝜽𝟏 = 𝝈𝟐 ) Population of Population parameter
Random
(𝜽𝟏 = 𝝈𝟐 )

(𝜽 ෡
𝟐
Variable Y
Variable X
Using the sample statistics of the two variable (X and Y), we make extensive use of
relationships between variables, which can also be expressed as a linear model
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 + 𝝐
Graphical
Representation Statistical Measures
two variables
Correlation
Correlation coefficient
Variable 1
Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression
Regression equation
Basic Concept
Correlation
Regression
Correlation Analysis
Using correlation analysis we can measure the relationship

CORRELATION ANALYSIS
A group of techniques to measure the
direction and strength of the
relationship between two variables.
The formula to calculate the CORRELATION COEFFICIENT
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ
For population data 𝛒=
𝝈𝒙 𝝈𝒚
σ 𝑿𝒊 − 𝑿 ഥ
For sample data 𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.
𝑪𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.
CORRELATION COEFFICIENT (𝐫)

▪ The sample correlation coefficient is identified as r
▪ It shows the direction and strength of the linear relationship between
two interval- or ratio-scale variables
Characteristics of Coefficient Correlation
Characteristics of the correlation

coefficient are
▪ It ranges from -1.00 to 1.00
▪ A value near 1.00 indicates a
direct or positive correlation
▪ A value near -1.00 indicates
a negative correlation
▪ If there is absolutely no relationship between the two sets of variables, the r is zero.
▪ if the r is below 0.5, it is considered weak relationship.
▪ if the r is above 0.5, it is considered strong relationship.
▪ if the correlation is weak, there is considerable scatter about a line drawn

through the center of the data.
▪ If the correlation is strong, there is very little scatter about the line.
Coefficient Correlation: An Example
How is the correlation coefficient determined?

We’ll use the North American Copier We begin with a scatter diagram, but this time we’ll draw a
Sales as an example. vertical line at the mean of the x-values (96 sales calls) and a
horizontal line at the mean of the y-values (45 copiers).
Coefficient Correlation: An Example
How is the correlation coefficient determined?

We’ll use the North American Copier
Sales as an example.
Now we find the deviations from the mean number of sales calls
and the mean number of copiers sold; then multiply the them.
The sum of their product is 6,672 and will be used in the
coefficient correlation formula to find r.
We also need the standard deviations.
6672
r= = 0.865
(15−1)(42.76)(12.89)
The result, r=.865 indicates a strong, positive relationship.

Testing the Significance of r
▪ Recall that the sales manager from North American Copier Sales found
an r of 0.865
▪ Could the result be due to sampling error? Remember only 15 sales
people were sampled
▪ We ask the question, could there be zero correlation in the population
from which the sample was selected?
▪ We’ll let "ρ" represent the correlation in the population and conduct a
hypothesis test to find out.
▪ Step 1: State the null and the

alternate hypothesis Suppose we’ll use .05, so we reject
𝐻0 : 𝜌 = 0 The correlation in the 𝐻0 if 𝑡 < 2.16 or 𝑡 > 2.16
population is zero
𝐻1 : 𝜌 ≠ 0 The correlation in the
population is different from zero
▪ Step 2: Select the test statistic,

we use t
▪ Step 3: Formulate the decision
rule based on the selected the
level of significance.
▪ Step 4: Make decision, reject 𝐻0 , 𝑡 = 6.216
▪ Step 5: Interpret, there is correlation with respect to the

number of sales calls made and the number of copiers sold in
the population of sales people.
Basic Concept
Correlation
Regression
Definition of Regression Analysis
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ The relationship between the variables is linear

▪ Both the independent and the dependent
variables must be interval or ratio scale
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
𝛽0 and 𝛽1 are referred to as the parameters of the model, and 𝜖 is a random variable
referred to as the error term.
The error term accounts for the variability in 𝑌 that cannot be explained by the linear
relationship between 𝑌 and 𝑋.
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
factors other than X that affect Y, such as:

▪ Unavailability of data
▪ Intrinsic randomness in human behavior
▪ Poor proxy variables
▪ Vagueness of theory
Disturbance term (unobserved)
▪ It estimates one variable based on another variable
(Y) (X)
The variable being estimated The variable used to make the
is the dependent variable estimate or predict the value is the
independent variable
▪ It estimates one variable based on another variable
(Y) (X)
We are interested in “explaining Y in terms of X,” or

in “studying how Y varies with changes in X.”
The Estimation Process in Simple Linear
Regression
▪ The estimation of 𝜷𝟎 and 𝜷𝟏

is a statistical process much
like the estimation of μ
discussed in week 8.
▪ 𝜷𝟎 and 𝜷𝟏 are the unknown
parameters of interest, and
෡ 𝟎 and 𝜷
𝜷 ෡ 𝟏 are the sample
statistics used to estimate
the parameters.
Outcomes of Regression Analysis
𝑬(𝒀) 𝑬(𝒀)
Linear regression provides two important
results:
▪ Predicted values, 𝑌෠ , of the dependent, or
endogenous, variable as a function of the
independent or exogenous variable.
▪ Estimated marginal change in the
𝑬(𝒀) endogenous variable, 𝛽መ1 , that results from
a one-unit change in the independent, or
exogenous, variable.
Regression Analysis
In regression analysis, our

objective is to use the data to
position a line that best
represents the relationship
Regression Analysis: Least Squares
▪ The first approach is to use a scatter

diagram to visually position the line.
▪ The lines drawn in the chart on the
right represents different judgement.
▪ We would prefer a method that results
in a single, best regression line.
▪ The method that results in a single,
best regression line is called the least
squares principle.
Least Squares Principle
A mathematical procedure that uses the data to position a line with

the objective of minimizing the sum of the squares of the vertical
distances between the actual y values and the predicted values of y.
𝑦𝑖 Prediction line that we want to fit

𝑒3 ෡𝟎 + 𝜷
෡𝒊 = 𝜷
𝒀 ෡ 𝟏 𝑿𝒊
𝑒1
𝑒4
𝑌෠ is the estimated value of Y for a selected value of X
𝑒2 𝛽መ0 is the constant or intercept
𝛽መ1 is the slope of the fitted line
𝑋 is the value of the independent variable
𝑋
𝑥1 𝑥2 𝑥3 𝑥4
𝒀 𝒀 𝒀
Chart 1 Chart 2 Chart 3
𝑿 𝑿 𝑿
Compared to the other charts, chart 1 Chart 2 and 3 were drawn differently and their sum of
represent the best fitting line because the squares is 44 and 132 respectively; They are not
its sum of the squares is 24. the best fitting lines.
The method accommodates us to minimize the total distance

associated with the gap between the actual values Y and fitted
values 𝑌෠
Least Squares Estimators
The least square estimators are shown by the following coefficients:
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)

ത 𝒄𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿, 𝒀)
𝛽መ1 = =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿)
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Estimators
The least square estimators are shown by the following coefficients:
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)

ത 𝒔𝒚
𝛽መ1 = =𝒓
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 𝒔𝒙
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Regression Example 1
▪ Suppose data were collected

from a sample of 10 Armand’s
Pizza Parlor restaurants located
near college campuses.
▪ For the observation or restaurant
in the sample, x is the size of the
student population (in thousands)
and y is the quarterly sales (in
thousands of dollars).
▪ The scatter diagram enables us to

observe the data graphically and to
draw preliminary conclusions about the
possible relationship between the
variables.
▪ Student population is shown on the
horizontal axis and quarterly sales is
shown on the vertical axis .
▪ Quarterly sales appear to be higher at
campuses with larger student
populations
The result:
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝒚𝒊 − 𝒚ഥ 𝟐𝟖𝟒𝟎
መ
𝛽𝟏 = = =𝟓
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝟐 𝟓𝟔𝟖
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙
ഥ = 𝟏𝟑𝟎 − 𝟓 𝟏𝟒 = 𝟔𝟎
The regression line:
෡ = 𝟔𝟎 + 𝟓𝑿
𝒀
෡ = 𝟔𝟎 + 𝟓𝑿
Y ▪ The graph of this equation on
the scatter diagram.
▪ The slope of the estimated
𝛽መ𝟏 = 𝟓
regression equation is positive,
implying that as student
population increases, sales
𝛽መ𝟎 = 𝟔𝟎
increase
▪ Suppose we have the case of North

American Copier Sales: the sales manager
gathered information on the number of sales
calls made (X) and the number of copiers
sold (Y).
▪ Based on the data, it is calculated that:

▪ Standard deviation of x (𝑠𝑥 ) and y (𝑠𝑦 ) is
𝟒𝟐.𝟕𝟔 and 𝟏𝟐.𝟖𝟗, respectively
▪ Coefficient of correlation (r) is 𝟎.𝟖𝟔𝟓.
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
1. The first step is to find the slope of the least squares regression line, 𝛽መ𝟏
𝒔𝒚 𝟏𝟐. 𝟖𝟗
𝛽መ𝟏 = 𝒓 = 𝟎. 𝟖𝟔𝟓 = 𝟎. 𝟐𝟔𝟎𝟖
𝒔𝒙 𝟒𝟐. 𝟕𝟔
▪ The 𝛽መ𝟏 value of .2608 indicates that for each additional sales call, the sales representative can
expect to increase the number of copiers sold by about .2608.
▪ So 20 additional sales calls in a month will result in about five more copiers being sold.
2. The second step is to find 𝛽መ𝟎
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙
ഥ
𝛽መ𝟎 = 𝟒𝟓 − 𝟎. 𝟐𝟔𝟎𝟖 𝟗𝟔 = 𝟏𝟗. 𝟗𝟔𝟑
3. Then determine the regression line
෡ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖𝑿
𝒀
So if a salesperson makes 100 calls, he or she can expect to sell
46.0432 copiers
෡ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖 𝟏𝟎𝟎 = 𝟒𝟔. 𝟎𝟒𝟑𝟐

𝒀
Estimated sales for all

sales representatives
are calculated using the
formula we determined
earlier and placed in the
table.
▪ The line of regression is drawn on

the scatter diagram.
▪ The regression line will always pass
through the mean of variables X
and Y.
▪ Plus, there is no other line through
the data where the sum of the
deviations is smaller.
Regression Equation Slope Test
▪ The next step is to conduct a test of hypothesis to see if the slope of the
regression line is different from zero.
▪ If we find that the slope is different from zero, then we can conclude that
using the regression equation adds to our ability to predict the dependent
variable based on the independent variable.
▪ The test is equivalent to the test for the correlation coefficient. We use 𝛽ሶ
to represent the population slope.
▪ For a regression equation, the slope is tested for significance

▪ We test the hypothesis that the slope of the line in the population is 0
▪ If we do not reject the null hypothesis, we conclude there is no
relationship between the two variables
Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 1: State the null and alternate hypothesis
𝐻0 : 𝛽 = 0 The slope of the population is zero
𝐻1 : 𝛽 ≠ 0 The slope of the population is different from zero
▪ Step 2: Select the test statistic, t
▪ Step 3: Formulate the decision rule, reject 𝐻0 if t > 1.771 (it is obtained
from the Students' t Table with degree of freedom n - 2)
▪ Step 4: Make decision, reject H0, t = 6.205
𝛽መ1 − 𝛽1 0.2606 − 0
𝑡= = = 6.205
𝑠𝑒 𝛽መ1 0.042
𝑠
𝑠𝑒 is the standard error of the slope estimate and is obtained by 𝑠𝑒 =
σ 𝑋𝑖 −𝑋ത 2
▪ Step 5: Interpret, the number of sales calls is useful in estimating copier
sales
Evaluating a Regression Equation’s
Ability to Predict
▪ Perfect prediction is practically impossible in almost all disciplines,

including economics and business
▪ The North American Copier Sales example showed a significant
relationship between sales calls and copier sales, the equation is
▪ Number of copiers sold = 19.9632 + .2608(Number of sales calls)
What if the number of sales calls is 84, we calculate the number of copiers sold is
41.8704—we did have two employees with 84 sales calls, they sold just 30 and 24
▪ So, is the regression equation a good predictor?
Evaluating a Regression Equation’s
Ability to Predict
▪ We need a measure that will tell how inaccurate the estimate might be.
▪ The measure we’ll use is
1. The standard error of the estimate, 𝑠𝑦,𝑥 .
2. The coefficient of determination
The Standard Error of Estimate
The standard error of estimate is a measure of the dispersion, or scatter, of

the observed values around the line of regression for a given value of x.
▪ It measures the variation around the regression line
▪ It is in the same units as the dependent variable
▪ It is based on squared deviations from the regression line
The standard error of estimate is computed using the following formula:

2
σ 𝑌 − 𝑌෠
𝑠𝑦,𝑥 =
𝑛−2
▪ If the standard error of estimate is small, this indicates that the data are
relatively close to the regression line and the regression equation can be
used.
▪ If it is large, the data are widely scattered around the regression line and
the regression equation will not provide a precise estimate of y.
▪ We calculate the standard error of estimate in this example.

▪ We need the sum of the squared differences between each observed
value of y and the predicted value of y, which is 𝑦ത
2
σ 𝑌 − 𝑌෠ 587.1108
𝑠𝑦,𝑥 = = = 6.720
𝑛−2 15 − 2
The Coefficient of Determination
▪ The coefficient of determination is the proportion of the total variation in

the dependent variable Y that is explained, or accounted for, by the
variation in the independent variable X.
▪ The coefficient of determination provides a more interpretable measure
of a regression equation’s ability to predict.
▪ It is found from the following formula
2
σ ෠𝑖 − 𝑌ത
𝑌
𝑟2 =
σ 𝑌𝑖 − 𝑌ത 2
The Coefficient of Determination
▪ The characteristics of coefficient of determination:

▪ It ranges from 0 to 1.0
▪ It is the square of the correlation coefficient
▪ In the North American Copier Sales example, the correlation coefficient

was .865; just square that (.865)2 = .748; this is the coefficient of
determination
▪ This means 74.8% of the variation in the number of copiers sold is
explained by the variation in sales calls
Relationships among 𝒓, 𝒓𝟐 and 𝒔𝒚,𝒙
▪ Recall the standard error of estimate (𝑠𝑦,𝑥 ) measures how close the actual values are to
the regression line
When it is small, the two variables are closely related
▪ The correlation coefficient (𝑟) measures the strength of the linear association between
two variables
When points on the scatter diagram are close to the line, the correlation coefficient tends to be
large
▪ The coefficient of determination is the correlation coefficient squared
▪ Therefore, the correlation coefficient (𝑟) and the coefficient of determination (𝑟 2 )have an
inverse relationship with the standard error of estimate (𝑠𝑦,𝑥 )
THANK YOU

Correlation and Regression: Statistics For Economics 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression: Statistics For Economics 1

Uploaded by

Copyright:

Available Formats

Statistics for Economics 1

▪ LO10-1: Explain the purpose of correlation analysis

▪ Our study to this point has focused on analysis and inference

▪ Our study to this point has focused on analysis and inference

In all business fields, identifying and

Examples of relationships between two variables are:

▪ A scatter diagram is a graphic tool used to portray the

Graphing the data in a

▪ The independent variable is

▪ It is perfectly reasonable for

Population of Population parameter

Sample statistic Sample statistic

Population of Population parameter

Sample statistic Sample statistic

Population of Population parameter

Sample statistic Sample statistic

Using correlation analysis we can measure the relationship

The formula to calculate the CORRELATION COEFFICIENT

The formula to calculate the CORRELATION COEFFICIENT

The formula to calculate the CORRELATION COEFFICIENT

CORRELATION COEFFICIENT (𝐫)

Characteristics of the correlation

▪ if the correlation is weak, there is considerable scatter about a line drawn

How is the correlation coefficient determined?

How is the correlation coefficient determined?

The result, r=.865 indicates a strong, positive relationship.

▪ Step 1: State the null and the

▪ Step 2: Select the test statistic,

▪ Step 4: Make decision, reject 𝐻0 , 𝑡 = 6.216

▪ Step 5: Interpret, there is correlation with respect to the

▪ The relationship between the variables is linear

factors other than X that affect Y, such as:

We are interested in “explaining Y in terms of X,” or

▪ The estimation of 𝜷𝟎 and 𝜷𝟏

In regression analysis, our

▪ The first approach is to use a scatter

A mathematical procedure that uses the data to position a line with

𝑦𝑖 Prediction line that we want to fit

The method accommodates us to minimize the total distance

The least square estimators are shown by the following coefficients:

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)

The least square estimators are shown by the following coefficients:

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)

▪ Suppose data were collected

▪ The scatter diagram enables us to

▪ Suppose we have the case of North

▪ Based on the data, it is calculated that:

෡ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖 𝟏𝟎𝟎 = 𝟒𝟔. 𝟎𝟒𝟑𝟐

Estimated sales for all

▪ The line of regression is drawn on

▪ For a regression equation, the slope is tested for significance

▪ Perfect prediction is practically impossible in almost all disciplines,

The standard error of estimate is a measure of the dispersion, or scatter, of

The standard error of estimate is computed using the following formula:

▪ We calculate the standard error of estimate in this example.

▪ The coefficient of determination is the proportion of the total variation in

▪ The characteristics of coefficient of determination:

▪ In the North American Copier Sales example, the correlation coefficient

You might also like