You are on page 1of 72

Statistics for Economics 1

Lecture 10:
Correlation and Regression
Faculty of Economics and Business
Universitas Gadjah Mada

2020/21
Learning Objectives

▪ LO10-1: Explain the purpose of correlation analysis


▪ LO10-2: Calculate a correlation coefficient to test and interpret the
relationship between two variables
▪ LO10-3: Apply regression analysis to estimate the linear relationship
between two variables
▪ LO10-4: Evaluate the significance of the slope of the regression
equation
▪ LO10-5: Evaluate a regression equation’s ability to predict using the
standard estimate of the error and the coefficient of determination
Correlation and Regression
Basic Concept

Correlation

Regression
Correlation and Regression
Basic Concept

Correlation

Regression
Background

▪ Our study to this point has focused on analysis and inference


related to a single variable.
▪ In this topic we extend our analysis to relationships between
variables.
Background

▪ Our study to this point has focused on analysis and inference


related to a single variable.
▪ In this topic we extend our analysis to relationships between
variables.

In all business fields, identifying and


studying relationships between variables
can provide information on ways to increase
profits, methods to decrease costs, or
variables to predict demand.
Background

Examples of relationships between two variables are:


▪ In a study of fuel efficiency, is there a relationship between
miles per gallon an the weight of a car?
▪ Does the number of hours that students study for an exam
influence the exam score?
▪ Does the amount company spends per month on training its
sales force affect its monthly sales?
Relationship Between Two Variables

Graphical
Representation

Relationship between
two variables

Variable 1

Scatter
Random
Diagram
Variables
Variable 2
Scatter Diagram

▪ A scatter diagram is a graphic tool used to portray the


relationship between two variables.
▪ Example:
A sales manager wants to know if there is a relationship between the
number of sales calls made in a month and the number of copiers sold
that month and begins the analysis with a random sample of 15 sales
representatives.
With this data, the number of sales calls is the independent variable and
number of copiers sold is the dependent variable.
Scatter Diagram: Example

Graphing the data in a


scatter diagram will make
the relationship between
sales calls and copiers
sales easier to see.
Scatter Diagram: Example

▪ The independent variable is


scaled on the X-axis and is
the variable used as the
predictor
▪ The dependent variable is
scaled on the Y-axis and is
the variable being estimated
Scatter Diagram: Example

▪ It is perfectly reasonable for


the manager to tell the sales
people that the more sales
calls they make, the more
copiers they can expect to sell.
▪ Note, that while there does
seem to be a positive
relationship between the two
variables, all the points do not
fall on a line.
Relationship Between Two Variables

Graphical
Representation Statistical Measures

Relationship between
two variables
Correlation
We develop numerical measures to express the
relationship between two variables:
Variable 1 ▪ Is the relationship strong or weak?
▪ Is it direct or inverse?

Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression
We develop an equation to express the relationship
between variables.
This will allow us to estimate one variable on the
basis of another.
Relationship Between Variables:
Correlation

Population of Population parameter


(𝜽𝟏 = 𝝈𝟐 ) Population parameter
Random Population of
(𝜽𝟏 = 𝝈𝟐 )
Variable Y Random Variable X

Sample statistic Sample statistic


Sample of Random ෡ 𝟏 = 𝒔𝟐 )
(𝜽 ෡
Sample of Random (𝜽𝟏 = 𝒔 )
𝟐

Variable Y
Variable X

Using the sample statistics of the two variable (X and Y), we can calculate
correlation coefficient using the following formula:

𝑪𝒐𝒗(𝒀, 𝑿)
𝒓 = 𝑪𝒐𝒓𝒓(𝒀, 𝑿) =
𝒔𝒀 𝒔𝑿
Relationship Between Variables:
Outcome of Correlation

The correlation
coefficient indicates the
direction and closeness
(the strength or degree)
of the linear association
between two variables.
Relationship Between Variables:
Regression

Population of Population parameter


(𝜽𝟏 = 𝝈𝟐 ) Population parameter
Random Population of
(𝜽𝟏 = 𝝈𝟐 )
Variable Y Random Variable X

Sample statistic Sample statistic


Sample of Random ෡ 𝟏 = 𝒔𝟐 )
(𝜽 ෡
Sample of Random (𝜽𝟏 = 𝒔 )
𝟐
Variable Y
Variable X

Using the sample statistics of the two variable (X and Y), we make extensive use
of relationships between variables, which can be expressed mathematically as
𝒀 = 𝒇(𝑿)
Relationship Between Variables:
Regression

Population of Population parameter


(𝜽𝟏 = 𝝈𝟐 ) Population of Population parameter
Random
(𝜽𝟏 = 𝝈𝟐 )
Variable Y Random Variable X

Sample statistic Sample statistic


Sample of Random ෡ 𝟏 = 𝒔𝟐 )
(𝜽 ෡
Sample of Random (𝜽𝟏 = 𝒔 )
𝟐
Variable Y
Variable X

Using the sample statistics of the two variable (X and Y), we make extensive use of
relationships between variables, which can also be expressed as a linear model
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 + 𝝐
Relationship Between Two Variables

Graphical
Representation Statistical Measures

Relationship between
two variables
Correlation

Correlation coefficient
Variable 1

Scatter
Random Covariance
Diagram
Variables
Variable 2
Regression

Regression equation
Correlation and Regression
Basic Concept

Correlation

Regression
Correlation Analysis

Using correlation analysis we can measure the relationship


between two variables

CORRELATION ANALYSIS
A group of techniques to measure the
direction and strength of the
relationship between two variables.
Correlation Analysis

The formula to calculate the CORRELATION COEFFICIENT

ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ
For population data 𝛒=
𝝈𝒙 𝝈𝒚
ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ
For sample data 𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
Correlation Analysis

The formula to calculate the CORRELATION COEFFICIENT

ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.
𝑪𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚
Correlation Analysis

The formula to calculate the CORRELATION COEFFICIENT

ഥ 𝒀𝒊 − 𝒀
σ 𝑿𝒊 − 𝑿 ഥ It represents the average level
of observed joint variation.
𝐫=
(𝒏 − 𝟏)𝒔𝒙 𝒔𝒚 It represents the maximum
possible joint variation between
X and Y.

CORRELATION COEFFICIENT (𝐫)


▪ The sample correlation coefficient is identified as r
▪ It shows the direction and strength of the linear relationship between
two interval- or ratio-scale variables
Characteristics of Coefficient Correlation

Characteristics of the correlation


coefficient are
▪ It ranges from -1.00 to 1.00
▪ A value near 1.00 indicates a
direct or positive correlation
▪ A value near -1.00 indicates
a negative correlation
Characteristics of Coefficient Correlation

▪ If there is absolutely no relationship between the two sets of variables, the r is zero.
▪ if the r is below 0.5, it is considered weak relationship.
▪ if the r is above 0.5, it is considered strong relationship.
Characteristics of Coefficient Correlation

▪ if the correlation is weak, there is considerable scatter about a line drawn


through the center of the data.
▪ If the correlation is strong, there is very little scatter about the line.
Coefficient Correlation: An Example

How is the correlation coefficient determined?


We’ll use the North American Copier We begin with a scatter diagram, but this time we’ll draw a
Sales as an example. vertical line at the mean of the x-values (96 sales calls) and a
horizontal line at the mean of the y-values (45 copiers).
Coefficient Correlation: An Example

How is the correlation coefficient determined?


We’ll use the North American Copier
Sales as an example.

Now we find the deviations from the mean number of sales calls
and the mean number of copiers sold; then multiply the them.
The sum of their product is 6,672 and will be used in the
coefficient correlation formula to find r.
We also need the standard deviations.
6672
r= = 0.865
(15−1)(42.76)(12.89)

The result, r=.865 indicates a strong, positive relationship.


Testing the Significance of r

▪ Recall that the sales manager from North American Copier Sales found
an r of 0.865
▪ Could the result be due to sampling error? Remember only 15 sales
people were sampled
▪ We ask the question, could there be zero correlation in the population
from which the sample was selected?
▪ We’ll let "ρ" represent the correlation in the population and conduct a
hypothesis test to find out.
Testing the Significance of r

▪ Step 1: State the null and the


alternate hypothesis Suppose we’ll use .05, so we reject
𝐻0 : 𝜌 = 0 The correlation in the 𝐻0 if 𝑡 < 2.16 or 𝑡 > 2.16
population is zero
𝐻1 : 𝜌 ≠ 0 The correlation in the
population is different from zero

▪ Step 2: Select the test statistic,


we use t
▪ Step 3: Formulate the decision
rule based on the selected the
level of significance.
Testing the Significance of r

▪ Step 4: Make decision, reject 𝐻0 , 𝑡 = 6.216

▪ Step 5: Interpret, there is correlation with respect to the


number of sales calls made and the number of copiers sold in
the population of sales people.
Correlation and Regression
Basic Concept

Correlation

Regression
Definition of Regression Analysis

Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)

▪ The relationship between the variables is linear


▪ Both the independent and the dependent
variables must be interval or ratio scale
Definition of Regression Analysis

Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
𝛽0 and 𝛽1 are referred to as the parameters of the model, and 𝜖 is a random variable
referred to as the error term.
The error term accounts for the variability in 𝑌 that cannot be explained by the linear
relationship between 𝑌 and 𝑋.
Definition of Regression Analysis

Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖

factors other than X that affect Y, such as:


▪ Unavailability of data
▪ Intrinsic randomness in human behavior
▪ Poor proxy variables
▪ Vagueness of theory
Disturbance term (unobserved)
Definition of Regression Analysis

Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable

(Y) (X)
The variable being estimated The variable used to make the
is the dependent variable estimate or predict the value is the
independent variable
Definition of Regression Analysis

Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable

(Y) (X)

We are interested in “explaining Y in terms of X,” or


in “studying how Y varies with changes in X.”
The Estimation Process in Simple Linear
Regression

▪ The estimation of 𝜷𝟎 and 𝜷𝟏


is a statistical process much
like the estimation of μ
discussed in week 8.
▪ 𝜷𝟎 and 𝜷𝟏 are the unknown
parameters of interest, and
෡ 𝟎 and 𝜷
𝜷 ෡ 𝟏 are the sample
statistics used to estimate
the parameters.
Relationship Between Variables:
Outcomes of Regression Analysis
𝑬(𝒀) 𝑬(𝒀)
Linear regression provides two important
results:
▪ Predicted values, 𝑌෠ , of the dependent, or
endogenous, variable as a function of the
independent or exogenous variable.
▪ Estimated marginal change in the
𝑬(𝒀) endogenous variable, 𝛽መ1 , that results from
a one-unit change in the independent, or
exogenous, variable.
Regression Analysis

In regression analysis, our


objective is to use the data to
position a line that best
represents the relationship
between two variables
Regression Analysis: Least Squares

▪ The first approach is to use a scatter


diagram to visually position the line.
▪ The lines drawn in the chart on the
right represents different judgement.
▪ We would prefer a method that results
in a single, best regression line.
▪ The method that results in a single,
best regression line is called the least
squares principle.
Least Squares Principle

A mathematical procedure that uses the data to position a line with


the objective of minimizing the sum of the squares of the vertical
distances between the actual y values and the predicted values of y.
Least Squares Principle

𝑦𝑖 Prediction line that we want to fit


𝑒3 ෡𝟎 + 𝜷
෡𝒊 = 𝜷
𝒀 ෡ 𝟏 𝑿𝒊
𝑒1
𝑒4
𝑌෠ is the estimated value of Y for a selected value of X
𝑒2 𝛽መ0 is the constant or intercept
𝛽መ1 is the slope of the fitted line
𝑋 is the value of the independent variable

𝑋
𝑥1 𝑥2 𝑥3 𝑥4
Least Squares Principle

𝒀 𝒀 𝒀
Chart 1 Chart 2 Chart 3

𝑿 𝑿 𝑿
Compared to the other charts, chart 1 Chart 2 and 3 were drawn differently and their sum of
represent the best fitting line because the squares is 44 and 132 respectively; They are not
its sum of the squares is 24. the best fitting lines.
Least Squares Principle

The method accommodates us to minimize the total distance


associated with the gap between the actual values Y and fitted
values 𝑌෠
Least Squares Estimators

The least square estimators are shown by the following coefficients:

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)


ത 𝒄𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿, 𝒀)
𝛽መ1 = =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿)

𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Estimators

The least square estimators are shown by the following coefficients:

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 −𝑦)


ത 𝒔𝒚
𝛽መ1 = =𝒓
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 𝒔𝒙

𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Least Squares Regression Example 1

▪ Suppose data were collected


from a sample of 10 Armand’s
Pizza Parlor restaurants located
near college campuses.
▪ For the observation or restaurant
in the sample, x is the size of the
student population (in thousands)
and y is the quarterly sales (in
thousands of dollars).
Least Squares Regression Example 1

▪ The scatter diagram enables us to


observe the data graphically and to
draw preliminary conclusions about the
possible relationship between the
variables.
▪ Student population is shown on the
horizontal axis and quarterly sales is
shown on the vertical axis .
▪ Quarterly sales appear to be higher at
campuses with larger student
populations
Least Squares Regression Example 1
Least Squares Regression Example 1

The result:
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝒚𝒊 − 𝒚ഥ 𝟐𝟖𝟒𝟎

𝛽𝟏 = = =𝟓
σ𝒏𝒊=𝟏 𝒙𝒊 − 𝒙
ഥ 𝟐 𝟓𝟔𝟖
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙
ഥ = 𝟏𝟑𝟎 − 𝟓 𝟏𝟒 = 𝟔𝟎
The regression line:
෡ = 𝟔𝟎 + 𝟓𝑿
𝒀
Least Squares Regression Example 1

෡ = 𝟔𝟎 + 𝟓𝑿
Y ▪ The graph of this equation on
the scatter diagram.
▪ The slope of the estimated
𝛽መ𝟏 = 𝟓
regression equation is positive,
implying that as student
population increases, sales
𝛽መ𝟎 = 𝟔𝟎
increase
Least Squares Regression Example 2

▪ Suppose we have the case of North


American Copier Sales: the sales manager
gathered information on the number of sales
calls made (X) and the number of copiers
sold (Y).

▪ Based on the data, it is calculated that:


▪ Standard deviation of x (𝑠𝑥 ) and y (𝑠𝑦 ) is
𝟒𝟐.𝟕𝟔 and 𝟏𝟐.𝟖𝟗, respectively
▪ Coefficient of correlation (r) is 𝟎.𝟖𝟔𝟓.
Least Squares Regression Example 2

Use the least squares method to determine a linear equation to express the
relationship between the two variables.
1. The first step is to find the slope of the least squares regression line, 𝛽መ𝟏

𝒔𝒚 𝟏𝟐. 𝟖𝟗
𝛽መ𝟏 = 𝒓 = 𝟎. 𝟖𝟔𝟓 = 𝟎. 𝟐𝟔𝟎𝟖
𝒔𝒙 𝟒𝟐. 𝟕𝟔

▪ The 𝛽መ𝟏 value of .2608 indicates that for each additional sales call, the sales representative can
expect to increase the number of copiers sold by about .2608.
▪ So 20 additional sales calls in a month will result in about five more copiers being sold.
Least Squares Regression Example 2

Use the least squares method to determine a linear equation to express the
relationship between the two variables.
2. The second step is to find 𝛽መ𝟎
𝛽መ𝟎 = 𝒚
ഥ − 𝛽መ𝟏 𝒙

𝛽መ𝟎 = 𝟒𝟓 − 𝟎. 𝟐𝟔𝟎𝟖 𝟗𝟔 = 𝟏𝟗. 𝟗𝟔𝟑
Least Squares Regression Example 2

Use the least squares method to determine a linear equation to express the
relationship between the two variables.
3. Then determine the regression line
෡ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖𝑿
𝒀
So if a salesperson makes 100 calls, he or she can expect to sell
46.0432 copiers

෡ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖 𝟏𝟎𝟎 = 𝟒𝟔. 𝟎𝟒𝟑𝟐


𝒀
Least Squares Regression Example 2

Estimated sales for all


sales representatives
are calculated using the
formula we determined
earlier and placed in the
table.
Least Squares Regression Example 2

▪ The line of regression is drawn on


the scatter diagram.
▪ The regression line will always pass
through the mean of variables X
and Y.
▪ Plus, there is no other line through
the data where the sum of the
deviations is smaller.
Regression Equation Slope Test

▪ The next step is to conduct a test of hypothesis to see if the slope of the
regression line is different from zero.
▪ If we find that the slope is different from zero, then we can conclude that
using the regression equation adds to our ability to predict the dependent
variable based on the independent variable.
▪ The test is equivalent to the test for the correlation coefficient. We use 𝛽ሶ
to represent the population slope.
Regression Equation Slope Test

▪ For a regression equation, the slope is tested for significance


▪ We test the hypothesis that the slope of the line in the population is 0
▪ If we do not reject the null hypothesis, we conclude there is no
relationship between the two variables
Regression Equation Slope Test

Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 1: State the null and alternate hypothesis
𝐻0 : 𝛽 = 0 The slope of the population is zero
𝐻1 : 𝛽 ≠ 0 The slope of the population is different from zero
▪ Step 2: Select the test statistic, t
Regression Equation Slope Test

Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 3: Formulate the decision rule, reject 𝐻0 if t > 1.771 (it is obtained
from the Students' t Table with degree of freedom n - 2)
▪ Step 4: Make decision, reject H0, t = 6.205
𝛽መ1 − 𝛽1 0.2606 − 0
𝑡= = = 6.205
𝑠𝑒 𝛽መ1 0.042
𝑠
𝑠𝑒 is the standard error of the slope estimate and is obtained by 𝑠𝑒 =
σ 𝑋𝑖 −𝑋ത 2
Regression Equation Slope Test

Recall the North American Copier Sales example. We identified the slope
as b and it is our estimate of the slope of the population.
We conduct a hypothesis test.
▪ Step 5: Interpret, the number of sales calls is useful in estimating copier
sales
Evaluating a Regression Equation’s
Ability to Predict

▪ Perfect prediction is practically impossible in almost all disciplines,


including economics and business
▪ The North American Copier Sales example showed a significant
relationship between sales calls and copier sales, the equation is
▪ Number of copiers sold = 19.9632 + .2608(Number of sales calls)
What if the number of sales calls is 84, we calculate the number of copiers sold is
41.8704—we did have two employees with 84 sales calls, they sold just 30 and 24
▪ So, is the regression equation a good predictor?
Evaluating a Regression Equation’s
Ability to Predict

▪ We need a measure that will tell how inaccurate the estimate might be.
▪ The measure we’ll use is
1. The standard error of the estimate, 𝑠𝑦,𝑥 .
2. The coefficient of determination
The Standard Error of Estimate

The standard error of estimate is a measure of the dispersion, or scatter, of


the observed values around the line of regression for a given value of x.
▪ It measures the variation around the regression line
▪ It is in the same units as the dependent variable
▪ It is based on squared deviations from the regression line
The Standard Error of Estimate

The standard error of estimate is computed using the following formula:


2
σ 𝑌 − 𝑌෠
𝑠𝑦,𝑥 =
𝑛−2
▪ If the standard error of estimate is small, this indicates that the data are
relatively close to the regression line and the regression equation can be
used.
▪ If it is large, the data are widely scattered around the regression line and
the regression equation will not provide a precise estimate of y.
The Standard Error of Estimate

▪ We calculate the standard error of estimate in this example.


▪ We need the sum of the squared differences between each observed
value of y and the predicted value of y, which is 𝑦ത

2
σ 𝑌 − 𝑌෠ 587.1108
𝑠𝑦,𝑥 = = = 6.720
𝑛−2 15 − 2
The Coefficient of Determination

▪ The coefficient of determination is the proportion of the total variation in


the dependent variable Y that is explained, or accounted for, by the
variation in the independent variable X.
▪ The coefficient of determination provides a more interpretable measure
of a regression equation’s ability to predict.
▪ It is found from the following formula
2
σ ෠𝑖 − 𝑌ത
𝑌
𝑟2 =
σ 𝑌𝑖 − 𝑌ത 2
The Coefficient of Determination

▪ The characteristics of coefficient of determination:


▪ It ranges from 0 to 1.0
▪ It is the square of the correlation coefficient

▪ In the North American Copier Sales example, the correlation coefficient


was .865; just square that (.865)2 = .748; this is the coefficient of
determination
▪ This means 74.8% of the variation in the number of copiers sold is
explained by the variation in sales calls
Relationships among 𝒓, 𝒓𝟐 and 𝒔𝒚,𝒙

▪ Recall the standard error of estimate (𝑠𝑦,𝑥 ) measures how close the actual values are to
the regression line
When it is small, the two variables are closely related
▪ The correlation coefficient (𝑟) measures the strength of the linear association between
two variables
When points on the scatter diagram are close to the line, the correlation coefficient tends to be
large
▪ The coefficient of determination is the correlation coefficient squared
▪ Therefore, the correlation coefficient (𝑟) and the coefficient of determination (𝑟 2 )have an
inverse relationship with the standard error of estimate (𝑠𝑦,𝑥 )
THANK YOU

You might also like