You are on page 1of 12

MODULE 11: CORRELATION AND REGRESSION ANALYSIS

UNIT 2: SIMPLE LINEAR REGRESSION


(For DECEMBER 9-11)

Learning Outcomes:
(1) Develop an estimated simple linear regression model to predict the value of
a dependent variable based on one independent variable.
(2) Interpret the constants in the estimated simple linear regression equation.

At the end of this learning module, you are expected to know how to model certain
phenomena using simple linear regression. You will be tasked to derive linear regression
models given some business-related data using a scientific calculator.

Regression analysis is a tool for building and developing a statistical (regression) model that
will characterize the association between a dependent variable and one or more
independent, or explanatory, variables. If the regression model is found to be adequate, it
can then be used to estimate or forecast values of the dependent variable. In simple
linear regression, there is only one independent variable, while multiple linear regression
uses two or more independent variables.

Correlation and regression analysis are closely related since both involve relationship
between two variables and they both use paired observations obtained from the same (or
matched) subjects. While correlation is used to determine the degree as well as the
direction of relationship between variables, regression analysis deals with the use of the
relationship for forecasting or predicting the value of a dependent variable.

For instance, regression analysis can be used for the following situations:

 Managers wish to predict the level of sales based on selling price, or extrapolate
a trend into the future.
 A company may wish to predict sales based on the GDP and the 10-year
treasury bond rate to capture the influence of the business cycle.
 A marketing researcher might want to predict the intent of buying a particular
car model based on a survey that measured consumer attitudes toward the
brand, negative word-of-mouth, and income level.

Before proceeding with (simple linear) regression analysis, a scatter diagram of 𝑌 versus 𝑋
can be done. It may give an idea of the form of relationship between them. It is important
to note here that the variable being predicted is always the dependent variable 𝑌, and
must be on the vertical (𝑦) axis.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 193
SIMPLE LINEAR REGRESSION

Simple linear regression attempts to model the relationship between two variables by fitting
a linear equation to observed data. One variable is considered to be a regressor/predictor
or independent variable, and the other is considered to be a response or dependent
variable (the variable being predicted). The simple linear regression model postulates that

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒

where the variables are defined as follows:


𝑌 = observed value of the dependent/response variable
𝑋 = observed value of the independent/explanatory variable
𝛽0 and 𝛽1 : regression coefficients
𝛽0 = true regression intercept or the value of the response variable when 𝑋 is zero
𝛽1 = true regression slope or the changes (increase if positive or decrease if
negative) in the response variable brought about by an increase of one
unit in the independent variable
𝑒 = residual/random error component which captures all other factors affecting
the response variable but were not included in the model

In practice, the parameter values 𝛽0 and 𝛽1 are not known and must be estimated using
sample data. In general, the goal of simple linear regression is to find the line that best
predicts 𝑌from 𝑋, that is, to find the line 𝒀 = 𝒂 + 𝒃𝑿 (fitted regression line) that best
estimates the regression model 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒. It determines𝑎 and 𝑏 that best estimate
𝛽0 and 𝛽1 , where the variables involved are defined as follows:

𝑌 = predicted value of 𝑌 for a given value of 𝑋


𝑎 = the 𝑦-intercept of the estimated regression line
𝑏 = the slope of the estimated regression line

The value of the slope 𝑏 and 𝑦-intercept 𝑎 can be obtained using the method of least
squares, using the following formulas:

𝑛 𝑥𝑦 − 𝑥 𝑦 𝑦 𝑥
𝑏= and𝑎 = −𝑏 = 𝑦 − 𝑏𝑥
𝑛 𝑥 2− 𝑥 2 𝑛 𝑛

The 𝑦-intercept 𝑎 is interpreted as the value of 𝑦 when 𝑥 is zero (if such a case exists). The
slope 𝑏 is interpreted as the amount of increase (if it is positive) or decrease (if it is negative)
in the value of 𝑌 for every unit increase in the value of 𝑋.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 194
Example 1:
To illustrate the interpretation of the values 𝑎and 𝑏, suppose we wish to find the line that
best estimates the relationship between the number of hours of study of a student (𝑋) and
the score obtained in a test (𝑌). Suppose that the fitted regression line is found to be
𝑌 = 12 + 5𝑋. Then a student who does not study at all is predicted to get a score of 12. In
addition, every additional hour of study will increase the student’s score by 5.

Example 2:
As a second example, suppose the linear regression equation 𝑌 = 14 − 2.5𝑋 predicts the
selling price 𝑌 of a second-hand laptop in thousands of pesos based on the age 𝑋 of the
laptop. Then the equations indicate that the price of a brand new laptop is P14,000, and
for every additional year of use, its selling price will decrease by P2,500.

Let us now show how to find the estimated simple linear regression equation.

Example 3:
In the 1990’s, research efforts have focused on the problem of predicting a manufacturer’s
market share using information on the quality of its product. Suppose that the following
data are available on market share, in percentage (𝑌), and product quality, on scale of 0
to 100, determined by an objective evaluation procedure (𝑋).

𝑿 27 39 73 66 33 43 47 55 60 68 70 75
𝒀 2 3 10 9 4 6 5 8 7 9 10 13

a. Draw the scatter diagram.


b. Determine the estimated simple linear regression equationto predict the market
share from the product quality rating obtained from the objective evaluation
procedure. Graph the fitted regression line and interpret the regression
coefficients.
c. Estimate the market share when the product quality is 95.

Solution:
a. Here is the scatter diagram for the data set. It appears that there is a positive
relationship between product quality rating and market share.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 195
b. To find the estimated simple linear regression equation, we determine the values
of the following using the calculator: 𝑛, 𝑥, 𝑥 , 𝑥 2 , 𝑦, 𝑦 , 𝑦 2 , 𝑥𝑦. The process is
the same as the procedure shown in the previous uniton finding the Pearson
correlation coefficient (choose 𝑦 = 𝑎 + 𝑏𝑥 in the statistics mode of the
calculator). Verify that these are the values for this problem:

𝑛 = 12, 𝑥 = 54.66666667, 𝑥 = 656, 𝑥 2 = 38856,


𝑦 = 7.166666667, 𝑦 = 86, 𝑦 2 = 734, 𝑥𝑦 = 5267

We copied the entire value of 𝑥 and 𝑦 to minimize round off errors in computing
𝑎. Hence, the slope and intercept are given respectively as follows:

𝑛 𝑥𝑦 − 𝑥 𝑦 12 5267 − 656 86
𝑏= = = 0.1888913624
𝑛 𝑥2 − 𝑥 2 12 38856 − 6562

𝑎 = 𝑦 − 𝑏𝑥
𝑎 = 7.166666667 − 0.1888913624 54.66666667
𝑎 = −3.159394479

Copying the values of 𝑎 and 𝑏 up to the fourth decimal place (rounding off
properly), the estimated simple linear regression equation is given by

𝑌 = 𝑎 + 𝑏𝑋
𝑌 = −3.1594 + 0.1889𝑋

How do we interpret the regression coefficients? The value of 𝑎 = −3.1594


represents the estimated market share when the product quality rating is zero,
which may not have any meaning in this case because market share cannot be

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 196
negative. On the other hand, the value 𝑏 = 0.1889 means that the market share
increases by 0.1889% for every unit increase in the product quality rating.

The figure belowshows the scatter plot and graph of the fitted regression line. It
can be seen that most of the data points of the problem are close but not on
the fitted regression line.

c. Next, let us estimate the market share when the product quality ratingis 95.The
market share can be predicted using the estimated simple regression equation,
by substituting 𝑋 = 95:

𝑌 = −3.1594 + 0.1889 95 = 14.7861

Thus the market share is predicted to be 14.7861% when the product quality
rating is 95.

When solving for the estimated linear regression equation, it is usually advisable to solve for
the Pearson correlation coefficient to see the magnitude and direction of the linear
relationship between the two variables. Using the values for 𝑛 and the summations above,
we have
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑟=
𝑛 𝑥 −2 𝑥 2 𝑛 𝑦2 − 𝑦 2
12 5267 − 656 86
𝑟=
12 38856 − 6562 12 734 − 862
𝑟 = 0.9529

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 197
This means that there is a very strong positive correlation between the product quality
rating and the market share. Thus, the estimated linear regression equation gives a good
prediction of market share.

Note that the values of 𝑎, 𝑏, and 𝑟 can all be obtained directly from the
calculator.However, just as I mentioned in the previous unit, I will still ask you to show your
substitution into the formula in exercises and graded activities. It would also be good for
you to compare these values given by the statistics mode output with the values you
obtained by using the formula.

THE COEFFICIENT OF DETERMINATION

Another way to determine how good a fit the estimated simple regression equation we
obtained is would be to compute the coefficient of determination. The coefficient of
determination, 𝑟 2 , is the square of the coefficient of correlation. It is used to determine the
proportion of the variance (fluctuation) of one variable that is predictable from the other
variable. It allows us to determine how certain one can be in making predictions from a
certain model/graph. It has values from 0 to +1, and measures how well the fitted
regression line represents the data (the percent of the data that is the closest to the line of
best fit). That is, 𝑟 2 is the proportion of the total variation in the dependent variable 𝑌 that is
explained, or accounted for, by the variation in the independent variable 𝑋.

For example, if 𝑟 = 0.922, then 𝑟 2 = 0.8501. This means that 85.01% of the total variation in 𝑌
can be explained by the linear relationship between 𝑋and 𝑌. Alternately, we can say that
85.01% of the variation in 𝑌 is explained by the variation in 𝑋. The other 14.99% of the total
variation in 𝑌 remains unexplained. If the regression line passes exactly through every point
on the scatter plot, it would be able to explain all of the variation. The further the line is
away from the points, the less it is able to explain.

Example 4:
Compute the coefficient of determination for the example above and interpret the
resulting value.

Solution:
Using the values from Eample 3, we have

𝑟 2 = 0.95292 = 0.9080

Thus, 90.80% of the variation in the market share is explained by the variation in the
product quality rating. Alternately, we can say that 90.80% of the variation in market

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 198
share is explained by the linear relationship between product quality rating and
market share.

Performing the Analysis using R

In using the R software as well as any other statistical software for regression analysis, the
output will not just be the coefficients of the regression model but we will be provided as
well with hypothesis test results for the significance of the regression coefficients. Let us work
on Example 3 using R.

R Script and Outputs

# Load readr package


library(readr)

# Import "marketshare.csv" file and assign it to "market"


market <-read.csv("marketshare.csv")
head(market)
X Y
1 27 2
2 39 3
3 73 10
4 66 9
5 33 4
6 43 6

# Create scatterplot of data


plot(market$X, market$Y, main="Scatter Plot of Data", xlab="X", ylab="Y")

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 199
The scatterplot shows a linear trend being exhibited by the points. This implies a linear
relationship between the X and the Y variable. We can then, hence, proceed with simple
linear regression modeling.

# Fit the simple linear regression model


model<-lm(Y~X, data = market)

# The Simple Linear Regression Analysis Output


summary(model)

Call:
lm(formula = Y ~ X, data = market)

Residuals:
Min 1Q Median 3Q Max
-1.2074 -0.6935 -0.1852 0.8093 1.9925

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.15939 1.08148 -2.921 0.0153 *
X 0.18889 0.01901 9.939 1.68e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 10 degrees of freedom


Multiple R-squared: 0.9081, Adjusted R-squared: 0.8989
F-statistic: 98.78 on 1 and 10 DF, p-value: 1.682e-06

The R output gives the regression intercept and slope (or coefficient of X). With these, we
can present the estimated simple linear regression model as

𝑌 = −3.1594 + 0.1889𝑋

Before using this regression model to predict market share (Y) based on product quality
rating (X), we need to make sure that the regression model is statistically significant.

The Coefficients portion of the output not only gives us the coefficients of the regression
equation but it also gives us the p-value for testing the significance of these coefficients.
For the intercept, the p-value is 0.0153 while for the coefficient of X, the p-value is 1.68x10-6.
Both p-values are lesser than a 0.05 significance level for the hypothesis test, hence this
indicate that these coefficients are significant in the model.

To assess the significance of the regression model as a whole, we now look at the last row
labeled F-statistic, and we should be interested as well with the p-value result. Here the p-
value is 1.682 x 10-6, lesser than a 0.05 significance level. This indicateS that the regression

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 200
model is significant, that is, it can be used to predict or estimate market share based on
product quality rating.

Practice Exercise 11-2

(1) A department of transportation’s study on driving speed and miles per gallon for
midsize automobiles resulted in the following data.
a. Plot and interpret the scatter diagram.
b. Find the estimated simple linear regression equation to predict gas
consumption from speed.
c. Compute the coefficient of determination and interpret.
d. Estimate the gas consumption when the speed is 45 miles per hour.

Speed(miles per hour) Consumption(miles per gallon)


30 28
50 25
40 25
55 23
30 30
25 32
60 21
25 35
50 26
55 25

(2) The marketing manager of a large supermarket chain would like to know the
correlation between shelf space and sales of pet food. A random sample of 12
equal-sized stores is selected, with the following results.
a. Plot and interpret the scatter diagram.
b. Find the estimated simple linear regression equation to predict weekly sales
from shelf space.
c. Compute the coefficient of determination and interpret.
d. Estimate the weekly sales when the shelf space is 18 feet.

Store Shelf Space (Feet) Weekly Sales ($)


1 5 160
2 5 220
3 5 140
4 10 190
5 10 240
6 10 260
7 15 230

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 201
8 15 270
9 15 280
10 20 260
11 20 290
12 20 310

Learning Reinforcement Activity No. 11-2: SIMPLE LINEAR REGRESSION


Accomplish by December 11, 2020

Following the solutions in the examples presented in this unit, solve the following problems
as directed using R. After each problem, just copy and present the R output on a .docx file
and name your output file as LRA11-2<LASTNAME>.docx. Give your interpretations of the
analysis results/output. Save your R script as LRA11-2<LASTNAME>.R.

1. Suppose data were collected from a sample of 10 branches of a pizza restaurant


chain located near college campuses, shown below.

StudentPopulation Quarterly Sales


Restaurant
(in 1000s) (in $1000)
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202

a. Plot the scatter diagram using Microsoft Excel or any available software. If you
don’t have the technology to do it, you may do it manually. Make sure to label
your 𝑋 and 𝑌 axes. Include a screenshot or photo of your scatter diagram in your
submission.(2 points)
b. What does the scatter diagram indicate about the relationship between student
population and quarterly sales?(1 point)
c. Develop the estimated linear regression equation that could be used to predict
the quarterly sales from the student population.(4 points)
d. Compute the Pearson correlation coefficient and the coefficient of
determination and interpret these.(4 points)

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 202
2. A study was made by a retail merchant to determine the relation between weekly
advertising expenditures and sales. The following data were recorded:

Advertising Costs ($) Sales ($)


40 385
20 400
25 395
20 365
30 475
50 440
40 490
20 420
50 560
40 525
25 480
50 510

a. Plot and interpret the scatter diagram. You may do it using Microsoft Excel or any
available software. If you don’t have the technology to do it, you may do it
manually. Make sure to label your 𝑋 and 𝑌 axes. Include a screenshot or photo of
your scatter diagram in your submission.(3 points)
b. Find the estimated linear regression equation to predict weekly sales from weekly
advertising expenditures.(4 points)
c. Compute the coefficient of correlation. Interpret.(2 points)
d. Compute the coefficient of determination. Interpret.(2 points)
e. Estimate the weekly sales when advertising costs are $35. (2 points)

3. The paired data below consist of the costs of advertising (in thousands of dollars)
and the number of products sold (in thousands).

Cost 9 2 3 4 2 5 9 10
Number 85 52 55 68 67 86 83 73

a. Plot and interpret thescatter diagram.You may do it using Microsoft Excel or any
available software. If you don’t have the technology to do it, you may do it
manually. Make sure to label your 𝑋 and 𝑌 axes. Include a screenshot or photo of
your scatter diagram in your submission.(3 points)
b. Find the estimated linear regression equation to predict number of products sold
from advertising costs.(4 points)
c. Compute the coefficient of correlation. Interpret.(2 points)
d. Compute the coefficient of determination. Interpret.(2 points)
e. Estimate the number of products sold when advertising costs are $4500.(2 points)

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 203
4. An article in Business Week listed the “Best Small Companies” with its sales and
earnings. A random sample of 12 companies was selected and the sales and
earnings, in millions of dollars, are reported below.

SmallCompany Sales(in million $) Earnings(in million $)


1 89.2 4.9
2 18.6 4.4
3 18.2 1.3
4 71.7 8.0
5 58.6 6.6
6 46.8 4.1
7 17.5 2.6
8 11.9 1.7
9 19.6 3.5
10 51.2 8.2
11 28.6 6.0
12 69.2 12.8

a. Plot and interpret the scatter diagram.You may do it using Microsoft Excel or any
available software. If you don’t have the technology to do it, you may do it
manually. Make sure to label your 𝑋 and 𝑌 axes. Include a screenshot or photo of
your scatter diagram in your submission.(3 points)
b. Find the estimated linear regression equation to predict earnings from sales.(4
points)
c. Compute the coefficient of correlation. Interpret.(2 points)
d. Compute the coefficient of determination. Interpret.(2 points)
e. For a small company with $50 million in sales, estimate the earnings.(2 points)

Congratulations! You just completed Module 11 Unit 2. Next, let us consider


problems involving more than one independent variable.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 204

You might also like