You are on page 1of 17

Chapter 6:

Correlation and Regression Analysis


Lesson 2: Going Beyond Correlation with Simple Linear Regression
TIME FRAME: 3 hour session
OVERVIEW OF LESSON
In this lesson, students will continue exploring the data they examined in Lesson 7-01, this time,
making use of the simple linear regression. They will be asked to predict the value of the
dependent variable After each student has interpreted their regression results, i.e. both regression
coefficients, they pool their findings as a class to explore the variability in the regression
coefficients that they estimated. As a class, they construct approximations to the sampling
distributions of the regression coefficients and use the sampling distributions to make assertions
about the values of the population parameters.
LEARNING COMPETENCIES: At the end of the lesson, the learner should be able to:

identifies independent and dependent variables;

draw the best-fit line on a scatter plot;


calculate the slope and y-intercept of the regression line;
interpret the calculated slope and y-intercept of the regression line;
predict the value of the dependent variable given the value of the independent variable;
solve problems involving regression analysis.

LESSON OUTLINE:
1. Motivation / Introduction
2. Preliminary Lesson : Simple Linear Regression Line
3. Main Lesson : Obtaining the Simple Linear Regression Line and Explaining the
Regression Coefficients
4. Enrichment: Sampling Distribution of Regression Coefficients
DEVELOPMENT OF THE LESSON
(A) Introduction
Inform students that when examining the relationship between two variables x and y, we can
consider one variable as some kind of input variable within an input-output framework, we
plot this variable along the horizontal (also called x) axis in the scatterplot. The output
variable is the variable along the vertical (also called the y) axis. The input or x variable is
typically called an independent variable; it is also called a covariate or an exogenous,
explanatory, regressor, or control variable. The output or y variable is called the dependent
variable; it is also called the regressand or the endogenous, explained, or response variable.

In Lesson 7-01, Karl Pearsons data on heights of fathers and of their respective first born
sons from the work of was presented. While taller-than-average fathers tend to have tallerthan-average sons, the sons are not quite as tall as the fathers. There is a regression toward
the average heights, thus the term regression analysis. Likewise, shorter-than-average
fathers tend to have shorter-than-average sons, but the sons are not quite as short as the
fathers.
(B) Preliminary Lesson : Simple Linear Regression Line
When we visualize the points in a scatterplot generally clustering about a line, we may be
interested to obtain an estimate of such a line in order to help us estimate the expected level
of a variable Y for a known specific value x of the variable X (say, daily allowance). For
instance, for the worked example in the previous lesson, we may want to determine how
many text messages a student to usually send if his/her daily allowance is 150 pesos. In
lesson 7-01, it was mentioned that we could consider the line that passes the point of
averages and whose slope is the ratio of the standard deviations as one possible line. Inform
students that this line ignores information about the magnitude of the association between the
two variables. If the correlation coefficient is zero, then we should not expect any increase in
one variable to accompany an increase or decrease in the other.
An alternative to this SD line that incorporates information provided by the correlation
coefficient, the means and standard deviations is the regression line:

The regression line for y on x is a line that


contains the point of averages and whose slope is
the product of the correlation coefficient and the
ratio of the standard deviation of y to the standard
deviation of x.

In mathematics, the point-slope form of an equation of the line is given by


y y 1=m ( xx 1 )
where

is the slope of the line passing through the point

description above, the regression line has slope equal to


the point of averages

x1 , y
( 1) . In the given

r ( y / x ) , and it passes through

( x , y ) . Using the point-slope form of the regression line, we get the

equation of the regression line as :

y y =r

y
( xx )
x

This can be transformed to the equation:


y=r

y
y
x + y r x
x
x

The term in parentheses in this expression is the y-intercept of the regression line. It can be
interpreted as what we expect y to be when the value of x is zero.
Explain to students that the regression line relates how much change in the y-value is
associated with a unit increase in the x-value. It estimates the expected value for the Y
variable corresponding to a particular level x of the variable X. On average, it associates with
each increase of one standard deviation in the x-units, r standard deviations in the y-units
(where r is the correlation coefficient).
Note that when we consider the notion of regression, we assume a functional dependence of
Y on X. Thus, we consider Y as a dependent, response, or output variable, while X is an
independent, explanatory or input variable. The magnitude of the output variable Y is
dependent on the magnitude of the input variable X. A persons blood pressure, for instance,
functionally depends on a persons age. This does not, however, suggest that age is the only
factor that is responsible for blood pressure, but that it is one possible determinant for blood
pressure.
On the other hand, arm length and leg length are correlated but not functionally dependent.
Increasing arm length would not have an effect on leg length although these variables are
correlated. In such instances, correlation can be calculated but obtaining a regression line
may not be of practical utility.
(C) Main Lesson : Obtaining the Simple Linear Regression Line and Explaining the Regression
Coefficients
Consider the worked example in Lesson 7-01 pertaining to information from the database
generated in Lesson 1-01. Students were asked in Lesson 7-01 to generate a random sample
of 30 students from the databse.
Worked Example: We have generated the following summary measures in the worked
example for students with complete information on their daily allowance and the usual
number of text messages they send in a day:
Summary
Measure

Daily Allowance
in School

Usual Number of
Text Messages
Sent in a Day

Mean
(Population)
Standard
Deviation
Correlation

90.37037

33.2963

120.9984
43.11124
0.780283

The regression line for Daily Allowance in School on Usual Number of Text
Messages Sent in a Day is then estimated as:
(Expected Usual Number of Text Messages Sent in a Day -33.2963) =
43.11124
( 0.780283 )
120.9984 (Daily Allowance in School

-90.37037)
or simply
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 Daily Allowance in School + 8.172270273
The earlier representation of the estimated regression line clearly indicates that
students with an average daily allowance are expected also have an average number of
text messages. That is, the point of averages is a point in the estimated regression line.
The later representation of the regression line is shown in a typical intercept-slope
form of an equation. In particular, the slope is interpreted as follows: for each increase
of 1 peso in total daily allowance, we expect a corresponding increase of 0.28 text
messages sent in a day, or equivalent, every 4 peso increase in allowance is expected
to have a corresponding increase of 1 text message sent by a student in a day.
Explaining the Regression Coefficients
Since the slope of a line is the rise over run, the slope of the regression line represents the rise
in Y over the run in X, i.e.,

The slope of the regression line of Y on X represents


how much we expect Y to change per unit increase in X.
Ask the students what a positive slope means. They should say that when the slope is
positive, Y increases as X increases. In this case, we say that Y is directly or positively
related to X. Ask students also what a negative slope means. They should say that when the
slope is negative, Y decreases as X increases. Here, we say that Y is inversely or negatively
related to X. Ask the students what happens with a zero slope? They should say that when
the slope is zero, Y is a constant and is equal to the y-intercept. Here, there is no change in Y

whatever X will be, i.e., the fit is a horizontal line. In the next lesson, we consider how to
make valid statistical inferences about the slope of the regression line.
Remind students that in an equation of a line, the y-intercept is the value of Y when X is
zero. For the worked example, the intercept may be interpreted as the usual number of text
messages sent daily by a student that has zero daily allowance. Students may have zero daily
allowance when the family of the student decides not to give an allowance to the student
because the family is poor, or because the student is deemed not to need an allowance since
everything is being provided for the student. However, in other situations, such an
interpretation may not be valid as we may be unnecessarily extending the segment
representing the regression way outside of the usual range of X values. Consider for
instance relating the monetary value of a house (Y) to the area of the dwelling in square
meters (X). Here, a house must always have nonzero area, and thus the data on area does not
include X=0.
Using the Regression Line for Predictions
The utility of the estimated regression line is not merely for explaining relationships between
X and Y but also for making predictions about Y given a certain value of X. Suppose, we
wish to randomly pick one of the students who gave information for Lesson 1-01, and we
wish to guess his or her usual number of text messages per day. In the absence of any
information, the best guess would naturally be the average usual number of text messages
sent by the students per day. However, we may be given some specific level of daily
allowance of the student that can be utilized to improve the prediction.
Suppose that for the worked example, we are provided information about the level of daily
allowance of a student, say 150 pesos. According to our estimated regression line,
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 Daily Allowance in School + 8.172270273
a student with a daily allowance of 150 pesos is expected to usually have the following total
number of text messages sent per day
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 (150) + 8.172270273
= 49.87404 50
which is more than the average usual number of text messages sent by students per day.
In many cases, obtaining a regression fit gives a sensible way of estimating the y-value. If,
however, there are nonlinearities in the relationship between the variables, one may have to
transform the variables, say, generate firstly the square root or logarithms of the X and/or Y
variables, and then perform a regression model on the transformed variables. In this case, tell
students that one will eventually have to re-express the generated analyses in terms of the
original units rather than the transformed data.

(D) Enrichment: Sampling Distribution of Regression Coefficients


If you have extra time, you can ask students to individually compute for the regression
coefficients based on the data they sampled in the last lesson and to also share their results
with the class. Instruct the students to form the groups of five that they formed in the
previous lesson. Using the data that they have used in constructing the scatterplot in the
previous session, ask them to compute for the regression coefficients, the slope and the
intercept. Then, together with the scatterplot that they have constructed in the previous
session, instruct them to plot the equation of the regression line. Ask them to describe the
position of the line in light of the different points on the scatter plot. Are any of the points on
the line? Are all the points on the line? Is it necessary to have as many points on the line?
Should you want to extend this further, make them draw vertical lines from the regression
line to the individual points. Ask them how do they think are these vertical lines related to the
position of the regression line?
The regression line ought to be viewed as a sample regression line since we are only
working with sample data. This line is the best fitting line for predicting Y for any value of
X, in the sense of minimizing the distance between the data and the fitted line. By distance
here, we mean the sum of the squares of the vertical distances of the points to the line. Thus,
the resulting coefficients, slope and intercept, in the sample regression line are typically
called the least squares estimates (of the population regression line) or the least squares
regression coefficients.
In the next lesson, we will state the assumptions that underlie the fitting of a regression line
and the generation of these least squares estimates. Such assumptions will enable us to
proceed to making statistical inferences, i.e. hypothesis tests and confidence intervals, on the
regression coefficients. This will also be discussed in more detail in the next lesson.
KEY POINTS

The regression model suggests that for every increase in one unit of an independent
variable x, we expect a change of

y
y is
x units in a dependent variable y, where

the standard deviation of the y-values (with the data treated as a population),

is the

standard deviation of the x-values (with the data treated as a population), and

is the

correlation coefficient.

The point of averages is a point in the estimated regression line

The regression line may be used to make predictions. Given the value x for an independent
variable X, we expect or predict Y to take the value

y=r

where

x and

x + y r y x
x
x

are respectively the mean of the x-values and y-values.

REFERENCES
Much of the material here adapted from:
Text Messaging is Time Consuming! What Gives? by Jeanie Gibson, Mary McNelis, and Anna
Bargagliotti, STatistics Education Web (STEW), Available on the Internet at
https://www.amstat.org/education/stew/pdfs/TextMessagingisTimeConsumingWhatGives.doc

See also:
Albert, J. R. G. (2008).Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo
Patungan, Nelia Marquez), published by Rex Bookstore.
De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc.
Freedman, D., Pisani, R, and Purves (2007). Statistics. Fourth Edition. W. W. Norton &
Company, New York.
Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Banos, College Laguna
4031

ACTIVITY SHEET 6-02


Make use of the data set you used for Activity Sheet 6-01, pertaining to a random sample of 30
observations from the database collected at the beginning of the Statistics and Probability course.
Individually carry out the following steps:
1. Compute for the sample regression line

Y ___________ X _____________

2. Provide an interpretation for the estimate slope of the sample regression line

3. Give an interpretation of the estimated y-intercept of the sample regression line

4. Illustrate how to use the sample regression line you generated to predict Y for a given
level of X. (Make sure to agree with group mates what X is)

5. Collect the regression coefficients and predictions found by each person in the class into

a table:

Question Y: _________________________ vs. X _________________________________

Slope
Studen
t

Intercep
t

Prediction
for Y
given
X = ___

Slope
Student

11

12

13

14

15

16

17

18

19

10

20

Intercep
t

Prediction
for Y
given
X = ___

6. Create a dot plot for the regression coefficients (slope and interception) and for the
prediction for Y given X= ____ (Note taht three dot plots will be created).

7. Look at the dot plot for the slope. This dot plot represents an approximation to the
sampling distribution of the estimated slopes. What do you notice about the dot plot?
What is the range of the estimated slopes? What seems to be the most common slope? If
you had to guess what the slope of the regression line was for the entire population, what
would you guess? Explain why.

8. Repeat numbers 7 for the intercept.

9. Repeat numbers 7 for the prediction for Y given X=____

ASSESSMENT 6-02
1. In a regression line, the Y-intercept represents the
a) predicted value of Y when X = 0.
b) change in estimated average Y per unit change in X.
c) predicted value of Y.
d) variation around the sample regression line.
ANSWER: a
2.

In a regression line, the slope represents


a) predicted value of Y when X = 0.
b) the estimated average change in Y per unit change in X.
c) the predicted value of Y.
d) variation around the line of regression.

ANSWER: b

Case 1 (For items 3 - 5 ) : A candy bar manufacturer is interested in trying to estimate how sales are
influenced by the price of their product. To do this, the company randomly chooses 6 cities and offers
the candy bar at different prices. Using candy bar sales as the dependent variable, the company will
conduct a simple linear regression on the data below:
City

Price (PHP)

Sales

Los Banos

39

100

Legazpi

48

90

Cagayan de Oro 54

90

Davao

60

40

Cebu

72

38

Makati

87

32

3. Referring to Case 1, what is the estimated average change in the sales of the candy bar if price
goes up by 1 peso?
a) 161.386
b) 0.784
c) 3.810
d) -1.606426
ANSWER: d

4. Referring to Case 1, what is the coefficient of correlation for these data?


a) 0.8854
b) 0.7839
c) 0.7839
d) 0.8854
ANSWER: a

5. Referring to Case 1, if the price of the candy bar is set at 60 pesos, the estimated average sales
will be
a) 30
b) 65
c) 90
d) 100
ANSWER: b

II. A study was done to investigate the relationship between the amount of protix (a new proteinvitamin-mineral supplement) on fortified-vitamin rice, known as FVR, and the gain in weight of
children. Ten randomly chosen sections of grade one pupils were fed with FVR containing
protix; different amounts X of protix were used for the 10 sections. The increase in the weight of
each child was measured after a given period. The average gain Y in weight for each section
with a prescribed protix level X is as follows:
Section
1
2
3
4
5

Protix

Gain
50
92.6
60
70
80
90

97.5

96.5
102.3
105.8

Section
6
7
8
9
10

Protix
100
110
120
130
140

Gain
106.2
108.9
108.4
110.2
110.8

a. Obtain the sample regression line to predict the average gain in weight given the protix
level
ANSWER: Estimated Average weight gain = .2014546 ( Protix) + 83.78182
b. How would you predict the average gain in weight to be at a protix level of 125.
ANSWER: Using the regression line at Protix = 125, the estimated Average weight gain is
0.2014546 ( 125) + 83.78182 = 109

III. At a large local high school, the principal wanted to ensure that her students would perform
well on this years standardized tests. As such, the principal came up with a list of factors that
may negatively or positively impact test scores and aimed to prove it to the students while giving
a practice test out of 100 points. A month before the practice test the principal asked students to
fill out a survey asking them how many hours per week they hung out with their friends and how
many hours per week they spent in study hall. Because the high school was very large, the
principal only surveyed a sample of the students. The following two scatterplots provided show
the results of the survey versus the students scores on the practice exam.

Scatter Plot

Collection 1
110
100
90
80
70
60
50
0

10
15
20
25
Hours_With_Friends

30

35

Y 2.69X 122.87,R2 = .71

Scatter Plot

Collection 1
110
100
90
80
70
60
50
0.0

0.5

1.0
1.5
2.0
2.5
Hours_in_Study_Hall

Y 2.85X 76.183,R2 = .02

3.0

3.5

Based on these two scatterplots, answer the following questions.


1. Is there a positive or negative relationship between the hours a student spends with their
friends and their test scores? Hours spent in study hall and their test scores?
2. On average, what would a student score if they spent zero hours per week hanging out
with friends? In study hall?
3. On average, how many points on the test would a student increase/decrease if they spent
1 extra hour in study hall? Hanging out with friends?
When the students heard the results of the study, they asked the principal to look at different
samples of students in the high school. To satisfy the students, the principal decided to randomly
sample groups of 20 students at a time 15 more times. The following dot plots provide the
summary of the results.

Dot Plot

Hours in Study Hall

1.5

2.0

2.5

3.0
Slope

3.5

-3.5

4.5

Dot Plot

Hours with Friends

-4.0

4.0

-3.0
-2.5
Slope

-2.0

-1.5

Based on the dot plots above, answer the following questions:


4. Should the students believe that the principals decision to mandate an extra hour of study
hall every week should increase their scores on the test? Explain.
5. Should the students try to decrease the number of hours they spent hanging out with
friends before the test? Explain.
Answers
1. There appears to be a negative linear relationship between the amount of time a student
spends hanging out with their friends and their test scores. There does not seem to be a
clear positive or negative relationship between the number of hours spent in study hall
and the test scores.
2. On average, a student would score 122.87 on the test if they spent zero hours per week
hanging out with friends. This y-intercept does not have a practical interpretation since
there is no way to score more than 100 on the test. Also note that 0 is not within the
range of the collected data values for hours spent with friends. On average, a student
would score 76.183 on the test if they spent zero hours per week in study hall.
3. On average, a students score will change by -2.69 points for every hour they spend
hanging out with friends. On average, a student will increase 2.85 points on the test for
every hour they spend in study hall.
4. The dot plot illustrates that all the sampled slopes are positive. This means that for every
one of the 50 samples of 20 subjects sampled, the slope of the regression line was
positive showing that as the number of hours of study hall increases, the scores on the test
increase. In particular, the dot plot shows that the slopes tend to be for the most part
between 2.6 and 3.6, meaning that on average scores would be raised between 2.6 and 3.6
for every hour extra spent in study hall.
5. The dot plot illustrates that all the sampled slopes are negative. This means that for every
one of the 50 samples of 20 subjects sampled, the slope of the regression line was
negative showing that as the number of hours of spent with friends increases, the scores
on the test decrease. In particular, the dot plot shows that the slopes tend to be centered
2.5,
around
meaning that on average scores would change by about -2.5 for every hour
extra spent in hanging out with friends.