You are on page 1of 79

U2-1

Examining Relationships
In statistics, we often want to compare two (or
more) different populations with respect to the
same variable.

We use tools such as side-by-side boxplots to make


comparisons between the samples.
U2-2
Examining Relationships
Often, however, we wish to examine relationships
between several variables for the same population.

When we are interested in examining the relationship


between two variables, we may find ourselves in one
of two situations:
 We may simply be interested in the nature of the
relationship.
 One of the variables may be thought to explain
or predict the other.
U2-3
Explanatory and Response Variables
In this second case, one of the variables is an
explanatory variable (which we denote by X) and
the other is a response variable (denoted by Y).

A response variable takes values representing the


outcome of a study, while an explanatory variable
helps explain this outcome.
U2-4
Example
Does the caffeine in coffee really help keep you
awake? Researchers interviewed 300 adults and
asked them how many cups of coffee they drink
on an average day, as well as how many hours of
sleep they get at night.

The response variable Y is the hours of sleep,


while the explanatory variable X is number of
cups of coffee per day.
U2-5
Example
Are students who excel in English also good at
math, or are most people strictly left- or right-
brained? A psychology professor locates 450
students at a large university who have taken the
same introductory English course and the same
Math course and compares their percentage
grades in the two courses at the end of the
semester.

In this case, there is no explanatory or response


variable; we are simply interested in the nature of
the relationship.
U2-6
Scatterplots
The best way to display the relationship between
two quantitative variables is with a scatterplot.

A scatterplot displays the values of two different


quantitative variables measured on the same
individuals. The data for each individual (for both
variables) appears as a single point on the
scatterplot. If there is an explanatory and a
response variable, they should be plotted on the
x- and y-axes, respectively. Otherwise, the choice
of axes is arbitrary.
U2-7
Example
Consider the relationship between the number of
classes a student misses during the term and his or
her final exam score. The table on the following
page gives the values for both variables for a sample
of eight students.
U2-8
Example
Student Classes Missed Exam Score
1 5 60
2 2 95
3 6 73
4 10 56
5 1 81
6 8 45
7 4 82
8 2 78
U2-9
Example
The scatterplot for these data is shown below:
U2-10
Scatterplots
We look for four things when examining a scatterplot:

1) Direction
 In this case, there is a negative association
between the two variables. An above-average
number of classes missed tends to be
accompanied by a below-average exam score, and
vice-versa. If the pattern of points slopes upward
from left to right, we say there is a positive
association.
U2-11
Scatterplots
2) Form
 A straight line would do a fairly good job
approximating the relationship between the
two variables. It is therefore reasonable to
assume that these two variables share a
linear relationship.
U2-12
Scatterplots
3) Strength
 The strength of the relationship is determined
by how close the points lie to a simple form
such as a straight line. In our example, if we
draw a line which roughly approximates the
relationship between the two variables, all
points will fall quite close to the line. As such,
the linear relationship is quite strong.
U2-13
Scatterplots
3) Strength (cont’d)
 Not all relationships are linear in form. They
can be quadratic, logarithmic or exponential,
to name a few. Sometimes the points appear
to be “randomly scattered”, in which case
many of them will fall far from a line used to
approximate the relationship. In this case, we
say the linear relationship between the two
variables is weak.
U2-14
Scatterplots
4) Outliers
 There are several types of outliers for bivariate
data. An observation may be outlying in either
the x- or y-directions (or both). Another type of
outlier occurs when an observation simply falls
outside the general pattern of points, even if it is
not extreme in either the x- or y-directions.
Some types of outliers have more of an impact
on our analysis than others, as we will discuss
shortly.
R Code
> missed <- c(5, 2, 6, 10, 1, 8, 4, 2)
> score <- c(60, 95, 73, 56, 81, 45, 82, 78)
> plot(missed, score)
U2-15
Strength of Linear Relationship
The STAT 1000 and STAT 2000 percentage grades
for a sample of students who have taken both
courses are displayed in the scatterplot below:
100
90
80
70
STAT 2000

60
50
40
30
20
10
40 50 60 70 80 90 100
STAT 1000
U2-16
Strength of Linear Relationship
The scatterplot shows a moderately strong positive
linear relationship. Does the relationship for the
data in the following scatterplot appear stronger?
140

120

100
STAT 2000

80

60

40

20

0
0 20 40 60 80 100 120 140
STAT 1000
U2-17
Strength of Linear Relationship
It might, but these are the same data; the scatterplots
are just constructed with different scales!
100 140
90 120
80
100
70
STAT 2000

STAT 2000
60 80
50 60
40
40
30
20 20
10 0
40 50 60 70 80 90 100 0 20 40 60 80 100 120 140
STAT 1000 STAT 1000
U2-18
Strength of Linear Relationship
This example shows that our eyes are not the best
tools to assess the strength of relationship between
two quantitative variables.

Can we find a numerical measure that will give us


a concrete description of the strength of a linear
relationship between two quantitative variables?

The measure we use is called correlation.


U2-19
Correlation Coefficient
The correlation coefficient r measures the direction
and strength of a linear relationship between two
quantitative variables.

Suppose the values of two quantitative variables X


and Y have been measured for n individuals. Then
1 n  xi  x  yi  y 
r   
n  1 i 1  sx  s y 
1 n
  ( xi  x )( yi  y )
(n  1) sx s y i 1
U2-20
Correlation
We will use the second version of the formula, as
it is computationally simpler. To calculate the
correlation r:
(i) Calculate x , y , sx and sy
(ii) Calculate the deviations xi  x and yi  y
(iii) Multiply the corresponding deviations for x and y
( xi  x )( yi  y ) n
(iv) Add the n products  ( xi  x )( yi  y )
i 1

(v) Divide by (n – 1)sxsy


1 n
r  ( xi  x ) yi  y 
(n  1) sx s y i 1
U2-21
Correlation
For the Classes Missed and Exam Score example,
(i)
xi yi (ii) xi  x yi  y (iii) ( xi  x )( yi  y )
5 60 0.25 – 11.25 – 2.8125
2 95 – 2.75 23.75 – 65.3125
6 73 1.25 1.75 2.1875
10 56 5.25 – 15.25 – 80.0625
1 81 – 3.75 9.75 – 36.5625
8 45 3.25 – 26.25 – 85.3125
4 82 – 0.75 10.75 – 8.0625
2 78 – 2.75 6.75 – 18.5625
sum = 0 sum = 0 (iv) sum = – 294.5
U2-22
Correlation
(v)

Note that some software programs display only the


value of r2. If there is a positive association, r is the
positive square root of r2, and if there is a negative
association, r is the negative square root of r2.
R Code
> cor(missed, score)
[1] -0.8165786

Calculations in R will often differ slightly from our


calculations, as R carries more decimal places.
U2-23
Association vs. Causation
We must be careful when interpreting correlation.
Despite the very strong negative correlation, we
cannot conclude that missing more classes causes
a student’s grade to decrease.

There are many other variables that could help


explain the strong relationship between Classes
Missed and Exam Score. One such variable is the
effort of a student.
U2-24
Lurking Variable
Students who put more effort into the course
generally miss fewer classes. We also know that
exam scores tend to be higher for more dedicated
students.

The effort of a student in this example is known


as a lurking variable. A lurking variable is one
that helps explain the relationship between
variables in a study, but which is not itself
included in the study.
U2-25
Association vs. Causation
Regardless of the existence of identifiable lurking
variables, we must remember that correlation
measures only the linear association between two
quantitative variables. It gives us no information
about the causal nature of the relationship.

Association does not imply causation!


U2-26
Correlation
Some properties of correlation:
 Positive values of r indicate a positive association
and negative values indicate a negative association.
 r falls between –1 and 1, inclusive. Values of r
close to –1 or 1 indicate a strong linear association
(negative or positive, respectively). A correlation
of –1 or 1 is obtained only in the case of a perfect
linear relationship, i.e., when all points fall on a
straight line. Values of r close to zero indicate a
weak linear relationship.
U2-27
Correlation
Some properties of correlation (cont’d):
 r has no units (i.e., it is just a number).
 The correlation makes no distinction between X
and Y. As such, an explanatory and response
variable are not necessary.
 Changing the units of X and Y has no effect on
the correlation, i.e., it doesn’t matter if we
measure a variable in pounds or kilograms, feet
or metres, dollars or cents, etc.
U2-28
Correlation
Some properties of correlation (cont’d):
 r measures only the strength of a linear
relationship. In other cases, it is a useless
measure.
 Because the correlation is a function of
several measures that are affected by
outliers, r is itself strongly affected by
outliers.
U2-29
Linear Regression
When a relationship appears to be linear in nature,
we often wish to estimate this relationship between
variables with a single straight line.

A regression line is a straight line that describes


how a response variable Y changes as an explanatory
variable X changes. This line is often used to
predict values of Y for given values of X.
U2-30
Linear Regression
Note that with correlation, we didn’t require a
response variable and an explanatory variable.

In regression, we always have an explanatory


variable X and a response variable Y.

Given a value of X, we would like to predict


the corresponding value of Y. Unless there is a
perfect relationship, we won’t know the exact
value of Y, because Y is a variable.
U2-31
Regression Line
We will use a sample to estimate the true relationship
between the two variables. Our estimate of the “true
line” is
yˆ  b0  b1 x

ŷ is the predicted value of Y for a given value of X.


b0 is the intercept of the line and b1 is the slope.

We will use this regression line to make our


predictions.
U2-32
Regression Line
We would like to find the line that fits our data the
best. That is, we need to find the appropriate values
of b0 and b1.

But there are infinitely many possible lines. Which


one is the “best” line?

Since we are using X to predict Y, we would like the


line to lie as close to the points as possible in the
vertical direction.
U2-33
Regression Line
The line we will use is the line that minimizes the
sum of squared deviations in the vertical direction:
n
 ( yi  yˆ i )
2

i 1
20

15
yi

10 ŷi
Y

0
0 2 4 6 8 10
X
U2-34
Least Squares Regression
The values of b0 and b1 that give us the line that
minimizes this sum of squared deviations are:

sy
b1  r and b0  y  b1 x
sx

The line yˆ  b0  b1 x is called the least squares


regression line, for obvious reasons.
U2-35
Slope
The slope of the regression line, b1, is defined as the
predicted increase in y when x increases by one unit.
20
 yˆ
15 b1 
x
10  ŷ
Y

5
x
0
0 2 4 6 8 10
X
U2-36
Intercept
The intercept of the regression line, b0 , is defined as
the predicted value of y when x = 0.
20

15

10
Y

5
b0
0
0 2 4 6 8 10
X
U2-37
Coefficient of Determination r2
Some variability in Y is accounted for by the fact
that, as X changes, it pulls Y along with it. The
remaining variation is accounted for by other
factors (which we usually don’t know).

The value of r2 has a special meaning in least


squares regression. It is the fraction of variation
in Y that is accounted for by its regression on X.
U2-38
Coefficient of Determination r2
If r = –1 or 1, then r2 = 1. That is, we can predict Y
exactly for any value of X, as regression on X
accounts for all of the variation in Y.

If r = 0, then r2 = 0, and so regression on X tells us


nothing about the value of Y.

Otherwise, r2 is between 0 and 1.


U2-39
Example
Can the monthly rent for an apartment be predicted by
the size of the apartment? The size X (in square feet)
and the monthly rent Y (in $) are recorded for a
sample of ten apartments in a large city. The data
are shown below:

X 770 650 925 850 575 860 800 1000 730 900

Y 1270 990 2230 1295 860 1925 1575 1790 1580 1550
U2-40
Example
The scatterplot for these data is shown below:
U2-41
Example
We see a strong positive linear relationship
between Length and Concentration. From the
data, we calculate

And so
U2-42
Example
The equation of the least squares regression line is
therefore . . The line is shown
on the scatterplot below:
R Code
> Size <- c(770, 650, 925, 850, 575, 860, 800,
1000, 730, 900)
> Rent <- c(1270, 990, 2230, 1295, 860, 1925,
1575, 1790, 1580, 1550)
> lm(Rent ~ Size)
R Code
> plot(Size, Rent)
> abline(lm(Rent ~ Size), col = "red")
U2-43
Example
The slope b1 = 2.60 tells us that, when the size of an
apartment increases by one square foot, we predict the
monthly rent to increase by $2.60.
The intercept b0 = – 589.10 is statistically meaningless
in this case. An apartment cannot have a size of
0 square feet, and a negative rent is impossible.
We also see that r2 = (0.8031)2 = 0.645, which tells us
that 64.5% of the variation in an apartment’s monthly
rent is accounted for by its regression on size.
U2-44
Example
We can now use this line to predict the monthly rent
for an apartment of a given size.

To do this, we simply plug the size of an apartment


into the equation of the least squares regression line.
For example, the predicted monthly rent for an 860
square foot apartment is
U2-45
Predicted Value of Y
We call this the predicted value of Y when X = 860.
U2-46
Residuals
Note that there is an 860 square foot apartment in
the sample. How does the actual monthly rent for
this apartment compare with the predicted rent?

The monthly rent for this apartment is $278.10


higher than we would have predicted it to be from
our regression line.

The value yi  yˆ i is called the residual for the ith


observation. The residual for any value of X reflects
the error of our prediction.
R Code
> lm(Rent ~ Size)$residuals
U2-47
Residuals
residual = actual value of y – predicted value of y

actual

predicted
U2-48
Residuals
A positive residual indicates that an observation falls
above the regression line and a negative residual indicates
that it falls below the line. As an example, check that the
residual for the 770 square foot apartment in the sample is
equal to –142.90.
Note that it is in fact the sum of squared residuals that
is minimized in calculating the least squares regression
line.
What if we want to predict the monthly rent for a 1250
square foot apartment? Our predicted value is
U2-49
Extrapolation
Mathematically, there is no problem with making this
prediction. However, there is a statistical problem.
Our range of values for X is from 575 to 1000 square
feet. We have good evidence of a linear relationship
within this range of values. However, we have no
apartments in our sample as large as 1250 square feet,
and so we have no idea whether this relationship
continues to hold outside our range of data.
The process of predicting a value of Y for a value of X
outside our range of data is known as extrapolation,
and should be avoided if at all possible.
U2-50
Transformations
The values of some explanatory variable X and some
response variable Y are measured on a sample of
individuals. The data are shown below:

X 2 3 5 8 10 14 15 18 21
Y 88 234 67 228 841 1621 904 1017 2809

X 23 27 32 36 40 45
Y 2154 5327 4118 6715 9063 8664
U2-51
Transformations
A scatterplot of the data is shown below:

The relationship between X and Y appears to be nonlinear.


It appears that the relationship may be parabolic.
U2-52
Transformations
Let us examine instead the relationship between X and the
transformed variable

The relationship between X and Y* appears to be linear.


U2-53
Transformations
We fit the least squares regression line to the transformed
data:

The equation of the regression line for this transformed


relationship is
U2-54
Transformations
Now suppose we want to predict the value of Y when X = 25.
We first find the predicted value of using the
regression line for this transformed relationship:

Now we can back-transform to find the predicted value of Y:


U2-55
Outliers
We have seen that an outlier can be defined as a point
that is far from the other data points in the x-direction
or the y-direction, or if it falls outside the general
pattern of points.

We now examine the effect of each of these three


types of outliers.
U2-56
Outliers
Point # 1 is an outlier in the y-direction. It generally
has little effect on the regression line.
14
#1
12

10

8
Y

0
0 5 10 15
X
U2-57
Outliers
Point # 2 is not an outlier in either the x- or
y-directions, but falls outside the pattern of points.
14

12

10

8
Y

2 #2
0
0 5 10 15
X
U2-58
Outliers
A bivariate outlier such as this generally has little
effect on the regression line.
14

12

10

8
Y

2 #2
0
0 5 10 15
X
U2-59
Outliers
Point # 3 is an outlier in the x-direction. It has a
strong effect on the regression line.
14

12

10

8
Y

2
#3

0
0 5 10 15
X
U2-60
Influential Observations
An observation is called influential if removing it
from the data set would dramatically alter the
position of the regression line (and the value of r2).

In the above illustration, Point # 3 is an influential


observation, which is often the case for outliers in
the x-direction.
U2-61
Influential Observations
In our example, suppose the size of the largest
apartment was 1500 square feet instead of 1000
square feet, and the monthly rent was still $1790.
The equation of the regression line changes to

In addition, the value of r2 reduces to 0.316. The


outlying value has had a strong effect on the
equation of the line and the value of r2.
U2-62
Influential Observations
We see that, with the outlier included, the regression
line is a less accurate description of the relationship.

LSR line with


outlier excluded

LSR line with


outlier included
U2-63
Least Squares Regression
One property of the least squares regression line is
that it always passes through the point ( x , y ) .
Consider our previous example for the regression
of Rent vs. Size of an apartment. The mean size
of the apartments in the sample was 806 square
feet. The predicted monthly rent for an apartment
of this size is

which is exactly equal to the mean monthly rent


for the apartments in the sample.
U2-64
Association vs. Causation
Recall our discussion of association vs. causation.
The former does not imply the latter. In the
apartment example, there was a strong positive
relationship between the size of an apartment and
its monthly rent. However, this doesn’t mean an
apartment being larger causes its monthly rent to
be higher. This was an observational study, and
so the observed relationship may be due to one or
more lurking variables. For example, perhaps
apartments in nicer, more expensive parts of the
city are larger. Then the neighbourhood where an
apartment is located might be a lurking variable.
U2-65
Experiment vs. Observational Study
The best way to avoid lurking variables is to perform an
experiment rather than an observational study.

In an experiment, the value of the explanatory variable


is randomly “assigned” to the sample units, rather than
being simply observed prior to the study.

For example, consider the issue of drug use among


teenagers. Does marijuana use cause teenagers to try
harder illegal drugs?
U2-66
Experiment vs. Observational Study
A national drug study examined a sample of American
cities. In these cities, the percentage X of teenagers who
have tried marijuana, and the percentage Y of teenagers
who have tried hard drugs were recorded. The correlation
between X and Y was calculated to be r = 0.85. But this
doesn’t mean that using marijuana causes teenagers to use
hard drugs. (We don’t even know if the teens using
marijuana are the same ones who are using other drugs.)

There are other possible lurking variables that we are not


considering. One possible example of a lurking variable
is the availability of drugs in different cities. Teenagers
in cities where drugs are more easily available may be
more likely to try them.
U2-67
Experiment vs. Observational Study
If we really wanted to know if marijuana use among teens
causes hard drug use, we would need to perform an
experiment. We would have to get a large number of
teenagers who have never tried marijuana or other drugs
to volunteer to participate in the study. We would
randomly assign half of the volunteers to start smoking
marijuana, and the other half would continue not to use it.
After two years, we could determine whether each
volunteer subsequently used hard drugs. If we still see a
strong positive association, we can then say that marijuana
use does in fact cause hard drug use.
U2-68
Experiment vs. Observational Study
The reason for this is that we have diversified away
the similarities within the two groups (those who
use marijuana and those who don’t) with respect to
all possible lurking variables.

For example, some teenagers who live in cities


where drugs are easily available will be assigned to
use marijuana, while others won’t. The same will
be true for teenagers who live in cities where drugs
are not easily available.
U2-69
Experiment vs. Observational Study
This example provides a good illustration that it is
not always possible to perform an experiment rather
than an observational study. It is not realistic to
expect to find a group of teenagers who have never
tried marijuana who are willing to start using it.

Note however that this doesn’t mean observational


studies are “bad”. We must just remember that
association does not imply causation!
U2-70
Categorical Variables on a Scatterplot
Sometimes a scatterplot may actually be displaying
two or more distinct relationships.
For example, the Average Driving Distance X and
the Average Score Y are recorded for a sample of
professional golfers. (A “drive” is a golfer’s first
shot on a golf hole).
U2-71
Categorical Variables on a Scatterplot
The data are plotted on the scatterplot below. The
relationship does not appear to be linear, but…..
U2-72
Categorical Variables on a Scatterplot
This scatterplot is actually displaying two distinct
linear relationships, one for male golfers and one
for female golfers.
U2-73
Categorical Variables on a Scatterplot
This example illustrates that we should be careful
when examining a relationship to make sure that
the data belong to only one population. In this
case, a separate regression line should be fit to the
data for the male and female golfers.

You might also like