MAT 2377 B
Chapter 6 - Correlation and Regression
In these chapter, we will discuss the description of the association between two quan-
titative variables.
Goals:
• Construct a visual representation of the association between the two variables.
• Learn to describe the association.
• Introduce a statistic to describe the degree of the intensity of a linear associa-
tion between the two quantitative variables (sample covariance and sample
correlation).
• Describe the linear association between two quantitative variables with a linear
model. This called regression analysis.
13.1 Sample Covariance and Correlation
In this section, we introduce some techniques that describe the association between
quantitative variables.
Scatter Plot
We will use a scatter plot to study the association between two random variables
X and Y . To do this, consider n paired observations (xi , yi ), for i = 1, . . . , n, from a
pair (X, Y ) of random variables.
Note: We place x on the horizontal axis and y on the vertical axis.
With R we use the following command
plot(x,y)
1
In the following Figure, we have an illustration of linear associations.
For each scatter plot,
• we display a horizontal line at y;
• we display a vertical line at x.
Note that, these lines define four quadrants.
Remarks:
If there is a positive linear association between X and Y , then most of the points
are going to lie in quadrants I and III.
If there is a negative linear association between X and Y , then most of the points
are going to lie in quadrants II and IV.
Sample Covariance
To describe the linear association between the two variables, we can use the sample
covariance
Pn Pn Pn Pn
(x i − x)(yi − y) ( x i y i ) − ( x i )( i=1 yi )/n
c xy = i=1
cov = i=1 i=1
.
n−1 n−1
It will be positive for positive linear associations and it will be negative for negative
linear associations. So the covariance captures the sign of a linear association.
2
Sample Correlation
We now define a statistic which is based on the covariance. The sample correlation is
cov
c xy
rxy = ,
sx sy
where sx and sy are the respective sample standard deviations.
The sample correlation is also called Pearson’s correlation, or the product-moment
correlation.
Properties of the sample correlation:
• It is invariant to linear scaling. For example, if we measure height in millimeters,
centimeters or meters, the correlation is the same.
• It has the same sign as the covariance.
• It can be shown that −1 ≤ r ≤ 1.
• If the points fall exactly on a line with a positive slope, then r = 1.
• If the points fall exactly on a line with a negative slope, then r = −1.
• If there is little to no association between x and y, it can be shown that r ≈ 0.
Note: Because of the above properties, we say that the correlation is a measure of
the linear association between two numerical variables.
3
Example 1. We would like to describe the relationship between the mean adult
female body mass (in kg) of grizzly bears (y) and the percentage of meat in the diet
(x). Below are the data for n = 12 different regions.
x y x y
5 120 42 169
6 122 42 171
7 117 60 201
11 129 76 210
12 132 77 225
26 139 79 220
(a) Calculate the mean and standard deviation for the mean adult female body mass
and for the percentage of meat in the diet.
(b) Draw a scatter plot of the mean adult female body mass against the percentage
of meat in the diet.
(c) Calculate the sample covariance and the sample correlation between the percent-
age of meat in the diet and the mean adult female body mass.
4
Example 1....
5
Sample covariance and sample correlation related to Example 1 using R
> cov(x,y)
[1] 1236.447
> cor(x,y)
[1] 0.9933918
13.2 Least Squares Line
Objective: describe the association between the following two random variables:
• y (also called the response or dependent variable);
• x (also called the predictor or independent variable).
We assume that we have a random sample of paired observations (xi , yi ) for i =
1, . . . , n.
Example 2 Consider the data from Example 1.
• response variable y is the mean adult female body mass (in kg) of grizzly bears;
• predictor variable x is the percentage of meat in the diet.
For these data, the line of best fit is ŷ = 111.529 + 1.392 x, which is overlayed in the
scatter plot in the following Figure
●
220
●
200
●
180
●
y
●
160
140
●
●
120
●
●
●
20 40 60 80
We can use the line to estimate the mean adult female body mass (in kg) of grizzly
bears when the percentage of meat in the diet is x = 80:
µ̂Y |x=80 = 111.529 + 1.392 (80) = 222.889.
6
Now, to find the line of best fit, denoted by
ŷ = α̂ + β̂ x,
we will define what we mean by “best”. Consider the i-th case (xi , yi ). The corre-
sponding fitted value
ŷi = α̂ + β̂ xi
is the evaluation of the estimated line at x = xi .
Note that, the difference ei = yi − ŷi is called the i-th residual (or observed error)
The sum of the squared residuals
n h
X i2
L(α̂, β̂) = yi − (α̂ + β̂ xi ) ,
i=1
is used as measure of fit.
To find the line of best fit, we need to minimize the sum of squared residuals L(α̂, β̂).
This approach is known as the method of least-squares.
The least-squares method allows us to find α̂ and β̂ that minimize the sum of squared
residuals. The solution of this optimization problem is determined by placing each of
the partial derivatives equal to zero:
∂L ∂L
= 0 and = 0.
∂ α̂ ∂ β̂
We can show that the solution is
slope
Pn
( ni=1 xi yi ) − ( ni=1 xi )( ni=1 yi )/n
P P P
i=1 (xi − x)(yi − y)
β̂ = Pn = ,
( ni=1 x2i ) − ( ni=1 xi )2 /n
2
P P
i=1 (xi − x)
intercept
α̂ = y − β̂ x.
The line of best fit is
ŷ = α̂ + β̂ x.
7
Example 3 Consider the data from Example 1. Estimate the mean adult female
body mass (in kg) of grizzly bears when the percentage of meat in the diet is x = 50.
8
Linear regression with R
To produce a scatter plot of y against x, we use
plot(x,y)
To overlay the least square line onto the plot, we use
abline(lm(y~x))
To compute the line of least squares, we use
lm(y~x)
Example 1 with R
> x=c(5, 6, 7, 11, 12, 26, 42, 42, 60, 76, 77, 79)
> y=c(120, 122, 117, 129, 132, 139,169, 171, 201, 210, 225, 220)
> lm(y~x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
111.529 1.392