2 Correlation and Linear Regression PDF

Correlation and Linear Regression
Sometimes we collect numerical data in pairs, 𝑥! , 𝑦! , for the purpose of determining

what kind of relationship exists between the two variables. The correlation coefficient
measures the strength of the linear relationship between two quantitative (usually ratio or
interval) variables. The correlation coefficient has the following properties:
1. Its value is between +1 and −1 inclusive.

Interpreting the Correlation Coefficient
2. Values of +1 and −1 signify an exact positive and negative relationship, respectively,
between the variables. That is, a plot of• the values
Correlation of x and
Coefficients are y exactly
always describes
between -1 and +1. a
straight line with a positive or negative• slope depending

The magnitude on the sign.
of the correlation indicates the strength of the relation-
ship, the overall closeness of the points to a straight line. The sign of the
3. A correlation of zero indicates no linear correlation
relationship
does notexists
tell us between the two
about the strength variables.
of the linear relationship.
This condition does not, however, imply that there is no relationship since correlation
• A correlation of either +1 or -1 indicates that there is a perfect linear
does not measure the strength of curvilinear relationships.
relationship and all data points fall on the same straight line.
4. The correlation coefficient is symmetric with
• The signrespect to x and
of the correlation y. Itthe
indicates is direction
thus a ofmeasure
the relationship. A
positive correlation
of the strength of a linear relationship regardless of indicates xthat
whether as one first
comes variable
or increases,
y. so does the
other. A negative correlation indicates that as one variable increases, the
Relationship between Numerical Variables Correlation
other decreases.
Scatterplots illustrate the relationship Correlation(if any) betweenof the
• A correlation two variables.
0 indicates The adjacent
that the best straight line through the data is
scatterplot shows height versus shoe exactly horizontal, so knowing the value of one variable does not change
the other.
size. the Correlation Coefficient
Interpreting
From the scatterplot, a positive or
increasing
• Correlation relationship
Coefficients are alwayscan be -1 and +1.
between
concluded. As height increases,

• The magnitude of the correlation indicates the shoe
strength of the relation-
ship, the overall closeness of the points to a straight line. The sign of the
size also
correlation increases.
does not tell us about the strength of the linear relationship.
• A correlation of either +1 or -1 indicates that there is a perfect linear
In the figure
relationship and all below,
data pointswe
fall can
on thesee more
same straight line.
about
• The the
sign of therelationship between
correlation indicates the of the relationship. A
the direction
positive correlation indicates that as one variable increases, so does the
scatterplot and its’ correlation
other. A negative correlation indicates that as one variable increases, the
coefficient.
other decreases. You should notice the
• Arelationship with the
correlation of 0 indicates thatvalue
the bestof
straight line through the data is
correlation
exactly horizontal,and the how
so knowing thatofrelates
the value to the
one variable graph
does not it represents.
change
Figure : Scatter plot of Shoe Size versus Height
the other.
We see a positive, or increasing, relationship. As height increases, shoe

size also increases.
Statistics Department (WMU) Handout 9 February 6, 2013 24 / 38
1
and Y .
Let r be the sample correlation of the n points.
Then the quantity
1 1+r
W = ln (7.4)
2 1−r
So we canisclassify the strength
approximately between
normally two variables
distributed, depending
with mean given by on the correlation
coefficient into : 1 1+ρ
µW = ln.
• If 0.00 ≤ |r| ≤ 0.35 is a weak correlation (7.5)
2 1−ρ
• If 0.35 < |r| ≤ 0.65 is a moderate correlation .
and variance given by
• If 0.65 < |r| ≤ 1.00 is a strong correlation.
1
σW2 = (7.6)
Correlation Coefficent Calculational Formula n−3
𝑆!"
Note that µW is a function of the 𝑟population
= correlation ρ. To construct confidence
! 𝑆!! 𝑆!!
intervals, we will need to solve Equation (7.5) for ρ. We obtain
where
∑ !! ∑ ! ! (∑ !! )! 2µW !
𝑆!" = ∑ 𝑥! 𝑦! − , 𝑆!! = ∑ 𝑥!! − e, and−𝑆1!! = ∑ 𝑦!! − (∑ !! ) .
! ρ=
! ! (7.7)
e2µW + 1
Example
7.3
Example:
In a study of reaction
In a study times,
of reaction the time
times, to respond
the time to respondto atovisual stimulus
a visual stimulus(x)
(x)and
andthe
thetime
timeto
respond to an auditory
to respond stimulus
to an auditory (y) were
stimulus (y)recorded for each
were recorded forof 10 of
each subjects. Times
10 subjects. were
Times
measured in ms. The results are presented in the following table.
were measured in ms. The results are presented in the following table.
x 161 203 235 176 201 188 228 211 191 178
y 159 206 241 163 197 193 209 189 169 201

Find the correlation coefficient between the time to respond to a visual stimulus
Find a 95% confidence interval for the correlation between the two reaction times.
(x) and
the time to respond to an auditory stimulus (y).
Solution
Solution:
Using
We need Equation
to find 𝑥! 𝑦(7.1),𝑥we compute
𝑦! , 𝑥!! ,the
andsample
𝑦!! . correlation,
So we buildobtaining r = 0.8159.
!, !, the following table
Next we use Equation (7.4) to compute the quantity W : 𝟐 𝟐
𝒊 𝒙𝒊 𝒚𝒊 𝒙𝒊 𝒚𝒊 𝒙𝒊 𝒚𝒊
1 161 159 1+r
1 25599 25921 25281
W = ln
2 203 206 2 41818
1−r 41209 42436
3 235 241 56635 55225 58081
1 1 + 0.8159
4 176 163 = ln28688 30976 26569
2 1 − 0.8159
5 201 197 39597 40401 38809
6 188 193 = 1.1444 36284 35344 37249
7 228 209 47652 51984 43681
8 211 189 39879 44521 35721
9 191 169 32279 36481 28561
10 178 201 35778 31684 40401
Total 1972 1927 384209 393746 376789
𝑥! 𝑦! 1972 1927
𝑆!" = 𝑥! 𝑦! − = 384209 − = 4204.6.
𝑛 10
! !
𝑥! 1972
𝑆!! = 𝑥!! − = 393746 − = 4867.6.
𝑛 10
𝑦! ! 1927 !
𝑆!! = 𝑦!! − = 376789 − = 5456.1.
𝑛 10
2
The correlation coefficient is
𝑆!" 4204.6
𝑟= = = 0.8159.
𝑆!! 𝑆!! 4867.6 5456.1
So we can conclude that the relationship between the time to respond to a visual
stimulus (x) and the time to respond to an auditory stimulus (y) is positive and strong.
The Confidence Interval for the population Correlation
The Confidence interval, and most tests on, the population correlation 𝜌 are based on the
following result:
Let X and Y be random variables with the bivariate normal distribution.

Let 𝜌 denote the population correlation between X and Y.
Let (𝑥! , 𝑦! ), . . . , (𝑥! , 𝑦! ) be a random sample from the joint distribution of X and Y.
Let r be the sample correlation of the n points.
Then the quantity
1 1+𝑟
𝑊 = 𝑙𝑛
2 1−𝑟
is approximately normally distributed, with mean given by
1 1+𝜌
𝜇! = 𝑙𝑛 ,
2 1−𝜌
and variance given by
!
1
𝜎! = .
𝑛−3
Note that 𝜇! is a function of the population correlation 𝜌. To construct confidence
intervals, we will need to solve the 𝜇! equation for 𝜌. We obtain
𝑒 !!! − 1
𝜌 = !! .
𝑒 !+1
Example:
Find a 95% confidence interval for the correlation between the two reaction times.
Solution:
We compute the sample correlation, obtaining r = 0.8159. Next we compute the quantity
W:
1 1 + 𝑟 1 1 + 0.8159
𝑊 = 𝑙𝑛 = 𝑙𝑛 = 1.1444.
2 1 − 𝑟 2 1 − 0.8159
Since W is normally distributed with known standard deviation
3
1 1
𝜎! = = = 0.378.
𝑛−3 10 − 3
The 95% confidence interval for 𝜇! is given by
1.1444 − 1.96(0.3780) < 𝜇! < 1.1444 + 1.96(0.3780)
0.4036 < 𝜇! < 1.8852
To obtain a 95% confidence interval for 𝜌 we transform the inequality using the 𝜌
equation, obtaining
𝑒 !(!.!"#$) − 1 𝑒 !!! − 1 𝑒 !(!.!!"#) − 1

< <
𝑒 !(!.!"#$) + 1 𝑒 !!! + 1 𝑒 !(!.!!"#) + 1
0.383 < 𝜌 < 0.955
Hypothesis Testing for the Population Correlation (The Hypothesized Correlation

Value = 𝝆𝟎 )
The hypotheses:
𝐻! : 𝜌 ≥ 𝜌! 𝐻! : 𝜌 ≤ 𝜌! 𝐻! : 𝜌 = 𝜌!
𝐻! : 𝜌 < 𝜌! 𝐻! : 𝜌 > 𝜌! 𝐻! : 𝜌 ≠ 𝜌!
Lower-tailed test Upper-tailed test Two-tailed test
Test statistic:
𝑊 − 𝜇!
𝑧=
𝜎!
Where 𝜌! = The hypothesized correlation and
1 1 + 𝜌!
𝜇! = 𝑙𝑛 .
2 1 − 𝜌!
P-value = 𝑃(𝑍 ≤ −𝑧) P-value = 𝑃(𝑍 ≥ 𝑧) P-value = 2 𝑃(𝑍 ≥ 𝑧) or 2 𝑃(𝑍 ≤ −𝑧)
The decision: Reject 𝐻! if p-value ≤ 𝛼, otherwise fail to reject 𝐻! .
Example:
Test the null and alternate hypotheses 𝐻! : 𝜌 ≤ 0.3 versus 𝐻! : 𝜌 > 0.3. Use significance
level 5%.
Solution:
Under H0 we take 𝜌 = 0.3, so,
4
1 1 + 𝜌! 1 1 + 0.3
𝜇! = 𝑙𝑛 = 𝑙𝑛 = 0.3095.
2 1 − 𝜌! 2 1 − 0.3
! !
The standard deviation of W is 𝜎! = !!!
= !"!!
= 0.378. The observed value of W
is 1.1444.The z-score is therefore
𝑊 − 𝜇! 1.1444 − 0.3095
𝑧= = = 2.21
𝜎! 0.378
The P-value is= 𝑃 𝑍 ≥ 𝑧 = 𝑃 𝑍 ≥ 2.21 = 1 − 𝑃 𝑍 < 2.21 = 1 − 0.9864 =0.0136.
So we reject the null hypothesis and conclude that the population correlation is larger
than 0.3 at significance level 5%.
Hypothesis Testing for the Population Correlation (The Hypothesized Correlation

Value is equal to zero)
The hypotheses:
𝐻! : 𝜌 ≥ 0 𝐻! : 𝜌 ≤ 0 𝐻! : 𝜌 = 0
𝐻! : 𝜌 < 0 𝐻! : 𝜌 > 0 𝐻! : 𝜌 ≠ 0
Test statistic:
𝑟 𝑛−2
𝑡=
1 − 𝑟!
Compute the P-value. The P-value is an area under the Student’s t curve with n − 2
degrees of freedom, which depends on the alternate hypothesis as follows:
P-value = 𝑃(𝑇 ≤ −𝑡) P-value = 𝑃(𝑇 ≥ 𝑡) P-value = 2 𝑃(𝑇 ≥ 𝑡) or 2 𝑃(𝑇 ≤ −𝑡)
Example:
Test the hypothesis 𝐻! : 𝜌 ≤ 0 versus 𝐻! : 𝜌 > 0. Use significance level 5%.

Solution:
The test statistic for the null hypothesis is
𝑟 𝑛−2 0.8159 10 − 2
𝑡= = = 3.991
1 − 𝑟! 1 − 0.8159!
Consulting the Student’s t table with 8 degrees of freedom, we find that the P-value is
between 0.001 and 0.005. It is reasonable to conclude that 𝜌 > 0.
5
R code
The cor.test()function in R with the Spearman option allow to you to calculate

the Spearman correlation and test the hypotheses. The cor.test() function
options are
cor.test(Variable #1, Variable #2, alternative="less, greater, or two.sided",

method=c("pearson", "kendall", "spearman"))
Simple Linear Regression
Regression analysis is a statistical method for analyzing a relationship between two or

more variables in such a manner that one variable can be predicted or explained by using
information on the others. In regression analysis, we are interested in a random variable y
that is related to a number of independent variables x’s. The objective is to create a
prediction equation that expresses y as a function of these independent variables. Then, if
you can measure the independent variables, you can substitute these values into the
prediction equation and obtain the prediction for y (on average).
In simple linear regression, the relationship is specified to have only one predictor
variable and the relationship is described by a straight line. This is, as the name implies,
the simplest of all regression models.
Some examples of analyses using regression include
• estimating weight gain by the addition to children’s diets of different amounts of a

dietary supplement,
• predicting scholastic success (grade point ratio) based on students’ scores on an
aptitude or entrance test,
• estimating changes in sales associated with increased expenditures on advertising,
• estimating fuel consumption for home heating based on daily temperatures, or
• estimating changes in interest rates associated with the amount of
deficit spending.
The simple linear regression
Assume that the variable of interest, y, is linearly related to an independent variable x. To
describe the linear relationship, you can use the following linear equation to describe
this relationship
𝒚 = 𝜶 + 𝜷𝒙 + 𝜺
where 𝜶 is the y-intercept—the value of y when x = 0—and 𝛽 is the line slope, and is
denoted as the change in y for a one-unit change in x. y, sometimes called the response
variable or dependent variable, and the independent variable x, often called the
predictor variable. 𝜺 is denoted by the error term or the random error component.
6
defined as the change in y for a one-unit change in x, as shown in Figure 12.1. This
model describes a deterministic relationship between the variable of interest y, some-
times called the response variable, and the independent variable x, often called the
predictor variable. That is, the linear equation determines an exact value of y when
the value of x is given. Is this a realistic model for an experimental situation? Con-
sider the following example.
FIGURE 12.1
● y
12.3 THE METHOD OF LEAST SQUARES ❍ 507
The y-intercept and slope
for a line
Look at the regression line Slope and= β the data points in Figure 12.4. SSE is the sum
slope ! change in y for a
of the squared distances represented by the area of the yellow squares (light blue in
1-unit change in x Figure 12.4).
y-intercept ! value of
y when x ! 0 Finding
y-intercept = α
the values of a and b, the estimates of a and b, uses differential calculus,
which is beyond the scope of this text. Rather than derive their values, we will sim-
ply present 0 formulas1 for calculating 2
the
x
values of a and b—called the least-squares
estimators of a and b. We will use notation that is based on the sums of squares for
the variables in thetheregression
Table 12.1 displays problem,testwhich
mathematics achievement area random
scores for similar in form to the sums of
sample
Assumptions about the random error
squares used in Chapter 11. These formulas look different fromofthe formulas presented
of n ! 10 college freshmen, along with their final calculus grades. A bivariate plot
these scores and grades is given in Figure 12.2. You can use the Building a Scatter-
in Assume
Chapter
plot appletthat 3,refresh
to but
the they areofin𝜺assatisfy
your memory
values fact
to howalgebraically
this plot is
these drawn.identical!
Notice that the points
conditions:
You
do not should
lie exactly use the data
ratherentry
seem tomethod
• Are independent in the probabilisticansense
on a line but be deviationsfor your
about scientific
underlying line.calculator to enter the
A simple way to modify the deterministic model is to add a random error compo-
sample data.
nent to explain•theHave a mean
deviations of the of 0 and
points aboutathecommon variance
line. A particular equal
response y isto 𝜎 !
probabilistic
•described using•theHave
If your calculator hasmodel
a normal probability
only distribution
a one-variable statistics function, you can still
y ! a " bx " e
Least save some Method
Squares time in finding the necessary sums and sums of squares.
• If your calculator has a two-variable statistics function, or if you have a
We use this method
graphing
Mathematics
to estimate
calculator,
Achievement the the parameters
Testcalculator
Scores and will
of the simple linear regression 𝛼 and 𝛽
Final automatically
Calculus Grades store all of the sums and
TABLE 12.1
● from the data.
sums
for College The formula
ofFreshmen
squares as wellforasthe
thebest-fitting
values of line
a, b,isand the correlation coefficient r.
Mathematics Final
• Make sure youCalculus
Achievement consult your calculator𝒚=𝒂+ manual
𝒃𝒙 to find the easiest way to
Student Test Score Grade
1
obtain39
the least65
squares estimators.
2
where
3
𝑎 43
and
21
𝑏 are 78 estimates of the intercept and slope parameters 𝛼 and 𝛽,
the
52
respectively.
4 64 82
LEAST-SQUARES
5 57 92 ESTIMATORS OF a AND b
6 47 89
7 28 73
8 S75xy 98
9 b ! "" and a ! !y # bx!
34 56
10 S52xx 75
where the quantities Sxy and Sxx are defined as

(Sxi )(Syi)
Sxy ! S(xi # !x )( yi # !y) ! Sxi yi # " "
n
and
(S xi)2
Sxx ! S(xi # !x )2 ! Sx 2i # ""
n
Notice that the sum of squares of the x-values is found using the computing formula
given in Section 2.3 and the sum of the cross-products is the numerator of the
covariance defined in Section 3.4.
PLE 12.1 Find the least-squares prediction line for the calculus grade data in Table 12.1.
Solution Use the data in Table 12.2 and the data entry method in your scientific
calculator to find the following sums of squares:
7
2 2
(Sxi) (460)
Sxx ! Sx 2i # "" ! 23,634 # "" ! 2474
n 10
(Sxi)(Syi) (460)(760)
described using the probabilistic model
y ! a " bx " e
Example:
Mathematics Achievement Test Scores and Final Calculus Grades
12.2 A SIMPLE LINEAR PROBABILISTIC MODE
ABLE 12.1
In the following
for Collegetable a test score and final calculus grades for 10-college freshmen
Freshmen
● FIGURE 12.2
●
Mathematics Final
Scatterplot of the data in
Achievement
Table 12.1 Calculus 100
Student Test Score Grade 90
1 39 65
2 43 78 80
Grade
3 21 52 70
4 64 82
5 57 92 60
6 47 89
7 28 73 50
8 75 98 20 30 40 50
Score
60 70 80
9 34 56
10 52 75
The first part of the equation, a ! bx—called the line of means—descr
508 ❍ CHAPTER 12 LINEAR erage value of AND
REGRESSION a given value of x. The error component e allows ea
y forCORRELATION
Find the least squares prediction line for theual
calculus
responsegrade data. Iffrom
y to deviate a freshman scored
the line of means50
by a small amount.
on the achievement test, find the student’s predicted
In ordercalculus grade.
to use this probabilistic model for making inferences, you need
TABLE 12.2 specific about this “small
Calculations amount,”
for the Data e. in Table 12.1
●
Solution
yi xi x i2 xi yi y i2
The scatter plot for the data shows that ASSUMPTIONS
65 39 1521ABOUT 2535 THE4225 RANDOM ERROR e
there is a linear relationship between test 78 43 1849 3354 6084
Assume52that the
21 values441 of e satisfy
1092these 2704
conditions:
score and final calculus grades
• Are 82 64
independent in4096 5248
the probabilistic 6724
sense
Now we need to construct the following • Have 92a mean
57 of 0 3249
and a common5244 variance
8464 equal to s 2
89 47 2209 4183 7921
table to find the required sums for • Have a normal probability distribution
73 28 784 2044 5329
calculations 98 75 5625 7350 9604
These assumptions about the random error e are shown in Figure 12
56 of x—say,
fixed values 34 x11156
, x2, and x1904 3136
3. Notice the similarity between these a
and the assumptions necessary for the tests5625
75 52 2704 3900 in Chapters 10 and 11. We
these
Sum assumptions
760 460later23,634
in this chapter
36,854 and59,816
provide some diagnostic tools
Then we find use in checking their validity.
!! !! !"# !"#
𝑆!" = 𝑥! 𝑦! − = 36854 − = 1894,
! !" Then
!! ! !"# !
FIGURE 12.3 S 1894
𝑆!! = 𝑥!! − !
=Linear
23634 − !"
probabilistic
●
model = 2427,
y b ! "x"y ! "" ! .76556 and a ! !y # bx! ! 76 #
Sxx 2474
!! !"# You can predict y for a
𝑥= !
= !"
= 46,
given value of x by The least-squares regression line is then
substituting x into the
!! !"#equation to find ŷ.
ŷ ! a $ bx ! 40.78424 $ .76556x
and 𝑦 = !
= !"
= 76
The graph of this line is shown in Figure 12.4. It can now
given value of x—either by referring to Figure 12.4 or by s
of x into the equation. For example, if a freshman scored
test, the student’s predicted calculus grade is (using full d
x1 x2 x3 x
ŷ ! a $ b(50) ! 40.78424 $ (.76556)(50) ! 79.06
8
98 7375 28 5625 784 7350 2044 9604 5329
56 9834 75 1156 5625 1904 7350 3136 9604
75 5652 34 2704 1156 3900 1904 5625 3136
Sum 760 75
460 52
23,634 2704
36,854 390059,816 5625
Sum 760 460 23,634 36,854 59,816
Then
Then
S 1894
b ! "x"y ! Sxy""18! 94.76556 and a ! y # bx! ! 76 # (.76556)(46) ! 40.78424
bSx!x ""2!474"" ! .76556 and a!! !y # bx! ! 76 # (.76556)(46) ! 40.78424
predict y for a Sxx 2474
uecan
of xpredict
by y for a The least-squares regression line is then
en
ng x into of
value thex by
The least-squares regression line is then
to find ŷ. into the
stituting x ŷ ! a $ bx ! 40.78424 $ .76556x
ation to find ŷ.
ŷ ! a $ bx ! 40.78424 $ .76556x
The graph of this line is shown in Figure 12.4. It can now be used to predict y for a
given The graph
value of this line
of x—either is shown into Figure
Figure 12.4. It by
cansubstituting
now be usedthetoproper
predict y for a
If agiven
freshman
valuescored
of x =by50referring
x—either on referring
by the achievement
to
12.4
Figure
or
test, theorstudent’s
12.4 by predicted
substituting the
value
calculus
proper value
ofgrade
x intoisthe equation.
(using full For example,
decimal accuracy)if a freshman scored x ! 50 on the achievement
test, of into the equation.
thex student’s predictedFor example,
calculus if aisfreshman
grade scored
(using full decimal 50 on the achievement
x ! accuracy)
test, the student’s predicted calculus grade is (using full decimal accuracy)
ŷ ! a $ b(50) ! 40.78424 $ (.76556)(50) ! 79.06
ŷ ! a $ b(50) ! 40.78424 $ (.76556)(50) ! 79.06
Analysis of Variance for Linear Regression and the Inference Concerning 𝜷 the
Line Slope
For simple regression analysis we divide the total variation in the response variable y into
How
two Do I Make Sure That My Calculations Are Correct?
parts:
How Do I Make Sure That My Calculations Are Correct?
• •BeSSR (sum
careful of squareserrors.
of rounding for regression)
Carry at leastmeasures the amount
six significant of variation
figures, and round explained
• byBeusing
careful
the of roundingline
regression errors.
withCarry
one at least six significant
independent variable x figures, and round
off only in reporting the end result.
off only in reporting the end result.
• UseSSE
• (sum of or
a scientific squares for calculator
graphing error) measures
to do all thethe
“residuals”
work for variation
you. Mostinofthe data that
these
• isUse
not aexplained
scientificby
or the
graphing calculator
independent to do x.all the work for you. Most of these
variable
calculators will calculate the values for a and b if you enter the data properly.
calculators will calculate the values for a and b if you enter the data properly.
• Use
Where• thea computer
Usevariation software
a computer
in the program
software
if you
program
response have
if you
variable access
y ishave to one.
access by
measured to one.
the total sum of squares
• Always
• Always plot the data and graph the line. If the line does notthrough
quantity plot the data and graph the line. If the line does not fit the the
fit through
points, you you
points, havehave
probably made
probably a mistake!
made a mistake!
!
!! ! !!"
𝑇𝑆𝑆 = 𝑆!! = 𝑦!! − !
, 𝑆𝑆𝑅 = !!!
, and 𝑆𝑆𝐸 = 𝑇𝑆𝑆 − 𝑆𝑆𝑅
The degrees of freedom that are corresponding to each sum of squares are:
• The degrees of freedom for TSS = 𝑛 − 1
YouYoucan use the
• Thecan use Method
degrees of Method
the of Least
freedom of
Squares
forLeast
SSR = Squares
1
applet to find
applet the the
to find values of aofand
values b that
a and b that
determine the best fitting line, ŷ ! a $ bx. The horizontal line that you see is the line
• determine
The degreesthe best fitting line,
of freedom ŷ ! a=$𝑛bx.
for SSE − 2The horizontal line that you see is the line
y ! y!y.!Use youryour
y!. Use mouse mouseto drag the line
to drag and and
the line watch the the
watch yellow squares
yellow change
squares size.size.
change
The The
object is toismake
object SSE—the totaltotal
areaarea
of the yellow squares (light blue in Fig-
The ANOVA Tabletoformake
simpleSSE—the
linear regression isof the yellow squares (light blue in Fig-
ure 12.4)—as
ure 12.4)—as small as possible. The value of SSEthe
small as possible. The value of SSE is is red portion
the red of the
portion bar bar
of the on the
on the
left of the applet (dark
left of the applet blue in
(dark blueSS Figure 12.4)
in Figure marked SSE ! . When you think
Source df 12.4) MS marked SSE ! F . When you think
that that
you youhavehave
minimized
minimized SSE,SSE,clickclick
the the button andand
button see see
howhow wellwell
youyoudid!did!
Regression SSR 1 MSR=SSR/1
MSR/MSE
Error SSE n-2 MSE=SSE/n-2
Total TSS n-1
9
notice that the total degrees of freedom for n measurements is (n " 1). Since esti-
mating the regression line, ŷ ! a # bxi ! y! " bx! # bxi , involves estimating one ad-
ditional parameter b, there is one degree of freedom associated with SSR, leaving
(n " 2) degrees of freedom with SSE.
As with all ANOVA tables we have discussed, the mean square for error,
Hypotheses SSE
MSE ! sTesting
2
! $$ for the Simple Linear Regression Model
n"2
Hypotheses
is an unbiased estimator of the underlying variance s 2. The analysis of variance table
𝐻 : 𝛽 = 0,
is!shown in which
Table means
12.3. there is no significant relationship between x and y.
𝐻! : 𝛽 ≠ 0, which means there is a significant relationship between x and y.
The test statistic:
𝑀𝑆𝑅
ABLE 12.3
● Analysis of Variance for Linear Regression 𝐹=
𝑀𝑆𝐸
Source df SS
The sampling distribution ofMSthe test statistic is F with two degrees of freedom
𝑑𝑓! = 1 and 𝑑𝑓! =(Sx𝑛 y)
2 − 2. It is one-tailed test, namely upper-tailed test.
Regression 1 $$ MSR
Sxx
The decision: we reject 𝐻 if the test statistic 𝐹 > 𝐹!"! ,!"! ,! , or we reject 𝐻! if the p-
(Sxy)2 !
Error n " 2 S " $ $
value of test statistic 𝐹 S≤xx 𝛼 the
yy MSEpredetermined significance level. Otherwise, we fail to
reject
Total
𝐻! . n " 1 S
yy
Example
Construct the ANOVA Table for the previous example and check if there is any
significant linear relationship between x and y.
For the data in Table 12.1, you can calculate
(Syi)2 (760)2
Total SS ! Syy ! Sy 2i " $$ ! 59,816 " $$ ! 2056
n 10
(S y)2 (1894)2
SSR ! $x$ ! $$ ! 1449.9741
Sxx 2474
so that
SSE ! Total SS " SSR ! 2056 " 1449.9741 ! 606.0259
and
ANOVA table
SSE 606.0259
MSE ! $$ ! $$ ! 75.7532
Source SS df MS F
n " 2 Regression 8 1449.9741 1 1449.9741
19.14
The analysis of variance table,606.0259
Error part of the 8 linear75.7532
regression output generated by
MINITAB, is the lower Total
shaded section2056 9 in Figure 12.6. The first two lines
in the printout
give the equation of the least-squares line, ŷ ! 40.8 # .766x. The least-squares esti-
Tomates
determine
a and bthe
aresignificance of the relationship
given with greater accuracy in between x and
the column y, we“Coef.”
labeled have toYoutest can
the null
hypothesis 𝐻! : 𝛽 =
find instructions for0,generating this there
which means outputisinnothe section “My
relationship MINITAB
between ” at ythe
x and end ofthe
against
hypothesis 𝐻! : 𝛽 ≠ 0.
this chapter.
alternative
Consulting the F table with 𝑑𝑓! = 1 and 𝑑𝑓! = 8 degrees of freedom, we find that the P-
value of 19.14 is less than 0.005. Therefore, the null hypothesis is rejected and we can
conclude that there is a significant linear relationship between x and y.
10
Another Hypotheses Testing Concerning the Line Slope
In the previous section we introduced the F-statistic test for testing the significance of the
slope 𝛽. In this section we will introduce a t-statistic test and a (1-𝛼) 100% confidence
interval for the slop.
Hypotheses Testing Concerning the Line Slope
The hypotheses:
𝐻! : 𝛽 ≥ 𝛽! 𝐻! : 𝛽 ≤ 𝛽! 𝐻! : 𝛽 = 𝛽!
𝐻! : 𝛽 < 𝛽! 𝐻! : 𝛽 > 𝛽! 𝐻! : 𝛽 ≠ 𝛽!
Test statistic:
𝑏−𝛽!
𝑡=
𝑀𝑆𝐸/𝑆!!
Compute the P-value. The P-value is an area under the Student’s t curve with n − 2
❍ Example:
CHAPTER 12 LINEAR REGRESSION AND CORRELATION
Determine whether there is a significant linear relationship between the calculus grades
and test scores in the previous example. Test at the 5% level of significance.
MPLE 12.2 Determine whether there is a significant linear relationship between the calculus grades
Solution:
and test scores listed in Table 12.1. Test at the 5% level of significance.
The hypotheses are
Solution The hypotheses to be tested are
H0 : b ! 0 versus Ha : b " 0
and the observed value of the test statistic is calculated as
b$0 .7656 $ 0
t ! ## ! ## ! 4.38
"
!M "
SE/Sxx "2
!7 5.753"/2474
with (n $ 2) ! 8 degrees of freedom. With a ! .05, you can reject H0 when
t % 2.306 the
Consulting or tt &
table with 8 Since
$2.306. degreestheof observed
freedom, we valuefindofthat
thethe
test statistic
P-value of falls
4.38 isinto the
between
rejection0.001 andH0.002.
region, 0 is Therefore,
rejected and the
younull
can hypothesis
conclude is rejected
that there and
is awe can conclude
significant linear
that there is a significant linear relationship between the calculus grades
relationship between the calculus grades and the test scores for the population of col- and the test
scores for the population of college freshmen. 12.5 TESTING THE USEFULNESS OF THE LINEAR REGRESSION MODEL ❍ 517
lege freshmen.
A (1-𝜶) 100% Confidence Interval for the Slope
A (1 ! a)100% CONFIDENCE INTERVAL FOR b
b ! ta/2(SE)
ta/2 isfor
the t-Test
You can use where the Slope
based on (n "applet shownofinfreedom
2) degrees Figure 12.7
and to find p-values
or rejection regions fors 2this test.MSEYou must first calculate the standard error
"",
SE ! !MSE/S SE #
xx type its
$
!" !"
$ # $ $
S value into
xx S the box marked “Std Error,” and press “Enter.”
xx
RE 12.7
●
or the Slope 11
EXAMPLE 12.3 Find a 95% confidence interval estimate of the slope b for the calculus grade data in
Table 12.1.
Solution Substituting previously calculated values into
MSE
b ! ta/2(SE)
where ta/2 is based on (n " 2) degrees of freedom and
!" !"
s2 MSE
SE # $$ # $$
Sxx Sxx
Example
AMPLE 12.3 Find
Finda a95%
95%confidence
confidenceinterval
intervalestimate
estimateofofthe
theslope
slopeforb the
for calculus gradegrade
the calculus data.data in
Table 12.1.
Solution
Solution Substituting previously calculated values into
MSE
b ! t.025 $$!"
Sxx
you have
75.7532
.766 ! 2.306 $$
!""
2474
.766 ! .404
Theresulting
The resulting95%
95%confidence
confidence interval
interval is .362toto1.170.
is 0.362 1.170. Since
Since thethe interval
interval doesdoes
notnot
contain0,0,you
contain youcan
canconclude
conclude that
that thethe
population of bisisnot
true valueslope notequal
0, andto you canyou
0, and reject
can the
reject
null hypothesis H : b # 0 in favor of H : b % 0, a conclusion that agrees
the null hypothesis H0 : 𝛽 = 0 in favor of Ha : 𝛽 ≠ 0, a conclusion that agrees with the
0 a with the
findings in Example 12.2. Furthermore, the confidence interval estimate indicates that
findings in previous two examples. Furthermore, the confidence interval estimate
there is an increase from as little as .4 to as much as 1.2 points in a calculus test score
indicates
for eachthat thereincrease
1-point is an increase
in the from as little test
achievement as 0.4 to as much as 1.2 points in a
score.
calculus test score for each 1-point increase in the achievement test score.
The IfCoefficient
you are using computer software to perform the regression analysis, you will find
of Determination
the t statistic and its p-value on the printout. Look at the MINITAB regression analysis
printout
The reproduced
coefficient of determination r2 can
in Figure 12.8. In be
theinterpreted
second portion
as the of the printout,
percent reductionyou will
in the
findvariation
total the least-squares estimatesobtained
in the experiment a (“Constant”)
by using and b (“x”) in line,
the regression the instead
column of marked
ignoring
x “Coef,”
and using their
thestandard errors𝑦(“SE
sample mean Coef”),
to predict thethe calculated
response valuey.of the t statistic (“T”)
variable
used for testing the hypothesis that the parameter equals 0, and its p-value (“P”). The
! !
H0 : b #
t-test for significant regression,𝑆𝑆𝑅 𝑆!"0, has a p-value
𝑆!" of P # .002, and the null
!
hypothesis is rejected, as 𝑟in Example
= =12.2. Does = this agree with the p-value found us-
𝑇𝑆𝑆 𝑆!! 𝑆!!
ing the t-Test for the Slope applet in Figure 12.7? In𝑆!! 𝑆!!event, there is a significant
any
linear relationship between x and y.
Example:
URE 12.8
Find the coefficient of determination for the calculus grade data.
● Regression Analysis: y versus x
TAB output for the
ulus grade data Solution:
The regression equation is
y = 40.8 + 0.766 x
From the ANOVACoef
Predictor table weSEhave
CoefSSR = 1450
T and TSS
P = 2056. Therefore,
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750!!" 4.38
!"#$ 0.002
!
for the standard 𝑟 = !"" = !"#$ = .705 or 70.5%
S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%
r of b in the column
ked “SE Coef.” Analysis of Variance
which
Sourcemeans that the
DF predictor SS
variable x explains
MS 70.5%
F of total
P variation and the
Regressionis unexplained,
remaining 1 1450.0
and it can be1450.0
found with19.14
the SSE0.002
portion.
Residual Error 8 606.0 75.8
Total 9 2056.0
12
Diagnostic tools for checking the regression assumptions
Residual plots
This plot allows checking two errors assumptions; that are the errors have a mean of 0
and a common variance equal to 𝜎 ! and the independence. To draw it, you need to
1. Calculate the residual at each point: the residual is the difference between the
actual value of 𝑦 and the predicted value from the regression equation 𝑦.
2. Plot the fitted values on the horizontal axis versus the residuals values on the
vertical axis.
3. If the plot of residuals versus fitted values
• Shows no substantial trend or curve, and
• Is homoscedastic, that is, the vertical spread does not vary too much
along the horizontal length of plot, except perhaps near the edges, then it
is likely, but not certain, that those two assumptions hold. However, if the
residual plot does show a substantial trend or curve, or is heteroscedastic,
it is certain that the assumptions do not hold.
The Normal Probability Plot
This plot allows checking the normality assumption of the errors. It is a graph that plots
the residuals against the expected values of that residual if it had come from a normal
distribution. When the residuals are normally distributed or approximately so, the plot
should appear as a straight line, sloping upward.
Example:
Check the errors assumptions of the test score and final calculus grades for 10-college
freshmen linear regression model.
Solution:
To check the assumptions we need to calculate the residuals as it follows
Math Final
Student Fitted
Achievement Calculus Residuals
number Values
test score Grade
1 39 65 70.64104 5.64104
2 43 78 73.70328 -4.29672
3 21 52 56.86096 4.86096
4 64 82 89.78004 7.78004
5 57 95 84.42112 -10.57888
6 47 89 76.76552 -12.23448
7 28 73 62.21988 -10.78012
8 75 98 98.2012 0.2012
9 34 56 66.81324 10.81324
10 52 75 80.59332 5.59332
Therefore, the residuals and the normal probability plots are given
13
from other types of distributions with a similar mean and measure of spread. He
of points about 0 on the vertical axis with approximately the same vertical spread for
curvature in either or both of the two ends of the normal probability plot is ind
all values of ŷ. One property of the residuals is that they sum to 0 and therefore have
tive of nonnormality.
a sample mean of 0. The plot of the residuals versus fit for the calculus grade exam-
ple is shown in Figure 12.10. There are no apparent patterns in this residual plot, which
indicates that the model assumptions appear to be satisfied for these data.
FIGURE 12.10 FIGURE 12.11

● ●
Plot of the residuals Normal probability plot of
Residuals versus Fitted Value Normal Probability Plot of the Residuals
versus ŷ for Example 12.1 (response is y) residualsfor Example 12.1 (response is y)
15 99
10 95
90
Residual
5 80
70
Percent
60
0 50
40
30
#5 20
10
#10 5
60 70 80 90 100 1
Fitted Value !20 !10 0 10 20
Residual
The residuals plot shows that there are no apparent pattern or trend, which indicates that
EXERCISES
the errors mean is zero and the12.6
variance is constant appear to be satisfied for these data.
BASIC TECHNIQUES appear that any regression assumptions have been
With the exception of the fourth and fifth plotted points, the remaining points
12.28 What diagnostic plot can you use to determine
appear to
lated? Explain.
lie approximately on a straight line. This plot is not unusual and
whether the data satisfy the normality assumption? does not indicate
MINITAB output for Exercise 12.31
normality. The most serious violations
What should the ofplot
thelook
normality assumption
like for normal residuals? usually appear in the
tails of the distribution because thisWhat
12.29 is where the
diagnostic plotnormal distribution
can you use to determine differs most fromNormal Probability Plot of the Residuals
(response is y)
other types of distributions with a similar
whether mean
the incorrect modelandhas measure of spread. Hence,
been used? What 99 curvature
shouldofthethe
in either or both of the two ends plotnormal
look like probability
if the correct model
plothasis been
indicative95of non-
used? 90
normality. 80
12.30 What diagnostic plot can you use to determine 70
Percent
60
The Multiple Linear Regression the assumption of equal variance has been
whether 50
40
violated? What should the plot look like when the 30
20
variances are equal for all values of x? 10
Simple regression uses one numerical independent variable, X, to predict 5 a numerical
youRefer
dependent variable, Y. Often12.31 cantomake
the databetter
in Exercise 12.7. The normal
predictions by using more
1 than one
probability plot and the residuals versus fitted values !0.4 !0.3 !0.2 !0.1 0.0 01. 02. 0.3 0.4
independent variable. Multiple regression
plots generated uses are
by MINITAB two
shownorhere.
more Doesindependent
it variables to Residual
predict the value of a dependent variable. The following linear equation to describe this
relationship
𝒚 = 𝜶 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 + ⋯ + 𝜷𝒑 𝒙𝒑 + 𝜺
where 𝜶 is the intercept or the value of y when x’s = 0, and 𝜷𝟏 , 𝜷𝟐 , 𝜷𝟑 , … , 𝜷𝒑 are the
slopes. y is the response variable or dependent variable, and the independent variables
𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , … , 𝒙𝒑 , often are called the predictor variables. 𝜺 is denoted by the error
term or the random error component.
Assumptions about the random error
Assume that the values of 𝜺 satisfy these conditions:

• Are independent in the probabilistic sense
• Have a mean of 0 and a common variance equal to 𝜎 !
• Have a normal probability distribution
14
Least Squares Method
We use this method to estimate the parameters of the linear regression 𝛼 and 𝛽′𝑠 from the
data. The formula for the best-fitting line is
𝒚 = 𝒂 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + 𝒃𝟑 𝒙𝟑 + ⋯ + 𝒃𝒑 𝒙𝒑
where 𝑎 and 𝑏′𝑠 are the estimates of the intercept and slopes parameters 𝛼 and 𝛽′𝑠,
respectively.
The Least Squares Method computes the 𝑏′𝑠 so as to minimize the sum of the squared
residuals
! ! !
Multiple Regression
𝑒! = 𝑦! − 𝑦! !
= 𝑦! − 𝑎 − 𝑏! 𝑥! − 𝑏! 𝑥! − 𝑏! 𝑥! − ⋯ − 𝑏! 𝑥!
!
!
Simple regression uses one numerical independent variable, X, to predict a numerical dependent variable, Y .
!!! !!! !!!
Often you can make better predictions by using more than one independent variable. Multiple regression
uses two or more𝑒independent
where variables
is the residuals to predict the value of a dependent variable.
or errors.
!
Suppose for our final

Example (1):selling price example, we also collected data on how many months after the house
was assessed that it sold, the data is below. Now we are using two independent variables (assessed value
and monthsSuppose thatsold)
till being we collected
to predictdataselling
aboutprice.
final selling
When price,
there and
are on how many
several months variables,
independent after you can
extend thethe houselinear
simple was assessed thatequation.
regression it sold, the data is below. Now we are using two independent
variables (assessed value and months till being sold) to predict selling price. We are
1 · x1 + b2equation.
ŷ = a + bregression
trying to estimate the following · x 2 + b3 · x 3 + · · ·
We would have as many b’s and x’s as we𝒚 had= 𝒂 independent

+ 𝒃𝟏 𝒙𝟏 + 𝒃𝟐variables.
𝒙𝟐 For our example, we are trying to
estimate the following regression equation.
where 𝑥! represents the assessed value and 𝑥! represents the time by months till being
ŷ = a + b1 · x1 + b2 · x2
sold.
House Selling Assessed Time House Selling Assessed Time

1 194.10 178.17 10 16 206.70 184.36 12
2 201.90 180.24 10 17 181.50 172.94 5
3 188.65 174.03 11 18 194.50 176.50 14
4 215.50 186.31 2 19 169.00 166.28 1
5 187.50 175.22 5 20 196.90 179.74 3
6 172.00 165.54 4 21 186.50 172.78 14
7 191.50 172.43 17 22 197.90 177.90 12
8 213.90 185.61 13 23 183.00 174.31 11
9 169.34 160.80 6 24 197.30 179.85 12
10 196.90 181.88 5 25 200.80 184.78 2
11 196.00 179.11 7 26 197.90 181.61 6
12 161.90 159.93 4 27 190.50 174.92 12
13 193.00 175.27 11 28 197.00 179.98 4
14 209.50 185.88 10 29 192.00 177.96 9
15 193.75 176.64 17 30 195.90 179.07 12

We cannot perform multiple regression in our calculator, therefore, we must look at the excel15 output, seen
below. Also, it is difficult to graph scatterplots of multiple regression for cases when we have more than two
independent variables because we are in three dimensional space.
SUMMARY OUTPUT
It is difficult to graph scatterplots of multiple regression for cases when we have more
than two independent variables because we are in three dimensional space. We can
perform multiple regression with any statistical package. The following is R output,
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -120.04830 15.06092 -7.971 1.44e-08 ***
Assessed 1.75060 0.08576 20.414 < 2e-16 ***
Time 0.36795 0.12805 2.873 0.00782 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.097 on 27 degrees of freedom

Multiple R-squared: 0.943, Adjusted R-squared: 0.9388
F-statistic: 223.5 on 2 and 27 DF, p-value: < 2.2e-16

Therefore, our estimated multiple regression equation is
𝒚 = −𝟏𝟐𝟎. 𝟎𝟒𝟖 + 𝟏. 𝟕𝟓𝟏𝒙𝟏 + 𝟎. 𝟑𝟔𝟖𝒙𝟐
Like simple regression, we need to interpret the regression coefficients. To do this, we

need to hold all other variables constant. For instance, for houses that were sold the same
number of months after having their value assessed, every additional $1,000 increase in a
houses assessed value increases the final selling price $1,751, on average.
Calculate the predicted final selling price for a house that was assessed at $175,000 and is
being sold 8 months after being assessed is
𝒚 = −𝟏𝟐𝟎. 𝟎𝟒𝟖 + 𝟏. 𝟕𝟓𝟏 𝟏𝟕𝟓 + 𝟎. 𝟑𝟔𝟖 𝟖 = 𝟏𝟖𝟗. 𝟑𝟐𝟏.
which mean the predicted final price on average is $189,321.
Coefficient of Determination
It is important to note that we cannot rely on the 𝑅! value simply in the way that it is
calculated. Although it still measures the percent of variability reduction, adding more
variables to the model will always increase 𝑅! . Some statisticians suggest using the
adjusted 𝑅! to reflect both the number of independent variables in the model and the
sample size, but, typically, we use this statistic when we are comparing different models.
The adjusted coefficient of determination is given by
𝑝
𝑅! 𝑎𝑑𝑗 = 𝑅! − 1 − 𝑅! ,
𝑛−𝑝−1
where 𝑝 is the total number of predictors or independent variables in the linear model and
𝑛 is the sample size.
16
Analysis of Variance for Multiple Linear Regression
The total variation in the response variable y (TSS) is divides into two parts:
• SSR (sum of squares for regression) measures the amount of variation explained
by the regression line.
• SSE (sum of squares for error) measures the “residuals” variation in the data that
is not explained by the regression line.
The ANOVA Table for simple linear regression is
Source SS df MS F
Regression SSR p MSR=SSR/p
MSR/MSE
Error SSE n-p-1 MSE=SSE/n-p-1
Total TSS n-1
where 𝑝 is the total number of predictors or independent variables in the linear model and
𝑛 is the sample size.
Hypotheses Testing for the Multiple Linear Regression Model
Hypotheses
𝐻! : 𝛽! , 𝛽! , … , 𝛽! = 0, which means the linear model is not significant .
𝐻! : 𝛽! , 𝛽! , … , 𝛽! ≠ 0, which means the linear model is significant.
The test statistic:
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
The sampling distribution of the test statistic is F with two degrees of freedom
𝑑𝑓! = 𝑝 and 𝑑𝑓! = 𝑛 − 𝑝 − 1. It is one-tailed test, namely upper-tailed test.
The decision: we reject 𝐻! if the test statistic 𝐹 > 𝐹!"! ,!"! ,! , or we reject 𝐻! if the p-
value of test statistic 𝐹 ≤ 𝛼 the predetermined significance level. Otherwise, we fail to
reject 𝐻! .
From the R output the p-value of F statistic is equal to zero and if we compare it with any
significance level we find that the model is significant.
17
Hypotheses Testing Concerning Slopes
The hypotheses:
𝐻! : 𝛽! ≥ 𝛽! 𝐻! : 𝛽! ≤ 𝛽! 𝐻! : 𝛽! = 𝛽!
𝐻! : 𝛽! < 𝛽! 𝐻! : 𝛽! > 𝛽! 𝐻! : 𝛽! ≠ 𝛽!
Test statistic:
𝑏! −𝛽!
𝑡=
𝑆𝐸!!
Compute the P-value. The P-value is an area under the Student’s t curve with n – p –1
From the excel output the p-value of t statistics for assessed value and the time are equal
to zero and if we compare with any significance level we find that assessed value and
time are significant and important for the model. The 95% Confidence intervals support
this conclusion.
Checking Assumptions in Multiple Regression
In multiple regression, as in simple linear regression, it is important to test the validity of

the assumptions for errors in linear models. We use the same Diagnostic tools for
checking the regression assumptions:
• Residuals Plot
• Normal Probability Plot
R. Code for Example (1)
=====================================================================
Entering the data into R
library(gdata) # This library allows reading Excel files (Mac Users).

data=read.xls(file.choose(), sheet=1) # Reading the data from the Excel file (Mac
Users)
OR
library(xlsx) # This library allows reading Excel files(Windows and Mac Users).
data=read.xlsx(file.choose(),1) # #Reading the data from the Excel file (Windows and
Mac Users).
data # Display the data
18
Implementing the regression analysis
model=with(data, lm(Selling~Assessed+Time)) # lm() function allows modeling the

linear regression relationships
summary(model) # Display linear regression results
plot(model) # Display plots for diagnosing the model
===============================================================
Variables Selection Methods
When the experiment has several independent variables and the main object is to find the
best regression model containing the minimum number of independent variables that
explain the majority of the variation in the dependent variable, the need for a method for
select best independent variables set among all independent variables of the experiment is
strongly required.
The purpose of variable selection is to find that subset of the variables in the original
model that will in some sense be “optimum.” There are two interrelated factors in
determining that optimum:
• For any given subset size (number of variables in the model) we want the subset
of independent variables that provides the minimum residual sum of squares.
Such a model is considered “optimum” for that subset size.
• Given a set of such optimum models, select the most appropriate subset size.
The problem here is all possible subset must be examined till reaching optimum subset.
For example, if we have 10 independent variables, we have to examine 210 = 1024 subset.
Modern computers allow solving this problem by using highly efficient computational
algorithms.
The most frequently used methods for variable selection are as follows:
• Backward elimination: Starting with the full model, delete the variable whose
coefficient has the smallest partial sum of squares (or smallest magnitude t
statistic). Repeat with the resulting (𝑝 − 1) variable equation, and so forth. Stop
deleting variables when all variables contribute some specified minimum partial
sum of squares (or have some minimum magnitude t statistic).
• Forward selection: Start by selecting the variable that, by itself, provides the
best-fitting equation. Add the second variable whose additional contribution to the
regression sum of squares is the largest, and so forth. Continue to add variables,
one at a time, until any variable when added to the model contributes less than
some specified amount to the regression sum of squares.
• Stepwise: This is an adaptation of forward selection in which, each time a
variable has been added, the resulting model is examined to see whether any
variable included makes a sufficiently small contribution so that it can be dropped
(as in backward elimination).
19
Criterion-based procedures
There are several criteria are used for evaluating the models and selecting the best model.
The known-common criterion is the Akaike Information Criterion (AIC).
𝐴𝐼𝐶 = −2𝑙𝑜𝑔 – 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + 2𝑝,
where the −2𝑙𝑜𝑔 − 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 is known as the deviance and is equal to 𝑛𝑙𝑜𝑔(𝑅𝑆𝑆/𝑛),
RSS is regression sum of squares, and p is number of independent variable (predictors) .
The best model is the model that has smallest AIC among all models.
Example (2): To investigate the housing market in anticipation of moving to a

new city. The well-known association between home size and cost has made the
price per square foot a widely used measure of housing costs. Also the following
five factors are associated to measure the housing costs: age of home in years,
number of bedrooms, number of bathrooms, size of home in 1000 ft2, and size of
lot in 1000 ft2. The data are on your D2L account. Find the multiple linear
regression model for the price on the optimum independent variables.
Solution Using R
Library(gdata)
data=read.xls(file.choose( ), sheet=1)
data
data1=na.omit(data) # omit the missing values from the data
data1
model=lm(price~age+bed+bath+size+lot, data=data1)
summary(model)
=================================================================
Coefficients:
(Intercept) 3.529e+01 1.412e+01 2.500 0.0161 *
age -3.498e-01 1.990e-01 -1.758 0.0855 .
bed -1.124e+01 4.427e+00 -2.539 0.0147 *
bath -4.540e+00 6.343e+00 -0.716 0.4778
size 6.595e+01 6.376e+00 10.342 1.79e-13 ***
lot 6.205e-05 5.020e-05 1.236 0.2229
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 42.93 on 5 and 45 DF, p-value: 4.965e-16

From the results, it is obvious that the bath and lot variables are not significant and
we need to select best predictors for the model.
20
In the following, we need to install MASS library to use stepAIC function for
implementing the variable selection. The argument of this function is as follows:
• Model: the linear regression model containing all predictors from prior
step.
• Direction: there are three options “both” for stepwise selection method,
”forward” for forward selection method, and “backward” for backward
selection method.
• Trace: if it is True, the selection steps will be display. If it is False, the
optimum model will be display.
=================================================================

library(MASS) # required for performing the variables selection method
summary(stepAIC(model, direction="both", trace = FALSE))

Call:
lm(formula = price ~ age + bed + size, data = data1)

Residuals:
Min 1Q Median 3Q Max
-40.747 -8.065 -0.094 9.774 56.194

Coefficients:
(Intercept) 36.9254 12.4484 2.966 0.00473 **
age -0.3193 0.1956 -1.632 0.10930
bed -11.8683 4.3851 -2.706 0.00945 **
size 61.7736 4.6383 13.318 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 71.44 on 3 and 47 DF, p-value: < 2.2e-16

================================================================
As is seen, the stepwise selection method was used without tracing the steps and
the final results are displayed. The best multiple regression model for modeling
the relationship between the housing price is that model that contains the age of
home in years, number of bedrooms, and size of home in 1000 ft2 as predictors.
21
Using Qualitative and Quantitative Predictors in a Linear Regression Model
The multiple linear regression model is so flexible model that can deal with qualitative
and quantitative variables as predictors. For the multiple regression methods, the
response variable y must be quantitative. However, the independent predictor variables
can be quantitative variables, qualitative variables, or both. The qualitative variable is
whose levels represent qualities or characteristics and can only be categorized.
As is seen in the previous section, the quantitative variables entered the regression model
as a linear term 𝑥! , 𝑥! … 𝑥! . In contrast to quantitative predictor variables, qualitative
predictor variables are entered into a regression model through dummy or indicator
variables. For example, suppose a model that relates the mean salary of a group of
employees to a number of predictor variables, you may want to include the employee’s
ethnic background. If each employee included in your study belongs to one of three
ethnic groups—say, A, B, or C—you can enter the qualitative variable “ethnicity” into
your model using two dummy variables:
1 , If group B
x! =
0 If not
1 , If group C
x! =
0 If not
Look at the effect these two variables have on the model 𝐸 𝑦 = 𝑏0 + 𝑏1𝑥1 +
𝑏2𝑥2: For employees in group A, 𝐸 𝑦 = 𝑏0 + 𝑏1 0 + 𝑏2 0 = 𝑏0 ,for employees
in group B, 𝐸 𝑦 = 𝑏0 + 𝑏1 1 + 𝑏2 0 = 𝑏0 + 𝑏1, and for those in group C,
𝐸 𝑦 = 𝑏0 + 𝑏1 0 + 𝑏2 1 = 𝑏0 + 𝑏2
The model allows a different average response for each group. 𝑏1 measures the
difference in the average responses between groups B and A, while 𝑏2 measures
the difference between groups C and A.
When a qualitative variable involves 𝑘 categories or levels, (𝑘 − 1) dummy

variables should be added to the regression model. This model may contain other
predictor variables—quantitative or qualitative—as well as cross-products
(interactions) of the dummy variables with other variables that appear in the
model. As you can see, the process of model building—deciding on the
appropriate terms to enter into the regression model—can be quite complicated.
Example:
A study was conducted to examine the relationship between university salary y,

the number of years of experience of the faculty member, and the gender of the
faculty member. If you expect a straight-line relationship between mean salary and
years of experience for both men and women, write the model that relates mean
22
at model building, gaining experience with the chapter exercises. The next example
involves one quantitative and one qualitative variable that interact.
XAMPLE 13.5 A study was conducted to examine the relationship between university salary y, the
salary toofthe
number years
twoof predictor
experience variables:
of the facultyyears
member, and the gender
of experience of the faculty
(quantitative) and gender
member. If you expect a straight-line relationship between mean salary and years of
ofexperience
the professor
for both(qualitative).
men and women, write the model that relates mean salary to the
two predictor variables: years of experience (quantitative) and gender of the profes-
Solution:
sor (qualitative).
Solution Since you may suspect the mean salary lines for women and men to be
You may interested in the difference the mean salary lines for women and men to,
different, your model for mean salary E(y) may appear as shown in Figure 13.11. A
your modelrelationship
straight-line for mean between
salary E(y)
E(y) may appear
and years as in thex1following
of experience implies theFigure.
model A
straight-line relationship
E(y) ! b0 " b1x1
between E(y) and years of experience x1 implies the
(graphs as a straight line)
model 𝐸 𝑦 = 𝑏! + 𝑏! 𝑥! (graphs as a straight line)
IGURE 13.11
● E(y)
ypothetical relationship
r mean salary E(y),
ears of experience (x1),
Mean Salary
nd gender (x2) for Men

xample 13.5
Women
0 1 2 3 4 5 x1
Years of
Experience
The qualitative variable “gender” involves k ! 2 categories, men and women. There-
The
fore,qualitative variable
you need (k # “gender”
1) ! 1 dummy involves
variable, 𝑘 = 2as categories, men and women.
x2, defined
There- fore,
1 ifyou need (𝑘 − 1) = 1 dummy variable, 𝑥2, defined as
a man
x2 ! !0 if a woman
1 , If a man
and the model is expanded to becomex! = 0 , If a woman
E(y) ! b0 " b1x1 " b2 x2 (graphs as two parallel lines)
and the model is expanded to become 𝐸 𝑦 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 (graphs as two
The fact that the slopes of the two lines may differ means that the two predictor vari-
parallel lines).
ables interact; that is, the change in E(y) corresponding to a change in x depends on 1
whether the professor is a man or a woman. To allow for this interaction (difference
The fact that
in slopes), the slopesterm
the interaction of xthe two lines may differ means that the two predictor
1x2 is introduced into the model. The complete model
that characterizes the graph in Figure
variables interact; that is, the change is E(y) corresponding to a change in 𝑥1
13.11 in
depends on whether thevariable
dummy professor is a man or a woman. To allow for this
interaction (difference in slopes), the interaction term 𝑥1 𝑥2 is introduced into the
for gender
!
model. The complete model that characterizes the graph in the previous Figure is
E(y) ! b0 " b1x1 " b2x2 " b3x1x2
" 𝐸 𝑦 = " 𝑏0 + 𝑏! 𝑥! + 𝑏! 𝑥! + 𝑏! 𝑥! 𝑥!
years of interaction
where experience
1 , If a man
𝑥! is the year of experience and x! = .
0 , If a woman
You can see how the model works by assigning values to the dummy variable x! .
When the faculty member is a woman, the model is
23
𝐸 𝑦 = 𝑏0 + 𝑏! 𝑥! + 𝑏! 0 + 𝑏! 𝑥! 0 = 𝑏0 + 𝑏! 𝑥!
which is a straight line with slope 𝑏! and intercept 𝑏0. When the faculty member is
a man, the model is
𝐸 𝑦 = 𝑏0 + 𝑏! 𝑥! + 𝑏! 1 + 𝑏! 𝑥! 1 = 𝑏! + 𝑏! + 𝑏! + 𝑏! 𝑥!
which is a straight line with slope 𝑏! + 𝑏! and intercept 𝑏! + 𝑏! . The two lines
have different slopes and different intercepts, which allows the relationship
between salary y and years of experience 𝑥! to behave differently for men and
women.
Example (3):
Random samples of six female and six male assistant professors were selected
from among the assistant professors in a college of arts and sciences. The data on
salary and years of experience are shown in the following Table. Note that each of
the two samples (male and female) contained two professors with 3 years of
experience, but no male professor had 2 years of experience. Find the linear
regression and check the regression assumptions
Years Gender Salary

1 Man 60710
2 Man -
3 Man 63160
3 Man 63210
4 Man 64140
5 Man 65760
5 Man 65590
1 Woman 59510
2 Woman 60440
3 Woman 61340
3 Woman 61760
4 Woman 62750
5 Woman 63200
5 Woman -
R. code
===============================================================
library(gdata)
data=read.xls(file.choose( ), sheet=1)
data
24
model=lm(Salary~Years*factor(Gender), data=data)
summary(model)
===============================================================
As is seen in the R. code, we use factor function for defining the dummy variable Gender.
The output as follows
Call:
lm(formula = Salary ~ Years * factor(Gender), data = data)
Residuals:
Min 1Q Median 3Q Max
-238.000 -108.250 -1.232 85.833 281.000
Coefficients:
(Intercept) 59459.71 223.47 266.072 < 2e-16 ***
Years 1229.13 59.37 20.702 3.11e-08 ***
factor(Gender)Woman -866.71 305.26 -2.839 0.0218 *
Years:factor(Gender)Woman -260.13 87.06 -2.988 0.0174 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(2 observations deleted due to missingness)
F-statistic: 346.2 on 3 and 8 DF, p-value: 8.372e-09
From the output, we can find the following two regression lines
For men: 𝐸 𝑦 = 𝑏0 + 𝑏! 𝑥! = 59459.71 + 1229.13𝑥!
For women:
𝐸 𝑦 = 𝑏! + 𝑏! + 𝑏! + 𝑏! 𝑥! = 59459.71 − 866.71 + 1229.13 − 260.13 𝑥!
= 58593 + 969𝑥!
Next, consider the overall fit of the model using the analysis of variance F-test. Since the
observed test statistic in the ANOVA portion of the printout is 𝐹 = 346.2 with P-value =
0.000, you can conclude that at least one of the predictor variables is contributing
information for the prediction of y. The strength of this model is further measured by the
coefficient of determination, R-squared= 99.2%. You can see that the model appears to fit
very well.
The following R. code is for drawing the lines and checking the regression assumptions
==============================================================
nf <- layout(matrix(c(1,2,1,3), 2, 2, byrow=TRUE), respect=TRUE)
layout.show(nf)
with(data, plot(Years, Salary))
abline(a=59459.71,b=1229.13,lty=1)
abline(a=58593, b=969, lty=2)
25
legend(1,65000,c("Men","Women"),lty=c(1,2))
plot(model)
==============================================================
Residuals vs Fitted
66000
300
12
11
65000
100
Residuals
Men
-100 0
Women
64000
13
-300
60000 62000 64000
63000
Fitted values
Salary
62000
Normal Q-Q
2
12
11
Standardized residuals
61000
1
0
60000
-1
13
1 2 3 4 5 -1.5 -0.5 0.5 1.0 1.5
Years Theoretical Quantiles
To explore the effect of the predictor variables in more detail, look at the individual t-
tests for the three predictor variables. The p-values for these tests are all significant,
which means that all of the predictor variables add significant information to the
prediction with the other two variables already in the model. Finally, check the residual
plots to make sure that there are no strong violations of the regression assumptions. These
plots, which behave as expected for a properly fit model, are shown in above figure.
26

2 Correlation and Linear Regression PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Correlation and Linear Regression PDF

Uploaded by

Copyright:

Available Formats

Correlation and Linear Regression

Sometimes we collect numerical data in pairs, 𝑥! , 𝑦! , for the purpose of determining

1. Its value is between +1 and −1 inclusive.

straight line with a positive or negative• slope depending

concluded. As height increases,

We see a positive, or increasing, relationship. As height increases, shoe

The Confidence Interval for the population Correlation

Let X and Y be random variables with the bivariate normal distribution.

The 95% confidence interval for 𝜇! is given by

1.1444 − 1.96(0.3780) < 𝜇! < 1.1444 + 1.96(0.3780)

0.4036 < 𝜇! < 1.8852

𝑒 !(!.!"#$) − 1 𝑒 !!! − 1 𝑒 !(!.!!"#) − 1

Hypothesis Testing for the Population Correlation (The Hypothesized Correlation

The P-value is= 𝑃 𝑍 ≥ 𝑧 = 𝑃 𝑍 ≥ 2.21 = 1 − 𝑃 𝑍 < 2.21 = 1 − 0.9864 =0.0136.

Hypothesis Testing for the Population Correlation (The Hypothesized Correlation

Test the hypothesis 𝐻! : 𝜌 ≤ 0 versus 𝐻! : 𝜌 > 0. Use significance level 5%.

The cor.test()function in R with the Spearman option allow to you to calculate

cor.test(Variable #1, Variable #2, alternative="less, greater, or two.sided",

Simple Linear Regression

Regression analysis is a statistical method for analyzing a relationship between two or

Some examples of analyses using regression include

• estimating weight gain by the addition to children’s diets of different amounts of a

where the quantities Sxy and Sxx are defined as

Student Test Score Grade 90

The Normal Probability Plot

To check the assumptions we need to calculate the residuals as it follows

FIGURE 12.10 FIGURE 12.11

Assumptions about the random error

Assume that the values of 𝜺 satisfy these conditions:

Suppose for our final

We would have as many b’s and x’s as we𝒚 had= 𝒂 independent

House Selling Assessed Time House Selling Assessed Time

Residual standard error: 3.097 on 27 degrees of freedom

𝒚 = −𝟏𝟐𝟎. 𝟎𝟒𝟖 + 𝟏. 𝟕𝟓𝟏𝒙𝟏 + 𝟎. 𝟑𝟔𝟖𝒙𝟐

Like simple regression, we need to interpret the regression coefficients. To do this, we

𝒚 = −𝟏𝟐𝟎. 𝟎𝟒𝟖 + 𝟏. 𝟕𝟓𝟏 𝟏𝟕𝟓 + 𝟎. 𝟑𝟔𝟖 𝟖 = 𝟏𝟖𝟗. 𝟑𝟐𝟏.

which mean the predicted final price on average is $189,321.

The ANOVA Table for simple linear regression is

Hypotheses Testing for the Multiple Linear Regression Model

Checking Assumptions in Multiple Regression

In multiple regression, as in simple linear regression, it is important to test the validity of

• Normal Probability Plot

R. Code for Example (1)

Entering the data into R

library(gdata) # This library allows reading Excel files (Mac Users).

data # Display the data

model=with(data, lm(Selling~Assessed+Time)) # lm() function allows modeling the

Variables Selection Methods

𝐴𝐼𝐶 = −2𝑙𝑜𝑔 – 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + 2𝑝,

Example (2): To investigate the housing market in anticipation of moving to a

When a qualitative variable involves 𝑘 categories or levels, (𝑘 − 1) dummy

A study was conducted to examine the relationship between university salary y,

nd gender (x2) for Men

Years Gender Salary

Residual standard error: 201.3 on 8 degrees of freedom

1 2 3 4 5 -1.5 -0.5 0.5 1.0 1.5

Years Theoretical Quantiles

You might also like