Professional Documents
Culture Documents
1
Chapter 14
CH 14.1: Linear Equations with One Independent Variable
In this chapter we will study the relationship between 2 variables for a given individual or unit.
In these cases, we are interested in whether the two variables have some kind of a relationship
They could be unrelated
o age and last digit of Social Security Number
One variable could be used to explain the other
o Volume of Ice cream sales can be used to explain the reported number of cases of
Lyme Disease
One variable could be thought of as causing the other to change
o Number of hours worked on a HW assignment can be thought of as causing the
grade on the assignment
Sometimes it is clear which variable is the explanatory variable and which is the response
variable, sometimes it is not.
Lurking Variable: Sometimes the two variables are both affected by a third variable, a lurking
variable, that had not been included in the study.
For example, volume of ice cream sales and number of reported cases of Lyme disease. What is
the lurking variable?
2
Equation 1: y = 4 + 2x
For every value of x, y has a single value.
x y
-4 -4
-2 0
0 4
2 8
4 12
Equation 2: y = 2 -3x
Fill in the chart and graph the line.
x y
-2
-1
0
1
2
b0 b1
Every linear equation, y=b0 +b1 x , has 2 constant numbers in it: , .
In equation 1: y=4+ 2 x
b0 4
b1 2
In equation 2: y=2−3 x,
Find:
b 0=¿
b 1=¿
3
The value b 0 is called the y-Intercept.
It is where the line intersects the y-axis. It shows what the value of y is when x = 0.
It measures how steep the line is. It determines how much the y value will change when the x
value increases by 1 unit. Look back at the graph and check that these interpretations hold true
for equations 1 and 2.
A positive slope
If b 1>0 the line slopes up as we move to the right.
A negative slope
If b 1<0 the line slopes down as we move to the right.
Example: y = 5 + 3 x
y- Intercept = ______
Slope=_______
4
Example: y = 2 – 4 x
y Intercept = ____
Slope = ______
Slope positive or negative?
Graph the line.
Example: A rental car costs $15 per day plus $2 per mile.
b0 =15
b1 =2
Total Cost (y) for one day rental can be calculated as:
y = 15 + 2x.
1. What is the equation for the total cost of buying x gallons of gasoline this week?
2. If it takes you 10 minutes to stretch and 9 minutes to run each mile, how much time does it
take you to complete a workout if the workout consists of stretching and running x miles?
5
CH 14.2: The regression Equation
A relationship is nonlinear when, plotted on a scatter diagram, the points follow a general
pattern, but it is not a line
A relationship has no correlation when, plotted on a scatter diagram, the points do not show any
pattern
6
Linear Relations
Nonlinear Relationship (Points that have a trend, but not around a line)
Example: Total Length of Fish and Mass of Fish
No Relationship
7
Points have no relationship
Example: Boys Height and Birth Month
x y
1 1
1 2
2 2
4 6
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
8
Line A: Y = 0.50 + 1.25x
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
Which line fits the points better, A or B?
That depends on how we define a better fit.
9
Definition: ^y denotes the y-value predicted by the straight line for a value of x.
At x = 2:
For Line A
y^ =0.5+1.25⋅2
^y =3
For Line B
y^ =−0.25+1.50⋅2
^y =2.75
Define the error e (residual) to be the error made in using the line to predict the y-value.
e= y− ^y .
So we see that at the point x=2, Line B is a better fit than Line A.
Define
( xi , y i )= observation i
For a given line, let i^y =the predicted value of point i (the value of the line at
xi )
The residual (error) at point i is denoted by e i= y i− ^yi
10
The best line minimizes “total error”.
∑ e 2i
We choose to define total error as Total Sum of Squared Errors = i
The line having the smaller sum of squared errors fits the data best.
Least Square Criterion: The straight line that best fits a set of data points is the one having the
smallest possible sum of squared errors.
Regression line: The straight line that best fits a set of data points according to the least-squares
criterion.
Define
1
x= ∑ x i
n
1
ȳ= ∑ y i
n
( xi −x )
2
S xx =∑
S xy =∑ (x i −x )( y i− y )
Y = b0 + b1x where
11
S xy
b0 =
S xx
b1 = y−m x .
Example: Finding the relationship between number of absences and average final grade.
Points seem to be scattered about a line, we use technology to find the line of best fit which is:
y = 88.73 – 2.83x
12
Regression Equation:
Average Grade = 88.73 – 2.83 Number of Absences
Slope = -2.83
y-Intercept = 88.73
Graphically:
Algebraically:
Find the grade the model predicts for a student that has 3 absences.
^y =88.73−2.83 ∙7
^y =68.92
Finding the residual:
Graphically:
13
Algebraically:
e = y - ^y Number of Average
e = 71.8 – 68.92 Absences Final Grade
e = 2.88 0.0 89.2
1.0 86.4
2.0 83.5
3.0 81.1
4.0 78.2
5.0 73.9
6.0 64.3
7.0 71.8
8.0 65.5
9.0 66.2
When is it valid to interpret the slope? When the data seems to be linear (scattered about a
line).
Interpretation of slope: For any unit increase in x-value, the model predicts the y-value will
change by b 1 (the value of the slope).
Example: What is the slope of the Average Grade regression equation? Can you interpret the
slope? Why or why not. If you can interpret the slope, then do so.
When is it valid to interpret the y-intercept? A regression model is valid only in the range of
the x-data. The y-intercept occurs when x = 0, so it is only valid to interpret the y-int when x=0
is in the range of the data.
Interpretation of y-intercept: When x is equal to 0, the model predicts that the y-value is equal
to b 0 (the value of y-int).
Example: What is the y-int of the Average Grade regression equation? Can you interpret the y-
int? Why or why not. If you can interpret the y-int, then do so.
14
The y-intercept = 88.73
The y-intercept is interpretable because x = 0 is in the range of the x-data.
Interpretation: When the number of absences is equal to 0, the model predicts that the Average
Grade is equal to 88.73.
Identify and interpret the coefficients of the parent’s height/student’s height model
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
15
What is the y-int of the Students Height regression equation? Can you interpret the y-int? Why or
why not. If you can interpret the y-int, then do so.
Using the regression equation to make predictions is OK only within the range of the data.
Extrapolation: The use of the regression equation to make predictions outside the range of the
data.
Extrapolation is NOT VALID.
Outliers: An outlier is an observation that lies outside the overall pattern of data.
Influential Observation: An influential observation is a data point whose removal causes the
regression equation to change considerably.
If a point is influential you can either
1. Collect more data to fill in
2. Remove the point from the analysis and limit the scope of the analysis
Tiger Woods joins the regulars at the country club: (An influential analysis)
16
Tiger Woods
Data:
Scatter plot:
Club Speed Head Distance (yards)
(mph)
100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
120 305
Outlier Removed
(forced axes to remain constant)
Distance = -55.80 +3.17Club Speed
17
Conclusion: Tiger Woods point is an influential point because removing it caused a significant
shift in the regression line.
Final Model:
Distance = -55.80 +3.17Club Speed
Limit the scope of the analysis: Valid for club speeds less than 115 mph
CH 14.3: The Coefficient of Determination
How reliable is the regression line?
Coefficient of Determination: r2
Consider the amount of variability on the y values. The regression model explains some of the
variability. The Coefficient of Determination, r2, is the percentage of this variability that the
regression model explains.
0≤r 2 ≤1
If regression is explaining almost all variation
R2 close to 1
When R2 is close to 1, the model is explaining most of the variability in the Y-values.
When R2 is close to 0, the model is explaining almost none of the variability in the Y-values
Some examples of r 2
18
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
19
Regression Plot
Y = 11.3885 + 0.870252X
R-Sq = 37.6 %
75
70
heights
65
60 61 62 63 64 65 66 67 68 69 70
Mom height
Regression Plot
Y = 33.0924 + 0.490966X
R-Sq = 20.7 %
75
70
heights
65
60 65 70 75
Dad height
20
Regression Plot
Y = 112.676 - 5.26403X
R-Sq = 95.0 %
100
90
TAX EFF
80
70
60
50
3 4 5 6 7 8 9 10 11
ENERGY
Regression Plot
Y = 371.602 - 27.9029X
R-Sq = 93.7 %
350
300
PRICE
250
200
1 2 3 4 5 6
AGE
21
Plant weight and hydrocarbons
Regression Plot
Y = 3.52369 + 0.162848X
R-Sq = 11.0 %
20
EMISSION
15
10
50 55 60 65 70 75 80 85
WEIGHT
The linear correlation coefficient, r, is a measure of the strength of linear relation between two
quantitative variables
22
Maximum positive Strong positive correlation Zero correlation (r = 0)
correlation (r = 1.0) (r = 0.80)
When the slope of the line in the plot is negative, the correlation is negative; and vice
versa.
The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a
straight line.
The correlation becomes weaker as the data points become more scattered.
If the data points fall in a random pattern, the correlation is equal to zero.
Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot.
The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).
SPSS Output:
23
Model Summary
Summary of Correlation
Just because two variables are correlated does not mean that one causes the other to
change.
o Example: There is a strong correlation between shoe sizes and vocabulary sizes
for grade school children.
Clearly larger shoe sizes do not cause larger vocabularies
Clearly larger vocabularies do not cause larger shoe sizes (lurking
variable?)
Correlation between two variables can be described with both visual and numeric
summaries
o Visual summaries: Scatter plots
o Numerical summaries: correlation coefficient
24
Care should be taken in the interpretation of linear correlation (nonlinearity and
causation)
25