You are on page 1of 25

MA150

Statistics Notes (Chapter 14)


Descriptive Methods in
Regression and Correlation

1
Chapter 14
CH 14.1: Linear Equations with One Independent Variable

In this chapter we will study the relationship between 2 variables for a given individual or unit.

Some variable pairs:


 Height and weight for basketball players on the WSU men’s basketball team
 Hours per week of exercise and cholesterol level for each faculty member at WSU

In these cases, we are interested in whether the two variables have some kind of a relationship
 They could be unrelated
o age and last digit of Social Security Number
 One variable could be used to explain the other
o Volume of Ice cream sales can be used to explain the reported number of cases of
Lyme Disease
 One variable could be thought of as causing the other to change
o Number of hours worked on a HW assignment can be thought of as causing the
grade on the assignment

Response Variable: the variable that is being explained:


Explanatory Variable: the variable that is being used to explain:

Sometimes it is clear which variable is the explanatory variable and which is the response
variable, sometimes it is not.

Explanatory Variable Response Variable


Independent Variable (Predictor Variable) Dependent Variable
x Variable y Variable
Number of hours worked Total wages earned
Age of a car Number of miles
Gallons of gas purchased Cost at the pump
Volume of water in a beaker Weight of water in a beaker
Age of a child Vocabulary of a child
Height of parents Height of their offspring

Lurking Variable: Sometimes the two variables are both affected by a third variable, a lurking
variable, that had not been included in the study.

For example, volume of ice cream sales and number of reported cases of Lyme disease. What is
the lurking variable?

The Linear Equation

2
Equation 1: y = 4 + 2x
For every value of x, y has a single value.

x y
-4 -4
-2 0
0 4
2 8
4 12

Equation 2: y = 2 -3x
Fill in the chart and graph the line.

x y
-2
-1
0
1
2

b0 b1
Every linear equation, y=b0 +b1 x , has 2 constant numbers in it: , .

In equation 1: y=4+ 2 x
b0  4
b1  2

In equation 2: y=2−3 x,
Find:

b 0=¿
b 1=¿

3
The value b 0 is called the y-Intercept.
It is where the line intersects the y-axis. It shows what the value of y is when x = 0.

The value b 1 is called the slope.


Slope measures how steep the line is. It determines how much the y value will change when the
x value increases by 1 unit.

It measures how steep the line is. It determines how much the y value will change when the x
value increases by 1 unit. Look back at the graph and check that these interpretations hold true
for equations 1 and 2.

A positive slope
If b 1>0 the line slopes up as we move to the right.
A negative slope
If b 1<0 the line slopes down as we move to the right.

We can graph a linear equation using


b 0 , b1

Example: y = 5 + 3 x
y- Intercept = ______

Slope=_______

Slope positive or negative?


Graph the line.

4
Example: y = 2 – 4 x
y Intercept = ____
Slope = ______
Slope positive or negative?
Graph the line.

We can use linear equations to model many systems.

Example: A rental car costs $15 per day plus $2 per mile.
b0 =15
b1 =2

Total Cost (y) for one day rental can be calculated as:
y = 15 + 2x.

1. What is the equation for the total cost of buying x gallons of gasoline this week?

2. If it takes you 10 minutes to stretch and 9 minutes to run each mile, how much time does it
take you to complete a workout if the workout consists of stretching and running x miles?

5
CH 14.2: The regression Equation

There are several different types of relations between two variables


A relationship is linear when, plotted on a scatter diagram; the points follow the general pattern
of a line

A relationship is nonlinear when, plotted on a scatter diagram, the points follow a general
pattern, but it is not a line

A relationship has no correlation when, plotted on a scatter diagram, the points do not show any
pattern

6
Linear Relations

Points that cluster around a line

Positive (the points slants upwards to the right


Example: Grip Strength and Arm Strength

Negative (the points slant downwards to the right)


Example: Elevation and Temperature

Nonlinear Relationship (Points that have a trend, but not around a line)
Example: Total Length of Fish and Mass of Fish

No Relationship

7
Points have no relationship
Example: Boys Height and Birth Month

Example: Find the relationship between x and y.

x y
1 1
1 2
2 2
4 6

7
6
5
4
3
2
Y

1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X

8
Line A: Y = 0.50 + 1.25x

7
6
5
4
3
2
Y

1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X

Line B: Y = -0.25 + 1.50x

7
6
5
4
3
2
Y

1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
Which line fits the points better, A or B?
That depends on how we define a better fit.

9
Definition: ^y denotes the y-value predicted by the straight line for a value of x.

At x = 2:
For Line A
y^ =0.5+1.25⋅2
^y =3
For Line B
y^ =−0.25+1.50⋅2
^y =2.75
Define the error e (residual) to be the error made in using the line to predict the y-value.
e= y− ^y .

For Line A the error at x=2 is


e=2−3=−1
For Line B the error at x=2 is
e=2−2. 75=−0. 75

So we see that at the point x=2, Line B is a better fit than Line A.

Define
( xi , y i )= observation i

For a given line, let i^y =the predicted value of point i (the value of the line at
xi )
The residual (error) at point i is denoted by e i= y i− ^yi

10
 The best line minimizes “total error”.
∑ e 2i
 We choose to define total error as Total Sum of Squared Errors = i
 The line having the smaller sum of squared errors fits the data best.

Line B fits the data better than Line A.

Least Square Criterion: The straight line that best fits a set of data points is the one having the
smallest possible sum of squared errors.

Regression line: The straight line that best fits a set of data points according to the least-squares
criterion.

How do we get the equation for the Regression Line?


We use SPSS, graphing calculator or other technology. But… this is how it’s done:

Define
1
x= ∑ x i
n
1
ȳ= ∑ y i
n

( xi −x )
2
S xx =∑
S xy =∑ (x i −x )( y i− y )

Then the Regression Equation has the form

Y = b0 + b1x where

11
S xy
b0 =
S xx
b1 = y−m x .

Example: Finding the relationship between number of absences and average final grade.

Number of Average Final


Absences Grade
0.0 89.2
1.0 86.4
2.0 83.5
3.0 81.1
4.0 78.2
5.0 73.9
6.0 64.3
7.0 71.8
8.0 65.5
9.0 66.2

Each individual is represented by a point in the diagram


 The explanatory (x) variable is plotted on the horizontal scale
 The response (y) variable is plotted on the vertical scale
Note the truncated vertical scale. More OK when vertical axis is not representing counts
Scatter plot allow us to see relationships .
A relationship clearly exists between the Number Absences and Average Grade. The scattering
suggests that some of the variation in Grade is not accounted for by Absences.

Points seem to be scattered about a line, we use technology to find the line of best fit which is:
y = 88.73 – 2.83x

12
Regression Equation:
Average Grade = 88.73 – 2.83 Number of Absences

Slope = -2.83
y-Intercept = 88.73

Calculating a Fitted Value:

Graphically:

Algebraically:
Find the grade the model predicts for a student that has 3 absences.
^y =88.73−2.83 ∙7
^y =68.92
Finding the residual:
Graphically:

13
Algebraically:

e = y - ^y Number of Average
e = 71.8 – 68.92 Absences Final Grade
e = 2.88 0.0 89.2
1.0 86.4
2.0 83.5
3.0 81.1
4.0 78.2
5.0 73.9
6.0 64.3
7.0 71.8
8.0 65.5
9.0 66.2
When is it valid to interpret the slope? When the data seems to be linear (scattered about a
line).
Interpretation of slope: For any unit increase in x-value, the model predicts the y-value will
change by b 1 (the value of the slope).

Example: What is the slope of the Average Grade regression equation? Can you interpret the
slope? Why or why not. If you can interpret the slope, then do so.

Average Grade = 88.73 – 2.83 Number of Absences


The slope is -2.83.
You can interpret the slope because the points seem to be scattered about that regression line.
Interpretation of slope:
For every additional absence , the model predicts the Average Grade will change by -2.87.
or
For every additional absence , the model predicts the Average Grade will decrease by 2.87.

When is it valid to interpret the y-intercept? A regression model is valid only in the range of
the x-data. The y-intercept occurs when x = 0, so it is only valid to interpret the y-int when x=0
is in the range of the data.
Interpretation of y-intercept: When x is equal to 0, the model predicts that the y-value is equal
to b 0 (the value of y-int).

Example: What is the y-int of the Average Grade regression equation? Can you interpret the y-
int? Why or why not. If you can interpret the y-int, then do so.

Average Grade = 88.73 – 2.83 Number of Absences

14
The y-intercept = 88.73
The y-intercept is interpretable because x = 0 is in the range of the x-data.
Interpretation: When the number of absences is equal to 0, the model predicts that the Average
Grade is equal to 88.73.

Identify and interpret the coefficients of the parent’s height/student’s height model

Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %

75

70
Heights

65

60

120 130 140 150

Mom and Dad


What is the slope of the Student Heights regression equation? Can you interpret the slope? Why
or why not. If you can interpret the slope, then do so.

15
What is the y-int of the Students Height regression equation? Can you interpret the y-int? Why or
why not. If you can interpret the y-int, then do so.

WARNINGS FOR THE USE OF REGRESSION

Using the regression equation to make predictions is OK only within the range of the data.

Extrapolation: The use of the regression equation to make predictions outside the range of the
data.
Extrapolation is NOT VALID.

Outliers: An outlier is an observation that lies outside the overall pattern of data.

Influential Observation: An influential observation is a data point whose removal causes the
regression equation to change considerably.
If a point is influential you can either
1. Collect more data to fill in
2. Remove the point from the analysis and limit the scope of the analysis

Tiger Woods joins the regulars at the country club: (An influential analysis)

16
Tiger Woods
Data:
Scatter plot:
Club Speed Head Distance (yards)
(mph)
100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
120 305

Distance = 39.46 + 2.229 Club Speed


r = 0.975

Outlier Removed
(forced axes to remain constant)
Distance = -55.80 +3.17Club Speed

Put both lines on one graph.

17
Conclusion: Tiger Woods point is an influential point because removing it caused a significant
shift in the regression line.
Final Model:
Distance = -55.80 +3.17Club Speed
Limit the scope of the analysis: Valid for club speeds less than 115 mph
CH 14.3: The Coefficient of Determination
How reliable is the regression line?

Coefficient of Determination: r2
Consider the amount of variability on the y values. The regression model explains some of the
variability. The Coefficient of Determination, r2, is the percentage of this variability that the
regression model explains.

0≤r 2 ≤1
 If regression is explaining almost all variation

R2 close to 1

 If regression is explaining almost none of the variation R2 close to 0


 You should use technology to compute R2

When R2 is close to 1, the model is explaining most of the variability in the Y-values.
When R2 is close to 0, the model is explaining almost none of the variability in the Y-values

● You should use technology (a calculator or software) to compute r 2

Some examples of r 2

18
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %

75

70
Heights

65

60

120 130 140 150

Mom and Dad

Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %

75

70
Heights

65

60

120 130 140 150

Mom and Dad

19
Regression Plot
Y = 11.3885 + 0.870252X
R-Sq = 37.6 %

75

70
heights

65

60 61 62 63 64 65 66 67 68 69 70

Mom height

Regression Plot
Y = 33.0924 + 0.490966X
R-Sq = 20.7 %

75

70
heights

65

60 65 70 75

Dad height

20
Regression Plot
Y = 112.676 - 5.26403X
R-Sq = 95.0 %

100

90
TAX EFF

80

70

60

50

3 4 5 6 7 8 9 10 11

ENERGY

Regression Plot
Y = 371.602 - 27.9029X
R-Sq = 93.7 %

350

300
PRICE

250

200

1 2 3 4 5 6

AGE

21
Plant weight and hydrocarbons
Regression Plot
Y = 3.52369 + 0.162848X
R-Sq = 11.0 %

20
EMISSION

15

10

50 55 60 65 70 75 80 85

WEIGHT

CH 14.4: Linear Correlation

The linear correlation coefficient, r, is a measure of the strength of linear relation between two
quantitative variables

Some properties of the linear correlation coefficient


 r is a unitless measure (so that r would be the same for a data set whether x and y
are measured in feet, inches, meters, or fathoms)
 r is always between –1 and +1
 Positive values of r correspond to positive relations
 Negative values of r correspond to negative relations

Some more properties of the linear correlation coefficient


 The closer r is to +1, the stronger the positive relation … when r = +1, there is a
perfect positive relation; the points fall exactly on a straight line with a positive
slope.
 The closer r is to –1, the stronger the negative relation … when r = –1, there is a
perfect negative relation; the points fall exactly on a straight line with a negative
slope.
 The closer r is to 0, the less of a linear relation (either positive or negative).

22
Maximum positive Strong positive correlation Zero correlation (r = 0)
correlation (r = 1.0) (r = 0.80)

Minimum negative Moderate negative correlation Strong correlation with


correlation (r = -1.0) (r = -0.43) outlier (r = 0.71)

Several points are evident from the scatterplots.

 When the slope of the line in the plot is negative, the correlation is negative; and vice
versa.
 The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a
straight line.
 The correlation becomes weaker as the data points become more scattered.
 If the data points fall in a random pattern, the correlation is equal to zero.
 Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot.
The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).

We won’t calculate correlation by hand. We can do using SPSS

We will need to know how to interpret it.

SPSS Output:

23
Model Summary

Adjusted R Std. Error of the


Model R R Square Square Estimate

1 .947a .898 .885 3.0673

a. Predictors: (Constant), Number Absences

The correlation coefficient, r = -0.947.

Summary of Correlation

 Correlation is not causation!

 Just because two variables are correlated does not mean that one causes the other to
change.

o Example: There is a strong correlation between shoe sizes and vocabulary sizes
for grade school children.
 Clearly larger shoe sizes do not cause larger vocabularies
 Clearly larger vocabularies do not cause larger shoe sizes (lurking
variable?)

 Correlation between two variables can be described with both visual and numeric
summaries
o Visual summaries: Scatter plots
o Numerical summaries: correlation coefficient

24
 Care should be taken in the interpretation of linear correlation (nonlinearity and
causation)

25

You might also like