MA150 Statistics Notes (Chapter 14) Descriptive Methods in Regression and Correlation

MA150
Statistics Notes (Chapter 14)

Descriptive Methods in
Regression and Correlation
1
Chapter 14
CH 14.1: Linear Equations with One Independent Variable
In this chapter we will study the relationship between 2 variables for a given individual or unit.
Some variable pairs:

 Height and weight for basketball players on the WSU men’s basketball team
 Hours per week of exercise and cholesterol level for each faculty member at WSU
In these cases, we are interested in whether the two variables have some kind of a relationship
 They could be unrelated
o age and last digit of Social Security Number
 One variable could be used to explain the other
o Volume of Ice cream sales can be used to explain the reported number of cases of
Lyme Disease
 One variable could be thought of as causing the other to change
o Number of hours worked on a HW assignment can be thought of as causing the
grade on the assignment
Response Variable: the variable that is being explained:

Explanatory Variable: the variable that is being used to explain:
Sometimes it is clear which variable is the explanatory variable and which is the response
variable, sometimes it is not.
Explanatory Variable Response Variable

Independent Variable (Predictor Variable) Dependent Variable
x Variable y Variable
Number of hours worked Total wages earned
Age of a car Number of miles
Gallons of gas purchased Cost at the pump
Volume of water in a beaker Weight of water in a beaker
Age of a child Vocabulary of a child
Height of parents Height of their offspring
Lurking Variable: Sometimes the two variables are both affected by a third variable, a lurking
variable, that had not been included in the study.
For example, volume of ice cream sales and number of reported cases of Lyme disease. What is
the lurking variable?
The Linear Equation
2
Equation 1: y = 4 + 2x
For every value of x, y has a single value.
x y
-4 -4
-2 0
0 4
2 8
4 12
Equation 2: y = 2 -3x
Fill in the chart and graph the line.
x y
-2
-1
0
1
2
b0 b1
Every linear equation, y=b0 +b1 x , has 2 constant numbers in it: , .
In equation 1: y=4+ 2 x
b0  4
b1  2
In equation 2: y=2−3 x,
Find:
b 0=¿
b 1=¿
3
The value b 0 is called the y-Intercept.
It is where the line intersects the y-axis. It shows what the value of y is when x = 0.
The value b 1 is called the slope.

Slope measures how steep the line is. It determines how much the y value will change when the
x value increases by 1 unit.
It measures how steep the line is. It determines how much the y value will change when the x
value increases by 1 unit. Look back at the graph and check that these interpretations hold true
for equations 1 and 2.
A positive slope
If b 1>0 the line slopes up as we move to the right.
A negative slope
If b 1<0 the line slopes down as we move to the right.
We can graph a linear equation using

b 0 , b1
Example: y = 5 + 3 x
y- Intercept = ______
Slope=_______
Slope positive or negative?

Graph the line.
4
Example: y = 2 – 4 x
y Intercept = ____
Slope = ______
Slope positive or negative?
Graph the line.
We can use linear equations to model many systems.
Example: A rental car costs $15 per day plus $2 per mile.
b0 =15
b1 =2
Total Cost (y) for one day rental can be calculated as:
y = 15 + 2x.
1. What is the equation for the total cost of buying x gallons of gasoline this week?
2. If it takes you 10 minutes to stretch and 9 minutes to run each mile, how much time does it
take you to complete a workout if the workout consists of stretching and running x miles?
5
CH 14.2: The regression Equation
There are several different types of relations between two variables

A relationship is linear when, plotted on a scatter diagram; the points follow the general pattern
of a line
A relationship is nonlinear when, plotted on a scatter diagram, the points follow a general
pattern, but it is not a line
A relationship has no correlation when, plotted on a scatter diagram, the points do not show any
pattern
6
Linear Relations
Points that cluster around a line
Positive (the points slants upwards to the right

Example: Grip Strength and Arm Strength
Negative (the points slant downwards to the right)

Example: Elevation and Temperature
Nonlinear Relationship (Points that have a trend, but not around a line)
Example: Total Length of Fish and Mass of Fish
No Relationship
7
Points have no relationship
Example: Boys Height and Birth Month
Example: Find the relationship between x and y.
x y
1 1
1 2
2 2
4 6
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
8
Line A: Y = 0.50 + 1.25x
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
Line B: Y = -0.25 + 1.50x
7
6
5
4
3
2
Y
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3 4 5 6 7
X
Which line fits the points better, A or B?
That depends on how we define a better fit.
9
Definition: ^y denotes the y-value predicted by the straight line for a value of x.
At x = 2:
For Line A
y^ =0.5+1.25⋅2
^y =3
For Line B
y^ =−0.25+1.50⋅2
^y =2.75
Define the error e (residual) to be the error made in using the line to predict the y-value.
e= y− ^y .
For Line A the error at x=2 is

e=2−3=−1
For Line B the error at x=2 is
e=2−2. 75=−0. 75
So we see that at the point x=2, Line B is a better fit than Line A.
Define
( xi , y i )= observation i
For a given line, let i^y =the predicted value of point i (the value of the line at
xi )
The residual (error) at point i is denoted by e i= y i− ^yi
10
 The best line minimizes “total error”.
∑ e 2i
 We choose to define total error as Total Sum of Squared Errors = i
 The line having the smaller sum of squared errors fits the data best.
Line B fits the data better than Line A.
Least Square Criterion: The straight line that best fits a set of data points is the one having the
smallest possible sum of squared errors.
Regression line: The straight line that best fits a set of data points according to the least-squares
criterion.
How do we get the equation for the Regression Line?

We use SPSS, graphing calculator or other technology. But… this is how it’s done:
Define
1
x= ∑ x i
n
1
ȳ= ∑ y i
n
( xi −x )
2
S xx =∑
S xy =∑ (x i −x )( y i− y )
Then the Regression Equation has the form
Y = b0 + b1x where
11
S xy
b0 =
S xx
b1 = y−m x .
Example: Finding the relationship between number of absences and average final grade.
Number of Average Final

Absences Grade
0.0 89.2
1.0 86.4
2.0 83.5
3.0 81.1
4.0 78.2
5.0 73.9
6.0 64.3
7.0 71.8
8.0 65.5
9.0 66.2
Each individual is represented by a point in the diagram

 The explanatory (x) variable is plotted on the horizontal scale
 The response (y) variable is plotted on the vertical scale
Note the truncated vertical scale. More OK when vertical axis is not representing counts
Scatter plot allow us to see relationships .
A relationship clearly exists between the Number Absences and Average Grade. The scattering
suggests that some of the variation in Grade is not accounted for by Absences.
Points seem to be scattered about a line, we use technology to find the line of best fit which is:
y = 88.73 – 2.83x
12
Regression Equation:
Average Grade = 88.73 – 2.83 Number of Absences
Slope = -2.83
y-Intercept = 88.73
Calculating a Fitted Value:
Graphically:
Algebraically:
Find the grade the model predicts for a student that has 3 absences.
^y =88.73−2.83 ∙7
^y =68.92
Finding the residual:
Graphically:
13
Algebraically:
e = y - ^y Number of Average
e = 71.8 – 68.92 Absences Final Grade
e = 2.88 0.0 89.2
1.0 86.4
2.0 83.5
3.0 81.1
4.0 78.2
5.0 73.9
6.0 64.3
7.0 71.8
8.0 65.5
9.0 66.2
When is it valid to interpret the slope? When the data seems to be linear (scattered about a
line).
Interpretation of slope: For any unit increase in x-value, the model predicts the y-value will
change by b 1 (the value of the slope).
Example: What is the slope of the Average Grade regression equation? Can you interpret the
slope? Why or why not. If you can interpret the slope, then do so.

The slope is -2.83.
You can interpret the slope because the points seem to be scattered about that regression line.
Interpretation of slope:
For every additional absence , the model predicts the Average Grade will change by -2.87.
or
For every additional absence , the model predicts the Average Grade will decrease by 2.87.
When is it valid to interpret the y-intercept? A regression model is valid only in the range of
the x-data. The y-intercept occurs when x = 0, so it is only valid to interpret the y-int when x=0
is in the range of the data.
Interpretation of y-intercept: When x is equal to 0, the model predicts that the y-value is equal
to b 0 (the value of y-int).
Example: What is the y-int of the Average Grade regression equation? Can you interpret the y-
int? Why or why not. If you can interpret the y-int, then do so.
14
The y-intercept = 88.73
The y-intercept is interpretable because x = 0 is in the range of the x-data.
Interpretation: When the number of absences is equal to 0, the model predicts that the Average
Grade is equal to 88.73.
Identify and interpret the coefficients of the parent’s height/student’s height model
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
120 130 140 150
Mom and Dad

What is the slope of the Student Heights regression equation? Can you interpret the slope? Why
or why not. If you can interpret the slope, then do so.
15
What is the y-int of the Students Height regression equation? Can you interpret the y-int? Why or
why not. If you can interpret the y-int, then do so.
WARNINGS FOR THE USE OF REGRESSION
Using the regression equation to make predictions is OK only within the range of the data.
Extrapolation: The use of the regression equation to make predictions outside the range of the
data.
Extrapolation is NOT VALID.
Outliers: An outlier is an observation that lies outside the overall pattern of data.
Influential Observation: An influential observation is a data point whose removal causes the
regression equation to change considerably.
If a point is influential you can either
1. Collect more data to fill in
2. Remove the point from the analysis and limit the scope of the analysis
Tiger Woods joins the regulars at the country club: (An influential analysis)
16
Tiger Woods
Data:
Scatter plot:
Club Speed Head Distance (yards)
(mph)
100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
120 305
Distance = 39.46 + 2.229 Club Speed

r = 0.975
Outlier Removed
(forced axes to remain constant)
Distance = -55.80 +3.17Club Speed
Put both lines on one graph.
17
Conclusion: Tiger Woods point is an influential point because removing it caused a significant
shift in the regression line.
Final Model:
Distance = -55.80 +3.17Club Speed
Limit the scope of the analysis: Valid for club speeds less than 115 mph
CH 14.3: The Coefficient of Determination
How reliable is the regression line?
Coefficient of Determination: r2
Consider the amount of variability on the y values. The regression model explains some of the
variability. The Coefficient of Determination, r2, is the percentage of this variability that the
regression model explains.
0≤r 2 ≤1
 If regression is explaining almost all variation
R2 close to 1
 If regression is explaining almost none of the variation R2 close to 0

 You should use technology to compute R2
When R2 is close to 1, the model is explaining most of the variability in the Y-values.
When R2 is close to 0, the model is explaining almost none of the variability in the Y-values
● You should use technology (a calculator or software) to compute r 2
Some examples of r 2
18
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
120 130 140 150
Mom and Dad
Regression Plot
Y = -4.97942 + 0.526150X
R-Sq = 74.8 %
75
70
Heights
65
60
120 130 140 150
Mom and Dad
19
Regression Plot
Y = 11.3885 + 0.870252X
R-Sq = 37.6 %
75
70
heights
65
60 61 62 63 64 65 66 67 68 69 70
Mom height
Regression Plot
Y = 33.0924 + 0.490966X
R-Sq = 20.7 %
75
70
heights
65
60 65 70 75
Dad height
20
Regression Plot
Y = 112.676 - 5.26403X
R-Sq = 95.0 %
100
90
TAX EFF
80
70
60
50
3 4 5 6 7 8 9 10 11
ENERGY
Regression Plot
Y = 371.602 - 27.9029X
R-Sq = 93.7 %
350
300
PRICE
250
200
1 2 3 4 5 6
AGE
21
Plant weight and hydrocarbons
Regression Plot
Y = 3.52369 + 0.162848X
R-Sq = 11.0 %
20
EMISSION
15
10
50 55 60 65 70 75 80 85
WEIGHT
CH 14.4: Linear Correlation
The linear correlation coefficient, r, is a measure of the strength of linear relation between two
quantitative variables
Some properties of the linear correlation coefficient

 r is a unitless measure (so that r would be the same for a data set whether x and y
are measured in feet, inches, meters, or fathoms)
 r is always between –1 and +1
 Positive values of r correspond to positive relations
 Negative values of r correspond to negative relations
Some more properties of the linear correlation coefficient

 The closer r is to +1, the stronger the positive relation … when r = +1, there is a
perfect positive relation; the points fall exactly on a straight line with a positive
slope.
 The closer r is to –1, the stronger the negative relation … when r = –1, there is a
perfect negative relation; the points fall exactly on a straight line with a negative
slope.
 The closer r is to 0, the less of a linear relation (either positive or negative).
22
Maximum positive Strong positive correlation Zero correlation (r = 0)
correlation (r = 1.0) (r = 0.80)
Minimum negative Moderate negative correlation Strong correlation with

correlation (r = -1.0) (r = -0.43) outlier (r = 0.71)
Several points are evident from the scatterplots.
 When the slope of the line in the plot is negative, the correlation is negative; and vice
versa.
 The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a
straight line.
 The correlation becomes weaker as the data points become more scattered.
 If the data points fall in a random pattern, the correlation is equal to zero.
 Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot.
The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).
We won’t calculate correlation by hand. We can do using SPSS
We will need to know how to interpret it.
SPSS Output:
23
Model Summary
Adjusted R Std. Error of the

Model R R Square Square Estimate
1 .947a .898 .885 3.0673
a. Predictors: (Constant), Number Absences
The correlation coefficient, r = -0.947.
Summary of Correlation
 Correlation is not causation!
 Just because two variables are correlated does not mean that one causes the other to
change.
o Example: There is a strong correlation between shoe sizes and vocabulary sizes
for grade school children.
 Clearly larger shoe sizes do not cause larger vocabularies
 Clearly larger vocabularies do not cause larger shoe sizes (lurking
variable?)
 Correlation between two variables can be described with both visual and numeric
summaries
o Visual summaries: Scatter plots
o Numerical summaries: correlation coefficient
24
 Care should be taken in the interpretation of linear correlation (nonlinearity and
causation)
25

MA150 Statistics Notes (Chapter 14) Descriptive Methods in Regression and Correlation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MA150 Statistics Notes (Chapter 14) Descriptive Methods in Regression and Correlation

Uploaded by

Copyright:

Available Formats

MA150

Statistics Notes (Chapter 14)

Some variable pairs:

Response Variable: the variable that is being explained:

Explanatory Variable Response Variable

The Linear Equation

The value b 1 is called the slope.

We can graph a linear equation using

Slope positive or negative?

We can use linear equations to model many systems.

There are several different types of relations between two variables

Points that cluster around a line

Positive (the points slants upwards to the right

Negative (the points slant downwards to the right)

Example: Find the relationship between x and y.

Line B: Y = -0.25 + 1.50x

For Line A the error at x=2 is

Line B fits the data better than Line A.

How do we get the equation for the Regression Line?

Then the Regression Equation has the form

Number of Average Final

Each individual is represented by a point in the diagram

Calculating a Fitted Value:

Average Grade = 88.73 – 2.83 Number of Absences

Average Grade = 88.73 – 2.83 Number of Absences

120 130 140 150

Mom and Dad

WARNINGS FOR THE USE OF REGRESSION

Distance = 39.46 + 2.229 Club Speed

Put both lines on one graph.

 If regression is explaining almost none of the variation R2 close to 0

● You should use technology (a calculator or software) to compute r 2

120 130 140 150

Mom and Dad

120 130 140 150

Mom and Dad

CH 14.4: Linear Correlation

Some properties of the linear correlation coefficient

Some more properties of the linear correlation coefficient

Minimum negative Moderate negative correlation Strong correlation with

Several points are evident from the scatterplots.

We won’t calculate correlation by hand. We can do using SPSS

We will need to know how to interpret it.

Adjusted R Std. Error of the

1 .947a .898 .885 3.0673

a. Predictors: (Constant), Number Absences

The correlation coefficient, r = -0.947.

 Correlation is not causation!

You might also like