You are on page 1of 11

Chapter 14

Describing Relationships:
Scatterplots and Correlation

Chapter 14 1

Statistical versus Deterministic


Relationships

• Distance versus Speed (when travel


time is constant).
• Income (in millions of dollars) versus
total assets of banks (in billions of
dollars).

Distance versus Speed


• Distance = Speed × Time
• Suppose time = 1.5 hours
• Each subject drives a
fixed speed for the 1.5 hrs
– speed chosen for each
subject varies from 10 mph
to 50 mph
• Distance does not vary for
those who drive the same
fixed speed
• Deterministic relationship

1
Income versus Assets
• Income =
a + b×Assets
• Assets vary from 3.4
billion to 49 billion
• Income varies from
bank to bank, even
among those with
similar assets
• Statistical relationship

• A scatter plot shows a linear


relationship if the points follow, more
or less, along a straight line
• Example - heights and weights of 165
students in a college statistics course:

Positive association: High values of one variable tend to occur together


with high values of the other variable.

Negative association: High values of one variable tend to occur together


with low values of the other variable.

2
No relationship:
x and y vary independently. Knowing x tells you nothing about y.

One way to remember this:


The equation for this line is y = 5.
x is not involved.

The strength of the relationship between the two


variables can be seen by how much variation, or
scatter, there is around the main form.

With a strong relationship, you With a weak relationship, for any


can get a pretty good estimate x you might get a wide range of
of y if you know x. y values.

Correlation
• measures the strength and direction of
a linear relationship between two
quantitative variables

3
• Negative correlation
– X ↑ Y↓
–X↓Y↑
• X,Y behave “oppositely”

• Positive correlation
–X↑Y↑
–X↓Y↓
• X,Y behave “similarly”

r
• Pearson correlation coefficient (r) describes the
direction and strength of a linear relationship between
two variables.

-1 ≤ r ≤ -0.8 strong negative correlation


-0.8 < r < -0.2 weak to moderate negative cor.
-0.2 ≤ r ≤ 0.2 negligible correlation
0.2 < r < 0.8 weak to moderate positive cor.
0.8 ≤ r ≤ 1 strong positive correlation

4
Problems with Correlations
• Outliers can inflate or deflate
correlations
• Groups combined inappropriately may
mask relationships (a third variable)
– groups may have different relationships
when separated

Not an outlier:
Outliers
The upper right-hand point here is
not an outlier of the
relationship—it is what you would
expect for this many beers given
the linear relationship between
beers/weight and blood alcohol.

This point is not in line with


the others, so it is an outlier
of the relationship.

What does “statistical


significance” mean?
• 5.Statistics. Of or relating to
observations or occurrences that are
too closely correlated to be
attributed to chance and therefore
indicate a systematic relationship

5
Strength and Statistical
Significance
• A strong relationship seen in the sample may
indicate a strong relationship in the population.
• The sample may exhibit a strong relationship
simply by chance and the relationship in the
population is not strong or is zero.
• The observed relationship is considered to be
statistically significant if it is stronger than a
large proportion of the relationships we could
expect to see just by chance.

Warnings about
Statistical Significance
• “Statistical significance” does not imply the
relationship is strong enough to be considered
“practically important.”
• Even weak relationships may be labeled
statistically significant if the sample size is very
large.
• Even very strong relationships may not be labeled
statistically significant if the sample size is very
small.

Chapter 15

Describing Relationships:
Regression, Prediction, and
Causation

Chapter 15 33

6
Straight lines
• y = a + bx
• a = y intercept
• b = slope

(lines: a quick review)


• Slope = ∆y/∆x = rise/run
• e.g. slope is - 2, y decreases 2 units for
every one unit increase in x

• y = 3 - 2x

A regression line is a straight line that describes


how a response variable y changes as an
explanatory variable x changes. We often use a
regression line to predict the value of y for a given
value of x.

7
The least-squares regression line is the unique line
such that the sum of the squared vertical (y)
distances between the data points and the line is
the smallest possible.

Distances between the points and


line are squared so all are positive
values. This is done so that
distances can be properly added.

The least-squares regression line can be


shown to have this equation: yˆ = a + bx

yˆ is the predicted y value


! (y hat)
b is the slope
a is the y-intercept
!

Making predictions
The equation of the least-squares regression allows you to predict
y for any x within the range studied. This is called interpolating.

yˆ = 0.0144 x + 0.0008 Nobody in the study drank 6.5


beers, but by finding the value
of ŷ from the regression line for
! x = 6.5, we would expect a
blood alcohol content of 0.094
mg/ml.

yˆ = 0.0144 * 6.5 + 0.0008


yˆ = 0.936 + 0.0008 = 0.0944 mg / ml

8
Coefficient of Determination
(R2)
• Measures usefulness of regression prediction
• R2 (or r2, the square of the correlation):
measures the percentage of the variation in
the values of the response variable (y) that is
explained by the regression line
• r=1: R2=1: regression line explains all (100%) of
the variation in y
• r=.7: R2=.49: regression line explains almost half
(50%) of the variation in y

r = −1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
y can be entirely
predicted for any
given value of x.

r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.

!!!
Height in Inches

!!!

Extrapolation is the use of a


regression line for predictions
outside the range of x values
used to obtain the line.

This can be a very stupid thing


to do, as seen here.
Height in Inches

9
Correlation Does Not Imply
Causation

Even very strong correlations


may not correspond to a real
causal relationship.

Evidence of Causation
• A properly conducted experiment
establishes the connection

• Other considerations:
– A reasonable explanation for a cause and
effect exists
– The connection happens in repeated trials
– The connection happens under varying
conditions
– Potential confounding factors are ruled out
– Alleged cause precedes the effect in time

Reasons for relationships between variables


1. Explanatory variable is the direct cause of
the response variable
2. The response variable is causing a change
in the explanatory variable
3. The explanatory variable is contributing to
but not the sole cause of change in the
response variable
4. Confounders may exist
5. Both variables result from a common cause
6. Both variables are changing over time
7. The association is coincidence

10
Association and causation
It appears that lung cancer is associated with smoking.
How do we know that both of these variables are not being affected by an
unobserved third (lurking) variable?
For instance, what if there is a genetic predisposition that causes people to
both get lung cancer and become addicted to smoking, but the smoking itself
doesn’t CAUSE lung cancer?

We can evaluate the association using the


following criteria:

1) The association is strong.


2) The association is consistent.
3) Higher doses are associated with stronger
responses.
4) The alleged cause precedes the effect.
5) The alleged cause is plausible.

Ch 14 & 15 concepts

•Statistical vs. Deterministic Relationships


•Statistical Significance
•Correlation Coefficient
•Problems with Correlations
•LS Regression Equation
•R2
•Correlation does not imply causation!

11

You might also like