U6 Deck1h PDF

Announcements
Unit 6: Introduction to linear regression ▶ MT 2 grades have been posted today!
1. Introduction to regression The CDC monitors the physical activity level of Americans. A recent survey on a
random sample of 23,129 Americans yielded a 95% confidence interval of 61.1%
to 62.9% for the proportion of Americans who walk for at least 10 minutes per
day. Which is the most accurate statement?
STA 104 - Summer 2017

A. 95% of random samples of 23,129 Americans will yield confidence
intervals between 61.1% and 62.9%.
Duke University, Department of Statistical Science
B. This interval does not support the claim that less than 50% of Americans
walk at least 10 minutes per day.
C. We are 95% confident that each American walks for at least 10 minutes
per day on 61.1% to 62.9% of the days.
D. Between 61.1% and 62.9% of random samples of 23,129 Americans are

expected to yield confidence intervals that contain the true proportion of
Americans who walk for at least 10 minutes per day.
E. 95% of the time the true proportion of Americans who walk for at least 10
Prof. van den Boom Slides posted at minutes per day is between 61.1% to 62.9%.
http://www2.stat.duke.edu/courses/Summer17/sta104.001-1/
For post-hoc tests of the results of an ANOVA we use a corrected

alpha or significance level. If we want an overall type 1 error rate of 5%, what 1
should the alpha be for the individual pairwise tests if the number of groups
equals 6?
Choose the closest option.
Modeling numerical variables A. 0.16667 Guessing the correlation

B. 0.00833
C. 0.00333
Clicker questionD. 0.05

E. 0.3
Which of the following is the best guess for the correlation between
▶ So far we have worked with single numerical and categorical annual murders per million and percentage living in poverty?
variables, and explored relationships between numerical and
categorical, and two categorical variables.
●
40
▶ In this unit we will learn to quantify the relationship between two
(a) -1.52
● ●
35
annual murders per million
numerical variables, as well as modeling numerical response
30
variables using a numerical or categorical explanatory variable. (b) -0.63
●
●
● ●
25
●
▶ In the next unit we’ll learn to model numerical variables using (c) -0.12
●
●
20
●
many explanatory variables at once. (d) 0.02

●
15
●●
●
●
●
10
●
(e) 0.84
●
5
14 16 18 20 22 24 26
% in poverty
2 3
Guessing the correlation Assessing the correlation
Clicker question
Clicker question
Which of the following is has the strongest correlation, i.e.
Which of the following is the best guess for the correlation between correlation coefficient closest to +1 or -1?
annual murders per million and population size?
●●
● ●●
●
●
● ●
●
● ●● ●●
● ●● ●
● ●●●●●● ●● ●
●
●●
● ● ●
● ●
40
● ●●
● ● ● ●
●● ●
● ●● ● ●●●● ●
●●● ●● ●●
● ● ●
(a) -0.97
● ● ● ●
●
35
●
● ● ●

●●● ●
●●
●
●● ●●● ●●
● ●
●●
●● ●● ●●● ●●
●
● ●●
●●● ● ● ● ●
●
● ●●
30
● ●● ● ●
(b) -0.61
● ●
● ●●
●●●
● ● ● ● ●● ●
●●
●● ●●
● ●●
● ●● ●●
● ● ● ●
● ●
●● ● ● ● ●●●● ● ●●●
● ● ●●
●●●
● ●
●●●● ●● ●●●
● ●
●●●●● ●
● ●
25
● ●●●
●●●
● ●●● ●
(c) -0.06 (a) (b)

●
●
20
●
(d) 0.55
●
15
● ● ● ● ●● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ●● ●
10 ●
● ● ●● ● ●● ● ● ● ●
(e) 0.97
● ●● ● ● ●
● ● ● ●● ● ●●●●●● ●● ●
●● ● ● ● ●● ●● ● ● ●
●
● ●●● ●●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●●●
● ●●●
5
● ●
● ● ● ● ●● ● ● ● ● ●●●
●● ●● ● ●● ● ● ●● ●
●●● ● ● ● ● ●●● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●●●
● ● ●● ●●● ●●● ●●
2e+06 4e+06 6e+06 8e+06 ● ● ●
● ● ● ●● ●
● ●
●●
● ● ● ●●
● ● ● ●
● ●●
● ● ● ●●●
● ●
population
● ● ●
(c) (d)
4 5
Play the game! Spurious correlations
Remember: correlation does not always imply causation!

http://www.tylervigen.com/
To sharpen your correlation guessing abilities:

http://guessthecorrelation.com/
6 7
(2) Least squares line minimizes squared residuals (3) Interpreting the last squares line
▶ Residuals are the leftovers from the model fit, and calculated as
the difference between the observed and predicted y:
ei = yi − ŷi ▶ Slope: For each unit increase in x, y is expected to be
▶ The least squares line minimizes squared residuals: higher/lower on average by the slope.
– Population data: ŷ = β0 + β1 x
– Sample data: ŷ = b0 + b1 x sy
b1 = R
sx
●
40
● ●
▶ Intercept: When x = 0, y is expected to equal the intercept.
35
b0 = ȳ − b1 x̄
30
●
●
● ●
25
●
●
20
– The calculation of the intercept uses the fact the a regression line
●
15
●●
●
●
always passes through (x̄, ȳ).

●
10
●
●
●
5
14 16 18 20 22 24 26
% in poverty
8 9
Why does the regression line always pass through (x̄, ȳ)?
▶ If there is no relationship between x and y (b1 = 0), the best

guess for ŷ for any value of x is ȳ.
▶ Even when there is a relationship between x and y (b1 ̸= 0), the
best guess for ŷ when x = x̄ is still ȳ.
Application exercise: 6.1 Linear model
See course website for details
10
1.5
8
0.5
6
2
(x, y) (x, y)
● ● ● ●
y2
y3
●
4
● ● ●
y
(x, y)
−0.5
●
0
●
0
−2
−1.5
−2
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0
x x x
10 11
Clicker question
Clicker question
Suppose you want to predict annual murder count (per million) for a
What is the interpretation of the slope?
series of districts that were not included in the dataset. For which of
the following districts would you be most comfortable with your
(a) Each additional percentage in those living in poverty increases prediction?
number of annual murders per million by 2.56.
(b) For each percentage increase in those living in poverty, the
number of annual murders per million is expected to be higher A district where % in ●
40
poverty =
●
by 2.56 on average.
●
35
30
(c) For each percentage increase in those living in poverty, the (a) 5% ● ●
●
●
25
number of annual murders per million is expected to be lower by
●
(b) 15% ●
●
20
29.91 on average. (c) 20%
●
●
15
●●
(d) For each percentage increase annual murders per million, the
●
●
(d) 26% ●
10
●
percentage of those living in poverty is expected to be higher by

●
(e) 40%
●
5
2.56 on average. 14 16 18 20 22 24 26
% in poverty
12 13
A note about the intercept Calculating predicted values
\ = −29.91 + 2.56 poverty

By hand: murder
The predicted number of murders per million per year for a county
Sometimes the intercept might be an extrapolation: useful for with 20% poverty rate is:
adjusting the height of the line, but meaningless in the context of the
data. \ = −29.91 + 2.56 × 20 = 21.29
murder
In R:
80
# load data
40
murder <- read.csv("https://stat.duke.edu/~mc301/data/murder.csv")

# fit model
0
m_mur_pov <- lm(annual_murders_per_mil ~ perc_pov, data = murder)

● # create new data
−40
newdata <- data.frame(perc_pov = 20)

0 10 20 30 40 50 60
# predict
% in poverty predict(m_mur_pov, newdata)
1
21.28663
14 15
Summary of main ideas
1. Correlation coefficient describes the strength and direction of

the linear association between two numerical variables
2. Least squares line minimizes squared residuals
3. Interpreting the least squares line
4. Predict, but don’t extrapolate
16

U6 Deck1h PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

U6 Deck1h PDF

Uploaded by

Copyright:

Available Formats

Announcements

Unit 6: Introduction to linear regression ▶ MT 2 grades have been posted today!

STA 104 - Summer 2017

D. Between 61.1% and 62.9% of random samples of 23,129 Americans are

For post-hoc tests of the results of an ANOVA we use a corrected

Choose the closest option.

Modeling numerical variables A. 0.16667 Guessing the correlation

Clicker questionD. 0.05

many explanatory variables at once. (d) 0.02

annual murders per million

(c) -0.06 (a) (b)

Play the game! Spurious correlations

Remember: correlation does not always imply causation!

To sharpen your correlation guessing abilities:

always passes through (x̄, ȳ).

▶ If there is no relationship between x and y (b1 = 0), the best

percentage of those living in poverty is expected to be higher by

A note about the intercept Calculating predicted values

\ = −29.91 + 2.56 poverty

murder <- read.csv("https://stat.duke.edu/~mc301/data/murder.csv")

m_mur_pov <- lm(annual_murders_per_mil ~ perc_pov, data = murder)

newdata <- data.frame(perc_pov = 20)

1. Correlation coefﬁcient describes the strength and direction of

You might also like