You are on page 1of 34

Chapter 4

•Describing Bivariate Numerical Data


•Honors 281
Bivariate Data
• Two Numeric Variables
• Goal: Explore relationship between the two variables.
• Correlation
• Linear Regression
Identify Identify patterns in scatterplots

Describe Describe relationships with scatterplots

Calculate and Calculate and Interpret the Correlation Coefficient (r)


Interpret

Learning Classify Classify variables in Simple Linear Regression

Goals
Find Find Equation for SLR and interpret parts

Use Use SLR

Analyze Analyze SLR


Pearson’s sample correlation coefficient ,
Correlation denoted by r, measures the strength and
Coefficient r direction of a linear relationship between two
numerical variables.
Properties of r
1) r is positive when the linear relationship is positive and negative
when the linear relationship is negative.

2) r is always between – 1 and 1.


Strong relationship when r is between – 1 and - .8 or .8 and 1
Moderate relationship when r is between - .8 and - .5 or .5 and .8
Weak relationship when r is between - .5 and .5
More properties of r
3) r = 1 occurs when all of the points lie on a straight line going
upward.

r = - 1 occurs when all of the points lie on a straight line going


downward.

4) r is a measure of the extent to which x and y are linearly related.

5) The value of r does not depend on the unit of measurement for


either variable.
Let’s test your understanding
• In the following graphs, tell me
• Strength of Relationship
• Direction of Relationship
• If linear, is it appropriate to report correlation or do a linear regression.
scatterplot price by carat

15000
10000
price

5000
0

1 2 3 4 5

carat
scatterplot of diamond length by width

60
50
40
30
y

20
10
0

0 2 4 6 8 10

x
40

30

hwy

20

2 3 4 5 6 7
displ
40

30

hwy

20

10 15 20 25 30 35
cty
Calculating r
•  = =

 
Try Calculating the Correlation Coefficient
How much should a healthy pony weigh? Let x be the age of the pony
(in months), and let y be the average weight of the pony (in kilograms).

x 3 6 12 18 24
y 60 95 140 170 185
Paste into first cell

x yZX ZY Zxy   Correlation =CORREL(A2:A6,B2:B6)


3 60 =(A2-$H$3)/$H$2 =(B2-$H$5)/$H$4 =C2*D2   std_X =STDEV.S(A2:A6)
6 95 =(A3-$H$3)/$H$2 =(B3-$H$5)/$H$4 =C3*D3   mean_X =AVERAGE(A2:A6)
12 140 =(A4-$H$3)/$H$2 =(B4-$H$5)/$H$4 =C4*D4   std_Y =STDEV.S(B2:B6)
18 170 =(A5-$H$3)/$H$2 =(B5-$H$5)/$H$4 =C5*D5   mean_Y =AVERAGE(B2:B6)

24 185 =(A6-$H$3)/$H$2 =(B6-$H$5)/$H$4 =C6*D6   r=Correlation =SUM(E2:E6)/(5-1)


x y ZX ZY Zxy Correlation 0.972246

3 60 -1.11749 -1.34404 1.501953 std_X 8.590693

6 95 -0.76827 -0.67202 0.516296 mean_X 12.6

12 140 -0.06984 0.192006 -0.01341 std_Y 52.08167

18 170 0.628587 0.768025 0.48277 mean_Y 130

24 185 1.327018 1.056034 1.401375 r=Correlation 0.972246


Fitting a
Linear
Regression
Idea
• When there is a linear relationship between two variables, you can
use information about one variable to learn about the value of the
second variable.
• The letter y is used to denote the variable you would like to predict,
and this variable is called the response variable (dependent variable).
• The other variable, denoted by x, is the predictor variable
(independent or explanatory variable).
Least Squares Line
•   equation of a line y = a + bx, where b = slope , and a = y-intercept.
The
 
The least squares regression line is the line that minimizes the sum of squared deviations.
The slope of the least squares regression line is:
b=
 
Calculating formula for slope that is easier for calculations is:
b=

The y – intercept is a =
Least Squares Equation of a Line
• 
Return to Ponies!
x 3 6 12 18 24
y 60 95 140 170 185

Recall it had a correlation coefficient of r= 0.972246

Find the Regression line! Use Excel or do it by hand.


Months by Weight
200
f(x) = 5.89 x + 55.73 185
R² = 0.95
180
170

160

140
140

120
Pony Weight

100 95

80

60
60

40

20

0
0 5 10 15 20 25 30

Months
• Residuals:
• 1 2 3 4 5
• -13.415 3.902 13.537 8.171 -12.195

• Coefficients:
• Estimate Std. Error t value Pr(>|t|)
• (Intercept) 55.7317 12.0856 4.611 0.01918 *
•x 5.8943 0.8189 7.198 0.00553 **
• ---
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

• Residual standard error: 14.07 on 3 degrees of freedom


• Multiple R-squared: 0.9453, Adjusted R-squared: 0.927
• F-statistic: 51.81 on 1 and 3 DF, p-value: 0.005527
Regression Line:
•X  + 55.7317

Time + 55.7317
Assessing a Line of Best Fit
Residuals
•The
  residuals are the n quantities:

1st residual

2nd residual

nth residual

Each residual is the difference between an observed y value and the


corresponding predicted y value.
Ideal Pattern is No Pattern for Residuals
Ponies Again Residuals

10
• Fit each of the points from the pony example:
• 73.41463 91.09756 126.46341 161.82927

5
197.19512

y - fitted(lm(y ~ x))
• Find the difference from the observed:

0
• -13.414634 3.902439 13.536585 8.170732
-12.195122

-5
• Is this a good fit? Why?

-10
1 2 3 4 5

Index
 Coefficient of Determination:
•  The coefficient of determination, denoted by , is the proportion of
variability in y that can be attributed to an approximate linear
relationship between x and y. The value or is often converted to a
percentage (by multiplying by 100).
• It is essentially a measure of the strength of the linear the relationship
between the two variables.
• A large value of indicates that a large proportion of the variability in
y can be explained by the approximate linear relationship between x
and y. This tells you that knowing the value of x is helpful for
predicting y.
 
Calculating
•Residual
  sum of squares = SSResid =
 
Total sum of squares = SSTo =
 
The coefficient of determination is calculated as
=
 
*Note that we can calculate by squaring r, which is what we will do to
find it.
Bivariate Summary
• Visualize with Scatterplot
• Make preliminary claims based on plot
• Fit Simple Linear Regression
• Look at residual plot for patterns
• Calculate and Interpret the Coefficient of Determination
• Decide if SLR is appropriate for the relationship
• Use SLR line to make predictions
Regression Activity
• Using either Excel or the provided R code on the D2L page, create a scatterplot of
weight described by height and calculate the correlation coefficient (r).
• Share your scatterplot and R score by sketching them on your white board.
• Is your plot different from that of the other students? Correlation Coefficient? Why?
• Using Excel or R, and the R code on our D2L page, generate a simple linear regression line for
your data. Write out the model.
• Interpret the model and coefficient of determination (r^2)
• Append your Linear Model to your plot.
• Using your model, predict the value of a person’s weight given that you know they are 68
inches tall.
• If you used someone else’s model, would you get the same result? Would it be close?
• If I gave you a larger subset of the data would your model have differed more or less from
every other group’s model? Explain.

You might also like