Professional Documents
Culture Documents
(SEE5211/SEE8212)
(SEE5211/SEE8212)
Chapter 3
Example
Suppose we found the age and weight for each person in a sample
of 10 adults. Is there any relationship between the age and weight
of these adults? Create a scatterplot of the data below.
Weight
Age
Age 24 30 41 28 50 46 49 35 20 39
Wt 256 124 320 185 158 129 103 196 110 130
Suppose we found the height and weight for each person in a
sample of 10 adults. Is there any relationship between the height
and weight of these adults? Create a scatterplot of the data
below.
Weight
Is it positive or negative?
Weak or strong?
Height
Ht 74 65 77 72 68 60 62 73 61 64
Wt 256 124 320 185 158 129 103 196 110 130
Correlation
1 xi x yi y
r
n 1 s x s y
Example
r = 0.05
Graduation Rates
1 xi x yi y
r
n 1 s x s y
Expenditures
Properties of r
Strong correlation
Moderate Correlation
Weak correlation
No Correlation
-1 -.8 -.5 0 .5 .8 1
Properties of r
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Suppose that the graduation rates were changed from percents to
decimals (divide by 100).
Transform the graduation rates and calculate r. Do the following transformations and calculate r
1) x’ = 5(x + 14)
2) y’ = (y + 30) ÷ 4
r = 0.05
It is the same!
Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Suppose we wanted to estimate the expenditures per student for
given graduation rates.
Switch x and y, then calculate r.
r = 0.05 , It is the same!
r = 0.42
Rates
Graduation Rates
correlation coefficient?
Expenditures
Expenditures
Interpret r = 0.05
Graduation Rates
Expenditures
Correlation does not imply causation
Does a value of r close to 1 or -1 mean that a change in one
variable cause a change in the other variable?
Consider the following examples:
• The relationship between the number of cavities in a child’s teeth and the
size of his or her vocabulary is strong and positive.
Should we all drink more hot chocolate to lower the crime rate?
Causality can only be shown by carefully controlling values of all variables that might be
related to the ones under study. In other words, with a well-controlled, well-designed
experiment.
These variables are both strongly related to the age of the child
So does this mean I should feed children more candy to increase their vocabulary?
What is the objective of regression analysis?
ŷ -
squares of the deviations from the line
means the predicted y
b – is the slope
• it is the approximate amount by which y increases when x
increases by 1 unit
a – is the y-intercept
• it is the approximate height of the line when x = 0
• in some situations, the y-intercept has no meaning
The slope of the LSRL is b
x x y y The intercept of the LSRL is a y bx
x x
2
Scatterplots frequently exhibit a linear pattern. When this is the case, it makes sense to summarize the
relationship between the variables by finding a line that is as close as possible to the plots in the plot.
This is done by calculating the line of best fit or Least Square Regression Line (LSRL).
Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).
y =.5(0) + 4 = 4
0 – 4 = -4
y =.5(3) + 4 = 5.5
10 – 5.5 = 4.5
-5
-4
(6,2)
(3,10)
What is the sum of
the deviations from Find the sum of the squares
the line? of the deviations from the line
Will it always be
zero?
6
Find the vertical 1
deviations from yˆ x 3
the line
3
-3
-3 (6,2)
The line that minimizes the sum of the squares of the
deviations from the line is the LSRL.
LSRL
x 11 15 19 23 27
y 150 270 450 580 740
Sketch a scatterplot for this data set.
x = number of days after injection of cancer cells in mice assigned to
plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Calculate the LSRL and the correlation coefficient.
The average volume of the tumor increases by approximately 37.25 There, positive, linear relationship
mm3 for each day increase in the number of days after injection. between is a strongthe average tumor
volume and the number of days since
injection.
x = number of days after injection of cancer cells in mice assigned to plain water and y =
average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
yˆ 269.75 37.25 x
Predict the average volume of the tumor for 20 days after injection.
It is unknown whether the pattern observed in the scatterplot continues outside the range of x-values.
x = number of days after injection of cancer cells in mice assigned to plain water and y =
average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Suppose we want to know how many days after injection of cancer cells would the
average tumor size be 500 mm3?
yˆ 269.75 37.25 x
xˆ 7.277 .027 y
The regression line of y on x should not be used to predict x, because it is not the line that minimizes the sum of the squared
deviations in the x direction.
x = number of days after injection of cancer cells in mice
assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Find the mean of the x-values (x) and the mean of the y-
values (y).
x = 19 and y = 438
+
Plot the point of averages
(x,y) on the scatterplot.
Let’s investigate how the LSRL and correlation coefficient
change when different points are added to the data set
Suppose we have the following data set.
x 4 5 6 7 8
y 2 5 4 6 9
Sketch a scatterplot. Calculate the LSRL and the correlation coefficient.
yˆ 3.8 1.5 x
r 0.916
Let’s investigate how the LSRL and correlation coefficient
change when different points are added to the data set
Suppose we have the following data set.
x 4 5 6 7 8 5
y 2 5 4 6 9 8
Suppose we add the point (5,8) to the data set. What happens to the regression
line and the correlation coefficient?
yˆ 3.8 1.5 x
r 0.916
yˆ 1.15 1.17 x
r 0.667
Let’s investigate how the LSRL and correlation coefficient
change when different points are added to the data set
Suppose we have the following data set.
x 4 5 6 7 8 12
y 2 5 4 6 9 12
Suppose we add the point (12,12) to the data set. What happens to the
regression line and the correlation coefficient?
yˆ 3.8 1.5 x
r 0.916
yˆ 2.24 1.225 x
r 0.959
Let’s investigate how the LSRL and correlation coefficient
change when different points are added to the data set
Suppose we have the following data set.
x 4 5 6 7 8 12
y 2 5 4 6 9 0
Suppose we add the point (12,0) to the data set. What happens to the regression
line and the correlation coefficient?
Once the LSRL is obtained, the next step is to examine how effectively the
line summarizes the relationship between x and y.
In a study, researchers were interested in how the distance a deer
mouse will travel for food (y) is related to the distance from the
food to the nearest pile of fine woody debris (x). Distances were
measured in meters.
yˆ 7.69 3.234x
In a study, researchers were interested in how the distance a deer mouse will
travel forUse the(y)
food LSRL to calculate
is related the predicted
to the distance from distance
the food to the nearest pile of
traveled.
fine woody debris (x). Distances were measuredSubtract to find the residuals.
in meters.
x x
10
Residuals
5
Since the residual plot displays
no pattern, a linear model is
5 6 7 8 9 appropriate for describing the
Distance from debris relationship between the
-5 distance from debris and the
distance a deer mouse will travel
-10 for food.
-15
Now plot the residuals against the predicted distance from food.
15
What do you notice about the
general scatter of points on this
10 residual plot versus the residual
plot using the x-values?
Residuals
10 15 20 25 9
Predicted Distance traveled
-5
-10
15
-15
10
R e siduals 5
Residual plots can be plotted against either the x-
values or the predicted y-values.
5 6 7 8 9
Distance from debris
-5
-10
-15
Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40 59 51 50
This point is considered an influential point because it
affects the placement of the least-squares regression line.
Do you notice anything unusual about this
Sketch a scatterplot with the fitted regression line. data set?
60
55
Influential observation
Weight
50
45
40
Age 5 10 15 20 25 30
Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40 59 51 50
An observation is an
outlier if it has a large
residual. 60
55
Weight
50
45
40
Age 5 10 15 20 25 30
• Denoted by r2
Distance traveled
30
15
y 15.938
10
Distance traveled
Your best guess would be the predicted
distance traveled (the point on the LSRL).
The standard deviation (s): This is the typical amount by which an observation
deviates from the least squares regression line. sIt’s found by:
SSResid
e
n -2
The slope (b): The distance traveled to food increases by approxiamtely 3.234 meters for an
increase of 1 meter to the nearest debris pile.
300
250
This curve minimizes the sum of the
squares of the residuals (similar to
least-squares linear regression).
200
10 20 30 40 50 60
Representative Age
Let’s examine this data set:
x = representative age
y = average marathon finish time
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
Notice the residuals from the quadratic regression. Here is the residual plot-
Since there is no pattern in the residual plot, the
quadratic regression is an appropriate model for
this data set.
Average Finish Time
300
20
250 10
Residuals
200 10 20 30 40 50 60
-10 Age
10 20 30 40 50 60 -20
Representative Age
Let’s examine this data set:
x = representative age
y = average marathon finish time
Age 15 25 35 45 55 65
200
R2 = .921
92.1% of the variation in average marathon
10 20 30 40 50 60
finish times can be explained by the
Representative Age
approximate quadratic relationship between
average finish time and age.
Depending on the data set, other regression models, such as
cubic regression, may be used. Statistical software is commonly
used to calculate these regression models.
Transformation Equation
No transformation yˆ a bx
Square root of x yˆ a b x
Log of x * yˆ a b log10 x
Reciprocal of x 1
yˆ a b
x
Log of y *
Exponential growth or decay
log10 yˆ a bx
500
the data points.
400
300
Let’s use a
200
transformation to
100
linearize the data.
10 15 20 25 30 35
Number of days
Pomegranate study revisited:
1
The LSRL is
Number of days
Pomegranate study revisited:
2
cancer cells?
log yˆ 2.456
10 15 20 25 30 35
Number of
10 10
days
2525 3030 3535
1 (Original value) No
transformation Then we would use a power that is up the ladder
1/3 3 Original value Cube root Suppose that the scatterplot looks like the curve labeled 1.
a bx
e
p
1 e a bx
Where a and b are constants
For any value of x, the value of p is always between 0 and 1.
In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf
spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53
pairs of courting wolf spiders.
x = the difference in body width (female – male)
y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism
Note that the plot was constructed so What is the probability of cannibalism if the male & female
that if two plots fell in the exact spiders are the same width (difference of 0)?
same location they would be
offset a little bit so that all points
would be visible (called jittering).
This equation can be used to predict the probability of the male
spider being cannibalized based on the difference in size.
e 3.08904 3.06928x
p
1 e 3.08904 3.06928x
e 3.08904 3.06928( 0)
p 3.08904 3.06928( 0 )
0.044
1e