SEE5211 Chapter3-P2017

Data Analysis in Envir Application
(SEE5211/SEE8212)
Dr. Wen Zhou

School of Energy and Environment
Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1

Outline
• The role of statistics and the data analysis process

• Numerical method of describing data
• Summarizing bivariate data
• Population distributions
• Sampling variability and Confidence interval
• Hypothesis Testing Using a Single Sample
• Comparing Two populations
• Regression Analysis
• Analysis of Variance
• Wavelet Analysis
Data Analysis in Envir Application
(SEE5211/SEE8212)
Summarizing Bivariate Data
Chapter 3
Example
Suppose we found the age and weight for each person in a sample
of 10 adults. Is there any relationship between the age and weight
of these adults? Create a scatterplot of the data below.
Weight
Do you think there is a relationship? If so,

what kind? If not, why not?
Age
Age 24 30 41 28 50 46 49 35 20 39
Wt 256 124 320 185 158 129 103 196 110 130
Suppose we found the height and weight for each person in a
sample of 10 adults. Is there any relationship between the height
and weight of these adults? Create a scatterplot of the data
below.
Weight
Is it positive or negative?
Weak or strong?
Height
Ht 74 65 77 72 68 60 62 73 61 64
Wt 256 124 320 185 158 129 103 196 110 130
Correlation
• The relationship between bivariate numerical variables
• May be positive or negative

What does it mean if the relationship is positive? Negative?
What feature(s) of the graph would indicate a weak or strong relationship?
• May be weak or strong

Identify the strength and direction of the data
Set A shows a strong, positive linear relationship.
Set A Set B Set C
Set B shows little or no relationship.

Set D
Set C shows a weaker (moderate),
negative linear relationship.
Set D shows a strong,

positive curved
relationship.
Identify as having a positive relationship, a negative relationship,
or no relationship.
1. Heights of mothers and heights of their adult +
daughters
2. Age of a car in years and its current value -
3. Weight of a person and calories consumed +
4. Height of a person and the person’s birth month no
5. Number of hours spent in safety training and the number of -
accidents that occur
Correlation Coefficient (r)
• A quantitative assessment of the strength and direction of the

linear relationship in bivariate, quantitative data
• Pearson’s sample correlation is used the most
• Population correlation coefficient – 
• statistic correlation coefficient – r
• Equation:
z-scores for x and y.

What are these values called?
1  xi  x  yi  y 
r   
n  1  s x  s y 

Example
For the six primarily undergraduate universities in California

with enrollments between 10,000 and 20,000, six-year graduation
rates (y) and student-related expenditures per full-time students
(x) for 2003 were reported as follows:
Expenditures 8011 7323 8735 7548 7071 8248

Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Create a scatterplot and calculate r.

Example
Expenditures 8011 7323 8735 7548 7071 8248

Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
r = 0.05
Graduation Rates
1  xi  x  yi  y 
r   
n  1  s x  s y 

Expenditures
Properties of r
1) legitimate values are -1 < r < 1
Strong correlation
Moderate Correlation
Weak correlation
No Correlation
-1 -.8 -.5 0 .5 .8 1
Properties of r
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Suppose that the graduation rates were changed from percents to
decimals (divide by 100).
Transform the graduation rates and calculate r. Do the following transformations and calculate r
1) x’ = 5(x + 14)
2) y’ = (y + 30) ÷ 4
r = 0.05
It is the same!
2) value of r is not changed by any linear

transformation
Properties of r
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Suppose we wanted to estimate the expenditures per student for
given graduation rates.
Switch x and y, then calculate r.
r = 0.05 , It is the same!
3) value of r does not depend on which of the two variables is

labeled x
Properties of r Plot a revised scatterplot and find r.
Expenditures 8011 7323 8735 7548 7071 8248
Graduation rates 64.6 53.0 46.3 42.5 38.5 63.9

33.9
r = 0.42
Rates
Graduation Rates
Suppose the 33.9 was REALLY 63.9. What do

you think would happen to the value of the
Graduation
correlation coefficient?
Extreme values affect the correlation coefficient
Expenditures
Expenditures
4) value of r is affected by extreme values.

Properties of r
Find the correlation for these points:
x -3 -1 1 3 5 7 9
Y 40 20 8 4 8 20 40
Compute the correlation coefficient? Sketch the scatterplot

r=0
Does this mean that there is NO

relationship between these points?
r = 0, but the data set has a definite relationship!
5) value of r is a measure of the extent to which x and y are

linearly related
Properties of r:
1. legitimate values of r are -1 < r < 1

2. value of r is not changed by any transformation
3. value of r does not depend on which of the two
variables is labeled x
4. value of r is affected by extreme values
5. value of r is a measure of the extent to which x and
y are linearly related
Example
Expenditures 8011 7323 8735 7548 7071 8248

Graduation
64.6 53.0 46.3 42.5 38.5 33.9
rates
Interpret r = 0.05
Graduation Rates
There is a weak, positive, linear

relationship between expenditures
and graduation rates.
Expenditures
Correlation does not imply causation
Does a value of r close to 1 or -1 mean that a change in one
variable cause a change in the other variable?
Consider the following examples:
• The relationship between the number of cavities in a child’s teeth and the
size of his or her vocabulary is strong and positive.
Should we all drink more hot chocolate to lower the crime rate?
• Consumption of hot chocolate is negatively correlated with crime rate.

Both are responses to cold weather
Causality can only be shown by carefully controlling values of all variables that might be
related to the ones under study. In other words, with a well-controlled, well-designed
experiment.
These variables are both strongly related to the age of the child
So does this mean I should feed children more candy to increase their vocabulary?
What is the objective of regression analysis?
• x – variable: is the independent or explanatory variable

• y- variable: is the dependent or response variable
• We will use values of x to predict values of y.
Suppose that we have two variables:
x = the amount spent on advertising

y = the amount of sales for the product during a given period
What question might I want to answer using this data?
The objective of regression analysis is to use information about one variable, x, to

draw some sort of a conclusion about a second variable, y.
The LSRL is
yˆ  a  bx The LSRL is the line that minimizes the sum of the
ŷ -
squares of the deviations from the line
means the predicted y
b – is the slope
• it is the approximate amount by which y increases when x
increases by 1 unit
a – is the y-intercept
• it is the approximate height of the line when x = 0
• in some situations, the y-intercept has no meaning
The slope of the LSRL is b
 x  x y  y  The intercept of the LSRL is a  y  bx
 x  x 
2
Scatterplots frequently exhibit a linear pattern. When this is the case, it makes sense to summarize the
relationship between the variables by finding a line that is as close as possible to the plots in the plot.
This is done by calculating the line of best fit or Least Square Regression Line (LSRL).
Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).
Let’s just fit a line to the data by

drawing a line through what appears to
(3,10)
be the middle of the points.
y =.5(6) + 4 = 7
2 – 7 = -5
Find the sum of the squares of

4.5
yˆ  .5 x  4
these deviations.
Now find the vertical distance from

each point to the line.
y =.5(0) + 4 = 4
0 – 4 = -4
y =.5(3) + 4 = 5.5
10 – 5.5 = 4.5
-5
-4
(6,2)
(0,0) Sum of the squares = 61.25

Use a calculator to find the line of best fit
(3,10)
What is the sum of
the deviations from Find the sum of the squares
the line? of the deviations from the line
Will it always be
zero?
6
Find the vertical 1
deviations from yˆ  x  3
the line
3
-3
-3 (6,2)
The line that minimizes the sum of the squares of the
deviations from the line is the LSRL.
LSRL
Sum of the squares = 54

(0,0)
Researchers are studying pomegranate's antioxidants properties to see if it might
be helpful in the treatment of cancer. In one study, mice were injected with
cancer cells and randomly assigned to one of three groups, plain water, water
supplemented with .1% pomegranate fruit extract (PFE), and water
supplemented with .2% PFE. The average tumor volume for mice in each
group was recorded for several points in time. (x = number of days after
injection of cancer cells in mice assigned to plain water and y = average tumor
volume (in mm3)
x 11 15 19 23 27
y 150 270 450 580 740
Sketch a scatterplot for this data set.
x = number of days after injection of cancer cells in mice assigned to
plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Calculate the LSRL and the correlation coefficient.
The average volume of the tumor increases by approximately 37.25 There, positive, linear relationship
mm3 for each day increase in the number of days after injection. between is a strongthe average tumor
volume and the number of days since
injection.
yˆ  269.75  37.25 x r  0.998

Interpret the slope and the correlation coefficient in context.
This is the danger of extrapolation. The least-squares line
should not be used to make predictions for y using x-values
outside the range in the data set.
x = number of days after injection of cancer cells in mice assigned to plain water and y =
average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
yˆ  269.75  37.25 x
Predict the average volume of the tumor for 20 days after injection.
yˆ  269.75  37.25(20)  475.25 mm 3

Predict the average volume of the tumor for 5 days after injection .
yˆ  269.75  37.25(5)  83.5 mm3 Can volume be negative?
It is unknown whether the pattern observed in the scatterplot continues outside the range of x-values.
x = number of days after injection of cancer cells in mice assigned to plain water and y =
average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Suppose we want to know how many days after injection of cancer cells would the
average tumor size be 500 mm3?
yˆ  269.75  37.25 x
the slope of the line for predicting x is

s sy
r x not r
sy sx
and the intercepts are almost always different.
Here is the appropriate regression line:
xˆ  7.277  .027 y
The regression line of y on x should not be used to predict x, because it is not the line that minimizes the sum of the squared
deviations in the x direction.
x = number of days after injection of cancer cells in mice
assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Find the mean of the x-values (x) and the mean of the y-
values (y).
x = 19 and y = 438
+
Plot the point of averages
(x,y) on the scatterplot.
Let’s investigate how the LSRL and correlation coefficient
change when different points are added to the data set
Suppose we have the following data set.
x 4 5 6 7 8
y 2 5 4 6 9
Sketch a scatterplot. Calculate the LSRL and the correlation coefficient.
yˆ  3.8  1.5 x
r  0.916
x 4 5 6 7 8 5
y 2 5 4 6 9 8
Suppose we add the point (5,8) to the data set. What happens to the regression
line and the correlation coefficient?
yˆ  3.8  1.5 x
r  0.916
yˆ  1.15  1.17 x
r  0.667
x 4 5 6 7 8 12
y 2 5 4 6 9 12
Suppose we add the point (12,12) to the data set. What happens to the
regression line and the correlation coefficient?
yˆ  3.8  1.5 x
r  0.916
yˆ  2.24  1.225 x
r  0.959
x 4 5 6 7 8 12
y 2 5 4 6 9 0
Suppose we add the point (12,0) to the data set. What happens to the regression
line and the correlation coefficient?
yˆ  3.8  1.5 x yˆ  6.26  0.275 x

r  0.916 r  0.248
The correlation coefficient and the LSRL

are both measures that are affected by
extreme values.
Assessing the fit of the LSRL
Important questions are:

1. Is the line an appropriate way to summarize the relationship
between x and y.
2. Are there any unusual aspects of the data set that we need to
consider before proceeding to use the line to make
predictions?
3. If we decide to use the line as a basis for prediction, how
accurate can we expect predictions based on the line to be?
Once the LSRL is obtained, the next step is to examine how effectively the
line summarizes the relationship between x and y.
In a study, researchers were interested in how the distance a deer
mouse will travel for food (y) is related to the distance from the
food to the nearest pile of fine woody debris (x). Distances were
measured in meters.
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
Predictor Coef SE Coef T P
Constant -7.69 13.33 -0.58 0.582
Distance to debris 3.234 1.782 1.82 0.112
S=8.67071 R-Sq = 32.0% R-Sq(adj) = 22.3%
yˆ  7.69  3.234x
In a study, researchers were interested in how the distance a deer mouse will
travel forUse the(y)
food LSRL to calculate
is related the predicted
to the distance from distance
the food to the nearest pile of
traveled.
fine woody debris (x). Distances were measuredSubtract to find the residuals.
in meters.
Distance Distance Predicted distance Residual

from debris traveled (y) traveled (yˆ) (y  yˆ) What does
6.94 0.00 14.76 -14.76 the sum of
Willthe
the sum
5.23 6.13 9.23 -3.10 of the
residuals
5.21 11.29 9.16 2.13 residuals
equal?
15.28 -0.93
always equal
7.10 14.35
zero?
8.16 12.03 18.70 -6.67
5.50 22.72 10.10 12.62
9.19 20.11 22.04 -1.93
9.05 26.16 21.58 4.58
9.36 30.65 22.59 8.06
Residual plots
• Is a scatterplot of the (x, residual) pairs.

• Residuals can also be graphed against the predicted y-values
• The purpose is to determine if a linear model is the best way to
describe the relationship between the x & y variables
• If no pattern exists between the points in the residual plot, then
the linear model is appropriate.
Residuals
Residuals
x x
This residual shows no This residual shows a curved

pattern so it indicates that the pattern so it indicates that the
linear model is appropriate. linear model is not
appropriate.
In a study, researchers were interested in how the distance a deer mouse will
travel for food (y) is related to the distance from the food to the nearest pile of
fine woody debris (x). Distances were measured in meters.
Distance Distance Predicted distance Residual
from debris traveled (y) traveled (yˆ) (y  yˆ)
6.94 0.00 14.76 -14.76
5.23 6.13 9.23 -3.10
5.21 11.29 9.16 2.13
7.10 14.35 15.28 -0.93
8.16 12.03 18.70 -6.67
5.50 22.72 10.10 12.62
9.19 20.11 22.04 -1.93
9.05 26.16 21.58 4.58
9.36 30.65 22.59 8.06
Plot the residuals against the distance from debris (x)
15
10
Residuals
5
Since the residual plot displays
no pattern, a linear model is
5 6 7 8 9 appropriate for describing the
Distance from debris relationship between the
-5 distance from debris and the
distance a deer mouse will travel
-10 for food.
-15
Now plot the residuals against the predicted distance from food.
15
What do you notice about the
general scatter of points on this
10 residual plot versus the residual
plot using the x-values?
Residuals
10 15 20 25 9
Predicted Distance traveled
-5
-10
15
-15
10
R e siduals 5
Residual plots can be plotted against either the x-
values or the predicted y-values.
5 6 7 8 9
Distance from debris
-5
-10
-15
Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40 59 51 50
This point is considered an influential point because it
affects the placement of the least-squares regression line.
Do you notice anything unusual about this
Sketch a scatterplot with the fitted regression line. data set?
What would happen to the regression line if this point is

removed?
60
55
Influential observation
Weight
50
45
40
Age 5 10 15 20 25 30
Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40 59 51 50
An observation is an
outlier if it has a large
residual. 60
55
Weight
50
45
40
Age 5 10 15 20 25 30
Predicted Distance traveled

Coefficient of determination-
• Denoted by r2
• gives the proportion of variation in y that can be attributed

to an approximate linear relationship between x & y
Let’s explore the meaning of r2 by revisiting the deer mouse data set.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
Distance traveled
30
Suppose you didn’t know any x-values. What 25
distance would you expect deer mice to travel? 20
15
y  15.938
10
What is total amount of variation in the 5 6 7 8 9

distance traveled (y-values)? Hint: Find the Distance to Debris
sum of the squared deviations.
Total amount of variation in the distance
SSTo   y  y 2
traveled is 773.95 m2.
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
Now suppose you DO know the x-values.
Distance traveled
Your best guess would be the predicted
distance traveled (the point on the LSRL).
By how much do the observed points

vary from the LSRL?
Hint: Find the sum of the residuals Distance to debris
squared.
The points vary from the LSRL by
SSResid   y  yˆ 526.27 m2.
2
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
Total amount of variation in the SSResid
2
distance traveled is 773.95 m2. r 1
SSTo
The points vary from the LSRL
2 526.27
by 526.27 m2. r 1  0.320
773.95
Approximately what percent of the
variation in distance traveled can be
explained by the regression line?
Partial output from the regression analysis of deer mouse data:
Predictor Coef SE Coef T P

Constant -7.69 13.33 -0.58 0.582
Distance to 3.234 1.782 1.82 0.112
debris
S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3%

The y-intercept (a):
This value has no meaning in context since it doesn't make sense to have a negative distance.
The standard deviation (s): This is the typical amount by which an observation
deviates from the least squares regression line. sIt’s found by:
SSResid
e
n -2
The slope (b): The distance traveled to food increases by approxiamtely 3.234 meters for an
increase of 1 meter to the nearest debris pile.
The coefficient of determination (r 2)

Only 32% of the observed variability in the distance traveled for food can be explained by the approximate
linear relationship between the distance traveled for food and the distance to the nearest debris pile.
Let’s examine this data set: Because of the curved pattern, a
straight line would not accurately
x = representative age describe the relationship between
average finish time and age.
y = average marathon finish time
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
Create a scatterplot for this data set. Since this curve resembles a parabola, a quadratic function
can be used to describe this relationship. 2
The least-squares quadratic regression is yˆ  a  b1x  b2x
yˆ  462  14.2x  0.179x 2

Average Finish Time
300
250
This curve minimizes the sum of the
squares of the residuals (similar to
least-squares linear regression).
200
10 20 30 40 50 60
Representative Age
Let’s examine this data set:
x = representative age
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
Notice the residuals from the quadratic regression. Here is the residual plot-
Since there is no pattern in the residual plot, the
quadratic regression is an appropriate model for
this data set.
Average Finish Time
300
20
250 10
Residuals
200 10 20 30 40 50 60
-10 Age
10 20 30 40 50 60 -20
Representative Age
Let’s examine this data set:
x = representative age
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
The measure R2 is useful for

Average Finish Time
assessing the fit of the quadratic

300 regression.
SSResid
R2 1
SSTo
250
200
R2 = .921
92.1% of the variation in average marathon
10 20 30 40 50 60
finish times can be explained by the
Representative Age
approximate quadratic relationship between
average finish time and age.
Depending on the data set, other regression models, such as
cubic regression, may be used. Statistical software is commonly
used to calculate these regression models.
Another method for fitting regression models to non-linear data

sets is to transform the data, making it linear. Then a least-
squares regression line can be fit to the transformed data.
Commonly Used Transformations
Transformation Equation
No transformation yˆ  a  bx
Square root of x yˆ  a  b x
Log of x * yˆ  a  b log10 x 
Reciprocal of x 1
yˆ  a  b  
x 
Log of y *
Exponential growth or decay
log10 yˆ  a  bx
*Natural log may also be used

Pomegranate study revisited:
x = number of days after injection of cancer cells in
mice assigned to .2% PFE and y = average tumor volume
Since the data appears to be
x 11 15 19 23 27 31 35 39
exponential growth, let’s try
y 40 75 the
90“log210 230 330 450 600
of y” transformation
Sketch a scatterplot for this data set.

600
There appears
to be a curve in
Average tumor volume
500
the data points.
400
300
Let’s use a
200
transformation to
100
linearize the data.
10 15 20 25 30 35
Number of days
x = number of days after injection of cancer cells in mice assigned

to .2% PFE and y = average tumor volume
x 11 15 19 23 27 31 35 39
Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78
Sketch a scatterplot of the log(y) and x.
Notice that the
relationship now
Log of Average tumor volume
appears linear. Let’s

3
fit an LSRL to the
transformed data.
1
The LSRL is
log yˆ  1.226  0.041x

10 15 20 25 30 35
Number of days
x = number of days after injection of cancer cells in mice assigned

to .2% PFE and y = average tumor volume
x 11 15 19 23 27 31 35 39
Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78
Sketch a scatterplot of the log(y) and x.
The LSRL is
3 What would the predicted
log yˆ  1.226  0.041x average tumor size be 30
tumor volume
days after injection of

Log of Average
2
cancer cells?
1 log yˆ  1.226  0.041(30)
log yˆ  2.456
10 15 20 25 30 35
Number of
10 10
days
2525 3030 3535
yˆ  102.456  285.76 mm3

Another useful transformation is the power
transformation. The power transformation
ladder and the scatterplot (both below) can be
used to help determine what type of
transformation is appropriate.
Power Transformation Ladder

Power Transformed Value Name
3 (Original value)3 Cube
2 (Original value)2 Square Suppose that the scatterplot looks like the curve labeled 2.
1 (Original value) No
transformation Then we would use a power that is up the ladder
½ Original value Square root

from the no transformation row for both the x
and y variables.
1/3 3 Original value Cube root Suppose that the scatterplot looks like the curve labeled 1.
0 Log(Original value) Logarithm

Then we would use a power that is up the ladder from the no
transformation row for the x variable and a power down the
1 ladder for the y variable.
-1 Reciprocal
Original value
Logistic Regression (Optional)
• Can be used if the dependent variable is categorical with just two

possible values
• Used to describe how the probability of “success” changes as a
numerical predictor variable, x, changes
• With p denoting the probability of success, the logistic regression
equation is
a bx
e
p
1  e a bx
Where a and b are constants
For any value of x, the value of p is always between 0 and 1.
In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf
spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53
pairs of courting wolf spiders.
x = the difference in body width (female – male)
y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism
Note that the plot was constructed so What is the probability of cannibalism if the male & female
that if two plots fell in the exact spiders are the same width (difference of 0)?
same location they would be
offset a little bit so that all points
would be visible (called jittering).
This equation can be used to predict the probability of the male
spider being cannibalized based on the difference in size.
e 3.08904 3.06928x
p
1  e 3.08904 3.06928x
e 3.08904 3.06928( 0)
p 3.08904 3.06928( 0 )
 0.044
1e

SEE5211 Chapter3-P2017

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SEE5211 Chapter3-P2017

Uploaded by

Copyright:

Available Formats

Data Analysis in Envir Application

Dr. Wen Zhou

Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1

• The role of statistics and the data analysis process

Summarizing Bivariate Data

Do you think there is a relationship? If so,

• The relationship between bivariate numerical variables

• May be positive or negative

What feature(s) of the graph would indicate a weak or strong relationship?

• May be weak or strong

Set A Set B Set C

Set B shows little or no relationship.

Set D shows a strong,

• A quantitative assessment of the strength and direction of the

z-scores for x and y.

For the six primarily undergraduate universities in California

Expenditures 8011 7323 8735 7548 7071 8248

Create a scatterplot and calculate r.

Expenditures 8011 7323 8735 7548 7071 8248

1) legitimate values are -1 < r < 1

2) value of r is not changed by any linear

Expenditures 8011 7323 8735 7548 7071 8248

3) value of r does not depend on which of the two variables is

Expenditures 8011 7323 8735 7548 7071 8248

Graduation rates 64.6 53.0 46.3 42.5 38.5 63.9

Suppose the 33.9 was REALLY 63.9. What do

Extreme values affect the correlation coefficient

4) value of r is affected by extreme values.

Compute the correlation coefficient? Sketch the scatterplot

Does this mean that there is NO

r = 0, but the data set has a definite relationship!

5) value of r is a measure of the extent to which x and y are

1. legitimate values of r are -1 < r < 1

Expenditures 8011 7323 8735 7548 7071 8248

There is a weak, positive, linear

• Consumption of hot chocolate is negatively correlated with crime rate.

• x – variable: is the independent or explanatory variable

Suppose that we have two variables:

x = the amount spent on advertising

What question might I want to answer using this data?

The objective of regression analysis is to use information about one variable, x, to

Let’s just fit a line to the data by

Find the sum of the squares of

Now find the vertical distance from

(0,0) Sum of the squares = 61.25

Sum of the squares = 54

yˆ  269.75  37.25 x r  0.998

yˆ  269.75  37.25(20)  475.25 mm 3

yˆ  269.75  37.25(5)  83.5 mm3 Can volume be negative?

the slope of the line for predicting x is

yˆ  3.8  1.5 x yˆ  6.26  0.275 x

The correlation coefficient and the LSRL

Important questions are:

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

Distance Distance Predicted distance Residual

• Is a scatterplot of the (x, residual) pairs.

This residual shows no This residual shows a curved

What would happen to the regression line if this point is

Predicted Distance traveled

• gives the proportion of variation in y that can be attributed

Suppose you didn’t know any x-values. What 25

distance would you expect deer mice to travel? 20

What is total amount of variation in the 5 6 7 8 9