You are on page 1of 13

1) A Biologist (Pierce, 1949) believed that the chirp rate of crickets might be influenced

by temperature. He measured the number of chirps per second for different crickets
at different temperatures (in F). A portion of the (modified) data and some R output
is shown below.
Regression Analysis: chirp_rate versus
temperature

22

The regression equation is


(A) = (B) + 0.214 temperature

20

Scatterplot of chirp_rate vs temperature

Predictor
(Intercept)
temperature

Coef
-0.435
0.21443

SE Coef
2.323
0.02910

T
-0.19
7.37

P
0.853
0.000

chirp_rate

21

19
18
17
16
15
14
70

chirp_rate
20.0
16.0
19.8
18.4
17.1

75

85
80
temperature

90

temperature
88.6
71.6
93.3
84.3
80.6

a) Two items have been replaced with letters. Fill in what they should be.
A -

95

[4]

B -

b) What does the estimated slope of 0.214 tell us about the relationship between
chirp rate and temperature? Be specific a qualitative answer will not suffice
here.
[2]

The residual plots for this analysis are shown below.


Residual Plots for chirp_rate
Normal Probability Plot of the Residuals

Percent

90
50
10
1

Residuals Versus the Fitted Values


Standardized Residual

99

-2

-1
0
1
Standardized Residual

2
1
0
-1
-2

Histogram of the Residuals


Standardized Residual

Frequency

6
4
2

-2.0

-1.5

-1.0 -0.5 0.0 0.5 1.0


Standardized Residual

16

17
18
Fitted Value

19

Residuals Versus the Order of the Data

15

1.5

2
1
0
-1
-2
2

8 10 12 14 16 18 20 22 24 26 28
Observation Order

c) Do you see any problems with these residuals? If so, list the problems in order of
importance. If not, state why you came to this conclusion.
[3]

d) Predict the chirp rate for a cricket at 78F. Are there any problems with this
prediction? Explain if so, ignoring the prediction if you like.
[2]

e) Predict the temperature for a chirp rate of 21. Are there any problems with this
prediction? Explain if so, ignoring the prediction if you like.
[3]

2) A research team wants to determine what brand of cat food results in the least
amount of cat hair on furniture. The team runs an experiment with 300 cats,
separated into long-haired and short-haired and each with their own living space.
Each cat is given one of three brands of cat food (cheap dry food, fancy dry food, or
wet food) as well as either purified water or tap water. The team measures the
amount of hair on the floor of each living space.
Identify all the key design elements, such as:
[10]
a) the factors, levels, and treatments

b) any blocking variables (if present)

c) response variable(s)

d) use of blinding

e) possible improvements to the design

f) We could use side-by-side boxplots to compare the amount of hair left on the
floor for each type of cat food.
True or False ?
g) If we notice a significant difference between the mean hair found from different
brands, we can assume a cause-effect relationship.
True or False ?
3

3) Regression, again
Below is some output from a regression analysis performed on a dataset containing
the age and systolic blood pressure measurement for 30 patients. These patients
were a random sample from all of the patients at a medical clinic in Toronto.
The regression equation is
blood_pressure = 98.7 + 0.971 age

Scatterplot of blood_pressure vs age


220

S = 17.3137

Coef
98.71
0.9709

SE Coef
10.00
0.2102

T
9.87
4.62

P
0.000
0.000

200

blood_pressure

Predictor
Constant
age

R-Sq = 43.2%

180
160
140
120
100
10

20

Unusual Observations
Obs
2

age blood_pressure
47.0
220.00

Fit
144.35

SE Fit
3.19

Residual
XXX

30

40
age

50

60

70

St Resid
XXX

a) Something was minimized by this regression procedure. What is it, in very simple
words, and what is the actual (minimal) numerical value of this quantity for the
regression here?
[3]

b) ____________ of the variation in ____________________ can be explained by the


relationship with _________________ . Fill in the blanks.
c) R has identified one unusual observation. This observation has: (circle one)
i) High leverage

True or False

ii) High influence

True or False

iii) Large residual

True or False

d) The value under Residual has been replaced with XXX. What is this value?

[2]
[3]

[1]

e) What is the meaning of the slope estimate 0.971? Again, be specific.

f) Does the intercept have any meaning in this analysis? If so, what is the
interpretation of the intercept? If not, state why it is meaningless.

[2]

[2]

g) A researcher wants to use this model to predict the blood pressure of all Toronto
residents between the ages of 18-70. Assuming we deal with the outlier (by
removing it, for example), is this an appropriate use of regression? Explain why
or why not using terminology from class.
[3]

Below is a portion (first 7 children) of a data set consisting of observations on a number of variables
of interest for 78 seventh grade students, followed by some analyses and plots (residuals are
calculated from the regression preceeding the plots). Higher self-concept scores (based on a
standard test) indicate more positive self-concept. Two different regression models are fitted below
using the data. We are interested in what factors influence the GPA of these students.
Portion of the data set:
OBS GPA IQ Gender
1
7.940 111
M
2
8.292 107
M
3
4.643 100
M
4
7.470 107
M
5
8.882 114
F
6
7.585 115
M
7
7.650 111
M
.
.
.
78

..
..

Self-concept
67
43
52
66
58
51
71

..

Correlations (Pearson)
GPA
IQ

0.677

Self-con

0.612

IQ

0.382

-------------------------Cell Contents: Correlation

Regression Analysis 1
The regression equation is
Predictor
Constant
IQ
S = 1.545

Coef
-6.602
0.12729

GPA = -6.60 + 0.127 IQ


StDev
2.568
0.02272

R-Sq = 45.9%

T
-2.57
5.60

P
0.014
0.000

R-Sq(adj) = 44.4%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
37
38

SS
74.980
88.377
163.357

MS
74.980
2.389

F
31.39

P
0.000

Unusual Observations
Obs
IQ
GPA
Fit
StDev Fit
Residual
8
97
2.412
5.745
0.430
-3.333
22
109
1.760
7.273
0.260
-5.513
R denotes an observation with a large standardized residual

St Resid
-2.25R
-3.62R

Regression Analysis 2
The regression equation is
Predictor
Constant
Self-con
S = 1.663

Coef
1.519
0.10638

GPA = 1.52 + 0.106 Self-concept


StDev
1.345
0.02263

R-Sq = 37.4%

T
1.13
4.70

P
0.266
0.000

R-Sq(adj) = 35.7%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
37
38

SS
61.087
102.270
163.357

MS
61.087
2.764

F
22.10

P
0.000

Unusual Observations
Obs
Self-con
GPA
Fit
StDev Fit
Residual
St Resid
8
51.0
2.412
6.944
0.313
-4.532
-2.78R
22
20.0
1.760
3.647
0.906
-1.887
-1.35 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.

1) Answer the following questions, based on the previous three pages of output.
a) Look at the first regression above (GPA on IQ). Carefully describe the plot and relationship to
someone who will not be able to actually view the plot. [3]

b) If looking only at the numerical output (without taking the graphs into account), which variable is
the better predictor? Why? [2]

c) Some but not all of the variation in GPAs is explained by a linear relationship with self-concept.
Many other variables are involved. How much of the variation in GPAs is not accounted for after
taking into account self-concept scores? [2]

d) A students IQ is 155. Give a prediction of this students GPA. Is your prediction reliable (why or
why not)? [3]

e) For the second regression (GPA on self-concept), examine the scatterplot of (standardized)
residuals versus predictor. What exactly do you learn from this plot? [2]

f) Attempting to make sense of the results, Joe came to the conclusion that high GPA is the result of a
high IQ quotient, and in addition high GPA is also a consequence of high self-concept, although
self-concept is not as strong a cause as IQ. Do you agree? Defend your argument. (<20 words)
[1]

g) For the regression of GPA on self-concept, explain what makes the


i) 22nd student in the list somewhat unusual (do not use very technical terms like residual or
deviation). [1]

ii) 8th student in the list somewhat unusual (do not use very technical terms like residual or
deviation). [1]

h) Describe the (frequency) distribution of the residuals resulting from the regression of GPA on selfconcept. [2]

2) A hobbyist / researcher rediscovered the research of optometrist Dr. William H. Bates (1860-1931),
who demonstrated that nearsightedness and farsightedness can be cured with proper training and care.
He believed that glasses are in fact the main reason why people's eyesight deteriorates over time, and
recorded hundreds of patients whom he helped to regain 20/20 (standard) vision from myopia
(nearsightedness).
Due to vehement opposition from glasses manufacturers and optometrists who refused to accept his
view in his time, only a small number of people today know about Dr. Bates ambitious work that
challenged the orthodox belief about eyesight. Unfortunately, the theory of statistical experimental
design, which could have supported Dr. Bates research, had not yet been developed 100 years ago.
Although this eyesight research sounded very convincing, the researcher could not find any
experiment on Bates' research, and decided to carry out an experiment on his own to confirm the truth.
The researcher found 80 volunteers from U. of T. He decided to determine the effects of two
different types of eye exercises (plus lens method, shifting method both described as highly
effective in the literature); also the length of exercise time (30 min/day, 90 min/day - the longer the
better); and the effect of taking Vitamin A (vitamin A pill, placebo pill); on vision improvement
(measured on a numeric 1-10 scale) after 3 months.
The subjects were randomly assigned to each possible treatment, so that 10 subjects were allocated
for each treatment. All the subjects were extremely eager to participate, although they did not know
anything about the effects of the different exercise types, length of exercises, or supplements. In
addition, the researcher was very careful with measuring the vision improvement on his own after 90
days of training.
10

a) Identify explicitly:
i) the experimental units

[1 ]

ii) the factors and the levels of each factor [3]

iii) the treatments

[3]

iv) the response variable (s)

[1 ]

b) Is this a randomized block design or a completely randomized design? (circle one) [1]

c) This experiment can be criticized severely by the scientific community due to two reasons - identify
those two "main" reasons relevant to experimental design. (< 40 words in total) [4]
(1)

(2)

11

3) Think about the sampling or experimental approach that you would use in each of the following
situations. Explain briefly how you would proceed, mentioning any critical procedural details, e.g. what
instructions you might give to your research assistant (you do not need to explain to him how to use a
random number table just be sure to tell him what/where to randomize). If there is a technical term
that describes the sample/experimental design, mention it as part of your explanation. (< 30 words,
each)
a) Child welfare service areas CWSAs are spread across Canada. We want to collect data re
investigated cases of child abuse, in order to estimate types of abuse, percentage of cases
substantiated, etc. Each investigated case occupies a file at one of these CWSAs. Wed like to
sample about 1% of total cases. [2]

b) We have a list of students enrolled in STA220 (the population of interest). We want to sample 20
students to estimate some population characteristics; these characteristics may be associated with
gender. [2]

c) You want to select 10 test papers, randomly, from a pile of 80 sitting in front of you, and have to do
it as fast as possible. [2]

d) We want to compare three diets for their effect on weight gain (over the next 4 months), in young
rats. We have 18 young rats of about the same age, though differing in weight. [2]

e) You want to compare the taste of french fries, where some will be cooked from potatoes stored at
room temp, and others will be cooked from potatoes stored at a colder temperature. 10 people are
available for your study. Each taster will have to give a rating on a 0-10 scale for flavour. [2]

12

4) Some questions re regression/correlation:


a) Look back to question 4 computer output. If two students differ by 10 points on IQ score, estimate
how much their GPA scores will differ by, on average. Show how you quickly got the correct
answer (but making two predictions using the fitted equation will not get any marks). [2]

b) Look back to question 4 computer output. In assessing the relation between GPA and self-concept,
we used a scatterplot, plotting GPA vs. self-concept, as shown in the output there. How could you
improve the information in this scatterplot, if interested in assessing this relationship as accurately
as possible, in this particular study? [2]

c) For the following scatterplot, with fitted regression line shown, draw a rough picture of the
histogram of the residuals, with 3 - 5 bins. [2]
y
6
5
4
3
2
1
0
x

13

You might also like