You are on page 1of 8

STAT7055 Topic 01 Tutorial Solutions

1. Data was collected on 105 homes in Canberra in 2003. For each house, the following
information was collected: the estimated price of the house (in dollars); the number of
bedrooms; the size of the house (in square metres); whether or not a pool was present (yes
or no); the distance from Civic; the rating of the insulation in the house (none, average
or high); the suburb; the number of bathrooms; and the type of internet connectivity
available (dialup, ADSL or the NBN, where dialup is the slowest connection and the NBN
is the fastest). Classify each variable as either nominal, ordinal, discrete or continuous.

Solution: There are nine variables in this study:


1. Estimated price. Although the estimated price is likely to be a whole dollar amount,
we generally consider monetary values to be continuous, as they can technically be
any possible amount.
2. Number of bedrooms. This is a count variable so it is discrete.
3. Size. This is a measurement so it is continuous.
4. Pool present. It’s not necessarily clear that having a pool is better than not having
a pool (e.g., increasing costs for maintenance), so this is nominal.
5. Distance from Civic. Again a measurement, so continuous.
6. Insulation rating. Clearly an ordering between the categories, so ordinal.
7. Suburb. Nominal.
8. Number of bathrooms. Discrete.
9. Type of internet. Again a clear ordering between the categories, so ordinal.

2. You work in a country where every resident plays a sport every day. However the only
two sports played are table tennis (when it is raining) and golf (when it is sunny). Your
job is to provide statistical analysis to the management of a company that sells “ping-
pong” (table tennis) balls directly through the internet. Over the past eight months you
have collected the following data:

Month Marketing Number of rainy Number of


expenditure ($) days sales
1 4150 6 778
2 3000 10 779
3 2500 25 4200
4 10600 2 250
5 12000 7 300
6 8000 20 6000
7 1500 18 1500
8 6850 9 500

For this data, the sample coefficients of variation for marketing expenditure, number of
rainy days per month, and number of sales have been calculated to be 0.642849, 0.656009,
and 1.194023, respectively.

Page 1 of 8
STAT7055 Topic 01 Tutorial Solutions

(a) The marketing manager has told you that it simply makes sense that there is a
strong and positive correlation between marketing expenditure and the number
of sales made. Provide some analysis regarding this relationship. What do you
conclude from your results?
Solution: If we let X be marketing expenditure and Y be number of sales, we can
calculate the sample means to be X̄ = 6075 and Ȳ = 1788.375. From the given
coefficients of variation, we can then determine the sample standard deviations:
sX = cvX × X̄ = 3905.3077
sY = cvY × Ȳ = 2135.3603
The following table can be used to help calculate the sample covariance between X
and Y :

Month Xi = Marketing Yi = Number of (Xi − X̄)(Yi − Ȳ )


expenditure ($) sales
1 4150 778 1944971.88
2 3000 779 3103828.13
3 2500 4200 −8621559.38
4 10600 250 −6961146.88
5 12000 300 −8818621.88
6 8000 6000 8107378.13
7 1500 1500 1319315.63
8 6850 500 −998490.63
󰁓
X̄ = 6075 Ȳ = 1788.375 = −10924325.00

Hence
−10924325
sXY = = −1560617.86
8−1
Therefore, the correlation coefficient is
−1560617.86
rXY = = −0.1871
3905.3077 × 2135.3603
Based on this information we would conclude that in contrast to the marketing
manager’s assertions, there is a negative relationship between marketing expenditure
and number of sales.
If we were to do this using R, we would first create vectors for the X and Y values
(the order that you list the numbers is important - they must match up for X and
Y ):
> x <- c(4150,3000,2500,10600,12000,8000,1500,6850)
> y <- c(778,779,4200,250,300,6000,1500,500)
The cor function can then be used to calculate the sample correlation coefficient:
> cor(x,y)
[1] -0.1871415

Page 2 of 8
STAT7055 Topic 01 Tutorial Solutions

(b) Using the data above, calculate the correlation coefficient between the number of
rainy days per month and the number of sales. The covariance between the number
of rainy days per month and the number of sales has been calculated as 14012.23.
Solution: Let Z denote the number of rainy days per month. Again from the
coefficient of variation, we can calculate the sample standard deviation of Z:

sZ = cvZ × Z̄ = 0.656009 × 12.125 = 7.9541

Since we are given sZY , the sample correlation between Z and Y is equal to:
sZY 14012.23
rZY = = = 0.8250
sZ sY 7.9541 × 2135.3603

Using R, we would just need to create the new vector of Z values:


> z <- c(6,10,25,2,7,20,18,9)
Then we can use the cor function again for the sample correlation coefficient:
> cor(z,y)
[1] 0.8249823
(c) What does the result in (b) above suggest, and provide a potential reason for this
result.
Solution: Part (b) suggests that there is very strong positive relationship between
the number of rainy days in a month and the number of sales in a month. This
might be because when there is a lot of rain, people play an indoor sport on those
days, and that sport is table tennis, hence the stronger levels of sales for ping pong
balls on those days.
Try using R to calculate the sample correlation coefficients from the raw data given in
the table.

3. A quality control officer in a chocolate factory records the number of minutes it takes
for the company’s signature chocolate bar to melt at room temperature. He recorded
the following 11 times for 11 different chocolate bars:

14 20 20 12 9 13 35 12 11 12 46

(a) Calculate the mean, mode and median of the times.


󰁓11
Xi
Solution: For the mean we have X̄ = i=1 11
= 18.55. For the median, we first
sort the observations from smallest to largest:

9 11 12 12 12 13 14 20 20 35 46

Since there are an odd number of observations, the median is the middle observation,
13. For the mode, the time of 12 occurs the most (three times).

Page 3 of 8
STAT7055 Topic 01 Tutorial Solutions

(b) It turned out that the quality control officer occasionally fell asleep while recording
the time for a chocolate bar to melt, leading to some incorrectly large melting times.
Based on this information, which would be a better measure of central tendency for
this data, the mean or the median?
Solution: Since the extreme observations we know are incorrect values, we should
really use the median, since it is more robust to extreme values.
(c) Calculate the IQR of the times.
Solution: We need to first calculate the first and third quartiles. Using the formula
p
given in lectures, Lp = (n + 1) 100 , we find that L25 = 3 and L75 = 9. Therefore, the
IQR is the difference between the 9th and 3rd values, that is, IQR = 20 − 12 = 8.
(d) Calculate the difference between the 60th percentile and the 10th percentile.
Solution: Again, using the formula from lectures, the location of the 10th and 60th
percentiles are L10 = 1.2 and L60 = 7.2. So the 10th percentile is 0.2 of the distance
between the 1st and 2nd values:

9 + 0.2 × (11 − 9) = 9.4

and the 60th percentile is 0.2 of the distance between the 7th and 8th values:

14 + 0.2 × (20 − 14) = 15.2

Therefore, the difference between the 60th and 10th percentiles is 15.2 − 9.4 = 5.8.
(e) To what percentile does a time of 15.5 minutes correspond to?
Solution: A time of 15.5 lies exactly one quarter of the way between 14 and 20,
which are the 7th and 8th values. Hence its location is equal to Lp = 7.25. Substi-
p
tuting this into Lp = (n + 1) 100 and solving for p we get p = 60.4167. So a time of
15.5 corresponds to the 60.4167th percentile.

4. There is a shortcut version for calculating the sample variance given by the following
formula: 󰀣󰀣 n 󰀤 󰁓n 󰀤
1 󰁛 ( X i )
2
s2 = Xi2 − i=1
n−1 i=1
n
Show that this is equivalent to the definition given in the lectures. In other words, show
that: 󰀣 n 󰀤 󰀣󰀣 n 󰀤 󰁓n 󰀤
1 󰁛󰀃 󰀄2 1 󰁛 ( X i )
2
Xi − X̄ = Xi2 − i=1
n − 1 i=1 n−1 i=1
n
Bonus: Show that the shortcut version of the sample covariance given below is equivalent
to the definition given in lectures.
󰀣󰀣 n 󰀤 󰁓n 󰁓n 󰀤
1 󰁛 ( i=1 Xi ) ( i=1 Yi )
sXY = X i Yi −
n−1 i=1
n

Page 4 of 8
STAT7055 Topic 01 Tutorial Solutions

1
Solution: Since the term n−1 is the same for both versions of the sample variance, we
only need to show that the remaining terms are equal. Starting with the left hand side,
let’s first expand the square:
n
󰁛 n
󰁛
󰀃 󰀄2 󰀃 󰀄
Xi − X̄ = Xi2 − 2Xi X̄ + X̄ 2
i=1 i=1

We then apply the summation to each separate term:


n
󰀣 n 󰀤 󰀣 n 󰀤 󰀣 n 󰀤
󰁛 󰀃 󰀄2 󰁛 󰁛 󰁛
2 2
Xi − X̄ = Xi − 2Xi X̄ + X̄
i=1
󰀣 i=1
n
󰀤 i=1
󰀣 n
󰀤 i=1
󰁛 󰁛
= Xi2 − 2X̄ Xi + nX̄ 2
i=1 i=1

To get the second


󰁓nline, we simply recognise that X̄ does not change with i. But we also
i=1 Xi
󰁓n
know that X̄ = n
, which we can rearrange to get i=1 Xi = nX̄. Substituting this
into the above equation, we get:
n
󰀣 n 󰀤
󰁛 󰀃 󰀄2 󰁛
Xi − X̄ = Xi2 − 2X̄ × nX̄ + nX̄ 2
i=1
󰀣 i=1
n
󰀤
󰁛
= Xi2 − 2nX̄ 2 + nX̄ 2
󰀣 i=1
n
󰀤
󰁛
= Xi2 − nX̄ 2
i=1

󰁓n
Xi
Finally, substituting back X̄ = , we get:
i=1
n

n
󰀣 n 󰀤 󰀕 󰁓n 󰀖2
󰁛 󰀃 󰀄2 󰁛
2 i=1 Xi
Xi − X̄ = Xi − n
i=1 i=1
n
󰀣 n 󰀤 󰁓
󰁛 ( ni=1 Xi )
2
2
= Xi −
i=1
n

A similar calculation can be used to derive the shortcut version of the sample covariance.

5. The Hula painted frog is an extremely rare species of frog that was thought to be extinct
but was rediscovered in 2011. Only 11 are believed to be living in the wild. Suppose the
weights of these 11 frogs are known and given in the table below (in grams):

13 26 22 16 18 28 14 15 15 17 25

Page 5 of 8
STAT7055 Topic 01 Tutorial Solutions

(a) Calculate the population variance of these 11 frogs.


Solution: The population mean is equal to µ = 19 and the population variance is
equal to
N
1 󰁛 282
σ2 = (Xi − µ)2 = = 25.6364
N i=1 11

Suppose now we take five random samples of size four from this population, with each
new sample being taken after returning the previous sample to the population. The five
samples, along with some sample statistics, are listed below:
󰁓n
Sample X̄ i=1 Xi2
13, 22, 18, 16 17.25 1233
26, 15, 17, 15 18.25 1415
14, 18, 15, 25 18 1370
25, 14, 16, 17 18 1366
13, 26, 25, 18 20.5 1794

(b) Calculate the sample variance for each of the five samples.
Solution: See the solution to part (c).
(c) Calculate the sample variance for each of the five samples, but this time using n as
the denominator, instead of n − 1. That is, calculate:
n
∗2 1 󰁛󰀃 󰀄2
s = Xi − X̄
n i=1

Solution: We can use the shortcut version of the sample variance formula to cal-
culate both s2 and s∗2 :
󰁓n
󰁓n 󰁓n ( Xi )
2

Sample X̄ 2
i=1 Xi ( 2
i=1 Xi ) −
i=1
n
s2 s∗2
13, 22, 18, 16 17.25 1233 42.75 14.25 10.6875
26, 15, 17, 15 18.25 1415 82.75 27.5833 20.6875
14, 18, 15, 25 18 1370 74 24.6667 18.5
25, 14, 16, 17 18 1366 70 23.3333 17.5
13, 26, 25, 18 20.5 1794 113 37.6667 28.25

(d) Calculate the average of the five samples variances in part (b) and the average of
the five sample variances in part (c). What do you notice?
Solution: The average of the sample variances in part (b) is equal to 25.5. The
average of the sample variances in part (c) is equal to 19.125. What we notice is
that the average from part (b) is much closer to the true population variance than

Page 6 of 8
STAT7055 Topic 01 Tutorial Solutions

the average from part (c). More specifically, the average in part (c) underestimates
the true population variance. This is an illustration of why we use n − 1 in the
denominator for the sample variance, rather than n. Using n − 1 produces a better
estimator of the true population variance.

6. The average score for a class of 30 students was 75. The 20 male students in the class
averaged 70. The boxplots for the scores for the male and female students are given
below. 100
90
80
70
60
50
40

Male Female

(a) What was the average of the 10 female students in the class?
Solution: Let’s denote the 20 male students as X1 , . . . , X20 and the 10 female
students as X21 , . . . , X30 . We know that the overall average was 75:
X1 + . . . + X20 + X21 + . . . + X30
= 75
30
We also know that the average of the male students was 70:
X1 + . . . + X20
= 70
20
Which implies that the sum of the scores for the male students was:
X1 + . . . + X20 = 20 × 70 = 1400
Using this information, from the formula for the overall average we can work out
the sum of the scores for the female students:
1400 + X21 + . . . + X30
= 75
30
∴ X21 + . . . + X30 = 75 × 30 − 1400
= 850
So the average of the 10 female students is:
X21 + . . . + X30 850
= = 85
10 10

Page 7 of 8
STAT7055 Topic 01 Tutorial Solutions

(b) Describe the relationship between the median and the mean for both male students
and female students.
Solution: From the boxplot for male students, the distribution seems fairly sym-
metric, so we would guess that the mean is approximately equal to the median. For
the female students, the distribution seems slightly positively skewed, so the mean is
probably larger than the median.
(c) Did a greater proportion of male students or female students score at least 83?
Solution: From the boxplot, the median for the female students is at least 83,
indicating that at least 50% of female students scored at least 83. From the boxplot,
we can also see that the third quartile for the male students is less than the median
for the female students, i.e., less than 83. This implies that the proportion of male
students that scored at least 83 cannot be greater than 25%. Therefore, a greater
proportion of female students scored at least 83.

7. Discussion Question
Some scientists are conducting a study to investigate the effects of exercise and caffeine
on sleep quality. A random sample of 300 people aged between 20 and 50 were included
in the study. For a particular day, each person was asked to record the number of
cups of coffee/tea they drank, the number of minutes they exercised, and the number
of hours they slept that night. The scientists have asked you to help them analyse their
data. They would like to summarise each variable in their sample data. They are also
interested in determining whether doing more exercise or consuming less caffeine is more
likely to cause the person to sleep for longer. Discuss some approaches that you could
use to help the scientists. Remember to note any important issues that need to be
considered in the analysis or in the interpretation of the analysis.

8. swirl
Work through lessons 1 and 2 of the R Programming course.

Page 8 of 8

You might also like