You are on page 1of 21

ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 5

Please download the t5e1, t5e2, t5e3 and t5e4 Excel data files from the subject website and
save them to your USB flash drive. Read this handout and try to complete the tutorial
exercises before your tutorial class, so that you can ask help from your tutor during the Zoom
session if necessary.

After you have completed the tutorial exercises attempt the “Exercises for assessment”.
Save your solutions and answers along with the relevant R/Rstudio scripts and printouts in
a Word document and email it to your tutor by the next tutorial in order to get the tutorial
mark.

Inferences about Population Variances

Let’s start with a single population variance. Suppose we take all possible random samples
of the same size (n) from a normal population, X: N(, 2), calculate the sample variance
(s2) from each, and develop the relative frequency distribution, i.e. the sampling distribution,
of the sample variance estimator.

It can be shown that the sample variance is an unbiased estimator of the population
variance, i.e. E(s2) = 2, and that the sum of squared deviations from the sample mean
divided by the population variance follows a chi-square distribution with n-1 degrees of
freedom, i.e.


n
(xi  x)2 (n 1)s2
i1
  n21
 2
 2

From this result, (i) the (1-)100% confidence interval estimate of 2 is

 (n 1)s2 (n 1)s2 
 2 , 2 
  /2,n1 1 /2,n1 

where  /2,n1 and 1 /2,n1 are the (1-/2)100% and /2 100% percentiles of the chi-
2 2

square distribution with n – 1 degrees of freedom, and (ii) the test statistic for H0 : 0
2 2

against a one-sided or two-sided alternative hypothesis is

(n 1)s2
 n21
 2
0

1
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercise 1 (Selvanathan, p. 581, ex. 14.10)

A company manufactures steel shafts for use in engines. One method for judging
inconsistencies in the production process is to determine the variance of the lengths of the
shafts. A random sample of 10 shafts was taken and their lengths measured in centimetres.
These measurements are saved in the t5e1 file. Do the calculations first manually and then
with R.

a) Find a 90% confidence interval for the population variance σ2, assuming that the lengths
of the steel shafts are normally distributed, at least approximately.

Use your calculator to find the sample standard deviation of length and square it to
obtain s2 = 0.493.

The sample size is 10 and the confidence level is (1-)100% = 90%, so the required
chi-square percentiles from Table 5 of Selvanathan (Appendix B, p. 1077) are1

2/2,n1  0.05,9
2
 16.9 and 12 /2,n1  0.95,9
2
 3.33

Therefore, the confidence interval estimate is

 (n 1)s2 (n 1)s2   90.493 90.493


 2 , 2    ,   (0.263,1.332)
 
 /2,n1 1/2,n1   16.9 3.33 

Hence, with 90% confidence, the population variance of the lengths of the shafts is
somewhere between 0.263 and 1.332 centimetres.2

b) In order to meet quality requirements, the production process of steel shafts has to be
suspended and the machines adjusted as soon as possible when the variance of the
lengths of the shafts is larger than 0.4 centimetres. At the 5% level of significance, can
we conclude that the population variance σ2 is greater than 0.4 centimetres and thus
some urgent adjustment is required? Assume again that the lengths of the steel shafts
are normally distributed.

The question implies the following hypotheses:

H0 :  2  0.4 , HA :  2  0.4

This is a right-tail tests and the 5% critical value with 9 degrees of freedom is the same
than the upper 2 table value in part (a), i.e. 16.9, and H0 is rejected if the observed test
statistic exceeds this critical value.

1 These chi-square percentiles can be also obtained by applying the R quantile function qchisq(alpha, df = ),
i.e. by executing the qchisq(0.05, df = 9) and qchisq(0.95, df = 9) commands.
2
The question is about the population variance. If it were about the population standard deviation, we should
take the square roots of these confidence interval limits.
2
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
The test static value calculated from the sample at hand is

(n 1)s2 90.493
obs
2
  11.0925
 2
0 0.4

Since it is smaller than the critical value, at the 5% significance level we cannot reject
H0 and cannot conclude that urgent adjustment is needed.

To complete parts (a) and (b) with R, launch RStudio, create a new project and script,
name them t5e1, and import the data saved in the t5e1 Excel data file to RStudio. In
Tutorial 3 you already installed the DescTools package and used its SignTest function.
Another function in this package is

VarTest(x, sigma.squared = sigmasq0, alternative = " ")

where x is the variable of interest, sigmasq0 is the hypothesized population variance


value (1 by default) and alternative is one of "two.sided" (default), "greater" or "less".3

Execute the

library(DescTools)
VarTest(length, sigma.squared = 0.4, alternative = “greater”)

commands to obtain

One Sample Chi-Square test on variance


data: length
X-squared = 11.103, df = 9, p-value = 0.2687
alternative hypothesis: true variance is greater than 0.4
95 percent confidence interval:
0.2624863 Inf
sample estimates:
variance of x
0.4934444

The variable of interest is length and the alternative hypothesis is true variance is greater
than 0.4. The test statistic value is 11.103, about the same than the one calculated
manually. The p-value is 0.2687, so H0 is maintained at the 5% significance level.

The printout also shows the 95% confidence interval, but since we performed a right-tail
test, it is open from above. To reproduce the confidence interval obtained in part (a),
you need to request a two-tail test.

VarTest(length, sigma.squared = 0.4)

returns the printout shown at the top of the next page. It shows the 95% confidence
interval, (0.233, 1.645).

3
Like other tests, by default this test is also performed at the 5% significance level. Use the optional conf.level
argument if your significance level is different.
3
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
One Sample Chi-Square test on variance
data: length
X-squared = 11.103, df = 9, p-value = 0.5375
alternative hypothesis: true variance is not equal to 0.4
95 percent confidence interval:
0.2334571 1.6445776
sample estimates:
variance of x
0.4934444

c) What assumption did we have to make to answer parts (a) and (b)?

The confidence interval estimator and the chi-square test of 2 require that the sampled
population be normally distributed. Although these techniques remain valid for some
moderate deviations from normality, they are less robust than the confidence interval
estimator and the t-test of . Unfortunately, given the small sample size, this time we
cannot rely on the standard checks of normality.

Suppose now that we are interested in the comparison of the variances of two normally
distributed populations, X1: N(1, 12) and X2: N(2, 22). Assuming that we draw
independent random samples of sizes n1 and n2 from these populations, and calculate the
sample variances (s12, s22) from each

(n1 1)s12 (n2 1)s22


 2
n1 1 and  n22 1
 2
1  2
2

The ratio of two independent chi-square random variables divided by their respective
degrees of freedom has an F distribution, so

n2 1 /(n1 1) s12 / 12


1
 F
n2 1 /(n2 1) s22 / 22 n 1,n 1
2
1 2

From this result, (i) the (1-)100% confidence interval estimate of 12 / 22 is

 s12 / s22 s12 / s22 


 , 
F F
  /2,n11,n2 1 1 /2,n11,n2 1 

where F /2,n11,n2 1 and F1 /2,n11,n2 1 are the (1-/2)100% and /2 100% percentiles of the F-
distribution with df1 = n1 – 1 numerator degrees of freedom and df2 = n2 – 1 denominator
degrees of freedom, and (ii) the test statistic for H0: 12 / 22 = 1 against a one-sided or two-
sided alternative hypothesis is

s12
 Fn11,n2 1
s22

4
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercise 2 (Selvanathan, p. 596, ex. 14.37)

An important statistical measurement in service facilities (such as restaurants, banks and


car washes) is the variability of service times. As an experiment, two tellers at a bank were
observed, and the service times of 100 customers served by each of the tellers were
recorded (Teller1 and Teller2 in seconds) and saved in the t5e2 Excel file.

a) The sample variances are s12 = 3.346 and s22 = 10.950. Estimate the ratio of the two
population variances with 95% confidence.

Assuming that both service times, X1 and X2, are normally distributed, we can use the
confidence interval estimator mentioned on the previous page.

Both sample sizes are 100 and the confidence level is (1-)100% = 95%, so the
required F percentiles from Table 6(b) of Selvanathan (Appendix B, pp. 1080-81)4 are

F /2,n1 1,n2 1  F0.025,99,99  F0.025,100,100  1.48

and

1 1
F1 /2,n1 1,n2 1  F0.975,99,99  F0.975,100,100    0.676
F0.025,100,100 1.48

Therefore, the confidence interval estimate is

 s12 / s22 s12 / s22   3.346/10.950 3.346/10.950 


 ,    ,   (0.206,0.452)
  /2,n11,n2 1 1 /2,n11,n2 1  
F F 1.48 0.676 

Hence, with 95% confidence, the ratio of the variances of the populations of the service
times at the two tellers is somewhere between 0.206 and 0.452 (seconds2).

b) Do the data allow us to infer at the 10% significance level that the variances in service
times differ between the two tellers?

The question implies the following hypotheses:

12 12
H0 : 2 1 , HA : 2  1
2 2

4
Tables 6(a)-6(d) provide F values only under the right tails of the various F distributions. F values under the
left tails can be determined from right-tail F values using the following formula: F1-,df1,df2 = 1 / F,df2,df1. Note
also, that we need to round df1 and df2 up to 100 because the F-tables provide the percentiles for selected
degrees of freedom only. The exact F percentiles could be obtained by applying the R quantile function
qf(alpha, df1 = , df2 = ), i.e. by executing the qf(0.025, df1 = 99, df2 = 99) and qf(0.975, df1 = 99, df2 = 99)
commands.
5
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Assume again that both service times are at least approximately normally distributed,
so that we can perform the F-test described on page 4.

This is a two-tail test and the 10% critical values are

F /2,n11,n2 1  F0.05,99,99  F0.05,100,100 1.39

and
1 1
F1/2,n11,n2 1  F0.95,99,99  F0.95,100,100    0.719
F0.05,100,100 1.39

H0 is rejected if the observed test statistic is smaller than the lower critical value (0.719)
or larger than the upper critical value (1.39).

The observed test static value is

s12 3.346
Fobs  2   0.306
s2 10.950

Since it is smaller than the lower critical value, we reject H0 at the 10% significance level
and conclude that the variances in service times differ between the two tellers.

Performing this test manually, it is possible to simplify the procedure a bit by labelling
the population with the larger sample variance as population 1. By doing so we assure
that the sample statistic is not smaller than one and hence we need only the upper
critical value straight from the F tables. The observed test statistic is

10.950
Fobs   3.273
3.346

and since it is larger than the upper critical value, we would again reject H0.5

To complete parts (a) and (b) with R, you can use the same VarTest command then in
Exercise 1, but with a slightly different set of arguments. This time, the general form of
the command is

VarTest(x, y, ratio = ratio0, alternative = " ")

where x and y are the variables to be compared, ratio0 is the hypothesized ratio of the
two population variances (by default, it is 1), and alternative is like before.

5
A word of caution is in order: if the test were a one-tail test, swapping the sample variances would imply that
the alternative hypothesis must be swapped as well, for example, from HA: 12 / 22 < 1 to HA: 22 / 12 > 1.
6
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Launch RStudio, create a new project and script, name them t5e2, import the data
saved in the t5e2 Excel data file to RStudio, and execute the following commands:

library(DescTools)
VarTest(Teller1, Teller2)

You should get

F test to compare two variances


data: x and y
F = 0.30561, num df = 99, denom df = 99, p-value = 1.045e-08
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2056242 0.4542014
sample estimates:
ratio of variances
0.3056056

The variable of interest is the variance of x (Teller1) divided by the variance of y (Teller2),
the alternative hypothesis is true ratio of variance is not equal to 1. The ratio of the
sample variances, and hence the test statistic value, is about 0.306. The numerator and
denominator degrees of freedoms are both 99, and the p-value is practically zero, so H0
is can be rejected at any reasonable significance level.

c) What assumption did you have to make in order to answer parts (a) and (b)? Try to
verify whether that assumption is reasonable this time.

In parts (a) and (b) we assumed that the samples are random and independent and that
both sampled populations are normally distributed, at least approximately. Take the first
two requirements granted and perform the usual diagnostics for normality.

hist(Teller1, freq = FALSE, col = "blue")


lines(seq(0, 16, by = 0.1),
dnorm(seq(0, 16, by = 0.1), mean(Teller1), sd(Teller1)),
col="red")
hist(Teller2, freq = FALSE, col = "green")
lines(seq(0, 16, by = 0.1),
dnorm(seq(0, 16, by = 0.1), mean(Teller2), sd(Teller2)),
col="red")

return the histograms shown on the next page. The superimposed normal curves seem
to fit to the relative frequency distributions.

qqnorm(Teller1, main = "Normal Q-Q Plot for Teller1",


xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "blue")
qqline(Teller1, col = "red")
qqnorm(Teller2, main = "Normal Q-Q Plot for Teller2",
xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "green")
qqline(Teller2, col = "red")

produce the normal QQ plots displayed on page 9. On both plots most of the points fall
close to the reference line.
7
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Histogram of Teller1

0.25
0.20
0.15
Density

0.10
0.05
0.00

4 6 8 10 12

Teller1

Histogram of Teller2
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Density

0 5 10 15

Teller2

8
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Normal Q-Q Plot for Teller1

12
10
Sample Quantiles

8
6
4

-2 -1 0 1 2

Theoretical Quantiles

Normal Q-Q Plot for Teller2


15
Sample Quantiles

10
5
0

-2 -1 0 1 2

Theoretical Quantiles

9
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
The

library(pastecs)
round(stat.desc(Teller1, basic = FALSE, desc = TRUE, norm = TRUE), 3)
round(stat.desc(Teller2, basic = FALSE, desc = TRUE, norm = TRUE), 3)

commands provide the following statistics

for Teller1, and

for Teller2.

For both samples, the mean and the median are fairly similar, skewness and excess
kurtosis are only insignificantly different from zero, and p-values of the Shapiro-Wilk test
are above 0.5.

All things considered, the normality assumption is supported by all diagnostic checks.6

Inferences about Population Proportions

Consider a binary population X, which has only two possible elements, “success” coded as
1 and “failure” coded as 0. Suppose we are interested in the proportion (relative frequency)
of successes in this population, denoted as p. It is given by the usual population mean
formula,

1 N
p Xi
N i1

It can be estimated from a random sample of n observations with the sample proportion,

1n
pˆ  Xi
n i1

which is also the sample mean of the (0;1) binary sample.

6
When the normality assumption is unreasonable, one can use the nonparametric Siegel-Tukey test for
equality in variability. This test is also available in the DescTools package (SiegelTukeyTest), but we do not
learn about it in this course.
10
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Depending on whether sampling is with or without replacement, the sample proportion (p-
hat) is a binomial or hypergeometric random variable. When the sample size is large (np 
5 and nq  5, and in the case of sampling without replacement, n < 0.05N as well), the
sampling distribution of p-hat can be approximated with a normal distribution,

pˆ  N  pˆ , pˆ 
pq
, pˆ  p ,  pˆ 
n

Consequently, (i) the large-sample approximate (1-)100% confidence interval estimate


of p is

ˆˆ
pq
pˆ  z /2spˆ , spˆ 
n

and (ii) the test statistic for H0: p = p0 against a one-sided or two-sided alternative hypothesis
is

pˆ  p0
Z  N(0;1)
p0q0
n

Exercise 3

Suppose that in a survey of 600 employers are asked whether they have used a recruitment
service within the past two months to find new staff. The responses are saved in the t5e3
Excel file.

a) Construct a 99% confidence interval for the population proportion of employers who
have used a recruitment service within the past two months to find new staff.

The exact population proportion is unknown, so strictly speaking we cannot check


whether the ‘large-sample’ requirements, np  5 and nq  5, are satisfied. At best we
can replace p with its estimate p-hat, and if n(p-hat) and n(q-hat) are both relatively
large, then we can be willing to assume that np  5 and nq  5 are also satisfied justifying
the large-sample confidence interval estimation of the population proportion.

Launch RStudio, create a new project and script, name them t5e3, import the data
saved in the t5e3 Excel data file to RStudio, and execute the following commands:

attach(t5e3)
table(used)

It returns the frequency distribution of used:

11
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
used
no yes
474 126

In general, the

table()

R function returns a basic contingency table. To obtain relative frequencies, we also


need the

length()

function, which returns the length of vectors and other R objects. In this case, the length
of the use vector is the sample size, so execute

table(used) / length(used)

to get the following relative frequency distribution:

used
no yes
0.79 0.21

Alternatively, we can combine table with the

prop.table()

function, which returns proportions (relative frequencies). Hence, by executing the

prop.table(table(used))

command, you get the same output than before.

These frequency and relative frequency distributions show that in the given sample 126
employers out of 600 (i.e. 21%) used a recruitment service within the past two months
to find new staff, so the sample proportion is 0.21. Using this sample proportion,

npˆ  600  0.21  126 , nqˆ  600  (1  0.21)  474

are both well above 5, so we can expect np  5 and nq  5 to be also satisfied.7

The estimated standard error of this sample proportion is8

7
n(p-hat) and n(q-hat) are the expected numbers of successes and failures in the sample, granted that the
probability of success is equal to p-hat and the probability of failure is equal to q-hat, and they are the same
than the frequencies of yes and no.
8
The sample proportion is always a small number between zero and one, and the estimate of its standard
error is even smaller. In order to avoid unreasonable loss of precision, it is recommended to do the manual
calculations with a precision to 4 or even more decimal places. Once you determined the required confidence
interval, you can round its limits if you wish.
12
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
ˆˆ
pq 0.210.79
spˆ    0.0166
n 600

The confidence level is 99% and from the Standard Normal table

z/2  z0.005  2.576

Therefore, the 99% confidence interval is

pˆ  z /2spˆ  0.21 2.576  0.0166  (0.167,0.253)

It implies that, with 99% confidence, the proportion of employers who have used a
recruitment service within the past two months to find new staff is between 16.7% and
25.3%.

With R, this confidence interval can be generated by the binom.test command that you
already used in Tutorial 3. Execute

binom.test(126, 600, conf.level = 0.99)

to obtain

Exact binomial test


data: 126 and 600
number of successes = 126, number of trials = 600, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
99 percent confidence interval:
0.1687938 0.2559079
sample estimates:
probability of success
0.21

The reported 99% confidence interval is (0.169, 0.256). It is slightly wider than the one
we got manually because it is based on the correct binomial distribution while ours was
based on the normal approximation of the binomial distribution.

b) Based on the survey data, is there sufficient evidence at the 0.05 level of confidence
that more than 20% of all employers have used a recruitment service within the past two
months to find new staff?

The question implies the following hypotheses:

H0 : p  0.2 , HA : p  0.2

The hypothesized value of the population proportion is 0.20, and it implies the following
standard error under the null hypothesis:

13
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
p0q0 0.20.8
  0.0163
n 600

This is a right-tail Z-test and the 5% critical value is

z  z0.05 1.645

The observed test static value is

pˆ  p0 0.21 0.20
zobs    0.613
p0q0 0.0163
n

Since it is smaller than the critical value, we cannot reject H0 at the 5% significance
level. Hence, there is not enough evidence to conclude that more than 20% of all
employers have used a recruitment service within the past two months to find new staff.

By default, the binom.test function assumes that the hypothesized population proportion
is 0.5. In part (a) this was fine because we were interested only in the confidence interval
estimate, which does not depend on the hypothesized population proportion. This time,
however, we are interested in a right-tail test at the 5% significance level with p0 = 0.2,
so we need to execute the following command:9

binom.test(126, 600, p = 0.2, alternative = “greater”, conf.level = 0.95)

It returns

Exact binomial test

data: 126 and 600


number of successes = 126, number of trials = 600, p-value = 0.2849
alternative hypothesis: true probability of success is greater than 0.2
95 percent confidence interval:
0.1829191 1.0000000
sample estimates:
probability of success
0.21

The p-value is 0.2849, implying that H0 cannot be rejected at the 5% significance level.

We can arrive at the same conclusion by applying the

prop.test(x, n, p = p0, alternative = " ")

command, where x is the number of successes, n is the sample size, p0 is the


hypothesized population proportion (0.5 by default).10

9
Since the default confidence level is 0.95, this time the last argument could be dropped from the command.
10
This command performs a chi-square test. The chi-square distribution was introduced in the Week 4 lecture
and you will learn about the chi-square test in the lectures next week.
14
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
In this case,

prop.test(126, 600, p = 0.2, alternative = "greater", conf.level = 0.95)

returns the following printout:


1-sample proportions test with continuity correction

data: 126 out of 600, null probability 0.2


X-squared = 0.3151, df = 1, p-value = 0.2873
alternative hypothesis: true p is greater than 0.2
95 percent confidence interval:
0.1831912 1.0000000
sample estimates:
p
0.21

As you can see, the reported p-value is almost the same than on the previous printout.

Let’s now turn our attention to the two binary populations case. Suppose that the population
proportions are p1 and p2, that we draw independently a random sample from each
population, and that the sample sizes, n1 and n2, are large enough so that n1p1, n1q1, n2p2
and n2q2 are all at least 5 (and ni < 0.05Ni if sampling is without replacement).

Under these conditions, (i) an approximate (1-)100% confidence interval of the difference
between the two population proportions, p1  p2, is

pˆ1qˆ1 pˆ 2qˆ2
 pˆ1  pˆ 2   z /2 spˆ  pˆ
1 2
, spˆ1  pˆ2  
n1 n2

and (ii) the test statistic for H0: p1  p2 = D0 against a one-sided or two-sided alternative
hypothesis follows the standard normal distribution reasonably well, but its actual formula
depends on D0.

Namely, on the one hand, if D0 = 0 and hence p1 = p2 under H0, the common population
proportion is best estimated from the pooled sample

f1  f2
pˆ 
n1  n2

and the test statistic is

pˆ1  pˆ2
Z  N(0,1)
spˆ1 pˆ2

where

15
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
1 1
spˆ1 pˆ2  pq
ˆ ˆ  
 n1 n2 

On the other hand, if D0  0 and hence p1  p2 under H0, the two population proportions
must be estimated separately,

f1 f
pˆ1  , pˆ2  2
n1 n2

the estimated standard error is

pˆ1qˆ1 pˆ 2qˆ2
spˆ1  pˆ2  
n1 n2

and the test statistic is

Z
 pˆ1  pˆ 2   D0  N (0,1)
spˆ1  pˆ2

Exercise 4 (Selvanathan, p. 455, ex. 11.47 and p. 559, ex. 13.68)

The impact of the accumulation of carbon dioxide in the atmosphere caused by burning of
fossil fuels such as oil, coal and natural gas has been hotly debated for more than a decade.
Some environmentalists and scientists have predicted that the excess carbon dioxide will
increase the Earth’s temperature over the next 50 to 100 years with disastrous
consequences.

To gauge the public’s opinion on the subject, a random sample of 400 people was asked
two years ago whether they believed in the greenhouse effect. This year, 500 people were
asked the same question. The results are recorded as 1 = believe in greenhouse effect and
0 = do not believe in greenhouse effect, the variables concerning belief are denoted as X1
(first sample, i.e. two years ago) and X2 (second sample, i.e. this year) and the data are
saved in the t5e4 Excel file.

a) Estimate the real change in the public’s opinion about the subject, using a 90%
confidence level.

X1 and X2 are two independent qualitative variables and defining “success” as 1 =


believe in greenhouse effect, the sample proportions11 are

pˆ1  0.62 , pˆ2  0.52

11
To save time, these sample proportions were obtained by R, like in Exercise 3.
16
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Using these sample proportions as estimates of the corresponding population
proportions,

npˆ1  248 , n(1 pˆ1) 152 , npˆ2  260 , n(1 pˆ2 )  244

They are all much bigger than 5, so the normal approximation is a reasonable option.

The reliability factor at the 90% confidence level is

z /2  z0.05  1.645

The estimate of the standard error is

pˆ1qˆ1 pˆ 2qˆ2 0.62 0.38 0.52 0.48


spˆ1  pˆ2      0.0330
n1 n2 400 500

Therefore, the 90% confidence interval is

 pˆ1  pˆ 2   z /2 spˆ  pˆ
1 2
 (0.62  0.52)  1.645  0.033  (0.0457,0.1543)

It implies that, with 90% confidence, the proportion of believers in the greenhouse effect
has decreased by 4.6% to 15.4%.12

In R, this confidence interval can be generated like the one in Exercise 3. The

table(X1)
table(X2)

commands return the frequencies:

X1
0 1
152 248

and

X2
0 1
240 260

There are 248 successes in the first sample and 260 in the second.

The confidence interval for the difference between the two population proportions is
provided by the prop.test function, but this time its x and n arguments (the number of

12
Recall that the confidence interval has been developed for the change in the proportion of believers between
the first and second surveys, so a positive p1 – p2 value indicates that the true proportion of believers was
larger two years ago than it is today.
17
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
successes and the sample size) have two elements and they need to be specified by
the c() function as c(x1, x2) and c(n1, n2), respectively.

prop.test(x = c(248,260), n = c(400,500), conf.level = 0.90)

generates the following printout:

2-sample test for equality of proportions with continuity correction


data: c(248, 260) out of c(400, 500)
X-squared = 8.6369, df = 1, p-value = 0.003294
alternative hypothesis: two.sided
90 percent confidence interval:
0.04348977 0.15651023
sample estimates:
prop 1 prop 2
0.62 0.52

The reported 90% confidence interval, (0.0435, 0.1565), is almost identical to the one
we obtained manually. The difference between them is due to the continuity correction
that is referred to in the heading of the printout. The prop.test function applies this
correction, called Yates’ correction for continuity, by default, though it only makes
practical difference when the sample sizes are small (i.e. some of n1p1, n1q1, n2p2 and
n2q2 is/are smaller than 5). To see its impact, add the correct = FALSE argument to the
pervious command,

prop.test(x = c(248,260), n = c(400,500), conf.level = 0.90, correct = FALSE)

The new printout is


2-sample test for equality of proportions without continuity correction
data: c(248, 260) out of c(400, 500)
X-squared = 9.039, df = 1, p-value = 0.002643
alternative hypothesis: two.sided
90 percent confidence interval:
0.04573977 0.15426023
sample estimates:
prop 1 prop 2
0.62 0.52

The 90% confidence interval is (0.0457, 0.1543), the same than the one we obtained
manually.

b) Can we infer at the 10% significance level that there has been a decrease in belief in
the greenhouse effect?

Since X1 denotes the belief in the greenhouse effect in the first sample (drawn two years
ago) and X2 denotes the belief in the greenhouse effect in the second sample (drawn
this year), the hypotheses are

H0 : p1  p2  0 , HA : p1  p2  0

In this case D0 = 0, so the common population proportion is estimated from the pooled
sample,

18
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
f1  f2 248 260
pˆ    0.564
n1  n2 400500

The estimate of the standard error is

1 1  1 1 
spˆ1 pˆ2  pq
ˆ ˆ     0.5640.436    0.0333
 n1 n2   400 500 

and the test statistic is

pˆ1  pˆ2 0.62  0.52


zobs    3.003
spˆ1 pˆ2 0.0333

The critical value is

z  z0.1 1.282

and since it is smaller than the test statistic, we reject the null hypothesis and conclude
at the 10% significance level that there has been a decrease in belief in the greenhouse
effect.

To perform this test in R, execute the following command:

prop.test(x = c(248,260), n = c(400,500),


alternative = "greater", conf.level = 0.90, correct = FALSE)

It returns

2-sample test for equality of proportions without continuity correction


data: c(248, 260) out of c(400, 500)
X-squared = 9.039, df = 1, p-value = 0.001321
alternative hypothesis: greater
90 percent confidence interval:
0.05772434 1.00000000
sample estimates:
prop 1 prop 2
0.62 0.52

The reported test statistic is a chi-square random variable, 2 = 9.039. How does it
compare to the standard normal test statistic we calculated manually, z = 3.003? As you
can see on the previous printout, the degrees of freedom of this chi-square random
variable is one, and you learnt on the week 4 lecture that the square of a standard normal
random variable is a chi-square random variable with df = 1. And indeed, as you can
check easily, apart from some rounding error,

2
zobs  3.0032  9.018  1,2obs

19
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
For the sake of comparison, we applied the prop.test function without continuity
correction. Note, however, that in general, there is no reason to overwrite the default
option. If you run

prop.test(x = c(248,260), n = c(400,500),


alternative = "greater", conf.level = 0.90)

you get

  2-sample test for equality of proportions with continuity correction


data: c(248, 260) out of c(400, 500)
X-squared = 8.6369, df = 1, p-value = 0.001647
alternative hypothesis: greater
90 percent confidence interval:
0.05547434 1.00000000
sample estimates:
prop 1 prop 2
0.62 0.52

The new test statistic is smaller and hence its p-value is larger, but since the sample
sizes are fairly large, the differences are negligible.

20
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercises for Assessment

Exercise 5

In Exercise 2 of Tutorial 4, first we developed a confidence interval for the difference


between the mean ages of purchasers and non-purchasers of a particular brand of
toothpaste (part a), and then performed a t-test to see whether there was sufficient evidence
to conclude that there was a difference in the mean age of purchasers and non-purchasers.
Based on the sample variances, in both cases we assumed that the two unknown population
variances are different. Using the same data,

a) Estimate the ratio of the two population variances with 95% confidence.

b) Can we conclude at the 5% significance level that the population variances differ? What
do you conclude if the significance level is increased to 10%?

In parts (a) and (b) alike, do the calculations both manually and with R.

Exercise 6 (Selvanathan, p. 558, ex. 13.58)

In a public opinion survey, 60 out of a sample of 100 high-income voters and 40 out of a
sample of 75 low-income voters supported the introduction of a new national security tax.
Can we conclude at the 5% level of significance that there is a difference in the proportion
of high- and low-income voters favouring a new national security tax? Do the calculations
both manually and with R.

21
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5

You might also like