You are on page 1of 8

Solution to SGTA week 3- (questions in Lohr’s book): Based on

Lohr’s 2.12 Exercises (page 61-72)

8. Discuss whether an SRS would be appropriate for the following situations. What
other design might be used (if you know)?

a. For an email survey of students, you have a sampling frame that contains a list of
email addresses for all students.

Maybe.

It depends on which variable is of particular interest in the survey, and need at least
consider the following question: Are we dealing with a rather homogenous population in
terms of the variable of interest?

c. You want to estimate the percentage of topics in a medical websites that have
errors.

May not be possible.

First of all we may not have a list (or lists) to be used as the frame.

Also we need define “errors” more specifically.

d. A county election official wants to assess the accuracy of the machine that counts
the ballots by taking a random sample of the paper ballots and comparing the
estimated vote tallies for candidates from the sample to the machine counts.

Seems ok here.

1
11. Mayr et al. (1994) took an SRS of 240 children aged 2 to 6 years who visited their
pediatric outpatient clinic. They found the following frequency distribution for free
(unassisted) walking among the children:

Age (months) 9 10 11 12 13 14 15 16 17 18 19 20

Number of children 13 35 44 69 36 24 7 3 2 5 1 1

a. Construct a histogram of the distribution of age at walking. Is the shape normally


distributed? Do you think the sampling distribution of the sample average will be
normally distributed? Why or why not?

x<- c(9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ,19, 20) # Age
freq<- c(13,35 ,44, 69, 36, 24 , 7, 3 , 2 , 5,1,1) # Number of Childre
n
data <- rep(x,freq)
hist(data, main = "Distribution of age at walking", xlab = "Age (month)",
ylab = "Number of children")

The histogram appears skewed to the right. With a mildly skewed distribution, a sample
of size 240 should large enough for us to assume that sample means should be
approximately, normally distributed according to the Central Limited Theory (CLT).

2
b. Find the mean, standard error, and a 95% Cl for the average age for onset of free
walking.

mean= mean(data) # Sample Mean


mean

## [1] 12.07917

s2 = var(data) # Sample variance


s2

## [1] 3.705003

y = 12.079; s 2 = 3.705; SE ( y) = s 2 / n = 3.705 / 240 = 0.12425

(Since we don’t know the population size N, we ignore the fpc, ie, assuming N is very
large, at the risk of a slightly overestimating the standard error.)

A 95% confidence interval (using z critical value as the sample size is very large here)
is

12.08 ± 1.96(0.12425) = [11.84, 12.32].

c. Suppose the researchers want to do another study in a different region and want a
95% Cl for the mean age of onset of walking to have margin of error 0.5. Using the
estimated standard deviation for these data, what sample size would they need to
take?

1.96 2 ⋅ 3.705
n= = 57 as an approximation, more accurately,
0.52
−1
  0.52 
n = N 1 + N  2
 = ? .
  1 . 96 ⋅ 3 . 705 
Without knowing N, you could only obtain the approximated sample size, which is 57.

15. The data set agsrs.csv on the unit iLearn are collected on a SRS of size 300 (n) from a
population of 3078 (N) farms, including a number of variables. For each of the
following variables, plot the data and estimate the population mean for that
variable, along with its standard error. Give a 95% CI for your estimate.

3
# Reading data set
LGA <- read.csv("agsrs.csv", fileEncoding = "UTF-8-BOM")
head(LGA) # first 6 observations

# 4 figures arranged in 2 rows and 2 columns


par(mfrow=c(2,2))
hist(LGA$acres87, xlab = "Number of acres devoted to farms in 1987")
hist(LGA$farms92, xlab = "Number of farms, 1992")
hist(LGA$largef92, xlab = "Number of farms with 1000 acres or more, 1992 "
)
hist(LGA$smallf92, xlab = "Number of farms with 9 acres or fewer, 1992")

mean(LGA$acres87)

## [1] 301953.7

var(LGA$acres87)

## [1] 118907450529

mean(LGA$farms92)

## [1] 599.06

var(LGA$farms92)

## [1] 161795.4

mean(LGA$largef92)

4
## [1] 56.59333

var(LGA$largef92)

## [1] 5292.73

mean(LGA$smallf92)

## [1] 46.82333

var(LGA$smallf92)

## [1] 4398.199

All have quite skewed distribution as shown above. It needs large sample size for normal
approximation. Given n=300 here, it should be big enough.

a. Number of acres devoted to farms in 1987

y = 301,953.7, s 2 = 118,907,450,529;
s2 300
95% CI : 301953.7 ± 1.96 (1 − )
300 3078
= (264883, 339025).

b. Number of farms, 1992

Similar to the previous part.

Check answer:

y = 599.06, s 2 = 161795.4; 95% CI : 556 - 642.

c. Number of farms with 1000 acres or more, 1992

Check answer:

5
y = 56.593, s 2 = 5292.73; 95% CI : 48.8 - 64.4.

d. Number of farms with 9 acres or fewer, 1992

Check answer:

y = 46.823, s 2 = 4398.199; 95% CI : 39.7 - 54.0.


Additional question:

Suppose a population has five elements {2, 5, 4, 8, 6}.

a) Calculate the population mean and the population standard deviation, using
the formulas on Slides 14 & 15 in Week 2 lecture.

For this finite population, N = 5, µ = (2+5+4+6+8)/5 = 5

σ2 =
1
5 −1
{[ ] }
2 2 + 5 2 + 4 2 + 6 2 + 8 2 − 5 × (5) = 5
2

Thus, σ = 5 = 2.236

b) List all possible simple random samples of size 4 could be selected from the
population above without replacement, and for each possible sample provide
its probability (chance) to be selected and calculate its average/mean.

There are five unique random samples if not respecting to order of values.
Each is equally likely with a probability of 1 in 5, ie, 0.2.

Sample i Sample Probability sample mean

1 {2, 5, 4, 6} 0.2 4.25

2 {2, 5, 4, 8} 0.2 4.75

6
3 {2, 4, 6, 8} 0.2 5

4 {2, 5, 6, 8} 0.2 5.25

5 {5, 4, 6, 8} 0.2 5.75

c) Based on the information in part b, determine the expected value and


variance of sample mean as a random variable, using the formulas on Slides 3
& 10 in Week 2 lecture.

5
E( Y ) = ∑ y i ⋅ Pr( Y = y i )
i =1

= 4.25x 0.2 + 4.75x 0.2 + 5x 0.2 + 5.25x 0.2 + 5.75x 0.2


= (4.25 + 4.75 + 5.5.25 + 5.75) x 0.2
= 25x 0.2
=5=µ

5
E( Y 2 ) = ∑ y 2i ⋅ Pr( Y = y i )
i =1

= 4.25 x 0.2 + 4.75 2 x 0.2 + 5 2 x 0.2 + 5.25 2 x 0.2 + 5.75 2 x 0.2


2

= 25.25
Var ( Y ) = E( Y 2 ) − [E( Y )]
2

= 25.25 − 5 2 = 0.25

d) Calculate the variance of the sample mean (not using the information in part
b) using the formula derived in Week 2 (see Slide 39), and compare it with the
variance obtained in part c.

σ2
Var ( y) = (1 − f )
n
5 4
= (1 − )
4 5
= 0.25

7
This is identical to variance of sample mean found in c), as expected.

e) Do you think that the variance of the mean of sample of size 4 from the
sampling with replacement will be greater, comparing to the sampling
without replacement? Do not carry out any calculation but give your answer
and reasons for it.

Certainly, as you could get more extreme samples, giving very small or large
mean and thus a greater variability and larger variance, eg, sample (2, 2, 2,
2) with a mean of 2, or sample (8, 8, 8, 8) with a mean of 8, when sampling
with replacement.

You might also like