You are on page 1of 23

Homework 1

Statistics 109

Due February 17, 2019 at 11:59pm EST

Homework policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, include the code in
your solution, along with any plots.
Please submit your homework assignment via Canvas as a PDF.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you
do collaborate with classmates on a problem, please list your collaborators on your solution.

The following problems are from the book Using R for Introductory Statistics available in the
pset directory on rstudio.cloud. For data sets in the book you will need to use the command
library(UsingR) to get access to them; then use the data(datasetname) command to load in the
data into R. FOr example, data(lawsuits) will load the lawsuits data into R from the package.
1) 5.29, page 160
1-pnorm(0.550, mean=0.5, sd=sqrt(.5*.5/1000))

## [1] 0.0007827011
The probability that 550 or more people in the random sample will be in favor of the issue
is 0.0007. For a sequence of iid Bernoulli random variables, the CLT says that the average
number of successes will be approximately normal with large n, and thus you can use the
pnorm function to identify the probability that 550 or more will be in favor of the issue in the
random sample (or 1 minus the probability that 549 or fewer of the people will be in favor of
the issue.)
2) 5.30, page 160
1-pnorm(3500/15, mean=180, sd=25/sqrt(15))

## [1] 1.110223e-16
It would be extremely unusual for an elevator holding 15 people to carry over 3500 pounds, if
the central limit theorem applied. The central limit theorem for means states that the mean
weight should be distributed as approximately normal as n grows large, and the probability
that the mean weight is greater than 3500/15 is extremely small and very close to 0. However,
we should consider whether 15 is a large enough sample size for the central limit theorem to
apply.
3) 6.2, page 174

1
library(UsingR)
data(lawsuits)
set.seed(109)
res10 = replicate(1000,mean(sample(lawsuits,size=10,replace=TRUE)))
hist(res10)
Histogram of res10
250
200
150
Frequency

100
50
0

0 50000 100000 150000

res10

res30 = replicate(1000,mean(sample(lawsuits,size=30,replace=TRUE)))
hist(res30)

2
Histogram of res30

250
200
150
Frequency

100
50
0

0 20000 40000 60000 80000 100000 120000

res30

res100 = replicate(1000,mean(sample(lawsuits,size=100,replace=TRUE)))
hist(res100)
Histogram of res100
200
150
Frequency

100
50
0

10000 20000 30000 40000 50000 60000 70000

res100

3
res90 = replicate(1000,mean(sample(lawsuits,size=90,replace=TRUE)))
hist(res90)
200
150 Histogram of res90
Frequency

100
50
0

10000 20000 30000 40000 50000 60000 70000 80000

res90

For n=10 and n=30, there is an obvious long right tail and the histogram does not look normal.
For n=100, however, the histogram appears much more bell-shaped. There is still a slight bit
of a right tail in some replications, but overall it is quite normal. At n=80 (not shown), around
one in every 10 or so replications appears normal, but at n=90, around one in two appear
roughly normal while the other half still has an obvious right skew.
4) 6.8, page 175
xbar = c();std = c()
for(i in 1:500) {
sam = rnorm(10)
xbar [i] = mean(sam); std[i] = sd(sam)
}
plot(xbar,std)

4
1.5
std

1.0
0.5

−1.0 −0.5 0.0 0.5 1.0

xbar

cor(xbar,std)

## [1] -0.07164655
xbar = c();std = c()
for(i in 1:500) {
sam = rt(n=100,df=3)
xbar [i] = mean(sam); std[i] = sd(sam)
}
plot(xbar,std)

5
5
4
std

3
2
1

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

xbar

cor(xbar,std)

## [1] 0.002133794
xbar = c();std = c()
for(i in 1:500) {
sam = rexp(n=100,rate=1)
xbar [i] = mean(sam); std[i] = sd(sam)
}
plot(xbar,std)

6
1.4
1.2
std

1.0
0.8
0.6

0.7 0.8 0.9 1.0 1.1 1.2 1.3

xbar

cor(xbar,std)

## [1] 0.7066733
The correlation between the mean and the standard deviation is lowest for the normal dis-
tribution and the t-distribution, implying they are the most independent. However, for the
exponential distribution, there is a strong positive correlation between the mean and the stan-
dard deviation - around 0.71 - and thus these variables are correlated.
5) 7.6 (page 185) (compare exact and asymptotic methods)
prop.test(5, 100)

##
## 1-sample proportions test with continuity correction
##
## data: 5 out of 100, null probability 0.5
## X-squared = 79.21, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.01855256 0.11829946
## sample estimates:
## p
## 0.05
binom.test(5, 100)

##
## Exact binomial test
##

7
## data: 5 and 100
## number of successes = 5, number of trials = 100, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.01643188 0.11283491
## sample estimates:
## probability of success
## 0.05
For both the exact method using the binomial distribution and the approximation to the nor-
mal distribution, the 95% confidence interval contains p=1/10. The confidence interval for the
exact method is (0.016, 0.113) and the confidence interval for the asymptotic method is (0.019,
0.118).
6) 7.18 (page 191)
library(UsingR)
data(normtemp)
qqnorm(normtemp$temperature)
Normal Q−Q Plot
100
99
Sample Quantiles

98
97

−2 −1 0 1 2

Theoretical Quantiles

t.test(normtemp$temperature, mu=98.6, conf.level=0.9)

##
## One Sample t-test
##
## data: normtemp$temperature
## t = -5.4548, df = 129, p-value = 2.411e-07
## alternative hypothesis: true mean is not equal to 98.6

8
## 90 percent confidence interval:
## 98.14269 98.35577
## sample estimates:
## mean of x
## 98.24923
This data appears to come from a normal distribution, as the qqnorm plot shows a linear re-
lationship between the theoretical quantiles and the sample quantiles. The 90% confidence
interval for the mean normal body temperature is (98.143, 98.356), which does not include
98.6.
7) 7.21 (page 192) (comment on the results)
n = 10
boxplot(rt(1000,df= n-1),rnorm(1000))
4
2
0
−2
−4

1 2

x = seq(0,1,length=150)
plot(qt(x,df=n-1), qnorm(x));abline(0,1)

9
2
1
qnorm(x)

0
−1
−2

−3 −2 −1 0 1 2 3

qt(x, df = n − 1)

curve(dnorm(x),-3.5,3.5)
curve(dt(x,df=n-1), lty=2, add=TRUE)
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

n = 3
boxplot(rt(1000,df= n-1),rnorm(1000))

10
20
10
0
−10

1 2

x = seq(0,1,length=150)
plot(qt(x,df=n-1), qnorm(x));abline(0,1)
2
1
qnorm(x)

0
−1
−2

−5 0 5

qt(x, df = n − 1)

curve(dnorm(x),-3.5,3.5)
curve(dt(x,df=n-1), lty=2, add=TRUE)

11
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

n = 25
boxplot(rt(1000,df= n-1),rnorm(1000))
3
2
1
0
−1
−2
−3

1 2

x = seq(0,1,length=150)
plot(qt(x,df=n-1), qnorm(x));abline(0,1)

12
2
1
qnorm(x)

0
−1
−2

−2 −1 0 1 2

qt(x, df = n − 1)

curve(dnorm(x),-3.5,3.5)
curve(dt(x,df=n-1), lty=2, add=TRUE)
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

n = 50
boxplot(rt(1000,df= n-1),rnorm(1000))

13
4
2
0
−2

1 2

x = seq(0,1,length=150)
plot(qt(x,df=n-1), qnorm(x));abline(0,1)
2
1
qnorm(x)

0
−1
−2

−2 −1 0 1 2

qt(x, df = n − 1)

curve(dnorm(x),-3.5,3.5)
curve(dt(x,df=n-1), lty=2, add=TRUE)

14
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

n = 100
boxplot(rt(1000,df= n-1),rnorm(1000))
3
2
1
0
−1
−2
−3

1 2

x = seq(0,1,length=150)
plot(qt(x,df=n-1), qnorm(x));abline(0,1)

15
2
1
qnorm(x)

0
−1
−2

−2 −1 0 1 2

qt(x, df = n − 1)

curve(dnorm(x),-3.5,3.5)
curve(dt(x,df=n-1), lty=2, add=TRUE)
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

For the range we are looking at, from -3.5 to 3.5, it appears that at n=50, the two distribu-
tions appear approximately the same. At n=3, the t-distribution obviously has a lower peak

16
and thicker tails; at n=10, this is much less noticeable, but the difference is still somewaht
dramatic. At n=25, the difference is much less dramatic but noticeable, and at n=50 the two
appear quite similar.
8) 7.25 (page 200)
cocktaila=c(3.1, 3.3, 1.7, 1.2, 0.7, 2.3, 2.9)
cocktailb=c(1.8, 2.3, 2.2, 3.5, 1.7, 1.6, 1.4)
t.test(cocktaila, cocktailb, var.equal=FALSE, conf.level=0.8)

##
## Welch Two Sample t-test
##
## data: cocktaila and cocktailb
## t = 0.21592, df = 10.788, p-value = 0.8331
## alternative hypothesis: true difference in means is not equal to 0
## 80 percent confidence interval:
## -0.5322402 0.7322402
## sample estimates:
## mean of x mean of y
## 2.171429 2.071429
An 80% confidence interval for the difference of means is (-0.532, 0.732). We’re assuming that
the populations for each sample - in this case, the number of years until failure for each AIDS
cocktail - are normally distributed.
The following two problems are from the book Mathematical Statistics with Resampling and
R available in the pset directory on rstudio.cloud. There is a data package for the book; do the
command library(resampledata) then to load a particular data set, for example, to load data set
FlightDelays, run data(FlightDelays).
9) Exercise 12 page 132
library(resampledata)
data("FishMercury")
hist(FishMercury$Mercury)

17
Histogram of FishMercury$Mercury

30
25
20
Frequency

15
10
5
0

0.0 0.5 1.0 1.5 2.0

FishMercury$Mercury

Upon making a histogram of Mercury, I see that all but one of the observations are between 0
and 0.5; there is one outlier between 1.5 and 2.0.
```r
boots = replicate(1000, mean(sample(FishMercury$Mercury, replace=TRUE)))
quantile(boots, c(0.025, 0.975))
```

```
## 2.5% 97.5%
## 0.1103325 0.3041133
```

```r
sd(boots)
```

```
## [1] 0.0571765
```

```r
mean(boots)
```

```

18
## [1] 0.1785431
```
The 95% bootstrap confidence interval is (0.114, 0.316), the bootstrap mean is 0.186, and the
bootstrap standard devition is 0.059.
mercurynooutliers = subset(FishMercuryMercury, FishMercuryMercury<1) boots = repli-
cate(1000, mean(sample(mercurynooutliers, replace=TRUE))) quantile(boots, c(0.025, 0.975))
sd(boots) mean(boots)
The 95% bootstrap confidence interval is (0.109, 0.139), the bootstrap mean is 0.123, and the
bootstrap standard deviation is 0.008.
By removing the outlier, the confidence interval, especially on the upper end, is much nar-
rower. Also, the mean is much lower (0.123 versus 0.186) and the standard deviation is an
order of magnitude smaller, from 0.059 to 0.008.
10) Exercise 14 parts a and b page 132
library(resampledata)
data("Girls2004")
Girls2004WY <- subset(Girls2004, State=="WY")
Girls2004AK <- subset(Girls2004, State=="AK")
mean(Girls2004WY$Weight)

## [1] 3207.9
sd(Girls2004WY$Weight)

## [1] 418.3184
mean(Girls2004AK$Weight)

## [1] 3516.35
sd(Girls2004AK$Weight)

## [1] 578.8336
The mean weight of baby girls born in Wyoming is 3207.9 grams and the standard deviation
is 418.32 grams, and the mean weight of baby girls born in Arkansas is 3516.35 grams and the
standard deviation is 578.83 grams.
```r
N<-10^4
diff.mean <- numeric(N)
for (i in 1:N){
WYSample <- sample(Girls2004WY$Weight, replace=TRUE)
AKSample <- sample(Girls2004AK$Weight, replace=TRUE)
diff.mean[i] <- mean(WYSample) - mean(AKSample)
}

hist(diff.mean)
```

19
![](stat109_hw_01_spring2020_b_le_files/figure-latex/unnamed-chunk-12-1.pdf)<!-- -->

```r
quantile(diff.mean, c(0.025, 0.975))
```

```
## 2.5% 97.5%
## -529.70250 -90.97437
```

```r
mean(diff.mean)
```

```
## [1] -309.1314
```

```r
sd(diff.mean)
```

```
## [1] 112.1939
```
The mean of the bootstrap is 309.63 grams, and the standard deviation is 110.02 grams. The
95% bootstrap percentile interval is (90.09, 524.85), which means that if we were to run this
simulation many times, approximately 95% of them would contain the true difference in
means between the baby weight in Arkansas and Wyoming. Also, this is strong evidence that
the weight of girls born in Arkansas and Wyoming are different.
11) As we have seen, bootstrap distributions are generally symmetric and bell-shaped and
centered at the value of the original sample statistic. However, strange things can happen
when the sample size is small and there is an outlier present. Create a bootstrap distribution
for the standard deviation based on the following data:
x=c(8,10,7,12,13,8,10,50)
boots = replicate(1000, sd(sample(x, replace=TRUE)))
hist(boots)

20
Histogram of boots

350
300
250
200
Frequency

150
100
50
0

0 5 10 15 20

boots

Describe the shape of the distribution. Is it appropriate to construct a confidence interval


from this distribution? Explain why the distribution might have the shape it does.
The distribution appears to be bimodal, having one peak between 0 and 5 and the other peak
around 15-20. It is not appropriate to construct a confidence interval from this distribution
because there is a large gap between 5 and around 12, and because of the significant outliers.
Because the sample size is so small and there is a large outlier of 50, when the resample con-
tains 50 the standard deviation is much larger, and when the resample doesn’t contain 50 the
standard deviation is much smaller; however, the standard deviation is never in between.
12) For this problem we are going to build our own normality test using the bootstrap. To do so,
we first need to know how to extract the bootstrap confidence interval.
We want to build a function called mynormtest that takes as input a vector of numbers and
either returns “I think the data is normally distributed” or “I don’t think the data is normally
distributed”. The general format of a function is as follows:
myfun = function(x) {
result = sum(x)
if (result>0) cat("The sum is positive\n")
if (result<0) cat("The sum is negative\n")
}
#an example run
myfun(rnorm(100))

## The sum is negative


The R package moments gives us two very useful functions; skewness and kurtosis. If data

21
is truly normal, it should have a skewness value of 0 and a kurtosis value of 3. Write an R
function that conducts a normality test as follows: it takes as input a data set, calculates a
bootstrap confidence interval for the skewness, calculates a bootstrap confidence interval for
the kurtosis, then sees if 0 is in the skewness interval and 3 is in the kurtosis interval. If so,
your routine prints that the data is normally distributed, otherwise your routine should print
that the data is not normally distributed. Test your routine on random data from normal
(rnorm), uniform (runif), and exponential (rexp) , with sample sizes of n=10,30,70 and 100.
An example code fragment is below:
library(moments)
mynormtest = function(x){
bootsskew = replicate(1000, skewness(sample(x, replace=TRUE)))
bootskurt = replicate(1000, kurtosis(sample(x, replace=TRUE)))
skew.lower = quantile(bootsskew, 0.025)
skew.upper = quantile(bootsskew, 0.975)
kurt.lower = quantile(bootskurt, 0.025)
kurt.upper = quantile(bootskurt, 0.975)
if (skew.lower<0 & skew.upper>0 & kurt.lower<3 & kurt.upper>3) {
cat("I think the data is normally distributed\n")
}
else cat("I don't think the data is normally distributed\n")
}

mynormtest(rnorm(10))

## I think the data is normally distributed


mynormtest(rnorm(30))

## I don't think the data is normally distributed


mynormtest(rnorm(70))

## I think the data is normally distributed


mynormtest(rnorm(100))

## I think the data is normally distributed


mynormtest(runif(10))

## I think the data is normally distributed


mynormtest(runif(30))

## I don't think the data is normally distributed


mynormtest(runif(70))

## I don't think the data is normally distributed


mynormtest(runif(100))

## I don't think the data is normally distributed

22
mynormtest(rexp(10))

## I think the data is normally distributed


mynormtest(rexp(30))

## I don't think the data is normally distributed


mynormtest(rexp(70))

## I don't think the data is normally distributed


mynormtest(rexp(100))

## I don't think the data is normally distributed

23

You might also like