You are on page 1of 39

Statistics Workshop 1: Introduction to R.

Tuesday May 26, 2009

Assignments
Generally speaking, there are three basic forms of assigning data. Case one is the single atom
or a single number. Assigning a number to an object in this case is quite trivial. All we need
is to use < or = for assigning a number or an atom to a character. In the following, >
refers to the prompt in R.
The second form is the vector form. In this form, we assign a name to an array of numbers.
This can be done with the command c which stands for concatenation. The interesting fact
is that we can call any member of the vector or we can replace that member with a new
member or to perform various arithmetic operations on that vector, as shown below.
Finally, the third form of storing data is to put them in a matrix form. The command
is matrix. First we need to input the data set of interest, followed by telling R the dimensionality of the matrix that needs to be specified. For example, we can put an array of 9
numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below.

Atoms, Vectors and Matrices


(a) Atoms:
> sam=2
> sam
[1] 2
> sam+sam
[1] 4
> (2*sam*2)/2
[1] 4
> sam^(1/3)
[1] 1.259921

> sqrt(sam)
[1] 1.414214
> abs(-sam)
[1] 2

(b) Vectors
> class.age=c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)
> class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[3]
[1] 36
> class.age[1:5]
[1] 35 35 36 37 37
> class.age[-5]
[1] 35.0 35.0 36.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[-c(2,7)]
[1] 35.0 36.0 37.0 37.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

> class.age*2
[1] 70 70 72

74

74

76

76

78

81

86

88

89 100

38

> sqrt(class.age)
[1] 5.916080 5.916080 6.000000 6.082763 6.082763 6.164414 6.164414 6.244998
6.363961 6.557439 6.633250 6.670832 7.071068
[14] 4.358899
> class.age^(-1)
[1] 0.02857143 0.02857143 0.02777778 0.02702703 0.02702703 0.02631579
0.02631579 0.02564103 0.02469136 0.02325581
[11] 0.02272727 0.02247191 0.02000000 0.05263158

> class.age*class.age
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> class.age^2
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> mean(class.age)
2

[1] 38.28571
> median(class.age)
[1] 38
> class.age=(class.age)/2
> class.age
[1] 17.50 17.50 18.00 18.50 18.50 19.00 19.00 19.50 20.25 21.50 22.00 22.25 25.00
> class.age=class.age*2

Often it is useful to create an empty vector. Here is the way this is done:
> hi=numeric(10)
> hi
[1] 0 0 0 0 0 0 0 0 0 0
Vectors do not have to be numerical. We can create a vector of characters:
> hi=c("hello","whasup","longday")
> hi
[1] "hello"
"whasup" "longday"
Later, it becomes useful to ask R the length of a vector:
> length(class.age)
[1] 14
(c) Matrices
> sam = matrix(nrow=3,ncol=4)
> sam
[,1] [,2] [,3] [,4]
[1,]
NA
NA
NA
NA
[2,]
NA
NA
NA
NA
[3,]
NA
NA
NA
NA
> sam = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=T)

> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12

> sam<-matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=F)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12

> sally =c(1,2,3,4,5,6,7,8,9,10,11,12)


> sam=matrix(sally,nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12
> v1=c(1,2,3,4)
> v2=c(5,6,7,8)
> v3=c(9,10,11,12)
> sam=matrix(c(v1,v2,v3),nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12

> sam[1,]
[1] 1 2 3 4
> sam[,2]
[1] 2 6 10
> sam[1,3]
[1] 3

sam[3,]<-v2
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
5
6
7
8
> sam[1,]<-log(v1)
> sam
[,1]
[,2]
[,3]
[,4]
[1,]
0 0.6931472 1.098612 1.386294
[2,]
5 6.0000000 7.000000 8.000000
[3,]
5 6.0000000 7.000000 8.000000

(d) Lists
R provides a powerful additional storing function called list. The importance of list is
in that we can store various objects of different natures such as matrices, vectors, or
atoms into a unique space, followed by calling different parts of that object separately.
Lets assume that we would like to store the following three object into a list-object
called sam:
> s1=3
> s2=seq(1,10,2)
> s3=matrix(c(1:9),nrow=3)
> s1
[1] 3
> s2
5

[1] 1 3 5 7 9
> s3
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> s<-list(s1,s2,s3)
> s
[[1]]
[1] 3
[[2]]
[1] 1 3 5 7 9
[[3]]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> s[[1]]
[1] 3
> s[[2]]
[1] 1 3 5 7 9
> s[[3]]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

The for loop


Often times, it becomes necessary to repeat certain calculations a number of times.
This is done in R using a simple command called for. Here are some examples:
6

> for(i in 1:3)


+ {
+ print("sam")
+ }
[1] "sam"
[1] "sam"
[1] "sam"
> s=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=T)
> for(i in 1:3)
+ print(s)
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
or:
> for(i in 1:3)
+ { print(s)}
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

> for(i in 1:3)


+{
+ print(s[i,])
+ }
[1] 1 4 7
[1] 2 5 8
[1] 3 6 9

Functions
In principle, there are two sorts of functions in R. The most common and useful ones
are the library functions or the already written commands. For example, mean and
sd are commands that calculate the average and the standard deviation of an object,
say a vector respectively. Here are a couple of examples:
> s2<-seq(1,10,2)
> s2
[1] 1 3 5 7 9
> mean(s2)
[1] 5
> var(s2)
[1] 10
> sd(s2)
[1] 3.162278
> median(s2)
[1] 5

Second type of functions are those that the users of R create. These functions will
remain in the command memory of the software unless you delete them or overwrite
them. Expectedly, the command to create a function is f unction.
Here is an example of a function that gets a matrix, and calculates the standard
deviation divided by the mean of its rows . This measurement is called the coefficient
of variation. Note that in writing this function, I use the commands mean, and sd.
8

In general, any time you are not sure what an R command does or to learn about its
specifics, just type a question mark, followed by the command in the prompt.

> m.cv<-function(mat)
{
u=nrow(mat)
t=numeric(u)
for(i in 1:u)
{
t[i]<-sd(mat[i,])/mean(mat[i,])
}
return(t)
}
> sm3<-matrix(c(1:9),nrow=3)
> sm3
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> m.cv(sm3)
[1] 0.75 0.60 0.50

Visualizing Data: Pie Charts, Stem plots, Histograms


Categorical Data
For categorical data, we keep track of counts or relative frequencies of each group. Therefore,
a schematic presentation should reflect the percentage of occurrences in each category. This
is usually done via two types of graphs: 1- Pie charts, and 2- Barplots. Both graphs are easy
to create in R. An important issue here is that in most cases, it would make sense to label
the categories. We show you how to do this below.
Example 1. The counts and the percentages of the marital status of American women
was collected by the Current Population Survey in 1995 as following:
Marital Status Count (millions) Percent
Never Married
43.9
22.9
116.7
20.9
Married
Widowed
13.4
7.0
Divorced
17.6
9.2
Here are the commands to provide the pie-chart for these data (figure 1):
> married<-c(43.9,116.7,13.4,17.6)
> married.code<-as.factor(c(1,2,3,4))
> pie(married,married.code)

Alternatively, we could label each piece of pie by creating a factor vector that contains
the names of each pie (figure 2):
> married.code<-c("never married","married","widowed","divorced")
> pie(married,married.code)

To create a barplot for the married data, it is sufficient to execute the following function
(figure 3):
>

married<-c(22.9,60.9,7,9.2)

> barplot(married,names.arg=married.code)

10

2
4

Figure 1: Pie chart for the Married data.

Stemplots and Histograms


For quantitative data, stemplots and histograms are the useful visual tools.
Example 2. Lets revisit the class-age data we introduced previously. To create the stemplot
for these data, we can do the following:

> class.age<-c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)
> class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> stem(class.age)
The decimal point is 1 digit(s) to the right of the |
1
2
3
4

| 9
|
| 55677889
| 1345
11

never married

married
divorced

widowed

Figure 2: Pie chart for the Married data with labels.

5 | 0
> stem(class.age,scale=2)
The decimal point is 1 digit(s) to the right of the |
1
2
2
3
3
4
4
5

|
|
|
|
|
|
|
|

55677889
134
5
0

> test<-c(0,0.01,0.22,0.34,0.36,0.31,0.36,0.45,0.4,0.55,0.65)
> stem(test)
12

60
50
40
30
20
10
0

never married

married

widowed

divorced

Figure 3: Barplot for the Married data with labels.

The decimal point is 1 digit(s) to the left of the |


0
2
4
6

|
|
|
|

01
21466
055
5

> stem(test,scale=2)
The decimal point is 1 digit(s) to the left of the |
0
1
2
3
4
5
6

|
|
|
|
|
|
|

01
2
1466
05
5
5
13

To create a histogram for the class-age data, it is sufficient to use the hist command
(figure 4):
> hist(class.age)

3
0

Frequency

Histogram of class.age

15

20

25

30

35

40

45

50

class.age

Figure 4: The histogram of class.age.

We can make the bars finer. Here is a simple trick (figure 5):
> b1<-seq(15,50,3)
> b1
[1] 15 18 21 24 27 30 33 36 39 42 45 48
> b1<-seq(15,50,3)+2
> hist(class.age,breaks=b1)

14

Frequency

Histogram of class.age

20

25

30

35

40

45

50

class.age

Figure 5: The histogram of class.age with finer classes.

Measuring Center: The Mean, The Median, and the


Quartile
The measures of centrality play a fundamental role in understanding the statistical distributions. The chief important ones are the mean, the median, and the other quantiles.

> mean(class.age)
[1] 38.28571
> median(class.age)
[1] 38
> quantile(class.age)
0%
25%
50%
75%
100%
19.000 36.250 38.000 42.375 50.000
> quantile(class.age,prob=0.66)
15

66%
39.87

Comparing Mean and Median


The Symmetric Case
For symmetric distributions such as the one in figure 6, the median and the mean are close
to each other.

100
0

50

Frequency

150

200

Histogram of n1

70

80

90

100

110

120

130

n1

Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.

For distributions that are skewed to the left such as the one in figure 7, the mean is
smaller than the median (why?)
Finally, for the right-skewed distributions, the mean is larger than the median (figure 8).

Measuring Spread: The Standard Deviation


The standard deviation reflects the number of units away from the mean. For example, the
standard deviation for the data in figures 6, 7, and 8 are 10.03, 0.19, and 0.19 respectively.
To calculate the variance and the standard deviation for the class.age data, we can type:
16

150
0

50

100

Frequency

200

250

Histogram of n3

0.2

0.4

0.6

0.8

1.0

n3

Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.

> var(class.age)
[1] 49.1044
> sd(class.age)
[1] 7.007453

Project 1. Visualizing Grades.


First, read the file grades.txt from the webpage. To do this, run the following code in R:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
This will generate a data frame of the size 100 3 called grades in R for you. Rows are
students, and columns represent the Verbal SAT score, the Math SAT score, and the GPA
for each student. To examine the dimensionality of this object you can type:
> dim(grades)
[1] 100
3
Which confirms what we planned initially.
Now, we are at a position to answer the following questions:
17

150
0

50

100

Frequency

200

250

Histogram of n5

0.0

0.2

0.4

0.6

0.8

1.0

n5

Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.

(1.) Create barplots, dotplots, stemplots, and histograms for the three variables of interest.
To make life easier, use the command attach to make the data file grades your
working data. Next, proceed by just typing the name of the column of interest. Here
is what I mean:
> attach(grades)
> GPA
[1] 2.6 2.3 2.4
[19] 2.0 2.3 3.3
[37] 2.9 3.3 2.1
[55] 3.4 2.5 3.6
[73] 3.3 3.0 3.2
[91] 3.5 3.4 2.3

3.0
2.8
1.2
2.6
2.3
2.9

3.1
1.7
3.3
3.6
3.3
1.8

2.9
2.4
2.0
2.9
3.3
2.8

3.1
3.4
3.1
2.6
3.9
2.3

3.3
2.8
2.6
3.8
2.1
2.5

2.3
2.4
2.4
3.0
2.6
2.4

3.3
1.9
2.4
2.5
2.4
2.9

2.6
2.5
2.3
3.5
3.3

3.3
2.3
3.0
2.0
3.1

2.0
3.4
2.9
3.0
3.6

3.0
2.8
3.4
2.0
2.9

1.9
1.9
2.3
1.8
2.4

2.7
3.0
1.4
2.3
1.8

2.0
3.7
2.8
2.1
2.4

(2.) Calculate the min, the max, the mean, the median, first quartile, third quartile, and
the standard deviation of each variable.
A good chunk of that information may be obtained through using the command summary:
> summary(GPA)
18

3.3
2.3
2.4
3.0
2.9

Min. 1st Qu.


1.200
2.300

Median
2.750

Mean 3rd Qu.


2.706
3.125

Max.
3.900

(3.) Report your findings in detail. Compare the verbal scores with the math scores. Comment on the symmetry, measures of centrality, measures of spread, and the potential
outliers in each distribution. Make sure to comment on the statistical features of the
GPA as well.

Boxplots
Boxplots are efficient tools for representing data distributions. The five number summary
can be traced on a boxplot. Additionally, we can figure-out outliers with boxplots.
Remember the three distributions in figures 6, 7 and 8. Note that these distributions are
symmetric, left skewed and right skewed respectively. Here is how I created figure 9 below:
>
>
>
>
>
>
>
>
>
>

n1=rnorm(100000,2,3)
n2=rpois(100000,3)
n3=rbeta(100000,12,3)
par(mfrow=c(3,2))
hist(n1)
boxplot(n1)
hist(n2)
boxplot(n2)
hist(n3)
boxplot(n3)

Note that the command par(mf row) creates a 3 by 2 grid in the graphic area.
The side-by-side boxplots are very helpful in comparing two or more distributions. For
example figure 2 shows a side-by-side boxplots for the two skewed distributions of figure 1.
> boxplot(n1,n2)

Linear Transformations and Their Effect on x and s:


Standardization
To experiment with the idea of linear transformations, lets go back to the first dataset n1
and calculate the mean and the standard deviation of it:

> mean(n1)
[1] 2.000206
19

10

25000

10 0

0 10000

Frequency

Histogram of n1

15

10

10

15

n1

8
0

10000
0

Frequency

12

Histogram of n2

10

12

n2

0.9
0.3

0.6

10000
0

Frequency

Histogram of n3

0.4

0.6

0.8

1.0

n3

Figure 9: Histograms along with boxplots for the three simulated datasets.

> sd(n1)
[1] 2.999017

Let us perform the following simple linear transformation on these numbers:


Z=

n1 n1
s

Here is the code:


> z<-(n1-mean(n1))/sd(n1)
> mean(z)
[1] 7.186852e-17
> sd(z)
[1] 1
20

15
10
5
0
5
10

Figure 10: Side-by-side boxplots for the two of the distributions of figure 1.

> hist(z)

21

10000
0

5000

Frequency

15000

20000

Histogram of z

Figure 11: Standardized version of n1. Note that the mean is roughly 0, and the standard deviation
is 1.

Verification of the 68% - 95% - 99.7% Rule


To verify the rule, we reconsider the standardized vector z. Next, we count the number of
elements of z whose values are between -1 to 1, -2 to 2 and -3 to 3 respectively:
> length(z[z>-1 & z<1])/length(z)
[1] 0.68157
> length(z[z>-2 & z<2])/length(z)
[1] 0.95452
> length(z[z>-3 & z<3])/length(z)
[1] 0.99747
This rule holds for any normal distribution not just the standardized ones:
> u=rnorm(10000,3,4)
> m=mean(u)
> s=sd(u)
22

> length(u[u>m-s&u<m+s])/length(u)
[1] 0.6811
> length(u[u>m-2*s&u<m+2*s])/length(u)
[1] 0.9552
> length(u[u>m-3*s&u<m+3*s])/length(u)
[1] 0.9971
Here is another way of doing this (without simulating data):
> pnorm(1,0,1)-pnorm(-1,0,1)
[1] 0.6826895
> pnorm(2,0,1)-pnorm(-2,0,1)
[1] 0.9544997
> pnorm(3,0,1)-pnorm(-3,0,1)
[1] 0.9973002

Areas Under Normal Distribution: General


So, the command pnorm provides the area below a given point for any normal distribution.
Suppose, we know that grades in statistics follow a normal distribution with the mean of 78,
and the standard deviation of 7. We like to know where does a grade of 83 stand:
> pnorm(83,78,7)
[1] 0.7624747
Roughly, 76% of all grades are below 83. We can do the reverse calculation. Suppose that
we like to find the same grade, this time knowing the area below it:
> qnorm(0.7624747,78,7)
[1] 83

Assessing Normality: Normal Quantile Plots or QQplots


Quantile-Quantile plots are among the most powerful tools for assessing the normality of a
data set. The idea is relatively simple. We want to know whether the empirical quantiles
of our data will match with the theoretical quantiles of a standard normal. Data will form
a atraight line if normality holds. R does provides QQ-plots through the command the
qqnorm. Here is the QQ-plot for the data set test3 (figure 1):
> qqnorm(n1)
Below, we demonstrate the qqplot of a symmetric, a left-skewed and a right-skewed
distribution. The code shows you how we generate the next figure.
23

0
1
4

Sample Quantiles

Normal QQ Plot

Theoretical Quantiles

Figure 12: qq-plot for n1.

>
>
>
>
>
>
>
>
>
>

n4=rnorm(1000,3,5)
n5=rpois(1000,3)
n6=rbeta(1000,8,3)
par(mfrow=c(3,2))
hist(n4)
qqnorm(n4)
hist(n5)
qqnorm(n5)
hist(n6)
qqnorm(n6)

Importing Text-files
Reminder:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
24

10

10

10
3

Normal QQ Plot

200
100

Histogram of n6

Normal QQ Plot

0.8

0.6
0.3

150

0.6

0.9

Theoretical Quantiles

Sample Quantiles

n5

0.4

0 2 4 6 8

Histogram of n5
Sample Quantiles

Theoretical Quantiles

0 50

Frequency

15

n4

Frequency

15

10

50

150

Sample Quantiles

Normal QQ Plot

Frequency

Histogram of n4

1.0

n6

Theoretical Quantiles

Figure 13: The Normal Quantile plots for symmetric and asymmetric distributions.

> attach(grades)
> dim(grades)
[1] 100
3
> Verbal
[1] 623
[19] 577
[37] 610
[55] 752
[73] 591
[91] 630

454
487
695
695
552
666

643
682
539
610
557
719

585
565
490
620
599
669

719
552
509
682
540
571

693
567
667
524
752
520

571
745
597
552
726
571

646
610
662
703
630
539

613
493
566
584
558
580

655
571
597
550
646
629

662
682
604
659
643

> hist(Verbal)
> summary(Verbal)
Min. 1st Qu. Median

Mean 3rd Qu.


25

Max.

585
600
519
585
606

580
740
643
578
682

648
593
606
533
565

405
488
500
532
578

506
526
460
708
488

669
630
717
537
361

558
586
592
635
560

361.0

552.0

592.5

598.5

649.8

752.0

> stem(verbal)
The decimal point is 2 digit(s) to the right of the |
3
4
4
5
5
6
6
7
7

|
|
|
|
|
|
|
|
|

6
1
5699999
0112223334444
55556666777777778888889999999
000001111112233334444
5556666777788889
000122234
555

26

Scatterplots and Pearsons Correlation


To create scatterplots, the command is simply plot.
> plot(Verbal,Math)
To calculate the correlation coefficient between any two random variables, we can use the
cor command:
> cor(Verbal,Math)
[1] 0.4306938
> cor(Verbal,GPA)
[1] 0.4847681
> cor(Math,GPA)
[1] 0.2236183

Central Limit Theorem


Central limit theorem says that if x D(, ) where D is a probability density or a mass
function (regardless of the form of the distribution), when sample size is large enough:
x

N (0, 1). When the data comes from a Normal distribution mean and standard
( )
n

deviation , we can assume

( n )

N (0, 1).

To verify these results lets look at Figure 1. Figure 1, shows a clearly right skewed
population with = 2.029 and = 1.382. For 1000 times, we take samples of size 2
from this population and we will obtain 1000 sample averages associated with those random
samples. Then we repeat this process for samples of sizes 3, 6, 10, 20, 100 and each time
we keep track of the mean and the distribution of those 1000 sample averages. We also
plot histograms and qq-plots for each scenario. It turns out that as sample sizes increases
the distribution of 1000 sample averages converge to normality (figures 2 and 3). Also, the
standard deviations of those sampling distributions get closer to n (table 1).
Next, I sample from a normal population with = 99.602 and = 10.2211 (figure 4).
Then I repeat the same procedure for sample averages. It turns out that regardless of sample
sizes, the result associated with central limit theorem hold (figures 5, 6, and table 2).

27

Sample Size
2
3
6
Mean
2.0945
2.061667
2.016
Standard Deviation 0.9860487 0.78204 0.5660369

10
2.0301
0.453773

20
2.0186
0.3010642

100
2.02654
0.1300328

Table 1. Results for popoulation 1. The mean and standard deviations of the sample mean with
different sample sizes

Sample Size
2
Mean
99.95793
Standard Deviation 7.199265

3
6
99.30017 99.5832
5.83941 4.132356

10
99.5639
3.290563

20
99.653
2.240824

100
99.57605
0.9427503

Table 2. Results for population 2. The mean and standard deviations of the sample mean with
different sample sizes

150
0

50

100

Frequency

200

250

300

Histogram of test1

test1

Figure 14: Case one: Population distribution. The distribution is Skewed to the right.

28

Frequency

50

Histogram of mean.size6

Histogram of mean.size10

Frequency

50
0

100

150

mean.size3

250

mean.size2

1.0

1.5

2.0

2.5

3.0

Histogram of mean.size20

Histogram of mean.size100

Frequency

100
0

3.5

200

mean.size10

200

mean.size6

100

Frequency

Frequency

100 200 300

150

Histogram of mean.size3

Frequency

Histogram of mean.size2

1.5

2.0

2.5

3.0

1.6

mean.size20

1.8

2.0

2.2

2.4

mean.size100

Figure 15: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

R-codes For Simulation


Population 1: Right Skewed Distribution We can simulate from a Poisson distribution:
> test1=rpois(1000,2)
> hist(test1)
> mean(test1)
[1] 2.029
> sd(test1)
[1] 1.382777
Population 1: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100
Here is the case for Size 2. Others are similar.
29

1.6

2.0

2.4

Normal QQ Plot
Sample Quantiles

Normal QQ Plot

3.0

Theoretical Quantiles

2.0

Theoretical Quantiles

1.0

Sample Quantiles
1

Theoretical Quantiles

2.5

Normal QQ Plot

1.5

Normal QQ Plot

Theoretical Quantiles

0 1 2 3 4 5

Sample Quantiles
1

Theoretical Quantiles

Sample Quantiles

Sample Quantiles

Normal QQ Plot

0 1 2 3 4 5

Sample Quantiles

Normal QQ Plot

Theoretical Quantiles

Figure 16: QQ-plots for the sample mean distributions for different sample sizes.

> test=matrix(nrow=1000,ncol=2)
> for(i in 1:1000)
{
test[i,]=sample(test1,2)
}
> mean.size2=apply(test,1,mean)
> mean(mean.size2)
[1] 2.0945
> sd(mean.size2)
[1] 0.9860487

30

100
0

50

Frequency

150

200

Histogram of test2

70

80

90

100

110

120

130

test2

Figure 17: Case two: Population distribution. The distribution symmetric.

Population 2: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100 Again,
only the case for size 2 is included.

test=matrix(nrow=1000,ncol=2)
for(i in 1:1000)
{
test[i,]=sample(test2,2)
}
mean.size2.norm=apply(test,1,mean)
> mean(mean.size2.norm)
[1] 99.95793
> sd(mean.size2.norm)
31

90

100

110

120

130

80

100

110

120

Histogram of mean.size10.norm

Frequency
95

100

105

0 50

50

90

150

Histogram of mean.size6.norm
150

mean.size3.norm

110

90

95

100

105

110

Histogram of mean.size20.norm

Histogram of mean.size100.norm

50
0

50
0

150

mean.size10.norm

Frequency

mean.size6.norm

100

85

90

mean.size2.norm

Frequency

80

Frequency

100 200 300


0

100

Frequency

200

Histogram of mean.size3.norm

Frequency

Histogram of mean.size2.norm

95

100

105

97

mean.size20.norm

98

99

100

101

102

103

mean.size100.norm

Figure 18: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

[1] 7.199265

Population 1: Plotting Histograms and QQ-plots


par(mfrow=c(3,2))
hist(mean.size2)
hist(mean.size3)
hist(mean.size6)
hist(mean.size10)
hist(mean.size20)
hist(mean.size100)

32

Normal QQ Plot

99
97

Sample Quantiles
1

101

Normal QQ Plot

110

Theoretical Quantiles

100

Theoretical Quantiles

90

Sample Quantiles
1

Theoretical Quantiles

96 100

Normal QQ Plot

92

Normal QQ Plot

100

Theoretical Quantiles

106

110

Sample Quantiles
0

80 90

120
100

Theoretical Quantiles

110

90

Sample Quantiles

Sample Quantiles

Normal QQ Plot

80

Sample Quantiles

Normal QQ Plot

Theoretical Quantiles

Figure 19: QQ-plots for the sample mean distributions for different sample sizes.

par(mfrow=c(3,2))
qqnorm(mean.size2)
qqnorm(mean.size3)
qqnorm(mean.size6)
qqnorm(mean.size10)
qqnorm(mean.size20)
qqnorm(mean.size100)

33

SAT Scores Again


Verbal: t-test and Confidence Interval

> mean(Verbal)
[1] 598.49
> t.test(Verbal,mu=600)
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative
hypothesis: true mean is not equal to 600 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
Verbal: Two-sided Versus One-sided Tests
> t.test(Verbal,mu=600,alternative="two.sided")
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative
hypothesis: true mean is not equal to 600 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
> t.test(Verbal,mu=600,alternative="less")
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.4215 alternative
hypothesis: true mean is less than 600 95 percent confidence
interval:
-Inf 611.1138
sample estimates: mean of x
598.49
> t.test(Verbal,mu=600,alternative="greater")
34

One Sample t-test


data: verbal t = -0.1986, df = 99, p-value = 0.5785 alternative
hypothesis: true mean is greater than 600 95 percent confidence
interval:
585.8662
Inf
sample estimates: mean of x
598.49

Verbal: Changing
> t.test(Verbal,mu=620)
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.00565 alternative
hypothesis: true mean is not equal to 620 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
> t.test(Verbal,mu=620,alternative="less")
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.002825 alternative
hypothesis: true mean is less than 620 95 percent confidence
interval:
-Inf 611.1138
sample estimates: mean of x
598.49
> t.test(Verbal,mu=620,alternative="greater")
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.9972 alternative
hypothesis: true mean is greater than 620

35

Math: Confidence Interval and t-test


> t.test(Math)
One Sample t-test
data: math t = 99.6647, df = 99, p-value = < 2.2e-16 alternative
hypothesis: true mean is not equal to 0 95 percent confidence
interval:
641.0874 667.1326
sample estimates: mean of x
654.11

Two-sample t-test for Math and Verbal


> t.test(Math,verbal,mu=0)
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value =
9.88e-08 alternative hypothesis: true difference in means is not
equal to 0 95 percent confidence interval:
35.81078 75.42922
sample estimates: mean of x mean of y
654.11
598.49
> t.test(Math,verbal,mu=0,alternative="greater")
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value =
4.94e-08 alternative hypothesis: true difference in means is
greater than 0 95 percent confidence interval:
39.02003
Inf
sample estimates: mean of x mean of y
654.11
598.49
> t.test(Math,verbal,mu=0,alternative="less")

36

Welch Two Sample t-test


data: math and verbal t = 5.5377, df = 193.867, p-value = 1
alternative hypothesis: true difference in means is less than 0 95
percent confidence interval:
-Inf 72.21997
sample estimates: mean of x mean of y
654.11
598.49

> t.test(Math,verbal,mu=50)
Welch Two Sample t-test
data: math and verbal t = 0.5595, df = 193.867, p-value = 0.5764
alternative hypothesis: true difference in means is not equal to
50 95 percent confidence interval:
35.81078 75.42922
sample estimates: mean of x mean of y
654.11
598.49

37

Project 2
(1) (Moore and McCabe, 1998) Crop researchers plant 15 plots with a new variety of corn.
The yields in bushels per acre are:
138.0 139.1 113.0 132.5 140.7 109.7 118.9 134.8
109.6 127.3 115.6 130.4 130.2 111.7 105.5
Assume that the population of yields is normal.
(a) Find the 90% confidence interval for the mean yield for this variety of corn.
(b) Find the 95% confidence interval.
(c) Find the 99% confidence interval.
(d) How do margin of error (sampling error) in (a), (b), and (c) change as confidence
level increases?
(2) (Moore and McCabe, 1998) The table below gives the pretest and posttest scores on
MLA listening test in Spanish for 20 high school Spanish teachers who attended an
intensive summer course in Spanish.
Subject Pretest Posttest Subject Pretest Posttest
1
30
29
11
30
32
2
28
30
12
29
28
3
31
32
13
31
34
4
26
30
14
29
32
15
34
32
5
20
16
6
30
25
16
20
27
17
26
28
7
34
31
18
25
29
8
15
18
9
28
33
19
31
32
20
29
32
10
20
25
Give a 90% confidence interval for the mean increase in listening score due to attending
the summer institute.
(3) Download the dataset grades.txt from the course webpage. Build a 95% confidence
intervals for math , and verbal . Are there overlaps? Interpret your findings.
(4) Bonus: Consider verbal scores in the grades dataset. First, show that verbal
scores follows a normal distribution. Then, construct a 95% confidence interval for the
population mean verbal of verbal scores.
An alternative way of constructing a 95% confidence interval is to use the quantiles of
the data: Consider (grades0.025 ,grades0.975 ), where grades0.025 , and grades0.975 are simply the 2.5% and the 97.5% percentiles of the dataset. Does this confidence interval
agree with the confidence interval you constructed before? Why?
Note that x = 598.49, and
s = 76.029 for grades. Generate 100,000 normal values from N ormal(598, 76/ 100). Create a quantile confidence interval for the latter
38

dataset. Does this confidence interval agree with the first confidence interval? Can
you think of an explanation for this agreement(disagreement)?
Hint: look at the command quantile.

39

You might also like