Intro to R Statistics Workshop

Statistics Workshop 1: Introduction to R.
Tuesday May 26, 2009
Assignments
Generally speaking, there are three basic forms of assigning data. Case one is the single atom
or a single number. Assigning a number to an object in this case is quite trivial. All we need
is to use < or = for assigning a number or an atom to a character. In the following, >
refers to the prompt in R.
The second form is the vector form. In this form, we assign a name to an array of numbers.
This can be done with the command c which stands for concatenation. The interesting fact
is that we can call any member of the vector or we can replace that member with a new
member or to perform various arithmetic operations on that vector, as shown below.
Finally, the third form of storing data is to put them in a matrix form. The command
is matrix. First we need to input the data set of interest, followed by telling R the dimensionality of the matrix that needs to be specified. For example, we can put an array of 9
numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below.
Atoms, Vectors and Matrices

(a) Atoms:
> sam=2
> sam
[1] 2
> sam+sam
[1] 4
> (2*sam*2)/2
[1] 4
> sam^(1/3)
[1] 1.259921
> sqrt(sam)
[1] 1.414214
> abs(-sam)
[1] 2
(b) Vectors
> class.age=c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)
> class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[3]
[1] 36
> class.age[1:5]
[1] 35 35 36 37 37
> class.age[-5]
[1] 35.0 35.0 36.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[-c(2,7)]
[1] 35.0 36.0 37.0 37.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age*2
[1] 70 70 72
74
74
76
76
78
81
86
88
89 100
38
> sqrt(class.age)
[1] 5.916080 5.916080 6.000000 6.082763 6.082763 6.164414 6.164414 6.244998
6.363961 6.557439 6.633250 6.670832 7.071068
[14] 4.358899
> class.age^(-1)
[1] 0.02857143 0.02857143 0.02777778 0.02702703 0.02702703 0.02631579
0.02631579 0.02564103 0.02469136 0.02325581
[11] 0.02272727 0.02247191 0.02000000 0.05263158
> class.age*class.age
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> class.age^2
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> mean(class.age)
2
[1] 38.28571
> median(class.age)
[1] 38
> class.age=(class.age)/2
> class.age
[1] 17.50 17.50 18.00 18.50 18.50 19.00 19.00 19.50 20.25 21.50 22.00 22.25 25.00
> class.age=class.age*2
Often it is useful to create an empty vector. Here is the way this is done:
> hi=numeric(10)
> hi
[1] 0 0 0 0 0 0 0 0 0 0
Vectors do not have to be numerical. We can create a vector of characters:
> hi=c("hello","whasup","longday")
> hi
[1] "hello"
"whasup" "longday"
Later, it becomes useful to ask R the length of a vector:
> length(class.age)
[1] 14
(c) Matrices
> sam = matrix(nrow=3,ncol=4)
> sam
[,1] [,2] [,3] [,4]
[1,]
NA
NA
NA
NA
[2,]
NA
NA
NA
NA
[3,]
NA
NA
NA
NA
> sam = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12
> sam<-matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=F)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12
> sally =c(1,2,3,4,5,6,7,8,9,10,11,12)

> sam=matrix(sally,nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12
> v1=c(1,2,3,4)
> v2=c(5,6,7,8)
> v3=c(9,10,11,12)
> sam=matrix(c(v1,v2,v3),nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
9
10
11
12
> sam[1,]
[1] 1 2 3 4
> sam[,2]
[1] 2 6 10
> sam[1,3]
[1] 3
sam[3,]<-v2
> sam
[,1] [,2] [,3] [,4]
[1,]
1
2
3
4
[2,]
5
6
7
8
[3,]
5
6
7
8
> sam[1,]<-log(v1)
> sam
[,1]
[,2]
[,3]
[,4]
[1,]
0 0.6931472 1.098612 1.386294
[2,]
5 6.0000000 7.000000 8.000000
[3,]
5 6.0000000 7.000000 8.000000
(d) Lists
R provides a powerful additional storing function called list. The importance of list is
in that we can store various objects of different natures such as matrices, vectors, or
atoms into a unique space, followed by calling different parts of that object separately.
Lets assume that we would like to store the following three object into a list-object
called sam:
> s1=3
> s2=seq(1,10,2)
> s3=matrix(c(1:9),nrow=3)
> s1
[1] 3
> s2
5
[1] 1 3 5 7 9
> s3
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> s<-list(s1,s2,s3)
> s
[[1]]
[1] 3
[[2]]
[1] 1 3 5 7 9
[[3]]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> s[[1]]
[1] 3
> s[[2]]
[1] 1 3 5 7 9
> s[[3]]
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
The for loop

Often times, it becomes necessary to repeat certain calculations a number of times.
This is done in R using a simple command called for. Here are some examples:
6
> for(i in 1:3)

+ {
+ print("sam")
+ }
[1] "sam"
[1] "sam"
[1] "sam"
> s=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=T)
> for(i in 1:3)
+ print(s)
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
or:
> for(i in 1:3)
+ { print(s)}
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> for(i in 1:3)

+{
+ print(s[i,])
+ }
[1] 1 4 7
[1] 2 5 8
[1] 3 6 9
Functions
In principle, there are two sorts of functions in R. The most common and useful ones
are the library functions or the already written commands. For example, mean and
sd are commands that calculate the average and the standard deviation of an object,
say a vector respectively. Here are a couple of examples:
> s2<-seq(1,10,2)
> s2
[1] 1 3 5 7 9
> mean(s2)
[1] 5
> var(s2)
[1] 10
> sd(s2)
[1] 3.162278
> median(s2)
[1] 5
Second type of functions are those that the users of R create. These functions will
remain in the command memory of the software unless you delete them or overwrite
them. Expectedly, the command to create a function is f unction.
Here is an example of a function that gets a matrix, and calculates the standard
deviation divided by the mean of its rows . This measurement is called the coefficient
of variation. Note that in writing this function, I use the commands mean, and sd.
8
In general, any time you are not sure what an R command does or to learn about its
specifics, just type a question mark, followed by the command in the prompt.
> m.cv<-function(mat)
{
u=nrow(mat)
t=numeric(u)
for(i in 1:u)
{
t[i]<-sd(mat[i,])/mean(mat[i,])
}
return(t)
}
> sm3<-matrix(c(1:9),nrow=3)
> sm3
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> m.cv(sm3)
[1] 0.75 0.60 0.50
Visualizing Data: Pie Charts, Stem plots, Histograms

Categorical Data
For categorical data, we keep track of counts or relative frequencies of each group. Therefore,
a schematic presentation should reflect the percentage of occurrences in each category. This
is usually done via two types of graphs: 1- Pie charts, and 2- Barplots. Both graphs are easy
to create in R. An important issue here is that in most cases, it would make sense to label
the categories. We show you how to do this below.
Example 1. The counts and the percentages of the marital status of American women
was collected by the Current Population Survey in 1995 as following:
Marital Status Count (millions) Percent
Never Married
43.9
22.9
116.7
20.9
Married
Widowed
13.4
7.0
Divorced
17.6
9.2
Here are the commands to provide the pie-chart for these data (figure 1):
> married<-c(43.9,116.7,13.4,17.6)
> married.code<-as.factor(c(1,2,3,4))
> pie(married,married.code)
Alternatively, we could label each piece of pie by creating a factor vector that contains
the names of each pie (figure 2):
> married.code<-c("never married","married","widowed","divorced")
> pie(married,married.code)
To create a barplot for the married data, it is sufficient to execute the following function
(figure 3):
>
married<-c(22.9,60.9,7,9.2)
> barplot(married,names.arg=married.code)
10
2
4
Figure 1: Pie chart for the Married data.
Stemplots and Histograms

For quantitative data, stemplots and histograms are the useful visual tools.
Example 2. Lets revisit the class-age data we introduced previously. To create the stemplot
for these data, we can do the following:
> class.age<-c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)
> class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> stem(class.age)
The decimal point is 1 digit(s) to the right of the |
1
2
3
4
| 9
|
| 55677889
| 1345
11
never married
married
divorced
widowed
Figure 2: Pie chart for the Married data with labels.
5 | 0
> stem(class.age,scale=2)
1
2
2
3
3
4
4
5
|
|
|
|
|
|
|
|
55677889
134
5
0
> test<-c(0,0.01,0.22,0.34,0.36,0.31,0.36,0.45,0.4,0.55,0.65)
> stem(test)
12
60
50
40
30
20
10
0
never married
married
widowed
divorced
Figure 3: Barplot for the Married data with labels.
The decimal point is 1 digit(s) to the left of the |

0
2
4
6
|
|
|
|
01
21466
055
5
> stem(test,scale=2)
The decimal point is 1 digit(s) to the left of the |
0
1
2
3
4
5
6
|
|
|
|
|
|
|
01
2
1466
05
5
5
13
To create a histogram for the class-age data, it is sufficient to use the hist command
(figure 4):
> hist(class.age)
3
0
Frequency
Histogram of class.age
15
20
25
30
35
40
45
50
class.age
Figure 4: The histogram of class.age.
We can make the bars finer. Here is a simple trick (figure 5):
> b1<-seq(15,50,3)
> b1
[1] 15 18 21 24 27 30 33 36 39 42 45 48
> b1<-seq(15,50,3)+2
> hist(class.age,breaks=b1)
14
Frequency
Histogram of class.age
20
25
30
35
40
45
50
class.age
Figure 5: The histogram of class.age with finer classes.
Measuring Center: The Mean, The Median, and the

Quartile
The measures of centrality play a fundamental role in understanding the statistical distributions. The chief important ones are the mean, the median, and the other quantiles.
> mean(class.age)
[1] 38.28571
> median(class.age)
[1] 38
> quantile(class.age)
0%
25%
50%
75%
100%
19.000 36.250 38.000 42.375 50.000
> quantile(class.age,prob=0.66)
15
66%
39.87
Comparing Mean and Median

The Symmetric Case
For symmetric distributions such as the one in figure 6, the median and the mean are close
to each other.
100
0
50
Frequency
150
200
Histogram of n1
70
80
90
100
110
120
130
n1
Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.
For distributions that are skewed to the left such as the one in figure 7, the mean is
smaller than the median (why?)
Finally, for the right-skewed distributions, the mean is larger than the median (figure 8).
Measuring Spread: The Standard Deviation

The standard deviation reflects the number of units away from the mean. For example, the
standard deviation for the data in figures 6, 7, and 8 are 10.03, 0.19, and 0.19 respectively.
To calculate the variance and the standard deviation for the class.age data, we can type:
16
150
0
50
100
Frequency
200
250
Histogram of n3
0.2
0.4
0.6
0.8
1.0
n3
Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.
> var(class.age)
[1] 49.1044
> sd(class.age)
[1] 7.007453
Project 1. Visualizing Grades.

First, read the file grades.txt from the webpage. To do this, run the following code in R:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
This will generate a data frame of the size 100 3 called grades in R for you. Rows are
students, and columns represent the Verbal SAT score, the Math SAT score, and the GPA
for each student. To examine the dimensionality of this object you can type:
> dim(grades)
[1] 100
3
Which confirms what we planned initially.
Now, we are at a position to answer the following questions:
17
150
0
50
100
Frequency
200
250
Histogram of n5
0.0
0.2
0.4
0.6
0.8
1.0
n5
Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.
(1.) Create barplots, dotplots, stemplots, and histograms for the three variables of interest.
To make life easier, use the command attach to make the data file grades your
working data. Next, proceed by just typing the name of the column of interest. Here
is what I mean:
> attach(grades)
> GPA
[1] 2.6 2.3 2.4
[19] 2.0 2.3 3.3
[37] 2.9 3.3 2.1
[55] 3.4 2.5 3.6
[73] 3.3 3.0 3.2
[91] 3.5 3.4 2.3
3.0
2.8
1.2
2.6
2.3
2.9
3.1
1.7
3.3
3.6
3.3
1.8
2.9
2.4
2.0
2.9
3.3
2.8
3.1
3.4
3.1
2.6
3.9
2.3
3.3
2.8
2.6
3.8
2.1
2.5
2.3
2.4
2.4
3.0
2.6
2.4
3.3
1.9
2.4
2.5
2.4
2.9
2.6
2.5
2.3
3.5
3.3
3.3
2.3
3.0
2.0
3.1
2.0
3.4
2.9
3.0
3.6
3.0
2.8
3.4
2.0
2.9
1.9
1.9
2.3
1.8
2.4
2.7
3.0
1.4
2.3
1.8
2.0
3.7
2.8
2.1
2.4
(2.) Calculate the min, the max, the mean, the median, first quartile, third quartile, and
the standard deviation of each variable.
A good chunk of that information may be obtained through using the command summary:
> summary(GPA)
18
3.3
2.3
2.4
3.0
2.9
Min. 1st Qu.

1.200
2.300
Median
2.750
Mean 3rd Qu.

2.706
3.125
Max.
3.900
(3.) Report your findings in detail. Compare the verbal scores with the math scores. Comment on the symmetry, measures of centrality, measures of spread, and the potential
outliers in each distribution. Make sure to comment on the statistical features of the
GPA as well.
Boxplots
Boxplots are efficient tools for representing data distributions. The five number summary
can be traced on a boxplot. Additionally, we can figure-out outliers with boxplots.
Remember the three distributions in figures 6, 7 and 8. Note that these distributions are
symmetric, left skewed and right skewed respectively. Here is how I created figure 9 below:
>
>
>
>
>
>
>
>
>
>
n1=rnorm(100000,2,3)
n2=rpois(100000,3)
n3=rbeta(100000,12,3)
par(mfrow=c(3,2))
hist(n1)
boxplot(n1)
hist(n2)
boxplot(n2)
hist(n3)
boxplot(n3)
Note that the command par(mf row) creates a 3 by 2 grid in the graphic area.
The side-by-side boxplots are very helpful in comparing two or more distributions. For
example figure 2 shows a side-by-side boxplots for the two skewed distributions of figure 1.
> boxplot(n1,n2)
Linear Transformations and Their Effect on x and s:

Standardization
To experiment with the idea of linear transformations, lets go back to the first dataset n1
and calculate the mean and the standard deviation of it:
> mean(n1)
[1] 2.000206
19
10
25000
10 0
0 10000
Frequency
Histogram of n1
15
10
10
15
n1
8
0
10000
0
Frequency
12
Histogram of n2
10
12
n2
0.9
0.3
0.6
10000
0
Frequency
Histogram of n3
0.4
0.6
0.8
1.0
n3
Figure 9: Histograms along with boxplots for the three simulated datasets.
> sd(n1)
[1] 2.999017
Let us perform the following simple linear transformation on these numbers:

Z=
n1 n1
s
Here is the code:

> z<-(n1-mean(n1))/sd(n1)
> mean(z)
[1] 7.186852e-17
> sd(z)
[1] 1
20
15
10
5
0
5
10
Figure 10: Side-by-side boxplots for the two of the distributions of figure 1.
> hist(z)
21
10000
0
5000
Frequency
15000
20000
Histogram of z
Figure 11: Standardized version of n1. Note that the mean is roughly 0, and the standard deviation
is 1.
Verification of the 68% - 95% - 99.7% Rule

To verify the rule, we reconsider the standardized vector z. Next, we count the number of
elements of z whose values are between -1 to 1, -2 to 2 and -3 to 3 respectively:
> length(z[z>-1 & z<1])/length(z)
[1] 0.68157
[1] 0.95452
[1] 0.99747
This rule holds for any normal distribution not just the standardized ones:
> u=rnorm(10000,3,4)
> m=mean(u)
> s=sd(u)
22
> length(u[u>m-s&u<m+s])/length(u)
[1] 0.6811
> length(u[u>m-2*s&u<m+2*s])/length(u)
[1] 0.9552
> length(u[u>m-3*s&u<m+3*s])/length(u)
[1] 0.9971
Here is another way of doing this (without simulating data):
> pnorm(1,0,1)-pnorm(-1,0,1)
[1] 0.6826895
> pnorm(2,0,1)-pnorm(-2,0,1)
[1] 0.9544997
> pnorm(3,0,1)-pnorm(-3,0,1)
[1] 0.9973002
Areas Under Normal Distribution: General

So, the command pnorm provides the area below a given point for any normal distribution.
Suppose, we know that grades in statistics follow a normal distribution with the mean of 78,
and the standard deviation of 7. We like to know where does a grade of 83 stand:
> pnorm(83,78,7)
[1] 0.7624747
Roughly, 76% of all grades are below 83. We can do the reverse calculation. Suppose that
we like to find the same grade, this time knowing the area below it:
> qnorm(0.7624747,78,7)
[1] 83
Assessing Normality: Normal Quantile Plots or QQplots

Quantile-Quantile plots are among the most powerful tools for assessing the normality of a
data set. The idea is relatively simple. We want to know whether the empirical quantiles
of our data will match with the theoretical quantiles of a standard normal. Data will form
a atraight line if normality holds. R does provides QQ-plots through the command the
qqnorm. Here is the QQ-plot for the data set test3 (figure 1):
> qqnorm(n1)
Below, we demonstrate the qqplot of a symmetric, a left-skewed and a right-skewed
distribution. The code shows you how we generate the next figure.
23
0
1
4
Sample Quantiles
Normal QQ Plot
Theoretical Quantiles
Figure 12: qq-plot for n1.
>
>
>
>
>
>
>
>
>
>
n4=rnorm(1000,3,5)
n5=rpois(1000,3)
n6=rbeta(1000,8,3)
par(mfrow=c(3,2))
hist(n4)
qqnorm(n4)
hist(n5)
qqnorm(n5)
hist(n6)
qqnorm(n6)
Importing Text-files
Reminder:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
24
10
10
10
3
Normal QQ Plot
200
100
Histogram of n6
Normal QQ Plot
0.8
0.6
0.3
150
0.6
0.9
Sample Quantiles
n5
0.4
0 2 4 6 8
Histogram of n5
Sample Quantiles
0 50
Frequency
15
n4
Frequency
15
10
50
150
Sample Quantiles
Normal QQ Plot
Frequency
Histogram of n4
1.0
n6
Figure 13: The Normal Quantile plots for symmetric and asymmetric distributions.
> attach(grades)
> dim(grades)
[1] 100
3
> Verbal
[1] 623
[19] 577
[37] 610
[55] 752
[73] 591
[91] 630
454
487
695
695
552
666
643
682
539
610
557
719
585
565
490
620
599
669
719
552
509
682
540
571
693
567
667
524
752
520
571
745
597
552
726
571
646
610
662
703
630
539
613
493
566
584
558
580
655
571
597
550
646
629
662
682
604
659
643
> hist(Verbal)
> summary(Verbal)
Min. 1st Qu. Median
Mean 3rd Qu.

25
Max.
585
600
519
585
606
580
740
643
578
682
648
593
606
533
565
405
488
500
532
578
506
526
460
708
488
669
630
717
537
361
558
586
592
635
560
361.0
552.0
592.5
598.5
649.8
752.0
> stem(verbal)
3
4
4
5
5
6
6
7
7
|
|
|
|
|
|
|
|
|
6
1
5699999
0112223334444
55556666777777778888889999999
000001111112233334444
5556666777788889
000122234
555
26
Scatterplots and Pearsons Correlation

To create scatterplots, the command is simply plot.
> plot(Verbal,Math)
To calculate the correlation coefficient between any two random variables, we can use the
cor command:
> cor(Verbal,Math)
[1] 0.4306938
> cor(Verbal,GPA)
[1] 0.4847681
> cor(Math,GPA)
[1] 0.2236183
Central Limit Theorem

Central limit theorem says that if x D(, ) where D is a probability density or a mass
function (regardless of the form of the distribution), when sample size is large enough:
x
N (0, 1). When the data comes from a Normal distribution mean and standard
( )
n
deviation , we can assume
( n )
N (0, 1).
To verify these results lets look at Figure 1. Figure 1, shows a clearly right skewed
population with = 2.029 and = 1.382. For 1000 times, we take samples of size 2
from this population and we will obtain 1000 sample averages associated with those random
samples. Then we repeat this process for samples of sizes 3, 6, 10, 20, 100 and each time
we keep track of the mean and the distribution of those 1000 sample averages. We also
plot histograms and qq-plots for each scenario. It turns out that as sample sizes increases
the distribution of 1000 sample averages converge to normality (figures 2 and 3). Also, the
standard deviations of those sampling distributions get closer to n (table 1).
Next, I sample from a normal population with = 99.602 and = 10.2211 (figure 4).
Then I repeat the same procedure for sample averages. It turns out that regardless of sample
sizes, the result associated with central limit theorem hold (figures 5, 6, and table 2).
27
Sample Size
2
3
6
Mean
2.0945
2.061667
2.016
Standard Deviation 0.9860487 0.78204 0.5660369
10
2.0301
0.453773
20
2.0186
0.3010642
100
2.02654
0.1300328
Table 1. Results for popoulation 1. The mean and standard deviations of the sample mean with
different sample sizes
Sample Size
2
Mean
99.95793
Standard Deviation 7.199265
3
6
99.30017 99.5832
5.83941 4.132356
10
99.5639
3.290563
20
99.653
2.240824
100
99.57605
0.9427503
Table 2. Results for population 2. The mean and standard deviations of the sample mean with
different sample sizes
150
0
50
100
Frequency
200
250
300
Histogram of test1
test1
Figure 14: Case one: Population distribution. The distribution is Skewed to the right.
28
Frequency
50
Histogram of mean.size6
Frequency
50
0
100
150
mean.size3
250
mean.size2
1.0
1.5
2.0
2.5
3.0
Frequency
100
0
3.5
200
mean.size10
200
mean.size6
100
Frequency
Frequency
100 200 300
150
Frequency
1.5
2.0
2.5
3.0
1.6
mean.size20
1.8
2.0
2.2
2.4
mean.size100
Figure 15: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.
R-codes For Simulation

Population 1: Right Skewed Distribution We can simulate from a Poisson distribution:
> test1=rpois(1000,2)
> hist(test1)
> mean(test1)
[1] 2.029
> sd(test1)
[1] 1.382777
Population 1: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100
Here is the case for Size 2. Others are similar.
29
1.6
2.0
2.4
Normal QQ Plot
Sample Quantiles
Normal QQ Plot
3.0
2.0
1.0
Sample Quantiles
1
2.5
Normal QQ Plot
1.5
Normal QQ Plot
0 1 2 3 4 5
Sample Quantiles
1
Sample Quantiles
Sample Quantiles
Normal QQ Plot
0 1 2 3 4 5
Sample Quantiles
Normal QQ Plot
Figure 16: QQ-plots for the sample mean distributions for different sample sizes.
> test=matrix(nrow=1000,ncol=2)
> for(i in 1:1000)
{
test[i,]=sample(test1,2)
}
> mean.size2=apply(test,1,mean)
> mean(mean.size2)
[1] 2.0945
> sd(mean.size2)
[1] 0.9860487
30
100
0
50
Frequency
150
200
Histogram of test2
70
80
90
100
110
120
130
test2
Figure 17: Case two: Population distribution. The distribution symmetric.
Population 2: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100 Again,
only the case for size 2 is included.
test=matrix(nrow=1000,ncol=2)
for(i in 1:1000)
{
test[i,]=sample(test2,2)
}
mean.size2.norm=apply(test,1,mean)
> mean(mean.size2.norm)
[1] 99.95793
> sd(mean.size2.norm)
31
90
100
110
120
130
80
100
110
120
Histogram of mean.size10.norm
Frequency
95
100
105
0 50
50
90
150
150
mean.size3.norm
110
90
95
100
105
110
50
0
50
0
150
mean.size10.norm
Frequency
mean.size6.norm
100
85
90
mean.size2.norm
Frequency
80
Frequency
100 200 300

0
100
Frequency
200
Frequency
95
100
105
97
mean.size20.norm
98
99
100
101
102
103
mean.size100.norm
Figure 18: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.
[1] 7.199265
Population 1: Plotting Histograms and QQ-plots

par(mfrow=c(3,2))
hist(mean.size2)
hist(mean.size3)
hist(mean.size6)
hist(mean.size10)
hist(mean.size20)
hist(mean.size100)
32
Normal QQ Plot
99
97
Sample Quantiles
1
101
Normal QQ Plot
110
100
90
Sample Quantiles
1
96 100
Normal QQ Plot
92
Normal QQ Plot
100
106
110
Sample Quantiles
0
80 90
120
100
110
90
Sample Quantiles
Sample Quantiles
Normal QQ Plot
80
Sample Quantiles
Normal QQ Plot
Figure 19: QQ-plots for the sample mean distributions for different sample sizes.
par(mfrow=c(3,2))
qqnorm(mean.size2)
qqnorm(mean.size3)
qqnorm(mean.size6)
qqnorm(mean.size10)
qqnorm(mean.size20)
qqnorm(mean.size100)
33
SAT Scores Again

Verbal: t-test and Confidence Interval
> mean(Verbal)
[1] 598.49
> t.test(Verbal,mu=600)
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative
hypothesis: true mean is not equal to 600 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
Verbal: Two-sided Versus One-sided Tests
> t.test(Verbal,mu=600,alternative="two.sided")
One Sample t-test
interval:
583.4042 613.5758
598.49
> t.test(Verbal,mu=600,alternative="less")
One Sample t-test
hypothesis: true mean is less than 600 95 percent confidence
interval:
-Inf 611.1138
598.49
> t.test(Verbal,mu=600,alternative="greater")
34
One Sample t-test

hypothesis: true mean is greater than 600 95 percent confidence
interval:
585.8662
Inf
598.49
Verbal: Changing
> t.test(Verbal,mu=620)
One Sample t-test
interval:
583.4042 613.5758
598.49
> t.test(Verbal,mu=620,alternative="less")
One Sample t-test
hypothesis: true mean is less than 620 95 percent confidence
interval:
-Inf 611.1138
598.49
> t.test(Verbal,mu=620,alternative="greater")
One Sample t-test
hypothesis: true mean is greater than 620
35
Math: Confidence Interval and t-test

> t.test(Math)
One Sample t-test
data: math t = 99.6647, df = 99, p-value = < 2.2e-16 alternative
interval:
641.0874 667.1326
654.11
Two-sample t-test for Math and Verbal

> t.test(Math,verbal,mu=0)
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value =
9.88e-08 alternative hypothesis: true difference in means is not
equal to 0 95 percent confidence interval:
35.81078 75.42922
sample estimates: mean of x mean of y
654.11
598.49
> t.test(Math,verbal,mu=0,alternative="greater")
data: math and verbal t = 5.5377, df = 193.867, p-value =
4.94e-08 alternative hypothesis: true difference in means is
greater than 0 95 percent confidence interval:
39.02003
Inf
654.11
598.49
> t.test(Math,verbal,mu=0,alternative="less")
36

data: math and verbal t = 5.5377, df = 193.867, p-value = 1
alternative hypothesis: true difference in means is less than 0 95
percent confidence interval:
-Inf 72.21997
654.11
598.49
> t.test(Math,verbal,mu=50)
data: math and verbal t = 0.5595, df = 193.867, p-value = 0.5764
alternative hypothesis: true difference in means is not equal to
50 95 percent confidence interval:
35.81078 75.42922
654.11
598.49
37
Project 2
(1) (Moore and McCabe, 1998) Crop researchers plant 15 plots with a new variety of corn.
The yields in bushels per acre are:
138.0 139.1 113.0 132.5 140.7 109.7 118.9 134.8
109.6 127.3 115.6 130.4 130.2 111.7 105.5
Assume that the population of yields is normal.
(a) Find the 90% confidence interval for the mean yield for this variety of corn.
(b) Find the 95% confidence interval.
(c) Find the 99% confidence interval.
(d) How do margin of error (sampling error) in (a), (b), and (c) change as confidence
level increases?
(2) (Moore and McCabe, 1998) The table below gives the pretest and posttest scores on
MLA listening test in Spanish for 20 high school Spanish teachers who attended an
intensive summer course in Spanish.
Subject Pretest Posttest Subject Pretest Posttest
1
30
29
11
30
32
2
28
30
12
29
28
3
31
32
13
31
34
4
26
30
14
29
32
15
34
32
5
20
16
6
30
25
16
20
27
17
26
28
7
34
31
18
25
29
8
15
18
9
28
33
19
31
32
20
29
32
10
20
25
Give a 90% confidence interval for the mean increase in listening score due to attending
the summer institute.
(3) Download the dataset grades.txt from the course webpage. Build a 95% confidence
intervals for math , and verbal . Are there overlaps? Interpret your findings.
(4) Bonus: Consider verbal scores in the grades dataset. First, show that verbal
scores follows a normal distribution. Then, construct a 95% confidence interval for the
population mean verbal of verbal scores.
An alternative way of constructing a 95% confidence interval is to use the quantiles of
the data: Consider (grades0.025 ,grades0.975 ), where grades0.025 , and grades0.975 are simply the 2.5% and the 97.5% percentiles of the dataset. Does this confidence interval
agree with the confidence interval you constructed before? Why?
Note that x = 598.49, and
s = 76.029 for grades. Generate 100,000 normal values from N ormal(598, 76/ 100). Create a quantile confidence interval for the latter
38
dataset. Does this confidence interval agree with the first confidence interval? Can
you think of an explanation for this agreement(disagreement)?
Hint: look at the command quantile.
39

Intro to R Statistics Workshop

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro to R Statistics Workshop

Uploaded by

Copyright:

Available Formats

Statistics Workshop 1: Introduction to R.

Tuesday May 26, 2009

Atoms, Vectors and Matrices

> sally =c(1,2,3,4,5,6,7,8,9,10,11,12)

The for loop

> for(i in 1:3)

> for(i in 1:3)

Visualizing Data: Pie Charts, Stem plots, Histograms

Figure 1: Pie chart for the Married data.

Stemplots and Histograms

Figure 2: Pie chart for the Married data with labels.

Figure 3: Barplot for the Married data with labels.

The decimal point is 1 digit(s) to the left of the |

Figure 4: The histogram of class.age.

Figure 5: The histogram of class.age with finer classes.

Measuring Center: The Mean, The Median, and the

Comparing Mean and Median

Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.

Measuring Spread: The Standard Deviation

Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.

Project 1. Visualizing Grades.

Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.

Min. 1st Qu.

Mean 3rd Qu.

Linear Transformations and Their Effect on x and s:

Let us perform the following simple linear transformation on these numbers:

Here is the code:

Verification of the 68% - 95% - 99.7% Rule

Areas Under Normal Distribution: General

Assessing Normality: Normal Quantile Plots or QQplots

Figure 12: qq-plot for n1.

Mean 3rd Qu.

Scatterplots and Pearsons Correlation

Central Limit Theorem

deviation , we can assume

100 200 300

R-codes For Simulation

Figure 17: Case two: Population distribution. The distribution symmetric.

100 200 300

Population 1: Plotting Histograms and QQ-plots

SAT Scores Again

One Sample t-test

Math: Confidence Interval and t-test

Two-sample t-test for Math and Verbal

Welch Two Sample t-test

You might also like