0% found this document useful (0 votes)
13 views8 pages

For-Loop and Sampling Distributions in R

Uploaded by

janecoco0222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

For-Loop and Sampling Distributions in R

Uploaded by

janecoco0222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PROJECT 4: AN INTRODUCTION TO THE FOR-LOOP AND SAMPLING

DISTRIBUTIONS

STEFFEN GRØNNEBERG

NOTE: The file “4.R” contains useful code.

Assignment 1. We here consider the Beauty-dataset introduced in Lecture 4.


The data-file and a text file with a basic description of the variables are uploaded
to Itslearning.
4
3
lwage

2
1
0

0 10 20 30 40

exper

Figure 1. R-plot.

Date: September 19, 2021.


1
2 STEFFEN GRØNNEBERG

(A) Consider Figure 1. It gives a scatter plot of experience versus log-wage. The
red curve is a smooth spline curve, and the green squares are the average
log wage for each observed experience category. Producing the scatter plot
and the smooth spline curve is easy, by using the the following commands.
1 rm ( l i s t = l s ( ) )
2 b e a u t y <− r e a d . c s v ( ” b e a u t y . c s v ” )
3 w i t h ( beauty , p l o t ( educ , l w a g e ) )
4 w i t h ( beauty , l i n e s ( smooth . s p l i n e ( educ , lwage ) , c o l = ” r e d ” , lwd =3))

In contrast, there is no standard command in R to calculate and plot the


green squares, and we have to write code to plot them ourselves. This is
common in applied data analysis problems.
Let us first try to do this with the commands that we have currently
studied, as well as introducing a few new: unique and sort, explained in
the text. Run through the following code, making sure that you understand
every step.
1 rm ( l i s t = l s ( ) )
2 b e a u t y <− r e a d . c s v ( ” b e a u t y . c s v ” )
3 dim ( beauty )
4 w i t h ( beauty , p l o t ( e x p e r , l w a g e ) )
5 w i t h ( beauty , l i n e s ( smooth . s p l i n e ( e x p e r , lwage ) , c o l = ” r e d ” , lwd =3))
6
7 head ( b e a u t y $ e x p e r , 2 0 )
8 #G i v e s o u t t h e f i r s t 20 o b s e r v a t i o n s :
9 # [ 1 ] 30 28 35 38 27 20 12 5 5 12 3 6 19 8 12 17 7 12 10 7
10
11 c l a s s e s <− u n i q u e ( b e a u t y $ e x p e r ) #u n i q u e g i v e s o u t a l l −u n i q u e − v a l u e s of a vector .
12 classes
13 # G i v e s o u t −u n s o r t e d − n um b e r s
14 # [ 1 ] 30 28 35 38 27 20 12 5 3 6 19 8 17 7 10 33 32 24 29 41 40
15 #[ 2 2 ] 43 18 37 31 9 42 14 4 13 23 16 26 15 11 25 1 36 21 22 34 44
16 #[ 4 3 ] 2 39 48 45 0 47 46
17 max ( c l a s s e s )
18 #Do we h a v e o b s e r v a t i o n s f r o m a l l i n t e g e r s f r o m 0 t o 48?
19 d <− l e n g t h ( c l a s s e s )
20 d # gives 49
21 #y e s , we h a v e o b s e r v a t i o n s f r o m a l l i n t e g e r s f r o m 0 t o 4 8 .
22
23 #L e t u s c r e a t e a new v e c t o r , r e p e a t i n g ( r e p ) t h e NA v a l u e
24 #(NA = ”non a p p l i c a b l e ”) . We w i l l p u t i n t h e averages for
25 #e a c h e x p e r i e n c e c l a s s i n t o t h i s v e c t o r .
26 e x p C l a s s A v e r a g e <− r e p (NA, d )
27
28 #I d e n t i f i e s t h e i n d i c e s which correspond t o a l l i n d i v i d u a l s
29 #w i t h z e r o y e a r s o f e x p e r i e n c e .
30 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 0 )
31 #we t h e n a c c e s s t h e l o g −w a g e s f o r t h e s e i n d i v i d u a l s
32 #b y w r i t i n g
33 beauty $ lwage [ expClass ]
34 #G i v e s
35 #[ 1 ] 1 . 0 5 7 7 9 0 3 0 . 2 3 9 0 1 6 9 0 . 9 6 6 9 8 3 9 0 . 8 7 5 4 6 8 8
36
37 #L e t u s now c o m p u t e t h e a v e r a g e o f t h e s e l o g −w a g e s ,
38 #and s a v e t h e r e s u l t i n t h e f i r s t e l e m e n t o f
39 #e x p C l a s s A v e r a g e .
40 e x p C l a s s A v e r a g e [ 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
41 expClassAverage
42 #g i v e s
43 # [ 1 ] 0.784815 NA NA NA NA NA NA
44 # [8] NA NA NA NA NA NA NA
45 # [15] NA NA NA NA NA NA NA
46 # [22] NA NA NA NA NA NA NA
47 # [29] NA NA NA NA NA NA NA
48 # [36] NA NA NA NA NA NA NA
49 # [43] NA NA NA NA NA NA NA
50 #s o t h e f i r s t e l e m e n t i s f i l l e d up . L e t ’ s do t h e same f o r t h e rest :
51
52
53 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 ) #n e x t c a t e g o r y is 1, etc :
54 e x p C l a s s A v e r a g e [ 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
55 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 )
56 e x p C l a s s A v e r a g e [ 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
57 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 )
PROJECT 4 3

58 e x p C l a s s A v e r a g e [ 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
59 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 )
60 e x p C l a s s A v e r a g e [ 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
61 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 5 )
62 e x p C l a s s A v e r a g e [ 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
63 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 6 )
64 e x p C l a s s A v e r a g e [ 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
65 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 7 )
66 e x p C l a s s A v e r a g e [ 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
67 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 8 )
68 e x p C l a s s A v e r a g e [ 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
69 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 9 )
70 e x p C l a s s A v e r a g e [ 1 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
71 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 0 )
72 e x p C l a s s A v e r a g e [ 1 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
73 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 1 )
74 e x p C l a s s A v e r a g e [ 1 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
75 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 2 )
76 e x p C l a s s A v e r a g e [ 1 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
77 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 3 )
78 e x p C l a s s A v e r a g e [ 1 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
79 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 4 )
80 e x p C l a s s A v e r a g e [ 1 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
81 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 5 )
82 e x p C l a s s A v e r a g e [ 1 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
83 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 6 )
84 e x p C l a s s A v e r a g e [ 1 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
85 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 7 )
86 e x p C l a s s A v e r a g e [ 1 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
87 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 8 )
88 e x p C l a s s A v e r a g e [ 1 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
89 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 9 )
90 e x p C l a s s A v e r a g e [ 2 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
91 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 0 )
92 e x p C l a s s A v e r a g e [ 2 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
93 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 1 )
94 e x p C l a s s A v e r a g e [ 2 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
95 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 2 )
96 e x p C l a s s A v e r a g e [ 2 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
97 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 3 )
98 e x p C l a s s A v e r a g e [ 2 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
99 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 4 )
100 e x p C l a s s A v e r a g e [ 2 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
101 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 5 )
102 e x p C l a s s A v e r a g e [ 2 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
103 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 6 )
104 e x p C l a s s A v e r a g e [ 2 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
105 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 7 )
106 e x p C l a s s A v e r a g e [ 2 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
107 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 8 )
108 e x p C l a s s A v e r a g e [ 2 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
109 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 9 )
110 e x p C l a s s A v e r a g e [ 3 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
111 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 0 )
112 e x p C l a s s A v e r a g e [ 3 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
113 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 1 )
114 e x p C l a s s A v e r a g e [ 3 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
115 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 2 )
116 e x p C l a s s A v e r a g e [ 3 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
117 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 3 )
118 e x p C l a s s A v e r a g e [ 3 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
119 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 4 )
120 e x p C l a s s A v e r a g e [ 3 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
121 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 5 )
122 e x p C l a s s A v e r a g e [ 3 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
123 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 6 )
124 e x p C l a s s A v e r a g e [ 3 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
125 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 7 )
126 e x p C l a s s A v e r a g e [ 3 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
127 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 8 )
128 e x p C l a s s A v e r a g e [ 3 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
129 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 9 )
130 e x p C l a s s A v e r a g e [ 4 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
131 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 0 )
132 e x p C l a s s A v e r a g e [ 4 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
133 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 1 )
134 e x p C l a s s A v e r a g e [ 4 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
135 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 2 )
4 STEFFEN GRØNNEBERG

136 e x p C l a s s A v e r a g e [ 4 3 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )


137 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 3 )
138 e x p C l a s s A v e r a g e [ 4 4 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
139 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 4 )
140 e x p C l a s s A v e r a g e [ 4 5 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
141 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 5 )
142 e x p C l a s s A v e r a g e [ 4 6 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
143 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 6 )
144 e x p C l a s s A v e r a g e [ 4 7 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
145 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 7 )
146 e x p C l a s s A v e r a g e [ 4 8 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
147 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 8 )
148 e x p C l a s s A v e r a g e [ 4 9 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
149
150 p o i n t s ((0:48) , e x p C l a s s A v e r a g e , lwd =3 , pch =5 , c o l =” g r e e n ” )

This method works, but is clearly a terrible solution: It is time-consuming


to write, and while the process is simple (it is a repetition of the same type
of commands again and again with minor variations) it is difficult to error-
check (did we really get all the numbers? did we skip some? etc.). Further,
it is clearly not scalable. Suppose we instead had 200 categories, or maybe
10 000 categories, etc.
Instead, we want the computer to do this in an automated manner. The
following for-loop does exactly the same as the code above, but in a much
smarter way.
1 e x p C l a s s A v e r a g e <− NULL
2 f o r ( i in (0:48)) {
3 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == i )
4 e x p C l a s s A v e r a g e [ i +1] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
5 }

Note that in Line 1, we initialize expClassAverage in a different way than


earlier: we assign the value NULL to expClassAverage, which means exp-
ClassAverage is empty. When we later fill in values in expClassAverage,
R will automatically expands its size, eventually producing a vector with 49
elements.
Verify that you get the same plot using the code using the for-loop.
(B) Use a for-loop to make histograms of the wage distribution for each beauty-
category. Base your work on the following code. Note that we access di-
rectly the variables fulfilling looks == 1 with wage[looks == 1], avoiding
which. This is often simpler.
1 p a r ( mfrow=c (3 ,2))
2 w i t h ( beauty , h i s t ( wage [ looks == 1] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
3 w i t h ( beauty , h i s t ( wage [ looks == 2] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
4 w i t h ( beauty , h i s t ( wage [ looks == 3] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
5 w i t h ( beauty , h i s t ( wage [ looks == 4] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
6 w i t h ( beauty , h i s t ( wage [ looks == 5] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
7 p a r ( mfrow=c (1 ,1))

Hint 1: Base your work on the following code.


1 for (i in ( 1 : 5 ) ) {
2 print ( i )
3 }

Instead of printing i, you use with(beauty, hist(wage[looks == 1], xlab="wages",


xlim=c(0,max(wage)))) appropriately modified. Be sure to put in par(mfrow=c(3,2))
before your for-loop, and par(mfrow=c(1,1)) after the for-loop is finished
(recall that par(mfrow=c(3,2)) splits up the plotting window in a 3 × 2
matrix, while par(mfrow=c(1,1)) restores the default behaviour of plot-
ting commands).
Hint 2: Do not try to get the captions of the plots to vary with i, but just
leave them as they are by default. It is easy to fix this with some additional
commands, but we will not do that here.
PROJECT 4 5

(C) In project 1, we used R to verify the formula


n
X 1 − xn+1
xi =
i=0
1−x
which holds for any number x and any non-negative integer n We used the
following code.
1 n <− 15
2 x <− 0 . 5
3 s e r i e s <− x ˆ ( 0 : n )
4 sum ( s e r i e s )
5 (1−x ˆ ( n +1)) / (1−x )

Let us now compute the sum without using the sum-command. One way to
do this is using the following code (here n = 5).
1 n <− 5
2 x <− 0 . 5
3 r e s u l t <− 0
4
5 result <− result + x ˆ0
6 result <− result + x ˆ1
7 result <− result + x ˆ2
8 result <− result + x ˆ3
9 result <− result + x ˆ4
10 result <− result + x ˆ5
11
12 result
13 (1−x ˆ ( n +1)) / (1−x )

Re-implement this code using a for-loop. Explain why it works.


(D) We have that the formula for an arithmetic sum is
n
X n(n + 1)
i= .
i=1
2
In “intro.R” worked with in the first lecture, we verified this formula using
the following code.
1 n <− 500
2 X <− ( 1 : n )
3 s u m (X)
4 n∗ ( n+1) / 2

Modify this code so that you compute the sum using a for-loop, as in the
previous task.

Assignment 2. We here study census data from the 1990 US census. The data-set
contains information from approximately 1 % of the total population. We will only
study one of the variables. The whole data-set (a rather big file!) is found at http:
//archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990raw.
data.txt See http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%
29 for more information.
The variable we will for illustration work with is “years in school”, whose variable-
name is iyearsch in the original dataset.
(A) Un-zip the file yrsSchool.zip to extract yrsSchool.csv.
(B) Load yrsSchool.csv and save the result in the variable yrsSchool. The file
contains only one variable, and does not start with a variable-name (specify
this when reading the file, using header = FALSE). To read the data-file
correctly in R, we can use
1 y r s S c h o o l <− r e a d . t a b l e ( ” y r s S c h o o l . c s v ” , h e a d e r = FALSE)
2 y r s S c h o o l <− y r s S c h o o l $ V1

The second line, which may be a bit surprising, is included because after
executing the first line, yrsSchool will be a data.frame, containing a single
6 STEFFEN GRØNNEBERG

variable (by default called V1, hence accessed via yrsSchool$V1). We ex-
tract its single variable, saving it in yrsSchool. This is done in order to be
able to directly write e.g. mean(yrsSchool), which would otherwise have
given us an error message (since the mean function cannot be used on a data
frame directly) unless we instead write mean(yrsSchool$V1). After execut-
ing this second line, we are able to successfully execute mean(yrsSchool)
without error. Verify this. Also, find out how many observations are there
in total.
(C) Make a bar-plot of yrsSchool. Make a mental note on the fact that the dis-
tribution is not normal. Why would it be inappropriate to use a histogram?
Hint: Recall that a barplot of X can be made using barplot(table(X)).
(D) What is the number of years of schooling of person 1000 in the dataset?
(E) What is the number of years of schooling of persons 1000, 1001, 1002, 1003
1004, 1005? To answer this, use the following procedure: Put the integers
1000, 1001, 1002, 1003, 1004, 1005 into a vector called indices. Then
extract the years of schooling for those people and place them into a vector
called X using the command
1 X <− y r s S c h o o l [ i n d i c e s ]

then print X.
(F) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005?
(G) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005, . . ., 2000? Make sure to program this in a manner
that involves practically no extra work compared to solving the previous
problem.
(H) The functions sample or sample.int can be used to sample at random from
a population. We will here use sample.int:
1 n <− 10
2 N <− l e n g t h ( y r s S c h o o l )
3 i n d i c e s <− s a m p l e . i n t (N, s i z e=n )

The third line samples n = 10 numbers from 1 to N , giving 10 random


indices that we will think of as being “randomly selected” people.
Identify these ten randomly sampled people’s yrsSchool, and compute
their average years of schooling.
(I) Sample n = 100 people, and compute their average years of school using a
single line of code (not counting respecifying the value of n).
(J) We will in the next assignment want to repeat this many times, i.e., repeat-
edly sample 100 people and compute their average years of schooling. We
also want to save the averages in a vector. Create a for-loop which does
this 1 000 times, and create a density histogram of the resulting averages.
Base your work on the following code.
1 a v e r a g e s <− NULL
2 f o r ( i in (1:1000)) {
3 a v e r a g e s [ i ] <− ## TO FILL IN
4 }

Assignment 3. Suppose we are drawing a simple random sample of size n from a


variable X in a large population (where the population has size N , a number much
bigger than n). From this simple random sample, we get values X1 , X2 , . . . , Xn .
If n is sufficiently large, one can show mathematically that the sampling distri-
bution of the average X̄n is approximately normally distributed, and indeed is close
to the specific distribution
σ2
 
N µ, ,
n
PROJECT 4 7

where µ is the population average (the expectation of X) and σ 2 is the population


variance of X, computed as the empirical variance of the whole population1
This phenomena is called the central limit distribution, and is rather surprising
since it holds for any population we sample from. The variable within this popula-
tion is just a collection of some numbers, and these numbers know nothing about
the normal distribution. Yet when sampled from, there is a regularity within the
sampling mechanism which turns the average into something very well-behaved.
This regularity is what is explored when studying this phenomenon from a more
mathematical perspective (we will not do this here).
The practical importance of the central limit theorem is immense. We will start
to appreciate some of its consequences next week.
(A) We now continue working with the dataset from Assignment 2. We will
consider this large dataset as the population, and will sample randomly
from yrsSchool. Find N , µ and σ 2 . Approximately which distribution
should the sample average based on n = 100 observations be close to?
(B) The distribution of X̄n can be found as explained in the lecture: We draw
from X̄n many times and make a density histogram. Since this density
histogram is close to the actual distribution of X̄n , this procedure allows us
to test whether the introductory text to2 this
 assignment is correct: Is the
distribution of X̄n be close to the N µ, σn distribution? We can check this
 2

by comparing the density histogram and the density curve of the N µ, σn
distribution. Recall that this correspondence is an approximation, and is to
hold with better and better quality when n, the sample size for the random
sample, is sufficiently large.
In order to draw from X̄n once, we need to draw a whole random sample
of n people. Their X values are then combined into a single number, which
is their average. We then need to do this many times. Recall that this is
exactly what we did in Assignment 2 J.
Use the code from Assignment 2 J to repeat the following step 1 000 times:
Sample 100 people at random, and then compute the sample’s average years
of schooling. Then, make a density histogram of the resulting 1 000 averages.
Based on your computations in (A), draw the relevant density curve to  assess
 2
whether the sampling distribution of X̄n is indeed close to the N µ, σn
distribution, given by
(x−µ)2
1 (x − µ)2
 
1 − 12 σ2 /n 1
f (x) = √ √ e = √ √ exp − .
2π(σ/ n) 2π(σ/ n) 2 σ 2 /n
To add the relevant density curve (points adds points to a current plot),
use the following code, where I assume sigmaSq = σ 2 and mu = µ. Explain
why I choose x ranging in the interval given in the code.
1 h i s t ( a v e r a g e s , f r e q=FALSE)
2 x <− s e q (−3∗ s q r t ( sigmaSq / n ) + mu , 3 ∗ s q r t ( sigmaSq / n ) + mu, l e n g t h . o u t = 4 0 0 )
3 y <− ( 1 / ( s q r t ( 2 ∗ p i ) ∗ ( s q r t ( sigmaSq / n ) ) ) ) ∗ e x p ( −0.5 ∗ ( x − mu) ˆ 2 / ( sigmaSq / n ) )
4 p o i n t s ( x , y , t y p e=” l ” , c o l =” r e d ” )

(C) Repeat the test of the central limit theorem with n = 2, 5, 10, 1 000. Com-
ment.
(D) (Optional, and will not be assigned as homework if not solved during class)
Take a moment to marvel at the central limit theorem. The dataset, in its
pure innocence, has no idea of even the existence of the normal distribution,
yet is somehow forced to adhere to its regularity when we consider the
1Strictly speaking, the empirical variance should be multiplied by a factor (N − 1)/N , for
reasons not covered by this project. This will have no practical influence on our experiments, since
when N is large, which it is here, the factor (N − 1)/N is practically equal 1.
8 STEFFEN GRØNNEBERG

sampling distribution of averages from it. Note again that yrsSchool is not
normally distributed, and it is the sampling distribution of the average that
is (approximately) normal, not the variable itself.

Assignment 4. (Optional)
As we have seen, yrsSchool takes on only certain integers. It turns out, how-
ever, that the average of many randomly chosen integers can approach any decimal
number. When we studied the sampling distribution of the average of yrsSchool,
we treated the average as a continuous variable, i.e., a variable that can take on any
decimal value within some interval.
This had the consequence that we made histograms of its distribution, and not
bar-plots, etc. Strictly speaking, the average is not quite continuous, but for large
samples, the distribution becomes more and more continuous. We will here briefly
investigate this phenomenon, for the simplest possible case, namely for binary vari-
ables. The general argument is similar, though involves more complex manipulation
with sums.
(A) Recall that we denote the average of the n observations X1 , X2 , . . . , Xn by
X̄n . That is
n
1X
X̄n = Xi .
n i=1
Suppose we observe zero/one variables, i.e., we observe numbers X1 , X2 , . . . , Xn
that are either zero or one. Show that
#Xi = 1
X̄n = . (1)
n
where “#Xi = 1” means number of observations that are 1.
(B) Show that 0 ≤ X̄n ≤ 1. Practically speaking, what does X̄n = 0 and X̄n = 1
mean?
(C) Suppose n = 1. How many values can X̄n take on? What are those values?
(D) Suppose n = 2. How many values can X̄n take on? What are those values?
What about n = 3 and n = 4?
(E) For general n, how many values can X̄n take on?
(F) From eq. (1), we see that the possible values X̄n can take on are of the form
k
n
for k = 0, 1, 2, . . . , n. For any number x ∈ [0, 1], explain why the distance
to x and an attainable value of X̄n is less than 1/n.

Department of Economics, BI Norwegian Business School, Nydalsveien 37, Oslo, Nor-


way 0484, Norway
Email address: Steffen.Gronneberg@bi.no

You might also like