0% found this document useful (0 votes)

13 views8 pages

For-Loop and Sampling Distributions in R

Uploaded by

janecoco0222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

For-Loop and Sampling Distributions in R

Uploaded by

janecoco0222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PROJECT 4: AN INTRODUCTION TO THE FOR-LOOP AND SAMPLING

DISTRIBUTIONS

STEFFEN GRØNNEBERG

NOTE: The file “4.R” contains useful code.

Assignment 1. We here consider the Beauty-dataset introduced in Lecture 4.

The data-file and a text file with a basic description of the variables are uploaded
to Itslearning.
4
3
lwage

2
1
0

0 10 20 30 40

exper

Figure 1. R-plot.

Date: September 19, 2021.

1
2 STEFFEN GRØNNEBERG

(A) Consider Figure 1. It gives a scatter plot of experience versus log-wage. The
red curve is a smooth spline curve, and the green squares are the average
log wage for each observed experience category. Producing the scatter plot
and the smooth spline curve is easy, by using the the following commands.
1 rm ( l i s t = l s ( ) )
2 b e a u t y <− r e a d . c s v ( ” b e a u t y . c s v ” )
3 w i t h ( beauty , p l o t ( educ , l w a g e ) )
4 w i t h ( beauty , l i n e s ( smooth . s p l i n e ( educ , lwage ) , c o l = ” r e d ” , lwd =3))

In contrast, there is no standard command in R to calculate and plot the

green squares, and we have to write code to plot them ourselves. This is
common in applied data analysis problems.
Let us first try to do this with the commands that we have currently
studied, as well as introducing a few new: unique and sort, explained in
the text. Run through the following code, making sure that you understand
every step.
1 rm ( l i s t = l s ( ) )
2 b e a u t y <− r e a d . c s v ( ” b e a u t y . c s v ” )
3 dim ( beauty )
4 w i t h ( beauty , p l o t ( e x p e r , l w a g e ) )
5 w i t h ( beauty , l i n e s ( smooth . s p l i n e ( e x p e r , lwage ) , c o l = ” r e d ” , lwd =3))
6
7 head ( b e a u t y $ e x p e r , 2 0 )
8 #G i v e s o u t t h e f i r s t 20 o b s e r v a t i o n s :
9 # [ 1 ] 30 28 35 38 27 20 12 5 5 12 3 6 19 8 12 17 7 12 10 7
10
11 c l a s s e s <− u n i q u e ( b e a u t y $ e x p e r ) #u n i q u e g i v e s o u t a l l −u n i q u e − v a l u e s of a vector .
12 classes
13 # G i v e s o u t −u n s o r t e d − n um b e r s
14 # [ 1 ] 30 28 35 38 27 20 12 5 3 6 19 8 17 7 10 33 32 24 29 41 40
15 #[ 2 2 ] 43 18 37 31 9 42 14 4 13 23 16 26 15 11 25 1 36 21 22 34 44
16 #[ 4 3 ] 2 39 48 45 0 47 46
17 max ( c l a s s e s )
18 #Do we h a v e o b s e r v a t i o n s f r o m a l l i n t e g e r s f r o m 0 t o 48?
19 d <− l e n g t h ( c l a s s e s )
20 d # gives 49
21 #y e s , we h a v e o b s e r v a t i o n s f r o m a l l i n t e g e r s f r o m 0 t o 4 8 .
22
23 #L e t u s c r e a t e a new v e c t o r , r e p e a t i n g ( r e p ) t h e NA v a l u e
24 #(NA = ”non a p p l i c a b l e ”) . We w i l l p u t i n t h e averages for
25 #e a c h e x p e r i e n c e c l a s s i n t o t h i s v e c t o r .
26 e x p C l a s s A v e r a g e <− r e p (NA, d )
27
28 #I d e n t i f i e s t h e i n d i c e s which correspond t o a l l i n d i v i d u a l s
29 #w i t h z e r o y e a r s o f e x p e r i e n c e .
30 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 0 )
31 #we t h e n a c c e s s t h e l o g −w a g e s f o r t h e s e i n d i v i d u a l s
32 #b y w r i t i n g
33 beauty $ lwage [ expClass ]
34 #G i v e s
35 #[ 1 ] 1 . 0 5 7 7 9 0 3 0 . 2 3 9 0 1 6 9 0 . 9 6 6 9 8 3 9 0 . 8 7 5 4 6 8 8
36
37 #L e t u s now c o m p u t e t h e a v e r a g e o f t h e s e l o g −w a g e s ,
38 #and s a v e t h e r e s u l t i n t h e f i r s t e l e m e n t o f
39 #e x p C l a s s A v e r a g e .
40 e x p C l a s s A v e r a g e [ 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
41 expClassAverage
42 #g i v e s
43 # [ 1 ] 0.784815 NA NA NA NA NA NA
44 # [8] NA NA NA NA NA NA NA
45 # [15] NA NA NA NA NA NA NA
46 # [22] NA NA NA NA NA NA NA
47 # [29] NA NA NA NA NA NA NA
48 # [36] NA NA NA NA NA NA NA
49 # [43] NA NA NA NA NA NA NA
50 #s o t h e f i r s t e l e m e n t i s f i l l e d up . L e t ’ s do t h e same f o r t h e rest :
51
52
53 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 ) #n e x t c a t e g o r y is 1, etc :
54 e x p C l a s s A v e r a g e [ 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
55 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 )
56 e x p C l a s s A v e r a g e [ 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
57 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 )
PROJECT 4 3

58 e x p C l a s s A v e r a g e [ 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
59 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 )
60 e x p C l a s s A v e r a g e [ 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
61 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 5 )
62 e x p C l a s s A v e r a g e [ 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
63 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 6 )
64 e x p C l a s s A v e r a g e [ 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
65 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 7 )
66 e x p C l a s s A v e r a g e [ 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
67 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 8 )
68 e x p C l a s s A v e r a g e [ 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
69 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 9 )
70 e x p C l a s s A v e r a g e [ 1 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
71 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 0 )
72 e x p C l a s s A v e r a g e [ 1 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
73 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 1 )
74 e x p C l a s s A v e r a g e [ 1 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
75 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 2 )
76 e x p C l a s s A v e r a g e [ 1 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
77 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 3 )
78 e x p C l a s s A v e r a g e [ 1 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
79 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 4 )
80 e x p C l a s s A v e r a g e [ 1 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
81 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 5 )
82 e x p C l a s s A v e r a g e [ 1 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
83 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 6 )
84 e x p C l a s s A v e r a g e [ 1 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
85 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 7 )
86 e x p C l a s s A v e r a g e [ 1 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
87 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 8 )
88 e x p C l a s s A v e r a g e [ 1 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
89 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 9 )
90 e x p C l a s s A v e r a g e [ 2 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
91 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 0 )
92 e x p C l a s s A v e r a g e [ 2 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
93 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 1 )
94 e x p C l a s s A v e r a g e [ 2 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
95 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 2 )
96 e x p C l a s s A v e r a g e [ 2 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
97 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 3 )
98 e x p C l a s s A v e r a g e [ 2 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
99 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 4 )
100 e x p C l a s s A v e r a g e [ 2 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
101 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 5 )
102 e x p C l a s s A v e r a g e [ 2 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
103 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 6 )
104 e x p C l a s s A v e r a g e [ 2 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
105 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 7 )
106 e x p C l a s s A v e r a g e [ 2 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
107 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 8 )
108 e x p C l a s s A v e r a g e [ 2 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
109 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 9 )
110 e x p C l a s s A v e r a g e [ 3 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
111 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 0 )
112 e x p C l a s s A v e r a g e [ 3 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
113 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 1 )
114 e x p C l a s s A v e r a g e [ 3 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
115 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 2 )
116 e x p C l a s s A v e r a g e [ 3 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
117 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 3 )
118 e x p C l a s s A v e r a g e [ 3 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
119 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 4 )
120 e x p C l a s s A v e r a g e [ 3 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
121 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 5 )
122 e x p C l a s s A v e r a g e [ 3 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
123 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 6 )
124 e x p C l a s s A v e r a g e [ 3 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
125 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 7 )
126 e x p C l a s s A v e r a g e [ 3 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
127 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 8 )
128 e x p C l a s s A v e r a g e [ 3 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
129 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 9 )
130 e x p C l a s s A v e r a g e [ 4 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
131 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 0 )
132 e x p C l a s s A v e r a g e [ 4 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
133 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 1 )
134 e x p C l a s s A v e r a g e [ 4 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
135 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 2 )
4 STEFFEN GRØNNEBERG

136 e x p C l a s s A v e r a g e [ 4 3 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )

137 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 3 )
138 e x p C l a s s A v e r a g e [ 4 4 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
139 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 4 )
140 e x p C l a s s A v e r a g e [ 4 5 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
141 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 5 )
142 e x p C l a s s A v e r a g e [ 4 6 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
143 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 6 )
144 e x p C l a s s A v e r a g e [ 4 7 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
145 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 7 )
146 e x p C l a s s A v e r a g e [ 4 8 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
147 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 8 )
148 e x p C l a s s A v e r a g e [ 4 9 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )
149
150 p o i n t s ((0:48) , e x p C l a s s A v e r a g e , lwd =3 , pch =5 , c o l =” g r e e n ” )

This method works, but is clearly a terrible solution: It is time-consuming

to write, and while the process is simple (it is a repetition of the same type
of commands again and again with minor variations) it is difficult to error-
check (did we really get all the numbers? did we skip some? etc.). Further,
it is clearly not scalable. Suppose we instead had 200 categories, or maybe
10 000 categories, etc.
Instead, we want the computer to do this in an automated manner. The
following for-loop does exactly the same as the code above, but in a much
smarter way.
1 e x p C l a s s A v e r a g e <− NULL
2 f o r ( i in (0:48)) {
3 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == i )
4 e x p C l a s s A v e r a g e [ i +1] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
5 }

Note that in Line 1, we initialize expClassAverage in a different way than

earlier: we assign the value NULL to expClassAverage, which means exp-
ClassAverage is empty. When we later fill in values in expClassAverage,
R will automatically expands its size, eventually producing a vector with 49
elements.
Verify that you get the same plot using the code using the for-loop.
(B) Use a for-loop to make histograms of the wage distribution for each beauty-
category. Base your work on the following code. Note that we access di-
rectly the variables fulfilling looks == 1 with wage[looks == 1], avoiding
which. This is often simpler.
1 p a r ( mfrow=c (3 ,2))
2 w i t h ( beauty , h i s t ( wage [ looks == 1] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
3 w i t h ( beauty , h i s t ( wage [ looks == 2] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
4 w i t h ( beauty , h i s t ( wage [ looks == 3] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
5 w i t h ( beauty , h i s t ( wage [ looks == 4] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
6 w i t h ( beauty , h i s t ( wage [ looks == 5] , x l a b=”wages ” , x l i m=c ( 0 , max ( wage ) ) ) )
7 p a r ( mfrow=c (1 ,1))

Hint 1: Base your work on the following code.

1 for (i in ( 1 : 5 ) ) {
2 print ( i )
3 }

Instead of printing i, you use with(beauty, hist(wage[looks == 1], xlab="wages",

xlim=c(0,max(wage)))) appropriately modified. Be sure to put in par(mfrow=c(3,2))
before your for-loop, and par(mfrow=c(1,1)) after the for-loop is finished
(recall that par(mfrow=c(3,2)) splits up the plotting window in a 3 × 2
matrix, while par(mfrow=c(1,1)) restores the default behaviour of plot-
ting commands).
Hint 2: Do not try to get the captions of the plots to vary with i, but just
leave them as they are by default. It is easy to fix this with some additional
commands, but we will not do that here.
PROJECT 4 5

(C) In project 1, we used R to verify the formula

n
X 1 − xn+1
xi =
i=0
1−x
which holds for any number x and any non-negative integer n We used the
following code.
1 n <− 15
2 x <− 0 . 5
3 s e r i e s <− x ˆ ( 0 : n )
4 sum ( s e r i e s )
5 (1−x ˆ ( n +1)) / (1−x )

Let us now compute the sum without using the sum-command. One way to
do this is using the following code (here n = 5).
1 n <− 5
2 x <− 0 . 5
3 r e s u l t <− 0
4
5 result <− result + x ˆ0
6 result <− result + x ˆ1
7 result <− result + x ˆ2
8 result <− result + x ˆ3
9 result <− result + x ˆ4
10 result <− result + x ˆ5
11
12 result
13 (1−x ˆ ( n +1)) / (1−x )

Re-implement this code using a for-loop. Explain why it works.

(D) We have that the formula for an arithmetic sum is
n
X n(n + 1)
i= .
i=1
2
In “intro.R” worked with in the first lecture, we verified this formula using
the following code.
1 n <− 500
2 X <− ( 1 : n )
3 s u m (X)
4 n∗ ( n+1) / 2

Modify this code so that you compute the sum using a for-loop, as in the
previous task.

Assignment 2. We here study census data from the 1990 US census. The data-set
contains information from approximately 1 % of the total population. We will only
study one of the variables. The whole data-set (a rather big file!) is found at http:
//archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990raw.
data.txt See http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%
29 for more information.
The variable we will for illustration work with is “years in school”, whose variable-
name is iyearsch in the original dataset.
(A) Un-zip the file yrsSchool.zip to extract yrsSchool.csv.
(B) Load yrsSchool.csv and save the result in the variable yrsSchool. The file
contains only one variable, and does not start with a variable-name (specify
this when reading the file, using header = FALSE). To read the data-file
correctly in R, we can use
1 y r s S c h o o l <− r e a d . t a b l e ( ” y r s S c h o o l . c s v ” , h e a d e r = FALSE)
2 y r s S c h o o l <− y r s S c h o o l $ V1

The second line, which may be a bit surprising, is included because after
executing the first line, yrsSchool will be a data.frame, containing a single
6 STEFFEN GRØNNEBERG

variable (by default called V1, hence accessed via yrsSchool$V1). We ex-
tract its single variable, saving it in yrsSchool. This is done in order to be
able to directly write e.g. mean(yrsSchool), which would otherwise have
given us an error message (since the mean function cannot be used on a data
frame directly) unless we instead write mean(yrsSchool$V1). After execut-
ing this second line, we are able to successfully execute mean(yrsSchool)
without error. Verify this. Also, find out how many observations are there
in total.
(C) Make a bar-plot of yrsSchool. Make a mental note on the fact that the dis-
tribution is not normal. Why would it be inappropriate to use a histogram?
Hint: Recall that a barplot of X can be made using barplot(table(X)).
(D) What is the number of years of schooling of person 1000 in the dataset?
(E) What is the number of years of schooling of persons 1000, 1001, 1002, 1003
1004, 1005? To answer this, use the following procedure: Put the integers
1000, 1001, 1002, 1003, 1004, 1005 into a vector called indices. Then
extract the years of schooling for those people and place them into a vector
called X using the command
1 X <− y r s S c h o o l [ i n d i c e s ]

then print X.
(F) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005?
(G) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005, . . ., 2000? Make sure to program this in a manner
that involves practically no extra work compared to solving the previous
problem.
(H) The functions sample or sample.int can be used to sample at random from
a population. We will here use sample.int:
1 n <− 10
2 N <− l e n g t h ( y r s S c h o o l )
3 i n d i c e s <− s a m p l e . i n t (N, s i z e=n )

The third line samples n = 10 numbers from 1 to N , giving 10 random

indices that we will think of as being “randomly selected” people.
Identify these ten randomly sampled people’s yrsSchool, and compute
their average years of schooling.
(I) Sample n = 100 people, and compute their average years of school using a
single line of code (not counting respecifying the value of n).
(J) We will in the next assignment want to repeat this many times, i.e., repeat-
edly sample 100 people and compute their average years of schooling. We
also want to save the averages in a vector. Create a for-loop which does
this 1 000 times, and create a density histogram of the resulting averages.
Base your work on the following code.
1 a v e r a g e s <− NULL
2 f o r ( i in (1:1000)) {
3 a v e r a g e s [ i ] <− ## TO FILL IN
4 }

Assignment 3. Suppose we are drawing a simple random sample of size n from a

variable X in a large population (where the population has size N , a number much
bigger than n). From this simple random sample, we get values X1 , X2 , . . . , Xn .
If n is sufficiently large, one can show mathematically that the sampling distri-
bution of the average X̄n is approximately normally distributed, and indeed is close
to the specific distribution
σ2

N µ, ,
n
PROJECT 4 7

where µ is the population average (the expectation of X) and σ 2 is the population

variance of X, computed as the empirical variance of the whole population1
This phenomena is called the central limit distribution, and is rather surprising
since it holds for any population we sample from. The variable within this popula-
tion is just a collection of some numbers, and these numbers know nothing about
the normal distribution. Yet when sampled from, there is a regularity within the
sampling mechanism which turns the average into something very well-behaved.
This regularity is what is explored when studying this phenomenon from a more
mathematical perspective (we will not do this here).
The practical importance of the central limit theorem is immense. We will start
to appreciate some of its consequences next week.
(A) We now continue working with the dataset from Assignment 2. We will
consider this large dataset as the population, and will sample randomly
from yrsSchool. Find N , µ and σ 2 . Approximately which distribution
should the sample average based on n = 100 observations be close to?
(B) The distribution of X̄n can be found as explained in the lecture: We draw
from X̄n many times and make a density histogram. Since this density
histogram is close to the actual distribution of X̄n , this procedure allows us
to test whether the introductory text to2 this
assignment is correct: Is the
distribution of X̄n be close to the N µ, σn distribution? We can check this
2

by comparing the density histogram and the density curve of the N µ, σn
distribution. Recall that this correspondence is an approximation, and is to
hold with better and better quality when n, the sample size for the random
sample, is sufficiently large.
In order to draw from X̄n once, we need to draw a whole random sample
of n people. Their X values are then combined into a single number, which
is their average. We then need to do this many times. Recall that this is
exactly what we did in Assignment 2 J.
Use the code from Assignment 2 J to repeat the following step 1 000 times:
Sample 100 people at random, and then compute the sample’s average years
of schooling. Then, make a density histogram of the resulting 1 000 averages.
Based on your computations in (A), draw the relevant density curve to assess
2
whether the sampling distribution of X̄n is indeed close to the N µ, σn
distribution, given by
(x−µ)2
1 (x − µ)2

1 − 12 σ2 /n 1
f (x) = √ √ e = √ √ exp − .
2π(σ/ n) 2π(σ/ n) 2 σ 2 /n
To add the relevant density curve (points adds points to a current plot),
use the following code, where I assume sigmaSq = σ 2 and mu = µ. Explain
why I choose x ranging in the interval given in the code.
1 h i s t ( a v e r a g e s , f r e q=FALSE)
2 x <− s e q (−3∗ s q r t ( sigmaSq / n ) + mu , 3 ∗ s q r t ( sigmaSq / n ) + mu, l e n g t h . o u t = 4 0 0 )
3 y <− ( 1 / ( s q r t ( 2 ∗ p i ) ∗ ( s q r t ( sigmaSq / n ) ) ) ) ∗ e x p ( −0.5 ∗ ( x − mu) ˆ 2 / ( sigmaSq / n ) )
4 p o i n t s ( x , y , t y p e=” l ” , c o l =” r e d ” )

(C) Repeat the test of the central limit theorem with n = 2, 5, 10, 1 000. Com-
ment.
(D) (Optional, and will not be assigned as homework if not solved during class)
Take a moment to marvel at the central limit theorem. The dataset, in its
pure innocence, has no idea of even the existence of the normal distribution,
yet is somehow forced to adhere to its regularity when we consider the
1Strictly speaking, the empirical variance should be multiplied by a factor (N − 1)/N , for
reasons not covered by this project. This will have no practical influence on our experiments, since
when N is large, which it is here, the factor (N − 1)/N is practically equal 1.
8 STEFFEN GRØNNEBERG

sampling distribution of averages from it. Note again that yrsSchool is not
normally distributed, and it is the sampling distribution of the average that
is (approximately) normal, not the variable itself.

Assignment 4. (Optional)
As we have seen, yrsSchool takes on only certain integers. It turns out, how-
ever, that the average of many randomly chosen integers can approach any decimal
number. When we studied the sampling distribution of the average of yrsSchool,
we treated the average as a continuous variable, i.e., a variable that can take on any
decimal value within some interval.
This had the consequence that we made histograms of its distribution, and not
bar-plots, etc. Strictly speaking, the average is not quite continuous, but for large
samples, the distribution becomes more and more continuous. We will here briefly
investigate this phenomenon, for the simplest possible case, namely for binary vari-
ables. The general argument is similar, though involves more complex manipulation
with sums.
(A) Recall that we denote the average of the n observations X1 , X2 , . . . , Xn by
X̄n . That is
n
1X
X̄n = Xi .
n i=1
Suppose we observe zero/one variables, i.e., we observe numbers X1 , X2 , . . . , Xn
that are either zero or one. Show that
#Xi = 1
X̄n = . (1)
n
where “#Xi = 1” means number of observations that are 1.
(B) Show that 0 ≤ X̄n ≤ 1. Practically speaking, what does X̄n = 0 and X̄n = 1
mean?
(C) Suppose n = 1. How many values can X̄n take on? What are those values?
(D) Suppose n = 2. How many values can X̄n take on? What are those values?
What about n = 3 and n = 4?
(E) For general n, how many values can X̄n take on?
(F) From eq. (1), we see that the possible values X̄n can take on are of the form
k
n
for k = 0, 1, 2, . . . , n. For any number x ∈ [0, 1], explain why the distance
to x and an attainable value of X̄n is less than 1/n.

Department of Economics, BI Norwegian Business School, Nydalsveien 37, Oslo, Nor-

way 0484, Norway
Email address: Steffen.Gronneberg@bi.no

Homework 10: R Markdown
No ratings yet
Homework 10: R Markdown
39 pages
Beauty's Impact on Professor Evaluations
No ratings yet
Beauty's Impact on Professor Evaluations
9 pages
Returns To Education: Chapter 1: Defining and Collecting Data
100% (1)
Returns To Education: Chapter 1: Defining and Collecting Data
13 pages
Flint Water Study and Income Analysis
No ratings yet
Flint Water Study and Income Analysis
22 pages
Introductory Econometrics Analysis
No ratings yet
Introductory Econometrics Analysis
8 pages
Data Analysis with R: CPS Dataset Insights
No ratings yet
Data Analysis with R: CPS Dataset Insights
9 pages
Standard Deviation in RStudio Guide
No ratings yet
Standard Deviation in RStudio Guide
10 pages
Math Bach 07
No ratings yet
Math Bach 07
24 pages
Employee Car Usage Prediction Report
67% (3)
Employee Car Usage Prediction Report
30 pages
Bayesian Wage Modeling Lab in R
No ratings yet
Bayesian Wage Modeling Lab in R
16 pages
R Basics: Math, Data Frames, Analysis
No ratings yet
R Basics: Math, Data Frames, Analysis
18 pages
SB Assignment 1 (Group 68)
No ratings yet
SB Assignment 1 (Group 68)
21 pages
R Programming Basics for Data Analysis
No ratings yet
R Programming Basics for Data Analysis
5 pages
R Data Cleaning Cheat Sheet
No ratings yet
R Data Cleaning Cheat Sheet
4 pages
Stata Instructions
No ratings yet
Stata Instructions
7 pages
Logistic Regression Assignment Guide
No ratings yet
Logistic Regression Assignment Guide
20 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Statistical Inference in ECON20003 Tutorial
No ratings yet
Statistical Inference in ECON20003 Tutorial
26 pages
Econometrics II Problem Set Solutions
No ratings yet
Econometrics II Problem Set Solutions
3 pages
Hair and Eye Color Statistics Analysis
No ratings yet
Hair and Eye Color Statistics Analysis
20 pages
Salary Prediction
No ratings yet
Salary Prediction
32 pages
Matrix Arithmetic and Factor Analysis
No ratings yet
Matrix Arithmetic and Factor Analysis
8 pages
R Data Management and Export Guide
No ratings yet
R Data Management and Export Guide
43 pages
Economics Problem Set on Hypothesis Testing
No ratings yet
Economics Problem Set on Hypothesis Testing
5 pages
Principal Component Regression Guide
No ratings yet
Principal Component Regression Guide
12 pages
Week 3 Gapminder Data Analysis Assignment
No ratings yet
Week 3 Gapminder Data Analysis Assignment
5 pages
Latent Class Models in Discrete Choice
No ratings yet
Latent Class Models in Discrete Choice
24 pages
Econometrics Problem Set 4 Solutions
No ratings yet
Econometrics Problem Set 4 Solutions
11 pages
Understanding K Nearest Neighbours (KNN)
No ratings yet
Understanding K Nearest Neighbours (KNN)
13 pages
R Tutorial
No ratings yet
R Tutorial
6 pages
R Toolkit for Data Analysis and Probability
No ratings yet
R Toolkit for Data Analysis and Probability
27 pages
SES Impact on Academic Program Choice
No ratings yet
SES Impact on Academic Program Choice
13 pages
EC331 Econometrics Homework 2 Guide
No ratings yet
EC331 Econometrics Homework 2 Guide
3 pages
Data Processing Techniques in R
No ratings yet
Data Processing Techniques in R
3 pages
Decision Tree Analysis of Carseats Data
No ratings yet
Decision Tree Analysis of Carseats Data
7 pages
SPSS Regression Analysis Insights
No ratings yet
SPSS Regression Analysis Insights
6 pages
R Data Analysis: Vector Operations & Packages
No ratings yet
R Data Analysis: Vector Operations & Packages
10 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
Gender Salary Disparity Analysis
No ratings yet
Gender Salary Disparity Analysis
41 pages
CS1B April 2024
No ratings yet
CS1B April 2024
9 pages
R Programming Skills and Data Analysis
No ratings yet
R Programming Skills and Data Analysis
8 pages
s05 Solution
No ratings yet
s05 Solution
15 pages
R Workshop: Data Manipulation & Analysis
No ratings yet
R Workshop: Data Manipulation & Analysis
3 pages
Econometricians' Salary Analysis
No ratings yet
Econometricians' Salary Analysis
2 pages
Demographic Analysis of Survey Respondents
No ratings yet
Demographic Analysis of Survey Respondents
8 pages
Data Wrangling
No ratings yet
Data Wrangling
12 pages
R Data Analysis: Vectors & Visualization
No ratings yet
R Data Analysis: Vectors & Visualization
7 pages
Econometrics: Tobit Model Analysis
No ratings yet
Econometrics: Tobit Model Analysis
20 pages
Credit Scoring and Default Prediction Analysis
No ratings yet
Credit Scoring and Default Prediction Analysis
29 pages
Predictive Analytics in Machine Learning
No ratings yet
Predictive Analytics in Machine Learning
7 pages
Data Analysis of Dirty Iris Dataset
No ratings yet
Data Analysis of Dirty Iris Dataset
19 pages
4503 Rc158 010d Machinelearning 1
100% (1)
4503 Rc158 010d Machinelearning 1
6 pages
Big Data Machine Learning
100% (1)
Big Data Machine Learning
6 pages
Beta Technologies Employee Salary Analysis
No ratings yet
Beta Technologies Employee Salary Analysis
10 pages
013 Plotting Predictors
No ratings yet
013 Plotting Predictors
14 pages
Wage Regression Analysis by Demographics
No ratings yet
Wage Regression Analysis by Demographics
6 pages
Logistic Regression for Class Prediction
No ratings yet
Logistic Regression for Class Prediction
4 pages
R Programming for Data Analysis Guide
No ratings yet
R Programming for Data Analysis Guide
6 pages
R Programming Data Analysis Report
No ratings yet
R Programming Data Analysis Report
8 pages
Agap2 Group Overview
No ratings yet
Agap2 Group Overview
15 pages
Vk20 Service Manual
No ratings yet
Vk20 Service Manual
60 pages
Technical Tips How To Reboot The VXP Acquisition Unit
No ratings yet
Technical Tips How To Reboot The VXP Acquisition Unit
8 pages
JD - WB Political Consultant
No ratings yet
JD - WB Political Consultant
2 pages
Sensors 20 00578 v2
No ratings yet
Sensors 20 00578 v2
21 pages
Common Rail Sensor
50% (2)
Common Rail Sensor
6 pages
Industrial Safety: Sasfroth 39
No ratings yet
Industrial Safety: Sasfroth 39
16 pages
Gate OS
No ratings yet
Gate OS
197 pages
Langmuir Isotherm Study of Charcoal Adsorption
No ratings yet
Langmuir Isotherm Study of Charcoal Adsorption
2 pages
Philosophy in Education Quotes & Practices
83% (6)
Philosophy in Education Quotes & Practices
2 pages
Home Economics: Module 5: Project Plan For Household Linens
100% (5)
Home Economics: Module 5: Project Plan For Household Linens
19 pages
MiCOM P143 Order Form and Configurator
No ratings yet
MiCOM P143 Order Form and Configurator
11 pages
Maharashtra Rajyatil Abhiyantriki Mahavidyalayachya Granthala Web Pages Dwara Granthalayatil Upkram V Granthalayanchya Vibhagancha Abhyas
No ratings yet
Maharashtra Rajyatil Abhiyantriki Mahavidyalayachya Granthala Web Pages Dwara Granthalayatil Upkram V Granthalayanchya Vibhagancha Abhyas
8 pages
Zoom MRS-1044 Mutitrack Recorder User's Manual
No ratings yet
Zoom MRS-1044 Mutitrack Recorder User's Manual
120 pages
Strategic Financial Planning Over The Lifecycle A Conceptual Approach To Personal Risk Management Narat Charupat
No ratings yet
Strategic Financial Planning Over The Lifecycle A Conceptual Approach To Personal Risk Management Narat Charupat
377 pages
PA Office Visit Workflow Flowchart
No ratings yet
PA Office Visit Workflow Flowchart
1 page
AS - AT - CB - VI - Geog - Globe - Latitudes and Longitudes.
No ratings yet
AS - AT - CB - VI - Geog - Globe - Latitudes and Longitudes.
2 pages
Four Corners Book 3 Unit 5
No ratings yet
Four Corners Book 3 Unit 5
10 pages
Science & Math Practice for Olympiads
No ratings yet
Science & Math Practice for Olympiads
5 pages
Irfan Ahmed S - Developer Resume
No ratings yet
Irfan Ahmed S - Developer Resume
3 pages
Cbe - receiptFT25031N3K BT
No ratings yet
Cbe - receiptFT25031N3K BT
1 page
ME8651-Design of Transmission Systems
No ratings yet
ME8651-Design of Transmission Systems
21 pages
147-Article Text-332-1-10-20220921
No ratings yet
147-Article Text-332-1-10-20220921
18 pages
La Chapelle 2010
No ratings yet
La Chapelle 2010
9 pages
PPS-Unit-IV and V Question Bank
No ratings yet
PPS-Unit-IV and V Question Bank
1 page
Mach3 Oem Code
57% (7)
Mach3 Oem Code
61 pages
Furniture for Indian Homes
No ratings yet
Furniture for Indian Homes
133 pages
Siaya - Rarieda - 162789 - Kagwa Village-Combined
No ratings yet
Siaya - Rarieda - 162789 - Kagwa Village-Combined
12 pages
Analytic Skills for Global Affairs Course
No ratings yet
Analytic Skills for Global Affairs Course
13 pages
RRU5905 Technical Specifications (V100R016C10 - 01) (PDF) - en
No ratings yet
RRU5905 Technical Specifications (V100R016C10 - 01) (PDF) - en
35 pages

For-Loop and Sampling Distributions in R

Uploaded by

For-Loop and Sampling Distributions in R

Uploaded by

PROJECT 4: AN INTRODUCTION TO THE FOR-LOOP AND SAMPLING

NOTE: The file “4.R” contains useful code.

Assignment 1. We here consider the Beauty-dataset introduced in Lecture 4.

Date: September 19, 2021.

In contrast, there is no standard command in R to calculate and plot the

136 e x p C l a s s A v e r a g e [ 4 3 ] <− m e a n ( b e a u t y $ l w a g e [ expClass ] )

This method works, but is clearly a terrible solution: It is time-consuming

Note that in Line 1, we initialize expClassAverage in a different way than

Hint 1: Base your work on the following code.

Instead of printing i, you use with(beauty, hist(wage[looks == 1], xlab="wages",

(C) In project 1, we used R to verify the formula

Re-implement this code using a for-loop. Explain why it works.

The third line samples n = 10 numbers from 1 to N , giving 10 random

Assignment 3. Suppose we are drawing a simple random sample of size n from a

where µ is the population average (the expectation of X) and σ 2 is the population

Department of Economics, BI Norwegian Business School, Nydalsveien 37, Oslo, Nor-

You might also like