For-Loop and Sampling Distributions in R
For-Loop and Sampling Distributions in R
DISTRIBUTIONS
STEFFEN GRØNNEBERG
2
1
0
0 10 20 30 40
exper
Figure 1. R-plot.
(A) Consider Figure 1. It gives a scatter plot of experience versus log-wage. The
red curve is a smooth spline curve, and the green squares are the average
log wage for each observed experience category. Producing the scatter plot
and the smooth spline curve is easy, by using the the following commands.
1 rm ( l i s t = l s ( ) )
2 b e a u t y <− r e a d . c s v ( ” b e a u t y . c s v ” )
3 w i t h ( beauty , p l o t ( educ , l w a g e ) )
4 w i t h ( beauty , l i n e s ( smooth . s p l i n e ( educ , lwage ) , c o l = ” r e d ” , lwd =3))
58 e x p C l a s s A v e r a g e [ 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
59 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 )
60 e x p C l a s s A v e r a g e [ 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
61 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 5 )
62 e x p C l a s s A v e r a g e [ 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
63 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 6 )
64 e x p C l a s s A v e r a g e [ 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
65 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 7 )
66 e x p C l a s s A v e r a g e [ 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
67 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 8 )
68 e x p C l a s s A v e r a g e [ 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
69 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 9 )
70 e x p C l a s s A v e r a g e [ 1 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
71 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 0 )
72 e x p C l a s s A v e r a g e [ 1 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
73 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 1 )
74 e x p C l a s s A v e r a g e [ 1 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
75 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 2 )
76 e x p C l a s s A v e r a g e [ 1 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
77 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 3 )
78 e x p C l a s s A v e r a g e [ 1 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
79 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 4 )
80 e x p C l a s s A v e r a g e [ 1 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
81 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 5 )
82 e x p C l a s s A v e r a g e [ 1 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
83 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 6 )
84 e x p C l a s s A v e r a g e [ 1 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
85 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 7 )
86 e x p C l a s s A v e r a g e [ 1 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
87 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 8 )
88 e x p C l a s s A v e r a g e [ 1 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
89 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 1 9 )
90 e x p C l a s s A v e r a g e [ 2 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
91 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 0 )
92 e x p C l a s s A v e r a g e [ 2 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
93 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 1 )
94 e x p C l a s s A v e r a g e [ 2 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
95 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 2 )
96 e x p C l a s s A v e r a g e [ 2 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
97 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 3 )
98 e x p C l a s s A v e r a g e [ 2 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
99 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 4 )
100 e x p C l a s s A v e r a g e [ 2 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
101 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 5 )
102 e x p C l a s s A v e r a g e [ 2 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
103 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 6 )
104 e x p C l a s s A v e r a g e [ 2 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
105 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 7 )
106 e x p C l a s s A v e r a g e [ 2 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
107 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 8 )
108 e x p C l a s s A v e r a g e [ 2 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
109 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 2 9 )
110 e x p C l a s s A v e r a g e [ 3 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
111 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 0 )
112 e x p C l a s s A v e r a g e [ 3 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
113 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 1 )
114 e x p C l a s s A v e r a g e [ 3 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
115 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 2 )
116 e x p C l a s s A v e r a g e [ 3 3 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
117 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 3 )
118 e x p C l a s s A v e r a g e [ 3 4 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
119 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 4 )
120 e x p C l a s s A v e r a g e [ 3 5 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
121 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 5 )
122 e x p C l a s s A v e r a g e [ 3 6 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
123 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 6 )
124 e x p C l a s s A v e r a g e [ 3 7 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
125 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 7 )
126 e x p C l a s s A v e r a g e [ 3 8 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
127 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 8 )
128 e x p C l a s s A v e r a g e [ 3 9 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
129 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 3 9 )
130 e x p C l a s s A v e r a g e [ 4 0 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
131 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 0 )
132 e x p C l a s s A v e r a g e [ 4 1 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
133 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 1 )
134 e x p C l a s s A v e r a g e [ 4 2 ] <− m e a n ( b e a u t y $ l w a g e [ e x p C l a s s ] )
135 e x p C l a s s <− w h i c h ( b e a u t y $ e x p e r == 4 2 )
4 STEFFEN GRØNNEBERG
Let us now compute the sum without using the sum-command. One way to
do this is using the following code (here n = 5).
1 n <− 5
2 x <− 0 . 5
3 r e s u l t <− 0
4
5 result <− result + x ˆ0
6 result <− result + x ˆ1
7 result <− result + x ˆ2
8 result <− result + x ˆ3
9 result <− result + x ˆ4
10 result <− result + x ˆ5
11
12 result
13 (1−x ˆ ( n +1)) / (1−x )
Modify this code so that you compute the sum using a for-loop, as in the
previous task.
Assignment 2. We here study census data from the 1990 US census. The data-set
contains information from approximately 1 % of the total population. We will only
study one of the variables. The whole data-set (a rather big file!) is found at http:
//archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990raw.
data.txt See http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%
29 for more information.
The variable we will for illustration work with is “years in school”, whose variable-
name is iyearsch in the original dataset.
(A) Un-zip the file yrsSchool.zip to extract yrsSchool.csv.
(B) Load yrsSchool.csv and save the result in the variable yrsSchool. The file
contains only one variable, and does not start with a variable-name (specify
this when reading the file, using header = FALSE). To read the data-file
correctly in R, we can use
1 y r s S c h o o l <− r e a d . t a b l e ( ” y r s S c h o o l . c s v ” , h e a d e r = FALSE)
2 y r s S c h o o l <− y r s S c h o o l $ V1
The second line, which may be a bit surprising, is included because after
executing the first line, yrsSchool will be a data.frame, containing a single
6 STEFFEN GRØNNEBERG
variable (by default called V1, hence accessed via yrsSchool$V1). We ex-
tract its single variable, saving it in yrsSchool. This is done in order to be
able to directly write e.g. mean(yrsSchool), which would otherwise have
given us an error message (since the mean function cannot be used on a data
frame directly) unless we instead write mean(yrsSchool$V1). After execut-
ing this second line, we are able to successfully execute mean(yrsSchool)
without error. Verify this. Also, find out how many observations are there
in total.
(C) Make a bar-plot of yrsSchool. Make a mental note on the fact that the dis-
tribution is not normal. Why would it be inappropriate to use a histogram?
Hint: Recall that a barplot of X can be made using barplot(table(X)).
(D) What is the number of years of schooling of person 1000 in the dataset?
(E) What is the number of years of schooling of persons 1000, 1001, 1002, 1003
1004, 1005? To answer this, use the following procedure: Put the integers
1000, 1001, 1002, 1003, 1004, 1005 into a vector called indices. Then
extract the years of schooling for those people and place them into a vector
called X using the command
1 X <− y r s S c h o o l [ i n d i c e s ]
then print X.
(F) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005?
(G) What is the avergage number of years of schooling for persons 1000, 1001,
1002, 1003, 1004, 1005, . . ., 2000? Make sure to program this in a manner
that involves practically no extra work compared to solving the previous
problem.
(H) The functions sample or sample.int can be used to sample at random from
a population. We will here use sample.int:
1 n <− 10
2 N <− l e n g t h ( y r s S c h o o l )
3 i n d i c e s <− s a m p l e . i n t (N, s i z e=n )
(C) Repeat the test of the central limit theorem with n = 2, 5, 10, 1 000. Com-
ment.
(D) (Optional, and will not be assigned as homework if not solved during class)
Take a moment to marvel at the central limit theorem. The dataset, in its
pure innocence, has no idea of even the existence of the normal distribution,
yet is somehow forced to adhere to its regularity when we consider the
1Strictly speaking, the empirical variance should be multiplied by a factor (N − 1)/N , for
reasons not covered by this project. This will have no practical influence on our experiments, since
when N is large, which it is here, the factor (N − 1)/N is practically equal 1.
8 STEFFEN GRØNNEBERG
sampling distribution of averages from it. Note again that yrsSchool is not
normally distributed, and it is the sampling distribution of the average that
is (approximately) normal, not the variable itself.
Assignment 4. (Optional)
As we have seen, yrsSchool takes on only certain integers. It turns out, how-
ever, that the average of many randomly chosen integers can approach any decimal
number. When we studied the sampling distribution of the average of yrsSchool,
we treated the average as a continuous variable, i.e., a variable that can take on any
decimal value within some interval.
This had the consequence that we made histograms of its distribution, and not
bar-plots, etc. Strictly speaking, the average is not quite continuous, but for large
samples, the distribution becomes more and more continuous. We will here briefly
investigate this phenomenon, for the simplest possible case, namely for binary vari-
ables. The general argument is similar, though involves more complex manipulation
with sums.
(A) Recall that we denote the average of the n observations X1 , X2 , . . . , Xn by
X̄n . That is
n
1X
X̄n = Xi .
n i=1
Suppose we observe zero/one variables, i.e., we observe numbers X1 , X2 , . . . , Xn
that are either zero or one. Show that
#Xi = 1
X̄n = . (1)
n
where “#Xi = 1” means number of observations that are 1.
(B) Show that 0 ≤ X̄n ≤ 1. Practically speaking, what does X̄n = 0 and X̄n = 1
mean?
(C) Suppose n = 1. How many values can X̄n take on? What are those values?
(D) Suppose n = 2. How many values can X̄n take on? What are those values?
What about n = 3 and n = 4?
(E) For general n, how many values can X̄n take on?
(F) From eq. (1), we see that the possible values X̄n can take on are of the form
k
n
for k = 0, 1, 2, . . . , n. For any number x ∈ [0, 1], explain why the distance
to x and an attainable value of X̄n is less than 1/n.