Professional Documents
Culture Documents
July 6, 2016
List of Figures
1 Uniform and Normal distribution . . . . . . . . . . . . . . . . 4
2 Bimodal distribution . . . . . . . . . . . . . . . . . . . . . . . 5
3 Written PDF graphic file from R . . . . . . . . . . . . . . . . . 16
4 Screen shot of Rcmdr window . . . . . . . . . . . . . . . . . . 17
5 Graph of relationship between height and weight . . . . . . . . 22
6 Temprature vs gas consumption . . . . . . . . . . . . . . . . . 26
7 Graph showing tempratures and gas consumption before and
after insullation . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Various possible plots . . . . . . . . . . . . . . . . . . . . . . . 64
9 Plot of Cars93 Data . . . . . . . . . . . . . . . . . . . . . . . . 65
10 Plot of Cars93 Data with labels, lines, points, texts and legend . 66
11 Adding regression lines . . . . . . . . . . . . . . . . . . . . . . 67
12 Mulitple Histograms for Cars93 data . . . . . . . . . . . . . . . 68
13 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14 Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15 Pie chart representing car types . . . . . . . . . . . . . . . . . 71
16 3-D Pie chart in R . . . . . . . . . . . . . . . . . . . . . . . . . 72
17 A 2-D plot od volcano data in R . . . . . . . . . . . . . . . . . 73
18 3-D plot of Volcano data . . . . . . . . . . . . . . . . . . . . . 73
i
Computer Interactive statistics Jane A. Aduda
List of Tables
1 Measurement of Variables . . . . . . . . . . . . . . . . . . . . 10
2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Contents
List of Figures i
List of Tables i
Course Outline ii
ii
Computer Interactive statistics Jane A. Aduda
4 Ploting 63
4.1 Adding titles, lines, points . . . . . . . . . . . . . . . . . . . . 64
4.2 Adding regression lines . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Histogramms . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Normal probability (Q-Q) plots . . . . . . . . . . . . . . . . . 68
4.6 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 2-D and 3-D plots . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.1 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . 72
iii
Computer Interactive statistics Jane A. Aduda
iv
Computer Interactive statistics Jane A. Aduda
Pre-Requisites:
STA 2102: Information Technology for Statistics & STA 2107: Database Man-
agement
Co-Requisites:
STA 2200 Probability and Statistics II
(a) Course Purpose
To give students hands on experience with statistical software to perform
data exploration and analysis as well as to write and run simple programs
that can be used to solve financial problems using a high level programming
language.
(b) Learning outcomes
By the end of this course the student should be able to;
(1) Describe the basic concepts of modern statistics
(2) Demonstrate good understanding of statistical reports
(3) Recognize the presence of errors or misleading quantitative information
(4) Conduct a robust and in-depth exploratory data analysis
(5) Conduct point and confidence estimation, and hypothesis testing
(6) Perform high level language programming.
(c) Course Description Basic concepts of modern statistics. Generation and
understanding of statistical reports. Exploratory data analysis, statistical
graphics, sampling variability, point and confidence interval estimation, and
hypothesis testing. Recognition of accuracy or misleading quantitative in-
formation. S-plus/R will be used throughout.
(d) Teaching Methodology
Lectures, Practicals, Tutorials, Self-study, Discussions and Student Presen-
tations.
(e) Instructional Material and Equipment
Black or White Boards, Chalk or White Board Markers, Dusters, Computer
and Projector.
v
Computer Interactive statistics Jane A. Aduda
[1] Crawley M., Statistics: An Introduction Using R, John Wiley & Sons,
ISBN-10: 0470022981, 2005.
[2] Uppal, S. M., Odhiambo, R. O. & Humphreys, H. M. Introduction to
Probability and Statistics. JKUAT Press, ISBN 9966923950, 2005.
[3] I. Miller & M Miller John E FreundâĂŹs Mathematical Statistics with
Applications, 7th ed., Pearsons Education, Prentice Hall, New Jersey,
2003 ISBN-10: 0131427067
[4] RV Hogg, JW McKean & AT Craig Introduction to Mathematical Statis-
tics, 6th ed., Prentice Hall, 2003 ISBN 0-13-177698-3
[5] HJ Larson Introduction to Probability Theory and Statistical Infer-
ence. 3rd ed., Wiley, 1982 ISBN-13: 978-0471059097
[1] Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978), Statistics for
Experimenters: An Introduction to Design, Data Analysis, and Model
Building, John Wiley and Sons. ISBN-13: 978-0471093152
[2] Du Toit, Steyn, and Stumpf (1986), Graphical Exploratory Data Anal-
ysis, Springer-Verlag ISBN 10: 0387963138 ISBN 13: 9780387963136.
[3] Cleveland, William (1993), Visualizing Data, Hobart Press ISBN-10:
0963488406/ISBN-13: 978-0963488404
vi
Computer Interactive statistics Jane A. Aduda
vii
Computer Interactive statistics Jane A. Aduda
viii
Computer Interactive statistics Jane A. Aduda
The branch of statistics used to interpret or draw inferences about a set of ob-
servations is referred to as inferential statistics.
Descriptive statistics include things such as means, medians, modes, and per-
centages, and they are everywhere, for example you might hear of
1
Computer Interactive statistics Jane A. Aduda
– a study showing that, as people age, women’s brains shrink 67% less than
men’s brains do.
– a meteorologist report that the average temperature for the past 7 days has
been over 24 ◦ C.
What makes descriptive statistics really important is that they take what could
be an extremely large and cumbersome set of observations and boil them down
to one or two highly representative numbers.
Imagine a sportscaster trying to tell us exactly how well Michael Olunga has
been scoring this season without using any descriptive statistics “ he scored a
goal, then he scored another one and another one . . . ".
2
Computer Interactive statistics Jane A. Aduda
One useful way to get a feel for a set of observations is to arrange them in order
from the lowest to the highest and to graph them pictorially so that taller parts
of the graph represent more frequently occurring scores (or, in the case of a
theoretical or ideal distribution, more probable scores).
The first graph on figure 1 shows a uniform distribution – every possible out-
come has an equal chance, or likelihood, of occurring, for example when rolling
a fair die, each of the six sides has an equal chance, or probability of turning up
with a probability of 1/6 or (16.7%).
Usually, about 68% of a set of normally distributed observations will fall within
one standard deviation of the mean. About 95% of a set of normally distributed
observations will fall within two standard deviations of the mean, and well over
99% of a set of normally distributed observations (99.8% to be exact) will fall
within three standard deviations of the mean.
For example in a crash test, 11 cars were tested to determine what impact speed
was required to obtain minimal bumper damage. Find the mode of the speeds
given in miles per hour as 24, 15, 18, 20, 18, 22, 24, 26, 18, 26, 24.
Answer: Since both 18 and 24 occur three times, the modes are 18 and 24 miles
per hour. This data set is bimodal.
3
Computer Interactive statistics Jane A. Aduda
Bimodal distributions are relatively rare, and they usually reflect the fact that a
sample is composed of two meaningful sub samples.
Researchers need to move beyond the data themselves in the hopes of drawing
general inferences about observations. To do this, researchers rely on inferential
4
Computer Interactive statistics Jane A. Aduda
statistics.
The basic idea behind inferential statistical testing is that decisions about what
to conclude from a set of research findings need to be made in a logical, unbiased
fashion.
One of the most highly developed forms of logic is mathematics, and statistical
testing involves the use of objective, mathematical decision rules to determine
whether an observed set of research findings is “real".
The opposite of the null hypothesis is the alternative hypothesis. This is the
5
Computer Interactive statistics Jane A. Aduda
hypothesis that any observed difference between the experimental and the con-
trol group is real.
In a simple, two–groups experiment, this would mean assuming that the experi-
mental group and the control group are not really different after the manipulation–
and that any apparent difference between the two groups is simply due to luck
(i.e., to a failure of random assignment).
The null hypothesis is very much like the presumption of innocence in the court-
room. Jurors in a courtroom are instructed to assume that they are in court
because an innocent person had the bad luck of being falsely accused of a crime.
Statistical testing tells researchers exactly how likely it is that a given research
finding would occur on the basis of luck alone (if nothing interesting is really
going on).
Researchers conclude that there is a true association between the variables they
have manipulated or measured only if the observed association would rarely
have occurred on the basis of chance.
After beginning with the presumption of innocence, jurors are instructed to ex-
6
Computer Interactive statistics Jane A. Aduda
amine all the evidence presented in a completely rational, unbiased fashion. The
statistical equivalent of this is to examine all the evidence collected in a study
on a purely objective, mathematical basis.
After examining the evidence against the defendant in a careful, unbiased fash-
ion, jurors are further instructed to reject the presumption of innocence (to
vote guilty) only if the evidence suggests beyond a reasonable doubt that the
defendant committed the crime in question.
The statistical equivalent of the principle of reasonable doubt is the alpha level
agreed upon by most statisticians as the reasonable standard for rejecting the
null hypothesis. In most cases, the accepted probability value at which alpha is
set is .05.
Researchers may reject the null hypothesis and conclude that their hypothesis
(the alternative) is correct only when findings as extreme as those observed in
the study (or more extreme) would have occurred by chance alone less than 5%
of the time.
Analysing and interpreting the data from most real empirical investigations re-
quire extensive calculations, but of course these labour-intensive calculations are
usually carried out by computers. In fact, a great deal of your training in this
course will involve getting a computer to crunch numbers for you using the
statistical software package R and SPSS.
Probability theory deals with the mathematical rules and procedures used to
predict and understand chance events.
By making use of some basic concepts in probability theory, along with our
knowledge of what a distribution of observations should look like when noth-
ing funny is going on (e.g., when we are merely flipping a fair coin 10 times
at random, when we are simply randomly assigning 20 people to either an ex-
perimental or a control condition), we can use inferential statistics to figure out
exactly how likely it is that a given set of usual or not so–usual observations
7
Computer Interactive statistics Jane A. Aduda
Unless it is pretty darn unlikely that a set of findings would have been observed
by chance, the logic of statistical hypothesis testing requires us to conclude that
the set of findings represents a chance outcome.
– Used for labelling variables, without any quantitative value. All the
scales are mutually exclusive (no overlap) and none of them have any
numerical significance.
– No inherent ordering to the categories.
– You can code the five genotypes with numbers if you want, but the
order is arbitrary and any calculations (for example, computing an
average) would be meaningless.
• Ordinal
8
Computer Interactive statistics Jane A. Aduda
– Order matters but not the difference between values, e.g., you may
ask patients to express the amount of pain they feel on a scale of 1
to 10. A score of 7 means more pain than 5, and that is more 3. But
the difference between 7 and 5 may not be the same as that between
5 and 3.
• Interval
– Numeric scales in which we know not only the order, but also the
exact differences between the values.
– Example is Celsius temperature because the difference between each
value is the same. For example, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference between 80
and 70 degrees. Time is another example where the increments are
known, consistent, and measurable.
– One problem, they don’t have a “true zero" e.g., there is no such
thing as “no temperature." Without a true zero, it is impossible to
compute ratios.
• Ratio
– They tell us about the order, they tell us the exact value between
units, AND they also have an absolute zero which allows for a wide
range of both descriptive and inferential statistics to be applied.
– A ratio scale is one in which the answers are real numbers, and an
answer of zero means what it says. "How old are you?", "How tall
are you?", "How many children do you have?"
– These variables can be meaningfully added, subtracted, multiplied,
divided (ratios). Central tendency and measures of dispersion can
be obtained.
9
Computer Interactive statistics Jane A. Aduda
10
Computer Interactive statistics Jane A. Aduda
11
Computer Interactive statistics Jane A. Aduda
– All source code is published, so you can see the exact algorithms being
used; also, expert statisticians can make sure the code is correct;
– Most programs written for the commercial S-PLUS program will run un-
changed, or with minor changes, in R
R is a suite of software facilities for reading and manipulating data, computa-
tion, conducting statistical analyses and displaying the results.
The base distribution already comes with some high–priority add–on packages,
namely
KernSmooth MASS boot class
cluster foreign lattice mgcv
nlme nnet rpart spatial
survival base datasets grDevices
graphics grid methods splines
stats stats4 tcltk tools
utils
These packages listed here implement standard statistical functionality, for ex-
ample linear models, classical tests, a huge collection of high-level plotting func-
tions or tools for survival analysis.
2.1 Starting R
R can be started in the usual way by double-clicking on the R icon on the desk-
top.
12
Computer Interactive statistics Jane A. Aduda
dedicated folder for each separate project – called the working folder.
X Create the directory/folder that will be used as the working folder, e.g.
create a folder on your desktop titled Your_name by right-clicking, then
clicking New > Folder.
X In the working folder, right-click and click Paste. The R icon will appear
in the folder.
X In the Start in box type the location of the working directory, e.g. "C:\
User \Jane\ Desktop\ CIS_R"
X Now when you double-click on the shortcut, it will start R in the direc-
tory of your choice. So, you can set up a different shortcut for each of
your projects.
Commands are recorded in a .Rhistory file and can be recalled and reissued or
edited using up- and down-arrow
We can Copy-and-paste from a “script" file or the history window used for re-
calling several commands at once
13
Computer Interactive statistics Jane A. Aduda
Users are expected to type input (commands) into R in the console window.
When R is ready for input, it prints out its prompt, a ">".
Users enter a line with a command after the prompt and press Enter. The pro-
gramme carries out the command and prints the result if relevant. For example,
if the expression 2+2 is typed in, the following is printed in the R console:
> 2+2
[1] 4
>
The prompt > indicates that R is ready for another command. If a command is
incomplete at the end of a line, the prompt + is displayed on subsequent lines
until the command is syntactically complete.
You can later review your saved graphics with programs such as Windows Pic-
ture Editor. If you want to add other graphical elements, you may want to save
as a PNG or JPEG; however in most cases it is cleaner to add annotations within
14
Computer Interactive statistics Jane A. Aduda
R itself.
You can also review graphics within the Windows R GUI itself. Create the
first graph, bring the graphics window to foreground, and then select the menu
command History | Recording. After this all graphs are automatically saved
within R, and you can move through them with the up and down arrow keys.
You can also write your graphics commands directly to a graphics file in many
formats, e.g. PDF or JPEG. You do this by opening a graphics device, writ-
ing the commands, and then closing the device. You can get a list of graphics
devices (formats) available on your system with ?Devices (note the upper-case
D).
For example, to write a PDF file, we open a PDF graphics device with the pdf
function, write to it, and then close it with the dev.off function:
> require("Rcmdr")
As it is loaded, it starts up in another window, with its own menu system. You
can run commands from these menus, but you can also continue to type com-
mands at the R prompt. Figure 4 shows an R Commander screen shot.
15
Computer Interactive statistics Jane A. Aduda
20
15
Frequency
10
5
0
−2 −1 0 1 2 3
rnorm(100)
To use Rcmdr, you first import or activate a dataset using one of the commands
on Rcmdr’s Data menu; then you can use procedures in the Statistics, Graphs,
and Models menus. You can also create and graph probability distributions
with the Distributions menu.
When using Rcmdr, observe the commands it formats in response to your menu
and dialog box choices. Then you can modify them yourself at the R command
line or in a script.
Rcmdr also provides some nice graphics options, including scatterplots (2D and
3D) where observations can be coloured by a classifying factor.
16
Computer Interactive statistics Jane A. Aduda
17
Computer Interactive statistics Jane A. Aduda
> 2 + 2
[1] 4
or
It also knows how to do other standard calculations. For instance, here is how
to compute e−2 :
> exp(-2)
[1] 0.1353353
or
The [1] in front of the result is part of R’s way of printing numbers and vectors.
It is not useful here, but it becomes so when the result is a longer vector. The
number in brackets is the index of the first number on that line. Consider the
case of generating 15 random numbers from a normal distribution:
> rnorm(15)
[1] 1.12076979 -0.09523206 1.45992818 0.69681119 -1.08466231 -1.66039570
[7] 0.01289009 1.70036979 0.62992239 0.10174016 -1.11853814 0.04482635
[13] 0.20740422 1.91260482 -0.74134190
Here, for example, the [7] indicates that 0.01289009 is the seventh element in
the vector.
2.6 Assignments
It is often necessary to store intermediate results so that they do not need to be
re-typed over and over again. R, like other computer languages, has symbolic
variables, that is names that can be used to represent values. To a value of 10 to
the variable x type:
18
Computer Interactive statistics Jane A. Aduda
> x <-10
> x
[1] 10
< − is the assignment operator in R. After the assignment, x takes the value 10
and can be used for various operations.
> x<-10
> x+x
[1] 20
> sqrt(x)
[1] 3.162278
Variables names can be chosen quite freely in R. They can be built from letters,
digits, and the period (.) symbol, with the limitation that the name must not
start with a digit or a period followed by a digit.
Names that start with a period are special and should be avoided.
Some names are already used by the system and can cause some confusion if
used for other purposes. The worst cases are the single-letter names c, q, t,
C, D, F, I, and T, but there are also diff, df, and pt, for example. Most
of these are functions and do not usually cause trouble when used as variable
names.
Also F and T are the standard abbreviations for FALSE and TRUE and no longer
work as such if redefined.
> weight <- c(40, 60, 72, 57, 90, 95, 72)
19
Computer Interactive statistics Jane A. Aduda
> weight
[1] 40, 60 72 57 90 95 72
You can do calculations with vectors just like ordinary numbers, as long as they
are of the same length.
Suppose that we also have the heights that correspond to the weights above.
The body mass index (BMI) is defined for each person as the weight in kilo-
grams divided by the square of the height in meters. This could be calculated as
follows:
> height <- c(1.55, 1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
> bmi <- weight/height^2
> bmi
[1] 16.64932 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630
These conventions for vectorized calculations make it very easy to specify typ-
ical statistical calculations. Consider, for instance, the calculation of the mean
and standard deviation of the weight variable.
n x
i
First, calculate the mean x̄ = and for this case, n = 7
P
i=1 n
> sum(weight)
[1] 486
> sum(weight)/length(weight)
[1] 69.42857
20
Computer Interactive statistics Jane A. Aduda
Notice how xbar, which has length 1, is recycled and subtracted from each
element of weight. The squared deviations will be
> (weight - xbar)^2
[1] 866.040816 88.897959 6.612245 154.469388 423.183673 653.897959 6.612245
The sum of squared deviations and the standard deviation becomes
> sum((weight - xbar)^2)
[1] 2199.714
> sqrt(sum((weight - xbar)^2)/(length(weight) - 1))
[1] 19.1473
Of course, since R is a statistical program, such calculations are already built
into the program, and you get the same results just by entering
> mean(weight)
[1] 69.42857
> sd(weight)
[1] 19.1473
If you want to investigate the relation between weight and height, the first
idea is to plot one versus the other. This is done by
> plot(height,weight,pch=3)
and the resultant graph is a simple x-y plot shown in figure 5. In the R com-
mand, pch=3 (“plotting character") is for changing the plotting symbol.
2.8 Objects
During an R session, objects are created and stored by name. The command
> ls()
displays all currently-stored objects (workspace).
> ls()
[1] "bmi" "c" "d"
[4] "data" "dataTS" "DeltaCF"
[7] "DeltaCS" "DeltaGF" "DeltaGS"
[10] "DeltaHF" "DeltaHS" "external.regressors"
21
Computer Interactive statistics Jane A. Aduda
22
Computer Interactive statistics Jane A. Aduda
> help(sqrt) # or
> ? sqrt
You can also obtain help on features specified by special characters, but they
must be enclosed in single or double quotes (e.g. "[[")
> help("[[")
> help.start()
> ? help
2.10 Packages in R
“R" contains one or more libraries of packages. Packages contain various func-
tions and data sets for numerous purposes, e.g. survival package, genetics pack-
age, fda package, etc.
Some packages are part of the basic installation. Others can be downloaded
from CRAN.
To access all of the functions and data sets in a particular package, it must be
loaded into the workspace. For example, to load the fda package:
> library(fda)
Note: that if you terminate your session and start a new session with the saved
workspace, you must load the packages again.
To check what packages are currently loaded into the workspace, use the com-
mand
> search()
23
Computer Interactive statistics Jane A. Aduda
> search()
[1] ".GlobalEnv" "package:parallel" "package:Rcmdr"
[4] "package:RcmdrMisc" "package:sandwich" "package:car"
[7] "package:fda" "package:Matrix" "package:splines"
[10] "package:stats" "package:graphics" "package:grDevices"
[13] "package:utils" "package:datasets" "package:methods"
[16] "Autoloads"
> detach("package:fda")
For the purposes of this session, a data set already stored in R will be used. To
access this data, must first load the package containing the data. (R has many
packages containing various functions that can be used to analyse data, e.g. if
you want to analyse your data using splines, need to load the splines package).
In this example, the data is stored in the MASS package. This is loaded with the
command
> library(MASS)
Now you have access to all functions and data sets stored in this package.
We will work with the data set titled “whiteside". To display the data:
> library(MASS)
> whiteside
Insul Temp Gas
1 Before -0.8 7.2
2 Before -0.7 6.9
3 Before 0.4 6.4
24
Computer Interactive statistics Jane A. Aduda
25
Computer Interactive statistics Jane A. Aduda
> whiteside$Temp
[1] -0.8 -0.7 0.4 2.5 2.9 3.2 3.6 3.9 4.2 4.3 5.4 6.0 6.0 6.0 6.2
[16] 6.3 6.9 7.0 7.4 7.5 7.5 7.6 8.0 8.5 9.1 10.2 -0.7 0.8 1.0 1.4
[31] 1.5 1.6 2.3 2.5 2.5 3.1 3.9 4.0 4.0 4.2 4.3 4.6 4.7 4.9 4.9
[46] 4.9 5.0 5.3 6.2 7.1 7.2 7.5 8.0 8.7 8.8 9.7
A plot of gas consumption versus temperature is now created as shown in figure
6. The command “main=" adds the title to a graph
> plot(Gas ~ Temp, data=whiteside, pch=16, main="Gas consumption Whiteside")
You can produce separate graphs for gas consumption versus temperature before
insulation was used and after insulation was used as shown in figure 7. This
requires the use of xyplot() available in the lattice package.
> library(lattice) # Loads the lattice package
> ? xyplot # Gives more information on xyplot()
> xyplot(Gas ~ Temp | Insul, whiteside)
26
Computer Interactive statistics Jane A. Aduda
Figure 7: Graph showing tempratures and gas consumption before and after
insullation
Use the data.frame() command and name the data frame elasticband.
27
Computer Interactive statistics Jane A. Aduda
2.13 Exercises
2.13.1 Excercise 1
1. Create summary statistics for the elastic band data.
3. Use the help() command to find more information about the hist()
command.
5. The following data are on snow cover for Eurasia in the years 1970-1979.
year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
snow.cover 6.5 12.0 14.9 10.0 10.7 7.9 21.9 12.5 14.5 9.2
(a) Enter the data into R. To save keystrokes, enter the successive years
as 1970:1979.
(b) Take the logarithm of snow cover.
(c) Plot snow cover versus year.
6. Display all objects in the workspace and then remove the data frame elas-
ticband.
28
Computer Interactive statistics Jane A. Aduda
2.13.2 Exercise 2
1. Assume that we have registered the height and weight for four people:
Heights in cm are 180, 165, 160, 193; weights in kg are 87, 58, 65, 100.
Make two vectors, height and weight, with the data. The body mass index
(BMI) is defined as
weight in kg
(height in m)2
Make a vector with the BMI values for the four people, and a vector with
the natural logarithm to the BMI values. Finally make a vector with the
weights for those people who have a BMI larger than 25.
2. In an experiment the dry weight has been measured for 8 plants grown
under certain conditions. The heights are given as 77, 93, 92, 68, 88, 75,
100. Create a vector called dry to hold the data and write a formula to
calculate it’s mean and standard deviation.
4. Assume that you are interested in cone-shaped structures, and have mea-
sured the height and radius of 6 cones. Make vectors with these values as
follows:
Recall that the volume of a cone with radius R and height H is given by
1 2
πR H. Make a vector with the volumes of the 6 cones.
3
5. Compute the mean, median and standard deviation of the cone volumes.
Compute also the mean of volume for the cones with a height less than
8.5.
29
Computer Interactive statistics Jane A. Aduda
3.1.1 Numeric
Decimal values are called numerics in R. It is the default computational data
type. If we assign a decimal value to a variable x as follows, x will be of numeric
type.
> x<-3.5
> x
[1] 3.5
> class(x)
[1] "numeric"
Notice that, if we assign an integer to a variable k, it is still saved as a numeric
value.
> k<-2
> k
[1] 2
> class(k)
[1] "numeric"
The fact that k is not an integer can be confirmed with the is.integer func-
tion.
> is.integer(k) # is k an integer?
[1] FALSE
30
Computer Interactive statistics Jane A. Aduda
3.1.2 Integer
In order to create an integer variable in R, we invoke the as.integer func-
tion. We can be assured that y is indeed an integer by applying the is.integer
function.
> y<-as.integer(3)
> y
[1] 3
> class(y)
[1] "integer"
> is.integer(y)
[1] TRUE
Using the as.integer function, we can coerce a numeric value into an integer.
And we can parse a string for decimal values in much the same way.
> as.integer("Jane")
[1] NA
Warning message:
NAs introduced by coercion
31
Computer Interactive statistics Jane A. Aduda
3.1.3 Complex
A complex value in R is defined via the pure imaginary value i.
> z = 1 + 2i # create a complex number
> z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
The following gives an error as -1 is not a complex value.
> sqrt(-1) # square root of -1
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
Instead, we have to use the complex value −1 + 0i.
> sqrt(-1+0i) # square root of -1+0i
[1] 0+1i
Alternatively, we can coerce -1 into a complex value.
> sqrt(as.complex(-1))
[1] 0+1i
3.1.4 Logical
A logical value is often created via comparison between variables. It is binary,
i.e. two possible values represented by TRUE and FALSE
> x = c(3, 7, 1, 2)
> x>2
[1] TRUE TRUE FALSE FALSE
> x<2
[1] FALSE FALSE TRUE FALSE
> x==2
[1] FALSE FALSE FALSE TRUE
> x<3
[1] FALSE FALSE TRUE TRUE
> which(x>2)
[1] 1 2
32
Computer Interactive statistics Jane A. Aduda
Another example
> x<-5
> y<-7
> z<-x>y
> z
[1] FALSE
> class(z)
[1] "logical"
For example
> u=TRUE
> v=FALSE
> u&v
[1] FALSE
> u|v
[1] TRUE
> !u
[1] FALSE
Note that there is a difference between operators that act on entries within a
vector and on the whole vector:
> a = c(TRUE,FALSE)
33
Computer Interactive statistics Jane A. Aduda
> b = c(FALSE,FALSE)
> a|b
[1] TRUE FALSE
> a||b
[1] TRUE
> xor(a,b)
[1] TRUE FALSE
3.1.5 Character
A character object is used to represent string values in R. We convert objects
into character values with the as.character() function:
> x = as.character(3.14)
> x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"
> fname="Jane"
> lname="Aduda"
> paste(fname,lname)
[1] "Jane Aduda"
34
Computer Interactive statistics Jane A. Aduda
And to replace the first occurrence of the word "little" by another word "big"
in the string, we apply the sub function.
1. Vectors
2. Factors
3. Matrices
4. Arrays
5. Lists
6. Data frames
3.2.1 Vectors
Vectors are the simplest type of object in R. There are 3 main types of vectors:
1. Numeric vectors
2. Character vectors
3. Logical vectors
To set up a numeric vector x consisting of 5 numbers, 10.4, 5.6, 3.1, 6.4, 21.7,
use
35
Computer Interactive statistics Jane A. Aduda
or
> x[4]
[1] 6.4
> y =c(x,0,x)
> y
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4 21.7
> 2/x
[1] 0.1923077 0.3571429 0.6451613 0.3125000 0.0921659
> z=x+y
Warning message:
In x + y : longer object length is not a multiple of shorter object length
Some functions take vectors of values and produce results of the same length:
sin, cos, tan, asin, acos, atan, log, exp, ...
> cos(x)
[1] -0.5609843 0.7755659 -0.9991352 0.9931849 -0.9579148
> exp(x)
[1] 3.285963e+04 2.704264e+02 2.219795e+01 6.018450e+02 2.655769e+09
> sum(x)
[1] 47.2
> length(x)
36
Computer Interactive statistics Jane A. Aduda
[1] 5
> sum(x)/length(x)
[1] 9.44
> mean(x)
[1] 9.44
1:10
[1] 1 2 3 4 5 6 7 8 9 10
This operator has the highest priority within an expression, e.g. 2*1:10 is
equivalent to 2*(1:10).
> 2*1:10
[1] 2 4 6 8 10 12 14 16 18 20
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1,to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(to=10,from=1)
[1] 1 2 3 4 5 6 7 8 9 10
We can also specify a step size (using by=value) or a length (using length=value)
for the sequence.
37
Computer Interactive statistics Jane A. Aduda
In some cases the entire contents of a vector may not be known. For example,
there could be missing data from a particular data set. A place can be reserved
for this by assigning it the special value NA. We can check for NA values in a
vector x using the command
> is.na(x)
38
Computer Interactive statistics Jane A. Aduda
This returns a logical vector the same length as x with a value TRUE if that
particular element is NA. For example
Have already seen how to access single elements of a vector. Subsets of a vector
may also be selected using a similar approach.
This command stores the elements of the vector w that do NOT have the value
NA, into ind1.
Selects the
first 3 elements of the vector w and stores them in the new vector ind2.
Using the - sign indicates that these elements should be excluded. This com-
mand excludes the first 4 elements of w.
In this case only the 1st and 4th elements of w are excluded.
39
Computer Interactive statistics Jane A. Aduda
> x
[1] 10.4 5.6 3.1 6.4 21.7
> x[1]<-5
> x
[1] 5.0 5.6 3.1 6.4 21.7
In this case we have replaced the NA (missing) values in the vector w with the
value 0
40
Computer Interactive statistics Jane A. Aduda
3.2.2 Factor
A factor is a special type of vector used to represent categorical data, e.g. gen-
der, social class, etc. It is a type vector containing a set of numeric codes with
character-valued levels. Factor variables are stored internally as a numeric vector
with values 1, 2, . . . , k, where k is the number of levels.
We can have either ordered and unordered factors. A factor with k levels stored
internally, consists of 2 items
1. a vector of k integers
41
Computer Interactive statistics Jane A. Aduda
Example
Consider a survey that has data on 200 females and 300 males. If the
first 200 values are from females and the next 300 values are from males, one
way of representing this is to create a vector
1 female
2 male
Each category, i.e. female and male, is called a level of the factor. To determine
the levels of a factor the function levels() can be used:
> levels(gender)
[1] "female" "male"
Example
Five people are asked to rate the performance of a product on a scale of 1-
5, with 1 representing very poor performance and 5 representing very good
performance. The following data were collected.
42
Computer Interactive statistics Jane A. Aduda
The first line creates a numeric vector containing the satisfaction levels of the 5
people. This is a categorical variable.
The second line creates a factor. The levels=1:5 argument indicates that there
are 5 levels of the factor.
Finally the last line sets the names of the levels to the specified character strings.
> sat
[1] 1 3 4 2 2
> fsat
[1] very poor average good poor poor
Levels: very poor poor average good very good
> levels(fsat)
[1] "very poor" "poor" "average" "good" "very good"
3.3 Matrices
A matrix is a two-dimensional array of numbers, it has rows and columns and it
is used for many purposes in statistics. In R matrices are represented as vectors
with dimensions.
> A<-rnorm(15) # creates a vector of standard normal 15 random numbers
> A
[1] -0.2199310 -0.1558994 0.2503376 0.9532383 0.1044002 1.5693812
[7] 0.9335553 -0.6430651 -1.0046298 -0.3206042 0.8627723 0.1836219
[13] -0.9662111 -0.7285974 -0.2556030
> dim(A)<-c(5,3)
> A
[,1] [,2] [,3]
[1,] -0.2199310 1.5693812 0.8627723
43
Computer Interactive statistics Jane A. Aduda
Note that the storage is carried out by filling in the columns first, then the rows.
> B<-rnorm(15)
> matrix(B, nrow=5, ncol=3, byrow=T)
[,1] [,2] [,3]
[1,] 0.6133484 0.1548929 -0.09625237
[2,] -0.8585939 -2.0814610 -1.27274845
[3,] 0.2588589 0.6999445 -0.90617682
[4,] 0.4095178 0.7336874 0.24013239
[5,] -0.2409285 -0.3444688 -1.37569853
The byrow=T command causes the matrix to be filled in row by row rather than
column by column.
Re-call the last command and change byrow=T to byrow="F". Notice the dif-
ference between the two outputs. This time the matrix is filled in column by
column.
Some useful functions for matrices include nrow(), ncol(), t(), rownames(),
colnames().
> nrow(A)
[1] 5
44
Computer Interactive statistics Jane A. Aduda
> rownames(A)<-c("a","b","c","d","e")
> A
[,1] [,2] [,3]
a -0.2199310 1.5693812 0.8627723
b -0.1558994 0.9335553 0.1836219
c 0.2503376 -0.6430651 -0.9662111
d 0.9532383 -1.0046298 -0.7285974
e 0.1044002 -0.3206042 -0.2556030
The t() function is the transposition function (rows become columns and vice
versa).
> t(A)
a b c d e
[1,] -0.2199310 -0.1558994 0.2503376 0.9532383 0.1044002
[2,] 1.5693812 0.9335553 -0.6430651 -1.0046298 -0.3206042
[3,] 0.8627723 0.1836219 -0.9662111 -0.7285974 -0.2556030
We can also merge vectors and matrices together, column-wise or row-wise us-
ing rbind() (add on rows) or cbind() (add on columns).
When using rbind() - if combining matrices with other matrices, the matri-
ces must have the same number of columns. If combining vectors with other
vectors or vectors with matrices the vectors can have any length but will be
lengthened/shortened accordingly if of differing lengths.
When using cbind() - if combining matrices with other matrices, the matrices
must have the same number of rows. If combining vectors with other vectors
or vectors with matrices, the vectors can have any length but will be length-
ened/shortened accordingly if of differing lengths.
45
Computer Interactive statistics Jane A. Aduda
> rbind(A,C)
[,1] [,2] [,3]
[1,] -0.2199310 1.5693812 0.8627723
[2,] -0.1558994 0.9335553 0.1836219
[3,] 0.2503376 -0.6430651 -0.9662111
[4,] 0.9532383 -1.0046298 -0.7285974
[5,] 0.1044002 -0.3206042 -0.2556030
[6,] 0.6343716 -1.7484903 0.5230774
[7,] -1.8148107 1.5492271 -0.5589112
[8,] -1.8660952 -1.1374160 0.1503507
[9,] -0.7046708 -0.4577602 -0.9357541
[10,] -1.0268098 -1.1896584 -0.1886360
> cbind(A,C)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.2199310 1.5693812 0.8627723 0.6343716 -1.7484903 0.5230774
[2,] -0.1558994 0.9335553 0.1836219 -1.8148107 1.5492271 -0.5589112
[3,] 0.2503376 -0.6430651 -0.9662111 -1.8660952 -1.1374160 0.1503507
[4,] 0.9532383 -1.0046298 -0.7285974 -0.7046708 -0.4577602 -0.9357541
[5,] 0.1044002 -0.3206042 -0.2556030 -1.0268098 -1.1896584 -0.1886360
For matrix multiplication
> D<-matrix(1:9, nrow=3,ncol=3)
> D
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
46
Computer Interactive statistics Jane A. Aduda
> F
[,1] [,2] [,3]
[1,] 138 174 210
[2,] 171 216 261
[3,] 204 258 312
47
Computer Interactive statistics Jane A. Aduda
> X[,c(2,4)]
B D
a 0.67723505 -0.8757583
b -0.95659291 -0.8038683
c 1.69335129 -0.3667330
d 0.25310854 -1.8813818
e 0.03015866 -1.2360666
> X[,c(2:4)]
B C D
a 0.67723505 0.2404820 -0.8757583
b -0.95659291 1.2656448 -0.8038683
c 1.69335129 1.9689660 -0.3667330
d 0.25310854 -0.3830079 -1.8813818
e 0.03015866 -1.8363220 -1.2360666
> X[2:4,]
A B C D E F
b -0.08013522 -0.9565929 1.2656448 -0.8038683 0.92483273 0.1256199
c -0.25061175 1.6933513 1.9689660 -0.3667330 -0.32647029 -1.6338905
d -0.22283307 0.2531085 -0.3830079 -1.8813818 0.04918283 -1.4148502
3.4 Arrays
An array can have multiple dimensions. A matrix is a special case of an array (a
2-d array).
We can construct an array from a vector z containing 300 elements using the
dim() function (as for matrices).
48
Computer Interactive statistics Jane A. Aduda
This creates a 3-d array with dimensions 10*6*5 (like storing 5 matrices, each
with 10 rows and 6 columns).
3.5 Lists
Lists are an ordered collection of components.
The components may be arbitrary R objects (data frames, vectors, lists, etc. ).
They are constructed using the function list().
$wife
[1] "Mary"
$no.children
[1] 3
49
Computer Interactive statistics Jane A. Aduda
$child.ages
[1] 4 7 9
Each component of the list is given a name (i.e. name, wife, no.children,
child.ages).
[[2]]
[1] "Mary"
[[3]]
[1] 3
[[4]]
[1] 4 7 9
R uses single bracket notation for sublists and double bracket notation for indi-
vidual components.
> L1$name
[1] "Joseph"
> L1[["name"]]
[1] "Joseph"
> L1[[1]]
[1] "Joseph"
50
Computer Interactive statistics Jane A. Aduda
> L1[1]
$name
[1] "Joseph"
> L1[2]
$wife
[1] "Mary"
> names(L1)
[1] "name" "wife" "no.children" "child.ages"
> names(L2)
NULL
We can set the names for the list components after the list has been created.
$name.wife
[1] "Mary"
$no.child
[1] 3
$child.age
[1] 4 7 9
> names(L2)
[1] "name.hus" "name.wife" "no.child" "child.age"
51
Computer Interactive statistics Jane A. Aduda
[1] "Joseph"
$wife
[1] "Mary"
$no.children
[1] 3
$child.ages
[1] 4 7 9
$name.hus
[1] "Joseph"
$name.wife
[1] "Mary"
$no.child
[1] 3
$child.age
[1] 4 7 9
It comprises a list of vectors and/or factors of the same length with a unique set
of row names.
52
Computer Interactive statistics Jane A. Aduda
In the previous exercises you created a list called mylist. To convert this to a
data frame:
> new.data <- as.data.frame(mylist)
> new.data
year mean_weight Gender mean_height
1 1980 71.5 M 179.3
2 1988 72.1 M 179.9
3 1996 73.7 F 180.5
4 1998 74.3 F 180.1
5 2000 75.2 M 180.3
6 2002 74.7 M 180.4
As with lists, individual components (columns) can be accessed using the $
notation.
> new.data$year
[1] 1980 1988 1996 1998 2000 2002
> new.data[3,2]
[1] 73.7
> new.data[,2]
[1] 71.5 72.1 73.7 74.3 75.2 74.7
> new.data[3,]
year mean_weight Gender mean_height
3 1996 73.7 F 180.5
53
Computer Interactive statistics Jane A. Aduda
Selecting all data for cases that satisfy some criterion, such as the data for all
males.
To select only the weight and height of females born after 1996 use:
Replacing the & with a | selects the rows that satisfy EITHER condition.
We can shorten this command by attaching the data so that we don’t have to
use $
> attach(new.data)
The following objects are masked _by_ .GlobalEnv:
54
Computer Interactive statistics Jane A. Aduda
4 74.3 180.1
5 75.2 180.3
6 74.7 180.4
> detach(new.data)
> search()
The apply family has four members, namely, lapply, sapply, tapply, apply
– lapply: takes any structure and gives a list of results (hence the ‘l’)
– sapply: like lapply, but tries to simplify the result to a vector or matrix
if possible (hence the ‘s’)
– tapply: allows you to create tables (hence the ‘t’) of values from sub-
groups defined by one or more factors.
Let us use an in-built data set called trees. Gives girth, height and volume mea-
surements for 31 trees.
> trees
$Height
[1] 76
$Volume
[1] 30.17097
55
Computer Interactive statistics Jane A. Aduda
For tapply, let us use the Cars93 dataset in the MASS package. An excerpt of
the data is shown below
> library(MASS)
> Cars93
Manufacturer Model Type Min.Price Price Max.Price MPG.city
1 Acura Integra Small 12.9 15.9 18.8 25
2 Acura Legend Midsize 29.2 33.9 38.7 18
3 Audi 90 Compact 25.9 29.1 32.3 20
4 Audi 100 Midsize 30.8 37.7 44.6 19
5 BMW 535i Midsize 23.7 30.0 36.2 22
6 Buick Century Midsize 14.2 15.7 17.3 22
7 Buick LeSabre Large 19.9 20.8 21.7 19
8 Buick Roadmaster Large 22.6 23.7 24.9 16
9 Buick Riviera Midsize 26.3 26.3 26.3 19
Manufacturer is a factor
> is.factor(Cars93$Manufacturer)
[1] TRUE
and we want to calculate the average price of a car for each manufacturer.
56
Computer Interactive statistics Jane A. Aduda
We can also calculate the average price if a car for each type
> is.factor(Cars93$Type)
[1] TRUE
> tapply(Cars93$Price, Cars93$Type, mean)
Compact Large Midsize Small Sporty Van
18.21250 24.30000 27.21818 10.16667 19.39286 19.10000
For apply, lets create a matrix and sum the rows, next sum the columns, then
get the mean of each column.
– read.table() - can be used to read data frames from formatted text files
57
Computer Interactive statistics Jane A. Aduda
– read.csv() can be used to read data frames from comma separated vari-
able files.
– When reading from Excel files, the simplest method is to save each work-
sheet separately as a .csv file and use read.csv() on each.
Save the dataset "CDA data" in your working folder so that the path when
when importing the data is short.
58
Computer Interactive statistics Jane A. Aduda
The row.names=FALSE command ensures that the row numbers are not saved
in the
file.
The sep=" " command ensures that the output is separated by a space. Can
change this using sep="," and output will now be separated by commas.
3.9 Exercises
3.9.1 Exercise 3
1. Create the following vectors
59
Computer Interactive statistics Jane A. Aduda
Patient 1 2 3 4 5 6
Pain level 0 3 1 2 1 2
3.9.2 Exercise 4
1. Construct a matrix A with values 10, 20, 30, 50 in column 1, values 1, 4,
2, 3 in column 2 and values 15, 11, 19, 5 in column 3, i.e. a 4 × 3 matrix.
Also construct a vector B with values 2.5, 3.5, 1.75. Check your results to
ensure that they are correct.
2. Combine A and B into a new matrix C using cbind().
3. Combine A and B into a new matrix H using rbind().
4. Determine the dimensions of C and H using dim() function.
5. Calculate the following:
1 9
1 4 3
× 2 17
0 −2 8
−6 3
60
Computer Interactive statistics Jane A. Aduda
3.9.3 Exercise 5
1. Create 4 vectors Year, mean_weight, Gender and mean_height with the
following entries:
2. Create a list called mylist consisting of the above vectors giving each com-
ponent of the list a name.
3.9.4 Exersise 6
1. Create a data frame called club.points with the following data.
3. Store the data for females only into a data frame called fpoints.
4. The age for Jerry Burke was entered incorrectly. Change his age to 28.
6. Extract the data for people with more than 100 points and are over the
age of 30.
61
Computer Interactive statistics Jane A. Aduda
3.9.5 Exersise 7
1. download the dataset in http://www.contextures.com/xlSampleData01.
html and save it in your work folder.
4. Print out the total number of pencils, pen, pen set and binders sold.
5. Find the mean, standard deviation, minimum and maximum total sales
for each region using the smallest number of commands possible.
62
Computer Interactive statistics Jane A. Aduda
4 Ploting
– For simple plotting, use plot, hist, pairs, boxplot,...
– To add to existing plots, use points, lines, abline, legend, title,
mtext,...
– for Interacting with graphics, use locator, identify
– For three dimensional data, use contour, image, persp, ...
– To see the many possibilities that R offers, see
> demo(graphics)
63
Computer Interactive statistics Jane A. Aduda
64
Computer Interactive statistics Jane A. Aduda
65
Computer Interactive statistics Jane A. Aduda
Figure 10: Plot of Cars93 Data with labels, lines, points, texts and legend
levels(Cars93$Origin)
[1] "USA" "non-USA"
66
Computer Interactive statistics Jane A. Aduda
4.3 Histogramms
par(mfrow =c(2,2))
# To create a histogram of the car weights from the Cars93 data set
hist(Cars93$Weight, xlab="Weight", main="Histogram of Weight", col="red")
67
Computer Interactive statistics Jane A. Aduda
4.4 Boxplot
par(mfrow =c(2,2))
boxplot(Cars93$Weight)
boxplot(Cars93$EngineSize)
boxplot(Cars93$Weight ~ Cars93$Origin)
boxplot(USA.weight, nonUSA.weight,names=c("USA", "non-USA"))
par(mfrow=c(1,1))
Want all of the points to lie in an approximate straight line (along the 456◦
dotted line) for a normal distribution.
qqnorm(Cars93$Weight)
qqline(Cars93$Weight)
68
Computer Interactive statistics Jane A. Aduda
To plot a 3-D pie chart, the pie3D() function in the plotrix package provides
3D exploded pie charts
library(plotrix)
pie3D(Type.freq, labels = paste(names(Type.freq), pct,sep=" "),radius = 0.8,
col=rainbow(length(Type.freq)),explode=0.05,
main="Pie Chart of Car Types")
69
Computer Interactive statistics Jane A. Aduda
? volcano
data(volcano)
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
# Creates a 2-D image of x and y co-ordinates.
image(x, y, volcano, col = terrain.colors(100),
axes = FALSE)
70
Computer Interactive statistics Jane A. Aduda
# Adds a title.
title(main = "Maunga Whau Volcano", font.main = 4)
library(lattice)
data(volcano)
dim(volcano)
# Creates a data frame from all combinations of the
# supplied vectors or factors.
vdat <- expand.grid(x = x, y = y)
vdat$v <- as.vector(volcano)
wireframe(v ~ x*y, vdat, drape=TRUE, col.regions = rainbow(100))
71
Computer Interactive statistics Jane A. Aduda
4.8 Exercises
4.8.1 Exercise 8
1. Create a vector x of the values from 1 to 20.
72
Computer Interactive statistics Jane A. Aduda
73
Computer Interactive statistics Jane A. Aduda
to fit a linear regression model. Add the estimated regression line to the
current plot and make it the colour blue. Write the equation of the line.
7. Extract the values of the residuals using resids <- resid(fm). Check
that the residuals are normally distributed by creating a Q-Q plot.
8. The airquality data set in the base library has columns Ozone, Solar.R,
Wind, Temp, Month and Day. Plot Ozone against Solar.R for each of
THREE temperature ranges and each of THREE wind ranges. (Hint:
Use coplot.)* Difficult.
74
Computer Interactive statistics Jane A. Aduda
R has many built in functions, and you can access many more by installing new
packages. So there’s no-doubt you already use functions. We have already seen
functions in R, e.g.
mean(x)
sd(x)
plot(x, y, ...)
lm(y ~ x, ...)
Functions have a name and a list of arguments or input objects. For example,
the argument to the function mean() is the vector x.
Functions also have a list of output objects, i.e. objects that are returned once
the function has been run (called).
Functions are typically written if we need to compute the same thing for sev-
eral data sets and what we want to calculate is not already implemented in the
commercial software yet.
75
Computer Interactive statistics Jane A. Aduda
Inside the parenthesis you outline the input objects required and decide what to
call them.
The commands occur inside the { }. This makes the function body.
The name of whatever output you want goes at the end of the function.
The procedure for writing any other functions is similar, involving three key
steps:
Whatever value is input for x will be squared and the result (output) printed to
the screen.
76
Computer Interactive statistics Jane A. Aduda
The values input for a1, a2, (a3) will be squared, summed and the square root
of the sum calculated and stored in x.
The return command specifies what the function returns, here the value of x.
Name matching happens first, then positional matching is used for any un-
matched arguments.
77
Computer Interactive statistics Jane A. Aduda
If a value for the argument pow is not specified in the function call, a value of 2
is used.
mypower(4)
[1] 16
mypower(4, 3)
[1] 64
mypower(pow=5, x=2)
[1] 32
If we have a function which performs multiple tasks and therefore has multiple
results to report then we have to include a return statement inside the function
is order to see all the results.
78
Computer Interactive statistics Jane A. Aduda
79
Computer Interactive statistics Jane A. Aduda
5.2 if Statement
Often, you want to make choices and take action dependent on a certain value.
If this condition is true, then carry out a certain task. Logical operators are used
as the conditions in the if statement.
Syntax of if statement
if (test_expression) {
statement
}
start
Test false
expression
True
if statement do nothing
stop
80
Computer Interactive statistics Jane A. Aduda
If the test_expression is TRUE, the statement gets executed. But if it’s FALSE,
nothing happens.
x<-5
if(x > 0){
print("Positive number")
}
[1] "Positive number"
if (test_expression) {
statement1
} else {
statement2
}
start
Test false
expression
True
stop
x <- -5
81
Computer Interactive statistics Jane A. Aduda
82
Computer Interactive statistics Jane A. Aduda
This function below demonstrates the use of && in the condition. This means
that both conditions must be met before a value of TRUE is returned.
83
Computer Interactive statistics Jane A. Aduda
{
if ( (number > 0) && (number < 10) )
{
cat(number,"is between 0 and 10\n")
}
}
> com3(8)
8 is between 0 and 10
The syntax is
ifelse( condition, true expr, false expr )
If condition == TRUE, the true expr is carried out. If
condition == FALSE, the false expr is carried out.
Example
x <- rnorm(20, mean=15, sd=5)
ifelse(x >= 17, sqrt(x), NA)
[1] NA NA NA NA NA NA NA NA
[9] NA NA NA NA NA NA 4.603291 NA
[17] NA 4.387977 NA 4.801747
84
Computer Interactive statistics Jane A. Aduda
for(i in 1:5){
print(sqrt(i))
}
[1] 1
[1] 1.414214
[1] 1.732051
[1] 2
[1] 2.236068
n <- 20
p <- 5
value <- vector(mode="numeric", length=n)
rand.nums <- matrix(rnorm(n*p), nrow=n)
for(i in 1:length(value)){
value[i] <- max(rand.nums[i,])
print(sum(value))
}
– Then create a numeric vector (of zeros) called value with length 20
85
Computer Interactive statistics Jane A. Aduda
– The for loop performs 20 loops and stores the maximum value from each
row of rand.nums into position i of the vector value. The sum of the
current numbers in value is also printed to the screen.
86
Computer Interactive statistics Jane A. Aduda
[1] 15.59315
[1] 17.79503
[1] 19.18108
[1] 19.29292
[1] 19.72578
[1] 20.78777
[1] 23.7152
[1] 26.09935
– See the value of the vector value now
> value
[1] -0.1229413 -0.1874345 2.1021781 1.8068138 1.4860620 1.2272390
[7] 1.0060012 1.7117409 1.4911224 1.3217960 1.8077148 1.0744179
[13] 0.8684422 2.2018768 1.3860512 0.1118424 0.4328577 1.0619847
[19] 2.9274367 2.3841454
Another example
u1 <- rnorm(30) # create a vector filled with random normal values
print("This loop calculates the square of the first 10 elements of vector u1")
[1] "This loop calculates the square of the first 10 elements of vector u1"
usq<-0
for(i in 1:10)
{
usq[i]<-u1[i]*u1[i] # i-th element of u1 squared into i-th position of usq
print(usq[i])
}
[1] 0.01080562
[1] 0.1109192
[1] 0.2443853
[1] 0.07218485
[1] 4.803868
[1] 1.961641
[1] 1.338433
[1] 0.0007863216
[1] 0.7498973
[1] 0.05495671
> print(i)
[1] 10
87
Computer Interactive statistics Jane A. Aduda
for loops and multiply nested for loops are generally avoided when possible in
R as they can be quite slow.
while (condition){
command
command
88
Computer Interactive statistics Jane A. Aduda
niter <- 0
num <- sample(1:100, 1)
while(num != 20) {
num <- sample(1:100, 1)
niter <- niter + 1
}
niter
another example
i <- 0
while (i < 4) {
i <- i+1
print (i)
}
[1] 1
[1] 2
[1] 3
[1] 4
The break statement completely terminates a loop. Useful if you want a loop
to end if an error is found.
The repeat loop uses next and break. The only way to end this type of loop is
to use the break statement.
i <- 0
repeat {
89
Computer Interactive statistics Jane A. Aduda
i <- i+1
print (i)
if (i == 4) break
}
[1] 1
[1] 2
[1] 3
[1] 4
If no break is given, loop runs forever!
apply()
apply(data, margin, function)
> a <- matrix (1:10 , nrow =2)
> apply (a ,1, mean ) # 1 = by rows
[1] 5 6
> apply (a ,2, mean ) # 2 = by columns
[1] 1.5 3.5 5.5 7.5 9.5
lapply() and lapply()
> a <- matrix (2:11 , nrow =2)
> b <- matrix (1:10 , nrow =2)
> c <- list (a,b)
> lapply (c, mean )
[[1]]
[1] 6.5
[[2]]
[1] 5.5
90
Computer Interactive statistics Jane A. Aduda
mapply()
Like sapply() but applies over the first elements of each argument
[[2]]
[1] 3.141593 3.141593
[[3]]
[1] 3.141593
# equivalent to:
rep (pi , 3)
rep (pi , 2)
rep (pi , 1)
tapply()
Run a function on each group of values specified by a factor. Requires a vector,
factor and function.
5.6 Exercises
5.6.1 Exercise 9
1. Write a function that when passed a number, returns the number squared,
the number cubed, and the square root of the number. Also access each
result separately by using the list indices
2. Write a function that when passed a numeric vector, prints the value of
the mean and standard deviation to the screen and creates a histogram of
the data.
91
Computer Interactive statistics Jane A. Aduda
4. For each of the following code sequences, predict the result. Then do the
computation:
5. Add up all the numbers for 1 to 100 in two different ways: using a for
loop and using sum.
6. Create a vector x <- seq(0, 1, 0.05). Plot x versus x and use type="l".
Label the y-axis "y". Add the lines x versus x^j where j can have
values 3 to 5 using either a for loop or a while loop.
92
Computer Interactive statistics Jane A. Aduda
– What is the level of uncertainty associated with our estimate of the mean
value? (Confidence interval) (Hypothesis test)
Researchers retain or reject hypothesis based on measurements of observed sam-
ples. The decision is often based on a statistical mechanism called hypothesis
testing.
A type I error (also known as a “false positive”) is the mishap of falsely rejecting
a null hypothesis when the null hypothesis is true. The probability of commit-
ting a type I error is called the significance level of the hypothesis testing, and
is denoted by the Greek letter α. It occurs when we are observing a difference
when in truth there is none (or more specifically – no statistically significant
difference).
Truth
Decision True H0 False H0
Reject Type I Accurate
Fail to reject Accurate Type II
To ensure that our analysis is correct we need to check for outliers in the data
and we also need to check whether the data are normally distributed or not.
93
Computer Interactive statistics Jane A. Aduda
Graphical methods are often used to check that the data being analysed are nor-
mally distributed. Can use Histogram, Box-plot, Normal probability (Q-Q)
plot etc.
When the sample size is sufficiently large, we can test our hypothesis by con-
ducting a z-test using the standard normal distribution. We conduct a one-
sample z-test when we want to test whether the mean of the population (from
which we have a random sample) is equal to the hypothesized value.
H0 : µ = µ0
H1 : µ 6= µ0
(two-sided) to
H1 : µ > µ0
or
H1 : µ < µ0
(one-sided).
We calculate the critical value for the α significance level by qnorm(1 − α/2) for
a two-sided test, and qnorm(1 − α) or qnorm(α) for a one-sided test, depending
on the direction of the prior knowledge.
94