You are on page 1of 102

INTRODUCTION TO COMPUTER

INTERACTIVE STATISTICS LECTURE


NOTES
JANE ADUDA

July 6, 2016

List of Figures
1 Uniform and Normal distribution . . . . . . . . . . . . . . . . 4
2 Bimodal distribution . . . . . . . . . . . . . . . . . . . . . . . 5
3 Written PDF graphic file from R . . . . . . . . . . . . . . . . . 16
4 Screen shot of Rcmdr window . . . . . . . . . . . . . . . . . . 17
5 Graph of relationship between height and weight . . . . . . . . 22
6 Temprature vs gas consumption . . . . . . . . . . . . . . . . . 26
7 Graph showing tempratures and gas consumption before and
after insullation . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Various possible plots . . . . . . . . . . . . . . . . . . . . . . . 64
9 Plot of Cars93 Data . . . . . . . . . . . . . . . . . . . . . . . . 65
10 Plot of Cars93 Data with labels, lines, points, texts and legend . 66
11 Adding regression lines . . . . . . . . . . . . . . . . . . . . . . 67
12 Mulitple Histograms for Cars93 data . . . . . . . . . . . . . . . 68
13 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14 Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15 Pie chart representing car types . . . . . . . . . . . . . . . . . 71
16 3-D Pie chart in R . . . . . . . . . . . . . . . . . . . . . . . . . 72
17 A 2-D plot od volcano data in R . . . . . . . . . . . . . . . . . 73
18 3-D plot of Volcano data . . . . . . . . . . . . . . . . . . . . . 73

i
Computer Interactive statistics Jane A. Aduda

List of Tables
1 Measurement of Variables . . . . . . . . . . . . . . . . . . . . 10
2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Contents
List of Figures i

List of Tables i

Course Outline ii

1 Basic concepts of Modern statistics 1


1.1 Some Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Central Tendency and Dispersion . . . . . . . . . . . . . . . . 2
1.4 The shape of distributions . . . . . . . . . . . . . . . . . . . . 2
1.5 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Types of Variables & Measurement Scales . . . . . . . . . . . . 8

2 Getting R and Getting Started 11


2.1 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 How R works . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Saving your work . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 The Rcmdr GUI . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 An overgrown calculator . . . . . . . . . . . . . . . . . . . . . 16
2.6 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Vectorized arithmetic . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Getting help in R . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Packages in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 Interractive session . . . . . . . . . . . . . . . . . . . . . . . . 24
2.12 Simple Data Entry . . . . . . . . . . . . . . . . . . . . . . . . 27
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13.1 Excercise 1 . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . 29

ii
Computer Interactive statistics Jane A. Aduda

3 Data types, Objects and simple manipulations 30


3.1 Basic Data types in R . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Integer . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Complex . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.4 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.5 Character . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Data objects in R . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Reading Data from Files . . . . . . . . . . . . . . . . . . . . . 57
3.8 Writing Data to Files . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.9.1 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.9.2 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.9.3 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.9.4 Exersise 6 . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.9.5 Exersise 7 . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Ploting 63
4.1 Adding titles, lines, points . . . . . . . . . . . . . . . . . . . . 64
4.2 Adding regression lines . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Histogramms . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Normal probability (Q-Q) plots . . . . . . . . . . . . . . . . . 68
4.6 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 2-D and 3-D plots . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.1 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Looping and Functions 75


5.1 User Defined Functions (UDF) in R . . . . . . . . . . . . . . . 75
5.2 if Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 ifelse Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 84

iii
Computer Interactive statistics Jane A. Aduda

5.4 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84


5.4.1 Nested for loops . . . . . . . . . . . . . . . . . . . . . 88
5.5 while loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.1 next, break, repeat Statements . . . . . . . . . . . . . . 89
5.5.2 The apply() commands . . . . . . . . . . . . . . . . . 90
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.1 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 One Sample Hypothesis Tests 93

iv
Computer Interactive statistics Jane A. Aduda

STA 2202 INTRODUCTION TO COMPUTER INTERACTIVE STATIS-


TICS (45 Contact Hours)

Pre-Requisites:
STA 2102: Information Technology for Statistics & STA 2107: Database Man-
agement

Co-Requisites:
STA 2200 Probability and Statistics II
(a) Course Purpose
To give students hands on experience with statistical software to perform
data exploration and analysis as well as to write and run simple programs
that can be used to solve financial problems using a high level programming
language.
(b) Learning outcomes
By the end of this course the student should be able to;
(1) Describe the basic concepts of modern statistics
(2) Demonstrate good understanding of statistical reports
(3) Recognize the presence of errors or misleading quantitative information
(4) Conduct a robust and in-depth exploratory data analysis
(5) Conduct point and confidence estimation, and hypothesis testing
(6) Perform high level language programming.
(c) Course Description Basic concepts of modern statistics. Generation and
understanding of statistical reports. Exploratory data analysis, statistical
graphics, sampling variability, point and confidence interval estimation, and
hypothesis testing. Recognition of accuracy or misleading quantitative in-
formation. S-plus/R will be used throughout.
(d) Teaching Methodology
Lectures, Practicals, Tutorials, Self-study, Discussions and Student Presen-
tations.
(e) Instructional Material and Equipment
Black or White Boards, Chalk or White Board Markers, Dusters, Computer
and Projector.

v
Computer Interactive statistics Jane A. Aduda

(f) Course Assessment


Assignments (5%), Practicals (10%), CATs (15%), End of Semester Exami-
nation (70%)

(g) Course Text Books

[1] Crawley M., Statistics: An Introduction Using R, John Wiley & Sons,
ISBN-10: 0470022981, 2005.
[2] Uppal, S. M., Odhiambo, R. O. & Humphreys, H. M. Introduction to
Probability and Statistics. JKUAT Press, ISBN 9966923950, 2005.
[3] I. Miller & M Miller John E FreundâĂŹs Mathematical Statistics with
Applications, 7th ed., Pearsons Education, Prentice Hall, New Jersey,
2003 ISBN-10: 0131427067
[4] RV Hogg, JW McKean & AT Craig Introduction to Mathematical Statis-
tics, 6th ed., Prentice Hall, 2003 ISBN 0-13-177698-3
[5] HJ Larson Introduction to Probability Theory and Statistical Infer-
ence. 3rd ed., Wiley, 1982 ISBN-13: 978-0471059097

(h) Course Journals

[1] Statistical Methodology (Stat. Methodol.),ISSN: 1572-3127.


[2] Statistical Methods and Applications (Stat. Methods Appl.) ISSN:
1618-2510; 1613-981X.
[3] Journal of Statistical Computation and Simulation ( J. Stat. Computer.
Simulation) ISSN: 0094-9655.

(i) Reference Text Books

[1] Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978), Statistics for
Experimenters: An Introduction to Design, Data Analysis, and Model
Building, John Wiley and Sons. ISBN-13: 978-0471093152
[2] Du Toit, Steyn, and Stumpf (1986), Graphical Exploratory Data Anal-
ysis, Springer-Verlag ISBN 10: 0387963138 ISBN 13: 9780387963136.
[3] Cleveland, William (1993), Visualizing Data, Hobart Press ISBN-10:
0963488406/ISBN-13: 978-0963488404

(j) Reference Journals

vi
Computer Interactive statistics Jane A. Aduda

[1] Computational Statistics & Data Analysis, ISSN: 0167-9473.


[2] Journal of Statistical and Econometric Methods, ISSN: 2241-0384, ISSN:
2241-0376.
[3] Statistics and Computing, ISSN: 1573-1375.

vii
Computer Interactive statistics Jane A. Aduda

STA 2202 Course Outline


The applications of R will be explored by considering several case studies in
statistics, each motivated by a scientific question that needs to be answered.
The cases are grouped by broad statistical topics, namely:
– data analysis and applied probability,
– statistical inference and regression methods.
Week 1-5 Language essentials:
– Basic concepts of modern statistics.
– Data types in R
– Objects; functions, vectors, missing values, matrices and arrays, fac-
tors, lists, data frames.
– Indexing, sorting and implicit loops.
– Logical operators.
– Packages and libraries.
– Recognition of accuracy or misleading quantitative information.
– CAT I
Week 6-7 Flow control:
– for, while, if/else, repeat, break.
Week 8-9 Probability distributions:
– Built-in distributions in R; densities, cumulatives, quantiles, random
numbers.
Week 10-12 Statistical graphics:
– Graphical devices.
– High level plots.
– Low level graphics functions.
– CAT II
Week 13-14 Statistical functions:
– One and two-sample inference, regression and correlation, tabular
data, power, sample size calculations.

viii
Computer Interactive statistics Jane A. Aduda

1 Basic concepts of Modern statistics


1.1 Some Basics
The purpose of most research is to assess relationships among a set of variables.
To do so, one must measure the variables in some manner. Measurement in-
volves error, the need for statistical design and analysis emanates from presence
of such errors. In statistics, there are two kinds of statistical inference:
1. Estimation: which refers to quantifying characteristic and strength of
relationships
2. Testing: which refers to specifying hypothesis about relationships mak-
ing statement of probability about appropriateness of such statements,
and then providing practical conclusion based on such statements.
Note that in this course, we focus on regression and correlation method involv-
ing one response variable, and one or more predictor variables.

A response/ dependent variable is predicted from other variables called predic-


tor/ independent variables as represented by the model
Y = β0 + β1 X + e
Where Y is the dependent variable, β0 is the Y intercept , β1 is the gradient, X
is the independent variable and e is the error term.

1.2 Descriptive Statistics


Statistics are a set of mathematical procedures for summarizing and interpret-
ing observations which are typically numerical or categorical facts about specific
people or things, and are usually referred to as data.

The most fundamental branch of statistics is descriptive statistics, i.e., statistics


used to summarize or describe a set of observations.

The branch of statistics used to interpret or draw inferences about a set of ob-
servations is referred to as inferential statistics.

Descriptive statistics include things such as means, medians, modes, and per-
centages, and they are everywhere, for example you might hear of

1
Computer Interactive statistics Jane A. Aduda

– a study showing that, as people age, women’s brains shrink 67% less than
men’s brains do.

– a meteorologist report that the average temperature for the past 7 days has
been over 24 ◦ C.

What makes descriptive statistics really important is that they take what could
be an extremely large and cumbersome set of observations and boil them down
to one or two highly representative numbers.

Imagine a sportscaster trying to tell us exactly how well Michael Olunga has
been scoring this season without using any descriptive statistics “ he scored a
goal, then he scored another one and another one . . . ".

1.3 Central Tendency and Dispersion


The descriptive statistics used by laypeople are typically incomplete in a very
important respect. They make frequent use of descriptive statistics that summa-
rize the central tendency (loosely speaking, the average) of a set of observations
spacing

– “A follow-up study revealed that women also happen to be exactly 67%


less likely than men to spend their weekends watching football and drink-
ing beer"

– “My old pal Michael Jordan once averaged 32 points in a season"

A second, equally useful and important category of descriptive statistics consists


of statistics that summarize the dispersion, or variability, of a set of scores or
observations. They are not only important in their own (descriptive) right, but,
they are also important because they play a very important role in inferential
statistics.

1.4 The shape of distributions


This is a third statistical property of a set of observations, which is a little more
difficult to quantify than measures of central tendency or dispersion.

2
Computer Interactive statistics Jane A. Aduda

One useful way to get a feel for a set of observations is to arrange them in order
from the lowest to the highest and to graph them pictorially so that taller parts
of the graph represent more frequently occurring scores (or, in the case of a
theoretical or ideal distribution, more probable scores).

The first graph on figure 1 shows a uniform distribution – every possible out-
come has an equal chance, or likelihood, of occurring, for example when rolling
a fair die, each of the six sides has an equal chance, or probability of turning up
with a probability of 1/6 or (16.7%).

The second graph on figure 1 shows a normal distribution – a symmetrical,


bell-shaped distribution in which most observations cluster near the mean and
in which scores become increasingly rare as they become increasingly divergent
from this mean.

Many things that can be quantified are normally distributed. Distributions of


height, weight, extroversion, self-esteem, and the age at which infants begin to
walk are all examples of approximately normal distributions.

Usually, about 68% of a set of normally distributed observations will fall within
one standard deviation of the mean. About 95% of a set of normally distributed
observations will fall within two standard deviations of the mean, and well over
99% of a set of normally distributed observations (99.8% to be exact) will fall
within three standard deviations of the mean.

Figure 2 shows a bimodal distribution – if it has two modes as two distinct


ranges of scores are more common than any other. This occurs when two data
observations tie for having the highest frequency.

For example in a crash test, 11 cars were tested to determine what impact speed
was required to obtain minimal bumper damage. Find the mode of the speeds
given in miles per hour as 24, 15, 18, 20, 18, 22, 24, 26, 18, 26, 24.

Solution: Ordering the data from least to greatest, we get:


15, 18, 18, 18, 20, 22, 24, 24, 24, 26, 26.

Answer: Since both 18 and 24 occur three times, the modes are 18 and 24 miles
per hour. This data set is bimodal.

3
Computer Interactive statistics Jane A. Aduda

Figure 1: Uniform and Normal distribution

Bimodal distributions are relatively rare, and they usually reflect the fact that a
sample is composed of two meaningful sub samples.

As we have seen, descriptive statistics provide researchers with an enormously


powerful tool for organizing and simplifying data. At the same time, descriptive
statistics give only half of the picture.

1.5 Inferential Statistics


In addition to simplifying and organizing the data they collect, researchers also
need to draw conclusions about populations from their sample data.

Researchers need to move beyond the data themselves in the hopes of drawing
general inferences about observations. To do this, researchers rely on inferential

4
Computer Interactive statistics Jane A. Aduda

Figure 2: Bimodal distribution

statistics.

The basic idea behind inferential statistical testing is that decisions about what
to conclude from a set of research findings need to be made in a logical, unbiased
fashion.

One of the most highly developed forms of logic is mathematics, and statistical
testing involves the use of objective, mathematical decision rules to determine
whether an observed set of research findings is “real".

When conducting a statistical test to aid in the interpretation of a set of experi-


mental findings, researchers begin by assuming that the null hypothesis is true.
That is, they begin by assuming that their own predictions are wrong, that is
that the researcher’s findings reflect chance variation and are not real.

The opposite of the null hypothesis is the alternative hypothesis. This is the

5
Computer Interactive statistics Jane A. Aduda

hypothesis that any observed difference between the experimental and the con-
trol group is real.

In a simple, two–groups experiment, this would mean assuming that the experi-
mental group and the control group are not really different after the manipulation–
and that any apparent difference between the two groups is simply due to luck
(i.e., to a failure of random assignment).

The null hypothesis is very much like the presumption of innocence in the court-
room. Jurors in a courtroom are instructed to assume that they are in court
because an innocent person had the bad luck of being falsely accused of a crime.

In the context of an experiment, the main thing statistical hypothesis testing


tells us is exactly how possible it is (i.e., how likely it is) that someone would
get results as impressive as, or more impressive than, those actually observed in
an experiment if chance alone (and not an effective manipulation) were at work
in the experiment.

If a researcher correlates a person’s height with that person’s level of education


and observes a modest positive correlation (such that taller people tend to be
better educated), it is always possible–out of dumb luck–that the tall people in
this specific sample just happen to have been more educated than the short peo-
ple.

Statistical testing tells researchers exactly how likely it is that a given research
finding would occur on the basis of luck alone (if nothing interesting is really
going on).

Researchers conclude that there is a true association between the variables they
have manipulated or measured only if the observed association would rarely
have occurred on the basis of chance.

Just as defendants are considered “innocent until proven guilty," researchers’


claims about the relation between the variables they have examined are consid-
ered incorrect unless the results of the study strongly suggest otherwise (“null
until proven alternative," you might say).

After beginning with the presumption of innocence, jurors are instructed to ex-

6
Computer Interactive statistics Jane A. Aduda

amine all the evidence presented in a completely rational, unbiased fashion. The
statistical equivalent of this is to examine all the evidence collected in a study
on a purely objective, mathematical basis.

After examining the evidence against the defendant in a careful, unbiased fash-
ion, jurors are further instructed to reject the presumption of innocence (to
vote guilty) only if the evidence suggests beyond a reasonable doubt that the
defendant committed the crime in question.

The statistical equivalent of the principle of reasonable doubt is the alpha level
agreed upon by most statisticians as the reasonable standard for rejecting the
null hypothesis. In most cases, the accepted probability value at which alpha is
set is .05.

Researchers may reject the null hypothesis and conclude that their hypothesis
(the alternative) is correct only when findings as extreme as those observed in
the study (or more extreme) would have occurred by chance alone less than 5%
of the time.

Analysing and interpreting the data from most real empirical investigations re-
quire extensive calculations, but of course these labour-intensive calculations are
usually carried out by computers. In fact, a great deal of your training in this
course will involve getting a computer to crunch numbers for you using the
statistical software package R and SPSS.

1.6 Probability Theory


All inferential statistics are grounded firmly in the logic of probability theory.

Probability theory deals with the mathematical rules and procedures used to
predict and understand chance events.

By making use of some basic concepts in probability theory, along with our
knowledge of what a distribution of observations should look like when noth-
ing funny is going on (e.g., when we are merely flipping a fair coin 10 times
at random, when we are simply randomly assigning 20 people to either an ex-
perimental or a control condition), we can use inferential statistics to figure out
exactly how likely it is that a given set of usual or not so–usual observations

7
Computer Interactive statistics Jane A. Aduda

would have been observed by chance.

Unless it is pretty darn unlikely that a set of findings would have been observed
by chance, the logic of statistical hypothesis testing requires us to conclude that
the set of findings represents a chance outcome.

The logic underlying virtually all inferential statistical tests ia as follows

1. A researcher makes a set of observations

2. These observations are compared with what we would expect to observe if


nothing unusual were happening in the experiment (i.e., if the researcher’s
hypothesis were incorrect)–namely, the probability that the researcher
would have observed a set of results at least this consistent with his or her
hypothesis if the hypothesis were incorrect.

3. If this probability is sufficiently low, we conclude that the researcher’s


hypothesis is probably correct.

1.7 Types of Variables & Measurement Scales


There are four measurement scales (or types of data): nominal, ordinal, interval
and ratio. These are simply ways to categorize different types of variables.

• Nominal-Also called Categorical variable

– Used for labelling variables, without any quantitative value. All the
scales are mutually exclusive (no overlap) and none of them have any
numerical significance.
– No inherent ordering to the categories.
– You can code the five genotypes with numbers if you want, but the
order is arbitrary and any calculations (for example, computing an
average) would be meaningless.

• Ordinal

– Typically measures of non-numeric concepts like satisfaction, happi-


ness, discomfort, etc.

8
Computer Interactive statistics Jane A. Aduda

– Order matters but not the difference between values, e.g., you may
ask patients to express the amount of pain they feel on a scale of 1
to 10. A score of 7 means more pain than 5, and that is more 3. But
the difference between 7 and 5 may not be the same as that between
5 and 3.

• Interval

– Numeric scales in which we know not only the order, but also the
exact differences between the values.
– Example is Celsius temperature because the difference between each
value is the same. For example, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference between 80
and 70 degrees. Time is another example where the increments are
known, consistent, and measurable.
– One problem, they don’t have a “true zero" e.g., there is no such
thing as “no temperature." Without a true zero, it is impossible to
compute ratios.

• Ratio

– They tell us about the order, they tell us the exact value between
units, AND they also have an absolute zero which allows for a wide
range of both descriptive and inferential statistics to be applied.
– A ratio scale is one in which the answers are real numbers, and an
answer of zero means what it says. "How old are you?", "How tall
are you?", "How many children do you have?"
– These variables can be meaningfully added, subtracted, multiplied,
divided (ratios). Central tendency and measures of dispersion can
be obtained.

Nominal -“name,". Ordinal- provide information about the order of choices.


Interval-give order of values + the ability to quantify the difference between
each one. Ratio scales give the ultimate order, interval values, + the ability to
calculate ratios since a “true zero" can be defined. Table 1 gives a summary of
these measurements.

9
Computer Interactive statistics Jane A. Aduda

Provides Nominal Ordinal Interval Ratio


“Counts" or Frequency of X X X X
distribution
Median, Mode & percentiles X X X
Known order of values X X X
Difference between each value X X
quantifiable
Values can be added or subtracted X X
mean, standard deviation, standard X X
error of the mean
Values can be multiplied or divided X
Ratio, or coefficient of variation X
Has “true zero" X

Table 1: Measurement of Variables

10
Computer Interactive statistics Jane A. Aduda

2 Getting R and Getting Started


R is a flexible and powerful open-source implementation of the language S (for
statistics) developed by John Chambers and others at Bell Labs. R has eclipsed
S and the commercially available S-Plus program for many reasons. R is free,
and has a variety of (over 4,000) contributed packages, most of which are also
free. R works on Macs, PCs, and Linux systems.

Although R is initially harder to learn and use than a spreadsheet or a dedicated


statistics package, it’s a very effective statistics tool in its own right, and is well
worth the effort to learn.

Here are some compelling reasons to learn and use R.


– R is open source and completely free. It is the de facto standard and
preferred program of many professional statisticians and researchers in a
variety of fields. R community members regularly contribute packages to
increase R’s functionality.
– It is the product of international collaboration between top computa-
tional statisticians and computer language designers;
– R is as good as (often better than) commercially available statistical pack-
ages like SPSS, SAS, and Minitab.
– R has extensive statistical and graphing capabilities. R provides hundreds
of built-in statistical functions as well as its own built-in programming
language.
– R is used in teaching and performing computational statistics. It is the
language of choice for many academics who teach computational statis-
tics.
– Getting help from the R user community is easy. There are readily avail-
able online tutorials, data sets, and discussion forums about R.
– R is a programming environment for data analysis and graphics and a
platform for development and implementation of new algorithms.
– It stimulates critical thinking about problem-solving rather than a “push
the button" mentality.

11
Computer Interactive statistics Jane A. Aduda

– It can work on objects of unlimited size and complexity with a consistent,


logical expression language

– All source code is published, so you can see the exact algorithms being
used; also, expert statisticians can make sure the code is correct;

– Most programs written for the commercial S-PLUS program will run un-
changed, or with minor changes, in R
R is a suite of software facilities for reading and manipulating data, computa-
tion, conducting statistical analyses and displaying the results.

Software and packages can be downloaded from www.cran.r-project.org.


Click on Download R for Windows ⇒ base ⇒ Download R(version) for win-
dows (32/64 bit depending) ⇒ Run and follow the instructions to install the
programme.

The base distribution already comes with some high–priority add–on packages,
namely
KernSmooth MASS boot class
cluster foreign lattice mgcv
nlme nnet rpart spatial
survival base datasets grDevices
graphics grid methods splines
stats stats4 tcltk tools
utils
These packages listed here implement standard statistical functionality, for ex-
ample linear models, classical tests, a huge collection of high-level plotting func-
tions or tools for survival analysis.

2.1 Starting R
R can be started in the usual way by double-clicking on the R icon on the desk-
top.

An important concept in R is the workspace, which contains the local data


and procedures for a given statistics project. Under Windows this is usually
determined by the folder from which R is started. R works best if you have a

12
Computer Interactive statistics Jane A. Aduda

dedicated folder for each separate project – called the working folder.

X Create the directory/folder that will be used as the working folder, e.g.
create a folder on your desktop titled Your_name by right-clicking, then
clicking New > Folder.

X Right-click on an existing R icon and click Copy.

X In the working folder, right-click and click Paste. The R icon will appear
in the folder.

X Right-click on the R icon and click Properties.

X In the Start in box type the location of the working directory, e.g. "C:\
User \Jane\ Desktop\ CIS_R"

X Click Apply, then Ok.

X Now when you double-click on the shortcut, it will start R in the direc-
tory of your choice. So, you can set up a different shortcut for each of
your projects.

2.2 How R works


R creates its objects in memory and saves them in a single file called .RData (by
default)

Commands are recorded in a .Rhistory file and can be recalled and reissued or
edited using up- and down-arrow

Flawed commands may be abandoned by pressing <Esc>

We can Copy-and-paste from a “script" file or the history window used for re-
calling several commands at once

To end your session type q( ) or just kill the window.


There are a number of drop-down menus in the R Gui (File, Edit, View, Pack-
ages, Help).

13
Computer Interactive statistics Jane A. Aduda

Users are expected to type input (commands) into R in the console window.
When R is ready for input, it prints out its prompt, a ">".

The commands consist of expressions or assignments and are separated by a


semi-colon (;) or by a newline. They can be grouped together using braces
({ and } ) and comments can be included and are indicated with a hash (#).

Users enter a line with a command after the prompt and press Enter. The pro-
gramme carries out the command and prints the result if relevant. For example,
if the expression 2+2 is typed in, the following is printed in the R console:

> 2+2
[1] 4
>

The prompt > indicates that R is ready for another command. If a command is
incomplete at the end of a line, the prompt + is displayed on subsequent lines
until the command is syntactically complete.

2.3 Saving your work


For your analysis steps, the File | Save to file... menu command will
save the entire console contents, i.e. both your commands and R’s response,
to a text file, which you can later review and edit with any text editor. This is
useful for cutting-and-pasting into your reports or thesis, and also for writing
scripts to repeat procedures.
For graphics, in the Windows version of R, you can save any graphical output
for insertion into documents or printing. If necessary, bring the graphics win-
dow to the front (e.g. click on its title bar), select menu command File |
Save to file..., and then one of the formats.

– Most useful for insertion into MS-Word documents is Metafile;

– most useful for LATEX is Postscript;

– most useful for PDFLaTeX and stand-alone printing is PDF.

You can later review your saved graphics with programs such as Windows Pic-
ture Editor. If you want to add other graphical elements, you may want to save
as a PNG or JPEG; however in most cases it is cleaner to add annotations within

14
Computer Interactive statistics Jane A. Aduda

R itself.

You can also review graphics within the Windows R GUI itself. Create the
first graph, bring the graphics window to foreground, and then select the menu
command History | Recording. After this all graphs are automatically saved
within R, and you can move through them with the up and down arrow keys.

You can also write your graphics commands directly to a graphics file in many
formats, e.g. PDF or JPEG. You do this by opening a graphics device, writ-
ing the commands, and then closing the device. You can get a list of graphics
devices (formats) available on your system with ?Devices (note the upper-case
D).

For example, to write a PDF file, we open a PDF graphics device with the pdf
function, write to it, and then close it with the dev.off function:

> pdf("figure2.pdf", h=6, w=6)


> hist(rnorm(100), main="100 random values from N[0,1])")
> dev.off()

This produces the graph in figure 3


At the end of each R session, you are prompted to save your workspace. If you
click Yes, all objects are written to the .RData le. When R is re-started, it reloads
the workspace from this file and the command history stored in .Rhistory is also
reloaded.

2.4 The Rcmdr GUI


The Rcmdr add-on package, written by John Fox of McMaster University, pro-
vides a GUI for common data management and statistical analysis tasks. It is
loaded like any other package, with the require method:

> require("Rcmdr")

As it is loaded, it starts up in another window, with its own menu system. You
can run commands from these menus, but you can also continue to type com-
mands at the R prompt. Figure 4 shows an R Commander screen shot.

15
Computer Interactive statistics Jane A. Aduda

100 random values from N[0,1])

20
15
Frequency

10
5
0

−2 −1 0 1 2 3

rnorm(100)

Figure 3: Written PDF graphic file from R

To use Rcmdr, you first import or activate a dataset using one of the commands
on Rcmdr’s Data menu; then you can use procedures in the Statistics, Graphs,
and Models menus. You can also create and graph probability distributions
with the Distributions menu.

When using Rcmdr, observe the commands it formats in response to your menu
and dialog box choices. Then you can modify them yourself at the R command
line or in a script.

Rcmdr also provides some nice graphics options, including scatterplots (2D and
3D) where observations can be coloured by a classifying factor.

2.5 An overgrown calculator


One of the simplest possible tasks in R is to enter an arithmetic expression and
receive a result. (The second line is the answer from the machine.)

16
Computer Interactive statistics Jane A. Aduda

Figure 4: Screen shot of Rcmdr window

17
Computer Interactive statistics Jane A. Aduda

> 2 + 2
[1] 4

or

> 1000*(1 + 0.075)^5 - 1000


[1] 435.6293

It also knows how to do other standard calculations. For instance, here is how
to compute e−2 :

> exp(-2)
[1] 0.1353353

or

> pi # R knows about pi


[1] 3.141593

The [1] in front of the result is part of R’s way of printing numbers and vectors.
It is not useful here, but it becomes so when the result is a longer vector. The
number in brackets is the index of the first number on that line. Consider the
case of generating 15 random numbers from a normal distribution:

> rnorm(15)
[1] 1.12076979 -0.09523206 1.45992818 0.69681119 -1.08466231 -1.66039570
[7] 0.01289009 1.70036979 0.62992239 0.10174016 -1.11853814 0.04482635
[13] 0.20740422 1.91260482 -0.74134190

Here, for example, the [7] indicates that 0.01289009 is the seventh element in
the vector.

2.6 Assignments
It is often necessary to store intermediate results so that they do not need to be
re-typed over and over again. R, like other computer languages, has symbolic
variables, that is names that can be used to represent values. To a value of 10 to
the variable x type:

18
Computer Interactive statistics Jane A. Aduda

> x <-10
> x
[1] 10

< − is the assignment operator in R. After the assignment, x takes the value 10
and can be used for various operations.

> x<-10
> x+x
[1] 20
> sqrt(x)
[1] 3.162278

Variables names can be chosen quite freely in R. They can be built from letters,
digits, and the period (.) symbol, with the limitation that the name must not
start with a digit or a period followed by a digit.

Names that start with a period are special and should be avoided.

A typical variable name could be height.1yr, which might be used to describe


the height of a child at the age of 1 year. Names are case-sensitive: WT Wt, wT and
wt do not refer to the same variable.

Some names are already used by the system and can cause some confusion if
used for other purposes. The worst cases are the single-letter names c, q, t,
C, D, F, I, and T, but there are also diff, df, and pt, for example. Most
of these are functions and do not usually cause trouble when used as variable
names.

Also F and T are the standard abbreviations for FALSE and TRUE and no longer
work as such if redefined.

2.7 Vectorized arithmetic


You cannot do much statistics on single numbers! One strength of R is that it
can handle entire data vectors as single objects. A data vector is simply an array
of numbers, and a vector variable can be constructed like this:

> weight <- c(40, 60, 72, 57, 90, 95, 72)

19
Computer Interactive statistics Jane A. Aduda

> weight
[1] 40, 60 72 57 90 95 72

The construct c(...) is used to define vectors.

You can do calculations with vectors just like ordinary numbers, as long as they
are of the same length.

Suppose that we also have the heights that correspond to the weights above.
The body mass index (BMI) is defined for each person as the weight in kilo-
grams divided by the square of the height in meters. This could be calculated as
follows:

> height <- c(1.55, 1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
> bmi <- weight/height^2
> bmi
[1] 16.64932 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630

These conventions for vectorized calculations make it very easy to specify typ-
ical statistical calculations. Consider, for instance, the calculation of the mean
and standard deviation of the weight variable.
n x
i
First, calculate the mean x̄ = and for this case, n = 7
P
i=1 n

> sum(weight)
[1] 486
> sum(weight)/length(weight)
[1] 69.42857

Then save the mean in a variable


s P xbar and proceed with the calculation od the
(xi − x̄)2
standard deviation. SD =
(n − 1)

> xbar <- sum(weight)/length(weight)


> xbar
[1] 69.42857
> weight - xbar
[1] -29.428571 -9.428571 2.571429 -12.428571 20.571429 25.571429 2.571429

20
Computer Interactive statistics Jane A. Aduda

Notice how xbar, which has length 1, is recycled and subtracted from each
element of weight. The squared deviations will be
> (weight - xbar)^2
[1] 866.040816 88.897959 6.612245 154.469388 423.183673 653.897959 6.612245
The sum of squared deviations and the standard deviation becomes
> sum((weight - xbar)^2)
[1] 2199.714
> sqrt(sum((weight - xbar)^2)/(length(weight) - 1))
[1] 19.1473
Of course, since R is a statistical program, such calculations are already built
into the program, and you get the same results just by entering
> mean(weight)
[1] 69.42857
> sd(weight)
[1] 19.1473
If you want to investigate the relation between weight and height, the first
idea is to plot one versus the other. This is done by
> plot(height,weight,pch=3)
and the resultant graph is a simple x-y plot shown in figure 5. In the R com-
mand, pch=3 (“plotting character") is for changing the plotting symbol.

2.8 Objects
During an R session, objects are created and stored by name. The command
> ls()
displays all currently-stored objects (workspace).
> ls()
[1] "bmi" "c" "d"
[4] "data" "dataTS" "DeltaCF"
[7] "DeltaCS" "DeltaGF" "DeltaGS"
[10] "DeltaHF" "DeltaHS" "external.regressors"

21
Computer Interactive statistics Jane A. Aduda

Figure 5: Graph of relationship between height and weight

[13] "garch" "height" "m.order"


[16] "model" "newdata" "order"
[19] "p" "q" "smaller.aic"
[22] "smallest.aic" "spec" "weight"
[25] "xbar"
They can also be accessed from the menu bar by Misc| List objects. Ob-
jects can be removed using
> rm(x, xbar, weight, height)
> x
Error: object ’x’ not found
The command
> rm(list=ls())
removes all of the objects in the workspace. Objects can also be removed from
the menu bar using the commands Misc| Remove all objects

22
Computer Interactive statistics Jane A. Aduda

2.9 Getting help in R


R has a built-in help facility. To get more information on any specific function,
e.g. sqrt(), the command is

> help(sqrt) # or
> ? sqrt

You can also obtain help on features specified by special characters, but they
must be enclosed in single or double quotes (e.g. "[[")

> help("[[")

Help is also available in HTML format by running

> help.start()

For more information use

> ? help

2.10 Packages in R
“R" contains one or more libraries of packages. Packages contain various func-
tions and data sets for numerous purposes, e.g. survival package, genetics pack-
age, fda package, etc.

Some packages are part of the basic installation. Others can be downloaded
from CRAN.

To access all of the functions and data sets in a particular package, it must be
loaded into the workspace. For example, to load the fda package:

> library(fda)

Note: that if you terminate your session and start a new session with the saved
workspace, you must load the packages again.

To check what packages are currently loaded into the workspace, use the com-
mand

> search()

23
Computer Interactive statistics Jane A. Aduda

and they get displayed as:

> search()
[1] ".GlobalEnv" "package:parallel" "package:Rcmdr"
[4] "package:RcmdrMisc" "package:sandwich" "package:car"
[7] "package:fda" "package:Matrix" "package:splines"
[10] "package:stats" "package:graphics" "package:grDevices"
[13] "package:utils" "package:datasets" "package:methods"
[16] "Autoloads"

To remove a package you have loaded use:

> detach("package:fda")

2.11 Interractive session


Create a folder called Session 1 and copy an R shortcut into this folder. Right-
click on this shortcut and go to Properties. Change the address in the Start
In box to the location of your folder.

For the purposes of this session, a data set already stored in R will be used. To
access this data, must first load the package containing the data. (R has many
packages containing various functions that can be used to analyse data, e.g. if
you want to analyse your data using splines, need to load the splines package).

In this example, the data is stored in the MASS package. This is loaded with the
command

> library(MASS)

Now you have access to all functions and data sets stored in this package.

We will work with the data set titled “whiteside". To display the data:

> library(MASS)
> whiteside
Insul Temp Gas
1 Before -0.8 7.2
2 Before -0.7 6.9
3 Before 0.4 6.4

24
Computer Interactive statistics Jane A. Aduda

4 Before 2.5 6.0


5 Before 2.9 5.8
6 Before 3.2 5.8
7 Before 3.6 5.6
8 Before 3.9 4.7
9 Before 4.2 5.8
10 Before 4.3 5.2
11 Before 5.4 4.9
12 Before 6.0 4.9
13 Before 6.0 4.3
this excerpt of the data is displayed in a data frame.

To get full description of the data, use the command


> ? whiteside
In part it gives the explanation “Mr Derek Whiteside of the UK Building Re-
search Station recorded the weekly gas consumption and average external tem-
perature at his own house in south-east England for two heating seasons, one
of 26 weeks before, and one of 30 weeks after cavity-wall insulation was in-
stalled. The object of the exercise was to assess the effect of the insulation on
gas consumption."
To remind ourselves of column names:
> names(whiteside)
[1] "Insul" "Temp" "Gas"
We can obtain some summary statistics using
> summary(whiteside)

Insul Temp Gas


Before :26 Min. :-0.800 Min. :1.300
After :30 1st Qu. : 3.050 1st Qu. :3.500
Median : 4.900 Median :3.950
Mean : 4.875 Mean :4.071
3rd Qu. : 7.125 3rd Qu. :4.625
Max. :10.200 Max. :7.200

To access the data in a particular column

25
Computer Interactive statistics Jane A. Aduda

> whiteside$Temp
[1] -0.8 -0.7 0.4 2.5 2.9 3.2 3.6 3.9 4.2 4.3 5.4 6.0 6.0 6.0 6.2
[16] 6.3 6.9 7.0 7.4 7.5 7.5 7.6 8.0 8.5 9.1 10.2 -0.7 0.8 1.0 1.4
[31] 1.5 1.6 2.3 2.5 2.5 3.1 3.9 4.0 4.0 4.2 4.3 4.6 4.7 4.9 4.9
[46] 4.9 5.0 5.3 6.2 7.1 7.2 7.5 8.0 8.7 8.8 9.7
A plot of gas consumption versus temperature is now created as shown in figure
6. The command “main=" adds the title to a graph
> plot(Gas ~ Temp, data=whiteside, pch=16, main="Gas consumption Whiteside")

Figure 6: Temprature vs gas consumption

You can produce separate graphs for gas consumption versus temperature before
insulation was used and after insulation was used as shown in figure 7. This
requires the use of xyplot() available in the lattice package.
> library(lattice) # Loads the lattice package
> ? xyplot # Gives more information on xyplot()
> xyplot(Gas ~ Temp | Insul, whiteside)

26
Computer Interactive statistics Jane A. Aduda

Figure 7: Graph showing tempratures and gas consumption before and after
insullation

2.12 Simple Data Entry


Create a data frame with 2 columns. The following data gives, for each amount
by which an elastic band is stretched over the end of a ruler, the distance that
the band moved when released:

Stretch (mm) Distance (cm)


46 148
54 182
48 173
50 166
44 109
42 141
52 166

Use the data.frame() command and name the data frame elasticband.

27
Computer Interactive statistics Jane A. Aduda

> elasticband <- data.frame(stretch = c(46,54,48,50,44,42,52),


+ distance=c(148,182,173,166,109,141,166))
> elasticband
stretch distance
1 46 148
2 54 182
3 48 173
4 50 166
5 44 109
6 42 141
7 52 166

2.13 Exercises
2.13.1 Excercise 1
1. Create summary statistics for the elastic band data.

2. Create a plot of distance versus stretch.

3. Use the help() command to find more information about the hist()
command.

4. Create a histogram of the distance using hist().

5. The following data are on snow cover for Eurasia in the years 1970-1979.

year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
snow.cover 6.5 12.0 14.9 10.0 10.7 7.9 21.9 12.5 14.5 9.2

(a) Enter the data into R. To save keystrokes, enter the successive years
as 1970:1979.
(b) Take the logarithm of snow cover.
(c) Plot snow cover versus year.

6. Display all objects in the workspace and then remove the data frame elas-
ticband.

28
Computer Interactive statistics Jane A. Aduda

2.13.2 Exercise 2
1. Assume that we have registered the height and weight for four people:
Heights in cm are 180, 165, 160, 193; weights in kg are 87, 58, 65, 100.
Make two vectors, height and weight, with the data. The body mass index
(BMI) is defined as
weight in kg
(height in m)2
Make a vector with the BMI values for the four people, and a vector with
the natural logarithm to the BMI values. Finally make a vector with the
weights for those people who have a BMI larger than 25.

2. In an experiment the dry weight has been measured for 8 plants grown
under certain conditions. The heights are given as 77, 93, 92, 68, 88, 75,
100. Create a vector called dry to hold the data and write a formula to
calculate it’s mean and standard deviation.

3. Assume that we have the following six observations of temperature: 23 ◦ C,


27 ◦ C, 19 ◦ C, 30 ◦ C, 37 ◦ C and 0 ◦ C. Make a vector with these values. The
relation between the Celsius and Fahrenheit temperature scale is
9
degress in Fahrenheit = degrees in Celsisus × + 32
5
Make a new vector with the temperatures in Fahrenheit.

4. Assume that you are interested in cone-shaped structures, and have mea-
sured the height and radius of 6 cones. Make vectors with these values as
follows:

R <- c(2.27, 1.98, 1.69, 1.88, 1.64, 2.14)


H <- c(8.28, 8.04, 9.06, 8.70, 7.58, 8.34)

Recall that the volume of a cone with radius R and height H is given by
1 2
πR H. Make a vector with the volumes of the 6 cones.
3
5. Compute the mean, median and standard deviation of the cone volumes.
Compute also the mean of volume for the cones with a height less than
8.5.

29
Computer Interactive statistics Jane A. Aduda

3 Data types, Objects and simple manipulations


3.1 Basic Data types in R
R supports a few basic data types, namely:
1. Numeric
2. Integer
3. Complex
4. Logical
5. Character/String
6. Factor

3.1.1 Numeric
Decimal values are called numerics in R. It is the default computational data
type. If we assign a decimal value to a variable x as follows, x will be of numeric
type.
> x<-3.5
> x
[1] 3.5
> class(x)
[1] "numeric"
Notice that, if we assign an integer to a variable k, it is still saved as a numeric
value.
> k<-2
> k
[1] 2
> class(k)
[1] "numeric"
The fact that k is not an integer can be confirmed with the is.integer func-
tion.
> is.integer(k) # is k an integer?
[1] FALSE

30
Computer Interactive statistics Jane A. Aduda

3.1.2 Integer
In order to create an integer variable in R, we invoke the as.integer func-
tion. We can be assured that y is indeed an integer by applying the is.integer
function.

> y<-as.integer(3)
> y
[1] 3
> class(y)
[1] "integer"
> is.integer(y)
[1] TRUE

Using the as.integer function, we can coerce a numeric value into an integer.

> as.integer(pi) # coerce a numeric value


[1] 3
> as.integer(3.14) # coerce a numeric value
[1] 3

And we can parse a string for decimal values in much the same way.

> as.integer("6.25") # coerce a decimal string


[1] 6

On the other hand, it is erroneous trying to parse a non-decimal string.

> as.integer("Jane")
[1] NA
Warning message:
NAs introduced by coercion

Often, it is useful to perform arithmetic on logical values. Like the C language,


TRUE has the value 1, while FALSE has value 0.

> as.integer(TRUE) # the numeric value of TRUE


[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0

31
Computer Interactive statistics Jane A. Aduda

3.1.3 Complex
A complex value in R is defined via the pure imaginary value i.
> z = 1 + 2i # create a complex number
> z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
The following gives an error as -1 is not a complex value.
> sqrt(-1) # square root of -1
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
Instead, we have to use the complex value −1 + 0i.
> sqrt(-1+0i) # square root of -1+0i
[1] 0+1i
Alternatively, we can coerce -1 into a complex value.
> sqrt(as.complex(-1))
[1] 0+1i

3.1.4 Logical
A logical value is often created via comparison between variables. It is binary,
i.e. two possible values represented by TRUE and FALSE
> x = c(3, 7, 1, 2)
> x>2
[1] TRUE TRUE FALSE FALSE
> x<2
[1] FALSE FALSE TRUE FALSE
> x==2
[1] FALSE FALSE FALSE TRUE
> x<3
[1] FALSE FALSE TRUE TRUE
> which(x>2)
[1] 1 2

32
Computer Interactive statistics Jane A. Aduda

Another example

> x<-5
> y<-7
> z<-x>y
> z
[1] FALSE
> class(z)
[1] "logical"

The standard logical operations are summarized below.

< less than


> great than
<= less than or equal
>= greater than or equal
== equal to
!= not equal to
| entry wise or
|| or
! not
& entry wise and
&& and
xor(a,b) exclusive or

For example

> u=TRUE
> v=FALSE
> u&v
[1] FALSE
> u|v
[1] TRUE
> !u
[1] FALSE

Note that there is a difference between operators that act on entries within a
vector and on the whole vector:

> a = c(TRUE,FALSE)

33
Computer Interactive statistics Jane A. Aduda

> b = c(FALSE,FALSE)
> a|b
[1] TRUE FALSE
> a||b
[1] TRUE
> xor(a,b)
[1] TRUE FALSE

3.1.5 Character
A character object is used to represent string values in R. We convert objects
into character values with the as.character() function:

> x = as.character(3.14)
> x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"

Two character values can be concatenated with the paste function.

> fname="Jane"
> lname="Aduda"
> paste(fname,lname)
[1] "Jane Aduda"

However, it is usually more convenient to create a readable string with the


sprintf function, which has a C language syntax.

> sprintf("%s does not have %d shillings", "Jane", 1000)


[1] "Jane does not have 1000 shillings"
> sprintf("%s %d", "test", 1:5) # re-cycle arguments
[1] "test 1" "test 2" "test 3" "test 4" "test 5"

To extract a substring, we apply the substr function. Here is an example show-


ing how to extract the substring between the third and twelfth positions in a
string.

> substr("Mary had a little lamb.", start=3, stop=12)


[1] "ry had a l"

34
Computer Interactive statistics Jane A. Aduda

And to replace the first occurrence of the word "little" by another word "big"
in the string, we apply the sub function.

> sub("little", "big", "Mary had a little lamb.")


[1] "Mary had a big lamb."

More functions for string manipulation can be found in the R documentation.

3.2 Data objects in R


The following data objects exist in R:

1. Vectors

2. Factors

3. Matrices

4. Arrays

5. Lists

6. Data frames

3.2.1 Vectors
Vectors are the simplest type of object in R. There are 3 main types of vectors:

1. Numeric vectors

2. Character vectors

3. Logical vectors

4. (Complex number vectors)

To set up a numeric vector x consisting of 5 numbers, 10.4, 5.6, 3.1, 6.4, 21.7,
use

> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)


> x
[1] 10.4 5.6 3.1 6.4 21.7

35
Computer Interactive statistics Jane A. Aduda

or

> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))

x is a numeric vector. To extract the fourth element for example we use

> x[4]
[1] 6.4

We can also do further assignments:

> y =c(x,0,x)
> y
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4 21.7

creates a vector y with 11 entries (two copies of x with a 0 in the middle).

Computations are performed element-wise

> 2/x
[1] 0.1923077 0.3571429 0.6451613 0.3125000 0.0921659

and short vectors are “recycled" to match long ones

> z=x+y
Warning message:
In x + y : longer object length is not a multiple of shorter object length

Some functions take vectors of values and produce results of the same length:
sin, cos, tan, asin, acos, atan, log, exp, ...

> cos(x)
[1] -0.5609843 0.7755659 -0.9991352 0.9931849 -0.9579148
> exp(x)
[1] 3.285963e+04 2.704264e+02 2.219795e+01 6.018450e+02 2.655769e+09

Some functions return a single value:


sum, mean, max, min, prod,

> sum(x)
[1] 47.2
> length(x)

36
Computer Interactive statistics Jane A. Aduda

[1] 5
> sum(x)/length(x)
[1] 9.44
> mean(x)
[1] 9.44

R has a number of ways to generate sequences of numbers. These include:

X the colon ":", e.g.

1:10
[1] 1 2 3 4 5 6 7 8 9 10

This operator has the highest priority within an expression, e.g. 2*1:10 is
equivalent to 2*(1:10).

> 2*1:10
[1] 2 4 6 8 10 12 14 16 18 20

X the seq() function.

> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1,to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(to=10,from=1)
[1] 1 2 3 4 5 6 7 8 9 10

We can also specify a step size (using by=value) or a length (using length=value)
for the sequence.

> s1 <- seq(1,10, by=0.5)


> s1
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
[16] 8.5 9.0 9.5 10.0
> s2 <- seq(1,10, length=5)
> s2
[1] 1.00 3.25 5.50 7.75 10.00

37
Computer Interactive statistics Jane A. Aduda

X the rep() function - replicates objects in various ways.

> s3 <- rep(x, 2)


> s3
[1] 10.4 5.6 3.1 6.4 21.7 10.4 5.6 3.1 6.4 21.7
> s4 <- rep(c(1,4),c(10,15))
> s4
[1] 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

To set up a character/string vector z consisting of 5 place names use


> z <- c("Nairobi", "Mombasa", "Nakuru", "Kisumu","Eldoret")
> z
[1] "Nairobi" "Mombasa" "Nakuru" "Kisumu" "Eldoret"
This vector can be concatenated using c()
[1] "Nairobi" "Mombasa" "Nakuru" "Kisumu" "Eldoret"
> z1<-c(z,"Malindi", "Kajiado")
> z1
[1] "Nairobi" "Mombasa" "Nakuru" "Kisumu" "Eldoret" "Malindi" "Kajiado"
A logical vector is a vector whose elements are TRUE, FALSE or NA, generated
by conditions e.g.
> x
[1] 10.4 5.6 3.1 6.4 21.7
> temp <- x > 13
> temp
[1] FALSE FALSE FALSE FALSE TRUE
It takes each element of the vector x and compares it to 13 and returns a vector
the same length as x, with a value TRUE when the condition is met and FALSE
when it is not.

In some cases the entire contents of a vector may not be known. For example,
there could be missing data from a particular data set. A place can be reserved
for this by assigning it the special value NA. We can check for NA values in a
vector x using the command
> is.na(x)

38
Computer Interactive statistics Jane A. Aduda

This returns a logical vector the same length as x with a value TRUE if that
particular element is NA. For example

> w <- c(1:10, rep(NA,4), 22)


> w
[1] 1 2 3 4 5 6 7 8 9 10 NA NA NA NA 22
> is.na(w)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[13] TRUE TRUE FALSE

Have already seen how to access single elements of a vector. Subsets of a vector
may also be selected using a similar approach.

> ind1 <- w[!is.na(w)]


> ind1
[1] 1 2 3 4 5 6 7 8 9 10 22

This command stores the elements of the vector w that do NOT have the value
NA, into ind1.

> ind2 <- w[1:3]


> ind2
[1] 1 2 3

Selects the
first 3 elements of the vector w and stores them in the new vector ind2.

> ind3 <- w[-(1:4)]


> ind3
[1] 5 6 7 8 9 10 NA NA NA NA 22

Using the - sign indicates that these elements should be excluded. This com-
mand excludes the first 4 elements of w.

> ind4 <- w[-c(1,4)]


> ind4
[1] 2 3 5 6 7 8 9 10 NA NA NA NA 22

In this case only the 1st and 4th elements of w are excluded.

To modify the contents of a vector, similar methods can be used.

39
Computer Interactive statistics Jane A. Aduda

> x
[1] 10.4 5.6 3.1 6.4 21.7
> x[1]<-5
> x
[1] 5.0 5.6 3.1 6.4 21.7

In this case we have modified the 1st element of x to 5.

> w[is.na(w)] <- 0


> w
[1] 1 2 3 4 5 6 7 8 9 10 0 0 0 0 22

In this case we have replaced the NA (missing) values in the vector w with the
value 0

Consider the following

> y <- c(-1, -2, rep(0, 3), 7, 8, 9)


> y
[1] -1 -2 0 0 0 7 8 9
> y[y<0]
[1] -1 -2

we replace any elements of y having a negative value, with the corresponding


positive value, as shown below. (Note this is equivalent to using the in-built
abs() function).

> y[y < 0] <- -y[y < 0]


> y
[1] 1 2 0 0 0 7 8 9

The same can be achieved as shown below

> y <- c(-1, -2, rep(0, 3), 7, 8, 9)


> y
[1] -1 -2 0 0 0 7 8 9
> abs(y)
[1] 1 2 0 0 0 7 8 9

40
Computer Interactive statistics Jane A. Aduda

3.2.2 Factor
A factor is a special type of vector used to represent categorical data, e.g. gen-
der, social class, etc. It is a type vector containing a set of numeric codes with
character-valued levels. Factor variables are stored internally as a numeric vector
with values 1, 2, . . . , k, where k is the number of levels.

We can have either ordered and unordered factors. A factor with k levels stored
internally, consists of 2 items
1. a vector of k integers

2. a character vector containing strings describing what the k levels are.


Example – for a family of two girls (1) and four boys (0),
> children = factor(c(1,0,1,0,0,0), levels = c(0, 1), labels = c("boy", "girl"))
> children
[1] girl boy girl boy boy boy
Levels: boy girl
> class(children)
[1] "factor"
> mode(children)
[1] "numeric"
Regardless of the levels/labels of the factor, the numeric storage is an integer
with 1 corresponding to the first level (in alph-order).
> children+1
[1] NA NA NA NA NA NA
Warning message:
In Ops.factor(children, 1) : ‘+’ not meaningful for factors
> as.numeric(children)
[1] 2 1 2 1 1 1
> 1+as.numeric(children)
[1] 3 2 3 2 2 2
> summary(children)
boy girl
4 2
create another variable children2

41
Computer Interactive statistics Jane A. Aduda

> children2 = factor(c("boy","girl","boy","girl","boy","boy"))


> children2
[1] boy girl boy girl boy boy
Levels: boy girl
> as.numeric(children2)
[1] 1 2 1 2 1 1
> summary(children2)
boy girl
4 2

Example
Consider a survey that has data on 200 females and 300 males. If the
first 200 values are from females and the next 300 values are from males, one
way of representing this is to create a vector

gender <- c(rep("female", 200), rep("male", 300))

To change this into a factor

> gender <- factor(gender)

The factor gender is stored internally as

1 female
2 male

Each category, i.e. female and male, is called a level of the factor. To determine
the levels of a factor the function levels() can be used:

> levels(gender)
[1] "female" "male"

Example
Five people are asked to rate the performance of a product on a scale of 1-
5, with 1 representing very poor performance and 5 representing very good
performance. The following data were collected.

> sat <- c(1, 3, 4, 2, 2)


> fsat <- factor(sat, levels=1:5)
> levels(fsat) <- c("very poor", "poor", "average","good", "very good")

42
Computer Interactive statistics Jane A. Aduda

The first line creates a numeric vector containing the satisfaction levels of the 5
people. This is a categorical variable.

The second line creates a factor. The levels=1:5 argument indicates that there
are 5 levels of the factor.

Finally the last line sets the names of the levels to the specified character strings.
> sat
[1] 1 3 4 2 2
> fsat
[1] very poor average good poor poor
Levels: very poor poor average good very good
> levels(fsat)
[1] "very poor" "poor" "average" "good" "very good"

3.3 Matrices
A matrix is a two-dimensional array of numbers, it has rows and columns and it
is used for many purposes in statistics. In R matrices are represented as vectors
with dimensions.
> A<-rnorm(15) # creates a vector of standard normal 15 random numbers
> A
[1] -0.2199310 -0.1558994 0.2503376 0.9532383 0.1044002 1.5693812
[7] 0.9335553 -0.6430651 -1.0046298 -0.3206042 0.8627723 0.1836219
[13] -0.9662111 -0.7285974 -0.2556030

> dim(A)<-c(3,5) # The dim function sets the dimension of m.


> A
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2199310 0.9532383 0.9335553 -0.3206042 -0.9662111
[2,] -0.1558994 0.1044002 -0.6430651 0.8627723 -0.7285974
[3,] 0.2503376 1.5693812 -1.0046298 0.1836219 -0.2556030

> dim(A)<-c(5,3)
> A
[,1] [,2] [,3]
[1,] -0.2199310 1.5693812 0.8627723

43
Computer Interactive statistics Jane A. Aduda

[2,] -0.1558994 0.9335553 0.1836219


[3,] 0.2503376 -0.6430651 -0.9662111
[4,] 0.9532383 -1.0046298 -0.7285974
[5,] 0.1044002 -0.3206042 -0.2556030

Note that the storage is carried out by filling in the columns first, then the rows.

Another way to create a matrix is to use the matrix() function.

> B<-rnorm(15)
> matrix(B, nrow=5, ncol=3, byrow=T)
[,1] [,2] [,3]
[1,] 0.6133484 0.1548929 -0.09625237
[2,] -0.8585939 -2.0814610 -1.27274845
[3,] 0.2588589 0.6999445 -0.90617682
[4,] 0.4095178 0.7336874 0.24013239
[5,] -0.2409285 -0.3444688 -1.37569853

The byrow=T command causes the matrix to be filled in row by row rather than
column by column.

Re-call the last command and change byrow=T to byrow="F". Notice the dif-
ference between the two outputs. This time the matrix is filled in column by
column.

> matrix(B, nrow=5, ncol=3, byrow=F)


[,1] [,2] [,3]
[1,] 0.61334843 -1.2727484 0.7336874
[2,] 0.15489291 0.2588589 0.2401324
[3,] -0.09625237 0.6999445 -0.2409285
[4,] -0.85859394 -0.9061768 -0.3444688
[5,] -2.08146099 0.4095178 -1.3756985

Some useful functions for matrices include nrow(), ncol(), t(), rownames(),
colnames().

> nrow(A)
[1] 5

44
Computer Interactive statistics Jane A. Aduda

> rownames(A)<-c("a","b","c","d","e")
> A
[,1] [,2] [,3]
a -0.2199310 1.5693812 0.8627723
b -0.1558994 0.9335553 0.1836219
c 0.2503376 -0.6430651 -0.9662111
d 0.9532383 -1.0046298 -0.7285974
e 0.1044002 -0.3206042 -0.2556030

The t() function is the transposition function (rows become columns and vice
versa).

> t(A)
a b c d e
[1,] -0.2199310 -0.1558994 0.2503376 0.9532383 0.1044002
[2,] 1.5693812 0.9335553 -0.6430651 -1.0046298 -0.3206042
[3,] 0.8627723 0.1836219 -0.9662111 -0.7285974 -0.2556030

We can also merge vectors and matrices together, column-wise or row-wise us-
ing rbind() (add on rows) or cbind() (add on columns).

When using rbind() - if combining matrices with other matrices, the matri-
ces must have the same number of columns. If combining vectors with other
vectors or vectors with matrices the vectors can have any length but will be
lengthened/shortened accordingly if of differing lengths.

When using cbind() - if combining matrices with other matrices, the matrices
must have the same number of rows. If combining vectors with other vectors
or vectors with matrices, the vectors can have any length but will be length-
ened/shortened accordingly if of differing lengths.

> C<-matrix(B, nrow=5, ncol=3, byrow=T)


> C
[,1] [,2] [,3]
[1,] 0.6343716 -1.7484903 0.5230774
[2,] -1.8148107 1.5492271 -0.5589112
[3,] -1.8660952 -1.1374160 0.1503507
[4,] -0.7046708 -0.4577602 -0.9357541
[5,] -1.0268098 -1.1896584 -0.1886360

45
Computer Interactive statistics Jane A. Aduda

> rbind(A,C)
[,1] [,2] [,3]
[1,] -0.2199310 1.5693812 0.8627723
[2,] -0.1558994 0.9335553 0.1836219
[3,] 0.2503376 -0.6430651 -0.9662111
[4,] 0.9532383 -1.0046298 -0.7285974
[5,] 0.1044002 -0.3206042 -0.2556030
[6,] 0.6343716 -1.7484903 0.5230774
[7,] -1.8148107 1.5492271 -0.5589112
[8,] -1.8660952 -1.1374160 0.1503507
[9,] -0.7046708 -0.4577602 -0.9357541
[10,] -1.0268098 -1.1896584 -0.1886360

> cbind(A,C)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.2199310 1.5693812 0.8627723 0.6343716 -1.7484903 0.5230774
[2,] -0.1558994 0.9335553 0.1836219 -1.8148107 1.5492271 -0.5589112
[3,] 0.2503376 -0.6430651 -0.9662111 -1.8660952 -1.1374160 0.1503507
[4,] 0.9532383 -1.0046298 -0.7285974 -0.7046708 -0.4577602 -0.9357541
[5,] 0.1044002 -0.3206042 -0.2556030 -1.0268098 -1.1896584 -0.1886360
For matrix multiplication
> D<-matrix(1:9, nrow=3,ncol=3)
> D
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

> E<-matrix(10:18, nrow=3, ncol=3)


> E
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18

> F<- D%*%E # Calculates the product of two matrices, F = AB

46
Computer Interactive statistics Jane A. Aduda

> F
[,1] [,2] [,3]
[1,] 138 174 210
[2,] 171 216 261
[3,] 204 258 312

> G<- D * D # Calculates element by element products


> G
[,1] [,2] [,3]
[1,] 1 16 49
[2,] 4 25 64
[3,] 9 36 81

Other functions to work on matrices include:


crossprod(A, B) # = t(A) %*% B
diag(n) # Creates a diagonal matrix with the values in the vector n on
# the diagonal
solve(C) # Calculates the inverse of A
C^(-1) # Calculates 1/c_ij
eigen(C) # Calculates the eigenvalues and eigenvectors of C
Lets say we have a 5×6 matrix
> X <- matrix(rnorm(30), nrow=5)
> dimnames(X) <- list(letters[1:5], LETTERS[1:6])
> X
A B C D E F
a 1.41346352 0.67723505 0.2404820 -0.8757583 0.89448789 0.4045111
b -0.08013522 -0.95659291 1.2656448 -0.8038683 0.92483273 0.1256199
c -0.25061175 1.69335129 1.9689660 -0.3667330 -0.32647029 -1.6338905
d -0.22283307 0.25310854 -0.3830079 -1.8813818 0.04918283 -1.4148502
e -0.29963721 0.03015866 -1.8363220 -1.2360666 -1.12169062 -0.4777748
We can access the value in row 4 column 3 by
> X[4,3]
[1] -0.3830079 # or
> X["d","C"]
[1] -0.3830079

47
Computer Interactive statistics Jane A. Aduda

For elements in columns 2 and 4

> X[,c(2,4)]
B D
a 0.67723505 -0.8757583
b -0.95659291 -0.8038683
c 1.69335129 -0.3667330
d 0.25310854 -1.8813818
e 0.03015866 -1.2360666

For elements in columns 2 to 4

> X[,c(2:4)]
B C D
a 0.67723505 0.2404820 -0.8757583
b -0.95659291 1.2656448 -0.8038683
c 1.69335129 1.9689660 -0.3667330
d 0.25310854 -0.3830079 -1.8813818
e 0.03015866 -1.8363220 -1.2360666

and elements in rows 2 to 4

> X[2:4,]
A B C D E F
b -0.08013522 -0.9565929 1.2656448 -0.8038683 0.92483273 0.1256199
c -0.25061175 1.6933513 1.9689660 -0.3667330 -0.32647029 -1.6338905
d -0.22283307 0.2531085 -0.3830079 -1.8813818 0.04918283 -1.4148502

3.4 Arrays
An array can have multiple dimensions. A matrix is a special case of an array (a
2-d array).

We can construct an array from a vector z containing 300 elements using the
dim() function (as for matrices).

> z <- rnorm(300)


> dim(z) <- c(10, 6, 5) # like 5 10 * 6 matirices

48
Computer Interactive statistics Jane A. Aduda

This creates a 3-d array with dimensions 10*6*5 (like storing 5 matrices, each
with 10 rows and 6 columns).

Can also use the array() function.

> A1 <- array(0, c(2, 2, 3)) # Creates an array of zeros


> a <- rnorm(50)
> A2 <- array(a, c(5, 5, 2)) # Creates an array from vector a

Use ? array to nd out more about arrays.

Elements of multi-dimensional arrays can be extracted using similar techniques.


For example

> ar.1 <- array(1:24, dim=c(4,2,3))


> ar.1[2,,] # Extracts the data in row 2 of the 3 ‘matrices’.
> ar.1[,2,] # Extracts the data in column 2 of the 3 ‘matrices’.
> ar.1[,,1] # Extracts the data in the first ‘matrix’.
> ar.1[1,2,3] # Extracts the data in the row 1, column 2 of the third ‘matrix’.

3.5 Lists
Lists are an ordered collection of components.

The components may be arbitrary R objects (data frames, vectors, lists, etc. ).
They are constructed using the function list().

A simple example of a list is as follows:

> L1 <- list(name="Joseph", wife="Mary", no.children=3, child.ages=c(4,7,9))


> L1
$name
[1] "Joseph"

$wife
[1] "Mary"

$no.children
[1] 3

49
Computer Interactive statistics Jane A. Aduda

$child.ages
[1] 4 7 9

Each component of the list is given a name (i.e. name, wife, no.children,
child.ages).

Construct a second list omitting the component names:

> L2 <- list("Joseph", "Mary", 3, c(4,7,9))


> L2
[[1]]
[1] "Joseph"

[[2]]
[1] "Mary"

[[3]]
[1] 3

[[4]]
[1] 4 7 9

R uses single bracket notation for sublists and double bracket notation for indi-
vidual components.

To access the component "Joseph", we use the command

> L1$name
[1] "Joseph"

> L1[["name"]]
[1] "Joseph"

> L1[[1]]
[1] "Joseph"

A sublist consisting of the first or second component only

50
Computer Interactive statistics Jane A. Aduda

> L1[1]
$name
[1] "Joseph"

> L1[2]
$wife
[1] "Mary"

The names of each component of the list can be accessed using

> names(L1)
[1] "name" "wife" "no.children" "child.ages"

> names(L2)
NULL

We can set the names for the list components after the list has been created.

> names(L2) <- c("name.hus", "name.wife", "no.child", "child.age")


> L2
$name.hus
[1] "Joseph"

$name.wife
[1] "Mary"

$no.child
[1] 3

$child.age
[1] 4 7 9

> names(L2)
[1] "name.hus" "name.wife" "no.child" "child.age"

Lists can also be concatenated

> L3 <- c(L1, L2)


> L3
$name

51
Computer Interactive statistics Jane A. Aduda

[1] "Joseph"

$wife
[1] "Mary"

$no.children
[1] 3

$child.ages
[1] 4 7 9

$name.hus
[1] "Joseph"

$name.wife
[1] "Mary"

$no.child
[1] 3

$child.age
[1] 4 7 9

3.6 Data frames


A data frame as a data matrix or data set. It is a generalisation of a matrix.

It comprises a list of vectors and/or factors of the same length with a unique set
of row names.

Data frames can be created from pre-existing variables.

> data <- data.frame(year, mean_weight, Gender, mean_height)


> data
year mean_weight Gender mean_height
1 1980 71.5 M 179.3
2 1988 72.1 M 179.9
3 1996 73.7 F 180.5

52
Computer Interactive statistics Jane A. Aduda

4 1998 74.3 F 180.1


5 2000 75.2 M 180.3
6 2002 74.7 M 180.4
This is the same as re-typing
> data <- data.frame(year=c(1980,1988,1996,1998,2000,2002),
+ mean_weight=c(71.5,72.1,73.7,74.3,75.2,74.7),
+ Gender=c("M", "M", "F", "F", "M", "M"),
+ mean_height = c(179.3, 179.9, 180.5, 180.1, 180.3, 180.4))
We can also convert other objects (e.g. lists, matrices) into a data frame.

In the previous exercises you created a list called mylist. To convert this to a
data frame:
> new.data <- as.data.frame(mylist)
> new.data
year mean_weight Gender mean_height
1 1980 71.5 M 179.3
2 1988 72.1 M 179.9
3 1996 73.7 F 180.5
4 1998 74.3 F 180.1
5 2000 75.2 M 180.3
6 2002 74.7 M 180.4
As with lists, individual components (columns) can be accessed using the $
notation.
> new.data$year
[1] 1980 1988 1996 1998 2000 2002

> new.data[3,2]
[1] 73.7

> new.data[,2]
[1] 71.5 72.1 73.7 74.3 75.2 74.7

> new.data[3,]
year mean_weight Gender mean_height
3 1996 73.7 F 180.5

53
Computer Interactive statistics Jane A. Aduda

Selecting all data for cases that satisfy some criterion, such as the data for all
males.

> new.data[new.data$Gender == "M",]


year mean_weight Gender mean_height
1 1980 71.5 M 179.3
2 1988 72.1 M 179.9
5 2000 75.2 M 180.3
6 2002 74.7 M 180.4

To select only the weight and height of females born after 1996 use:

> new.data[new.data$Gender == "F" & new.data$year > 1996, c(2,4)]


mean_weight mean_height
4 74.3 180.1

Replacing the & with a | selects the rows that satisfy EITHER condition.

> new.data[new.data$Gender == "F" | new.data$year > 1996, c(2,4)]


mean_weight mean_height
3 73.7 180.5
4 74.3 180.1
5 75.2 180.3
6 74.7 180.4

We can shorten this command by attaching the data so that we don’t have to
use $

> attach(new.data)
The following objects are masked _by_ .GlobalEnv:

Gender, mean_height, mean_weight, year

The following objects are masked from new.data (pos = 3):

Gender, mean_height, mean_weight, year

> new.data[Gender == "F" | year > 1996, c(2,4)]


mean_weight mean_height
3 73.7 180.5

54
Computer Interactive statistics Jane A. Aduda

4 74.3 180.1
5 75.2 180.3
6 74.7 180.4

To detach the data frame

> detach(new.data)
> search()

We can apply a function to each element of a vector/data frame/list/array.

The apply family has four members, namely, lapply, sapply, tapply, apply

– lapply: takes any structure and gives a list of results (hence the ‘l’)

– sapply: like lapply, but tries to simplify the result to a vector or matrix
if possible (hence the ‘s’)

– tapply: allows you to create tables (hence the ‘t’) of values from sub-
groups defined by one or more factors.

– apply: only used for arrays/matrices

Let us use an in-built data set called trees. Gives girth, height and volume mea-
surements for 31 trees.

to view the dataset, type

> trees

Let us calculate the mean of each variable in trees

> lapply(trees, mean)


$Girth
[1] 13.24839

$Height
[1] 76

$Volume
[1] 30.17097

55
Computer Interactive statistics Jane A. Aduda

> sapply(trees, mean)


Girth Height Volume
13.24839 76.00000 30.17097

For tapply, let us use the Cars93 dataset in the MASS package. An excerpt of
the data is shown below

> library(MASS)
> Cars93
Manufacturer Model Type Min.Price Price Max.Price MPG.city
1 Acura Integra Small 12.9 15.9 18.8 25
2 Acura Legend Midsize 29.2 33.9 38.7 18
3 Audi 90 Compact 25.9 29.1 32.3 20
4 Audi 100 Midsize 30.8 37.7 44.6 19
5 BMW 535i Midsize 23.7 30.0 36.2 22
6 Buick Century Midsize 14.2 15.7 17.3 22
7 Buick LeSabre Large 19.9 20.8 21.7 19
8 Buick Roadmaster Large 22.6 23.7 24.9 16
9 Buick Riviera Midsize 26.3 26.3 26.3 19

Manufacturer is a factor

> is.factor(Cars93$Manufacturer)
[1] TRUE

and we want to calculate the average price of a car for each manufacturer.

> tapply(Cars93$Price, Cars93$Manufacturer, mean)


Acura Audi BMW Buick Cadillac
24.90000 33.40000 30.00000 21.62500 37.40000
Chevrolet Chrylser Chrysler Dodge Eagle
18.18750 18.40000 22.65000 15.70000 15.75000
Ford Geo Honda Hyundai Infiniti
14.96250 10.45000 16.46667 10.47500 47.90000
Lexus Lincoln Mazda Mercedes-Benz Mercury
31.60000 35.20000 17.60000 46.90000 14.50000
Mitsubishi Nissan Oldsmobile Plymouth Pontiac
18.20000 17.02500 17.50000 14.40000 16.14000
Saab Saturn Subaru Suzuki Toyota

56
Computer Interactive statistics Jane A. Aduda

28.70000 11.10000 12.93333 8.60000 17.27500


Volkswagen Volvo
18.02500 24.70000

We can also calculate the average price if a car for each type

> is.factor(Cars93$Type)
[1] TRUE
> tapply(Cars93$Price, Cars93$Type, mean)
Compact Large Midsize Small Sporty Van
18.21250 24.30000 27.21818 10.16667 19.39286 19.10000

For apply, lets create a matrix and sum the rows, next sum the columns, then
get the mean of each column.

> X1 <- matrix(1:12, nrow=3)


> X1
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> apply(X1, 1, sum)
[1] 22 26 30

> apply(X1, 2, sum)


[1] 6 15 24 33

> apply(X1, 2, mean)


[1] 2 5 8 11

3.7 Reading Data from Files


R provides several functions to read data in from such files such as text, Excel,
etc.

– scan() - offers a low-level reading facility

– read.table() - can be used to read data frames from formatted text files

57
Computer Interactive statistics Jane A. Aduda

– read.csv() can be used to read data frames from comma separated vari-
able files.

– When reading from Excel files, the simplest method is to save each work-
sheet separately as a .csv file and use read.csv() on each.

Save the dataset "CDA data" in your working folder so that the path when
when importing the data is short.

> Dataset <- read.table("CDA data.txt", header=TRUE)


> Dataset
Patient Sex Age Counsellor Sessions Satisfaction
1 1 male 21 John 8 5
2 2 male 22 John 8 4
3 3 male 25 John 9 7

> Dataset1 <- readXL("CDA data.xlsx", header=TRUE)


> Dataset1
Patient Sex Age Counsellor Sessions Satisfaction
1 1 male 21 John 8 5
2 2 male 22 John 8 4
3 3 male 25 John 9 7

> Dataset2 <- read.table("CDA data.csv", header=TRUE)


> Dataset2
Patient.Sex.Age.Counsellor.Sessions.Satisfaction
1 1.00,male,21.00,John,8.00,5.00
2 2.00,male,22.00,John,8.00,4.00
3 3.00,male,25.00,John,9.00,7.00

Saving as .txt: (tab delimited) is the ideal format!

3.8 Writing Data to Files


May also want to save your output in an external file. Use write.table() or
write.csv().

> write.table(Dataset, "CDA data.txt", row.names=FALSE, sep=" ")

58
Computer Interactive statistics Jane A. Aduda

Writes the data in Dataset to a text


le called "CDA data.txt" which is in the same folder as your R session. (Can
also specify a file path).

The row.names=FALSE command ensures that the row numbers are not saved
in the
file.

The sep=" " command ensures that the output is separated by a space. Can
change this using sep="," and output will now be separated by commas.

3.9 Exercises
3.9.1 Exercise 3
1. Create the following vectors

(a) (1, 2, 3,. . . , 19, 20)


(b) (20, 19,. . . , 2, 1)
(c) (1, 2, 3,. . . ,19, 20, 19, 18,. . . ,2, 1)
(d) 0.13 0.21 , 0.16 0.24 , . . . , 0.136 0.234 .
22 23 225
(e) 2, , ,..., .
2 3 25
2. Create a vector of the values of ex cos(x) at x = 3, 3.1, 3.2, . . . , 6 and plot
these values against x.

3. Create a vector x with the following entries


3 4 1 1 2 1 4 2 1 1 5 3 1 1 1 2 4 5 5 3
Check which elements of x are equal to 1 (Hint use == operator), then
modify x so that all of the 1’s are changed to 0’s.

4. Create a vector y containing the elements of x that are greater than 1.

5. Concatenate x and y into a vector called newVec.

6. Create a sequence of numbers from 1 to 20 in steps of 0.2 and store.

7. Create a sequence of 6 numbers from 8 to 20.

59
Computer Interactive statistics Jane A. Aduda

8. Display all objects in the workspace and then remove newVec.


9. Six patients were asked to rate their pain from 0 to 3, with 0 representing
‘no pain’, 1 representing ‘mild’ pain, 2 representing ‘medium’ pain and 3
representing ‘severe’ pain. The following results were obtained:

Patient 1 2 3 4 5 6
Pain level 0 3 1 2 1 2

Create a factor fpain to represent the above data.


10. The data y<-c(33,44,29,16,25,45,33,19,54,22,21,49,11,24,56) con-
tain sales of milk in litres for 5 days in three different shops (the first 3
values are for shops 1,2 and 3 on Monday, etc.) Produce a statistical sum-
mary of the sales for each day of the week and also for each shop.

3.9.2 Exercise 4
1. Construct a matrix A with values 10, 20, 30, 50 in column 1, values 1, 4,
2, 3 in column 2 and values 15, 11, 19, 5 in column 3, i.e. a 4 × 3 matrix.
Also construct a vector B with values 2.5, 3.5, 1.75. Check your results to
ensure that they are correct.
2. Combine A and B into a new matrix C using cbind().
3. Combine A and B into a new matrix H using rbind().
4. Determine the dimensions of C and H using dim() function.
5. Calculate the following:
 
  1 9
1 4 3
×  2 17
0 −2 8
−6 3

6. Create a 4×4×2 array arr using the values 1 to 32.


7. Print out the value in row 1, column 3 of the first ‘matrix’.
8. Print out the value in row 2, column 4 of the second ‘matrix’.
9. Add these two values together.

60
Computer Interactive statistics Jane A. Aduda

3.9.3 Exercise 5
1. Create 4 vectors Year, mean_weight, Gender and mean_height with the
following entries:

Year 1980 1988 1996 1998 2000 2002


mean_weight 71.5 72.1 73.7 74.3 75.2 74.7
Gender M M F F M M
mean_height 179.3 179.9 180.5 180.1 180.3 180.4

2. Create a list called mylist consisting of the above vectors giving each com-
ponent of the list a name.

3. Use 3 different ways to access the 4th element of the list.

3.9.4 Exersise 6
1. Create a data frame called club.points with the following data.

Firstname Lastname Age Gender Points


Alice Ryan 37 F 278
Paul Collins 34 M 242
Jerry Burke 26 M 312
Thomas Dolan 72 M 740
Marguerite Black 18 F 177
Linda McGrath 24 F 195

2. Calculate the average number of points received.

3. Store the data for females only into a data frame called fpoints.

4. The age for Jerry Burke was entered incorrectly. Change his age to 28.

5. Determine the maximum age of the males.

6. Extract the data for people with more than 100 points and are over the
age of 30.

61
Computer Interactive statistics Jane A. Aduda

3.9.5 Exersise 7
1. download the dataset in http://www.contextures.com/xlSampleData01.
html and save it in your work folder.

2. Read this data into R.

3. Print out the data for each sales representatives.

4. Print out the total number of pencils, pen, pen set and binders sold.

5. Find the mean, standard deviation, minimum and maximum total sales
for each region using the smallest number of commands possible.

62
Computer Interactive statistics Jane A. Aduda

4 Ploting
– For simple plotting, use plot, hist, pairs, boxplot,...
– To add to existing plots, use points, lines, abline, legend, title,
mtext,...
– for Interacting with graphics, use locator, identify
– For three dimensional data, use contour, image, persp, ...
– To see the many possibilities that R offers, see

> demo(graphics)

Basic plotting function is plot(). Possible arguments to plot() include:


– x, y– basic arguments (y may be omitted)
– xlim = c(lo, hi), ylim = c(lo, hi) - make sure the plotted axes ranges
include these values
– xlab = "x", ylab = "y" - labels for x- and y-axes respectively
– type = "c" - type of plot (p, l, b, h, S, . . . )
– lty = n - line type (if lines used)
– lwd = n - line width
– pch = v - plotting character(s)
– col = v - colour to be used for everything.
The commands
> x <- seq(0,1,length=40)
> par(mfrow =c(2,2)) # Will create 4 plots on the same page.
> plot(sin(2*pi*x)) # default type ="b"
> plot(sin(2*pi*x), type="l", col="blue")
> plot(sin(2*pi*x), type="S", col= "red")
> plot(sin(2*pi*x), type="h", col="blue")
produce the plots in figure 8

63
Computer Interactive statistics Jane A. Aduda

Figure 8: Various possible plots

4.1 Adding titles, lines, points


> dev.off() # removes par(mfrow)
null device
1
> library(MASS) # loads MASS package
> # Colour points and choose plotting symbols according to the
> # levels of a factor
> plot(Cars93$Weight, Cars93$EngineSize, col=as.numeric(Cars93$Type),
+ pch=as.numeric(Cars93$Type))

> # Adds x and y axes labels and a title.


> plot(Cars93$Weight, Cars93$EngineSize, ylab="Engine Size",
+ xlab="Weight", main="My plot")

> # Add lines to the plot.

64
Computer Interactive statistics Jane A. Aduda

Figure 9: Plot of Cars93 Data

> lines(x=c(min(Cars93$Weight), max(Cars93$Weight)), y=c(min(Cars93$EngineSize),


+ max(Cars93$EngineSize)), lwd=4, lty=3, col="green")
> abline(h=3, lty=2) # horizontal line
> abline(v=1999, lty=4) # vertical line

> # Add points to the plot.


> points(x=min(Cars93$Weight), y=min(Cars93$EngineSize), pch=16, col="red")

> # Add text to the plot.


> text(x=2000, y=5, "some text")

> # Add text under main title.


> mtext(side=3, "sub-title", line=0.45)

> # Add a legend


> legend("bottomright", legend=c("Data Points"), pch="o")

65
Computer Interactive statistics Jane A. Aduda

Figure 10: Plot of Cars93 Data with labels, lines, points, texts and legend

4.2 Adding regression lines


The function lm is used to fit linear models. It can be used to carry out regres-
sion.
lm(formula, data, subset, weights, ...)

levels(Cars93$Origin)
[1] "USA" "non-USA"

plot(Cars93$Weight, Cars93$EngineSize, pch = (1:2)[Cars93$Origin],


col = (2:3)[Cars93$Origin], xlab="Weight", ylab="Engine Size")
legend("topleft", legend=levels(Cars93$Origin), pch=1:2, col=2:3)

fm1 <- lm(EngineSize ~ Weight, Cars93, subset = Origin == "USA")


abline(coef(fm1), lty=4, col="blue")

fm2 <- lm(EngineSize ~ Weight, Cars93, subset = Origin == "non-USA")


abline(coef(fm2), lty=4, col="black")

66
Computer Interactive statistics Jane A. Aduda

Figure 11: Adding regression lines

4.3 Histogramms
par(mfrow =c(2,2))
# To create a histogram of the car weights from the Cars93 data set
hist(Cars93$Weight, xlab="Weight", main="Histogram of Weight", col="red")

# To change bin sizes


hist(Cars93$Weight, breaks=c(1500, 2050, 2300, 2350, 2400,
2500, 3000, 3500, 3570, 4000, 4500), xlab="Weight",
main="Histogram of Weight")

# Histograms for multiple groups.


USA.weight <- Cars93$Weight[Cars93$Origin == "USA"]
nonUSA.weight <- Cars93$Weight[Cars93$Origin == "non-USA"]

hist(USA.weight, breaks=10, xlim=c(1500,4500), col="grey")


hist(nonUSA.weight, breaks=10, xlim=c(1500,4500), col="green")
par(mfrow=c(1,1))

67
Computer Interactive statistics Jane A. Aduda

Figure 12: Mulitple Histograms for Cars93 data

4.4 Boxplot
par(mfrow =c(2,2))
boxplot(Cars93$Weight)
boxplot(Cars93$EngineSize)
boxplot(Cars93$Weight ~ Cars93$Origin)
boxplot(USA.weight, nonUSA.weight,names=c("USA", "non-USA"))
par(mfrow=c(1,1))

4.5 Normal probability (Q-Q) plots


To check that data are normally distributed:

Want all of the points to lie in an approximate straight line (along the 456◦
dotted line) for a normal distribution.

qqnorm(Cars93$Weight)
qqline(Cars93$Weight)

68
Computer Interactive statistics Jane A. Aduda

Figure 13: Boxplots

4.6 Pie Charts


Cars93$Type
Type <- Cars93$Type
Type.freq <- table(Type) # get frequancy for each category
pct <- round(Type.freq/sum(Type.freq)*100) # calc percentages
pct <- paste(pct,"%",sep="")
pie(Type.freq, labels = paste(names(Type.freq), pct,sep=" "),radius = 0.8,
col=rainbow(length(Type.freq)))

To plot a 3-D pie chart, the pie3D() function in the plotrix package provides
3D exploded pie charts

library(plotrix)
pie3D(Type.freq, labels = paste(names(Type.freq), pct,sep=" "),radius = 0.8,
col=rainbow(length(Type.freq)),explode=0.05,
main="Pie Chart of Car Types")

69
Computer Interactive statistics Jane A. Aduda

Figure 14: Q-Q plots

4.7 2-D and 3-D plots


We use an existing dataset in R, volcano

? volcano
data(volcano)
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
# Creates a 2-D image of x and y co-ordinates.
image(x, y, volcano, col = terrain.colors(100),
axes = FALSE)

# Adds contour lines to the current plot.


contour(x, y, volcano, levels = seq(90, 200, by=5),
add = TRUE, col = "peru")

# Adds x and y axes to the plot.


axis(1, at = seq(100, 800, by = 100))

70
Computer Interactive statistics Jane A. Aduda

Figure 15: Pie chart representing car types

axis(2, at = seq(100, 600, by = 100))

# Draws a box around the plot.


box()

# Adds a title.
title(main = "Maunga Whau Volcano", font.main = 4)

For a 3-D plot

library(lattice)
data(volcano)
dim(volcano)
# Creates a data frame from all combinations of the
# supplied vectors or factors.
vdat <- expand.grid(x = x, y = y)
vdat$v <- as.vector(volcano)
wireframe(v ~ x*y, vdat, drape=TRUE, col.regions = rainbow(100))

71
Computer Interactive statistics Jane A. Aduda

Figure 16: 3-D Pie chart in R

4.8 Exercises
4.8.1 Exercise 8
1. Create a vector x of the values from 1 to 20.

2. Create a vector w <- 1 + sqrt(x)/2.

3. Create a data frame called dummy, with columns x = x and y = x +


rnorm(x)*w. To ensure we all get the same values, set the seed to 0.

4. Create a histogram and a boxplot of y and plot them side-by-side on the


same graphing region. Label the axes accordingly. Save your results as a
Jpeg file. The histogram should be green in colour.

5. Plot y versus x using an appropriate plotting command. Put a title on the


graph and labels on the axes.

6. Enter the command

72
Computer Interactive statistics Jane A. Aduda

Figure 17: A 2-D plot od volcano data in R

Figure 18: 3-D plot of Volcano data

73
Computer Interactive statistics Jane A. Aduda

fm <- lm(y ~ x, data=dummy)

to fit a linear regression model. Add the estimated regression line to the
current plot and make it the colour blue. Write the equation of the line.

7. Extract the values of the residuals using resids <- resid(fm). Check
that the residuals are normally distributed by creating a Q-Q plot.

8. The airquality data set in the base library has columns Ozone, Solar.R,
Wind, Temp, Month and Day. Plot Ozone against Solar.R for each of
THREE temperature ranges and each of THREE wind ranges. (Hint:
Use coplot.)* Difficult.

9. Construct a histogram of Wind and overlay the density curve. (Hint:


Need hist, lines and density.)

74
Computer Interactive statistics Jane A. Aduda

5 Looping and Functions


5.1 User Defined Functions (UDF) in R
A function is formally a part of a computer program that performs some spe-
cific action, but is not itself a complete executable program. Functions may
perform the same things that complete programs do.

R has many built in functions, and you can access many more by installing new
packages. So there’s no-doubt you already use functions. We have already seen
functions in R, e.g.
mean(x)
sd(x)
plot(x, y, ...)
lm(y ~ x, ...)
Functions have a name and a list of arguments or input objects. For example,
the argument to the function mean() is the vector x.

Functions also have a list of output objects, i.e. objects that are returned once
the function has been run (called).

A function must be written and loaded into R before it can be used.

Functions are typically written if we need to compute the same thing for sev-
eral data sets and what we want to calculate is not already implemented in the
commercial software yet.

A simple function can be constructed as follows:


function_name <- function(arg1, arg2, ...){
commands
output
}
You decide on the name of the function.

The function command shows R that you are writing a function.

75
Computer Interactive statistics Jane A. Aduda

Inside the parenthesis you outline the input objects required and decide what to
call them.

The commands occur inside the { }. This makes the function body.

The name of whatever output you want goes at the end of the function.

Comments lines are denoted by #.

The procedure for writing any other functions is similar, involving three key
steps:

1. Define the function,

2. Load the function into the R session,

3. Use the function.

myf1 <- function(x){


x^2
}

This function is called myf1.

It has one argument, called x.

Whatever value is input for x will be squared and the result (output) printed to
the screen.

This function must be loaded into R and can then be called.

> myf1 <- function(x){


+ x^2
+ }
> x.sq <- myf1(x = 3:5)
> x.sq
[1] 9 16 25

76
Computer Interactive statistics Jane A. Aduda

myf2 <- function(a1, a2, a3){


x <- sqrt(a1^2 + a2^2 + a3^2)
return(x)
}
This function is called myf2 with 3 arguments.

The values input for a1, a2, (a3) will be squared, summed and the square root
of the sum calculated and stored in x.

The return command specifies what the function returns, here the value of x.

Consider the function


sum.of.sqs <- function(x,y) {
x^2 + y^2
}
which requires two arguments and returns the sum of the squares of these argu-
ments. sum.of.sqs(3,4) will return 25
> sum.of.sqs <- function(x,y) {
+ x^2 + y^2
+ }
> sum.of.sqs(3,4)
[1] 25
It is often convenient to store the result of function in an object.
myf2 <- function(a1, a2, a3){
x <- sqrt(a1^2 + a2^2 + a3^2)
return(x)
}
res <- myf2(2, 3, 4)
The arguments can be marched by position using myf2(2, 3, 4) or by name
using myf2(a1=2, a3=3, a2=4), myf2(a3=2, a1=3, a2=4),.

Name matching happens first, then positional matching is used for any un-
matched arguments.

We can also give some/all arguments default values.

77
Computer Interactive statistics Jane A. Aduda

mypower <- function(x, pow=2){


x^pow
}

If a value for the argument pow is not specified in the function call, a value of 2
is used.

mypower(4)
[1] 16

If a value for pow is specified, that value is used.

mypower(4, 3)
[1] 64
mypower(pow=5, x=2)
[1] 32

If we have a function which performs multiple tasks and therefore has multiple
results to report then we have to include a return statement inside the function
is order to see all the results.

Furthermore, when we have a function which performs multiple tasks (i.e.


computes multiple computations) then it is often useful to save the results in
a list. Now we can access each result separately by using the list indices (double
square brackets).

The following function returns several values in the form of a list:

myfunc <- function(x)


{
# x is expected to be a numeric vector
# function returns the mean, sd, min, and max of the vector x
the.mean <- mean(x)
the.sd <- sd(x)
the.min <- min(x)
the.max <- max(x)
return(list(average=the.mean,stand.dev=the.sd,minimum=the.min,
maximum=the.max))
}
x <- rnorm(40)

78
Computer Interactive statistics Jane A. Aduda

res <- myfunc(x)


res
$average
[1] 0.29713
$stand.dev
[1] 1.019685
$minimum
[1] -1.725289
$maximum
[1] 2.373015
To access any particular value use:
res$average
[1] 0.29713
res$stand.dev
[1]1.019685
Execute the following in R and then write a function that does it.
x <- c(0.458, 1.6653, 0.83112)
percent <- round(x * 100, digits = 1)
result <- paste(percent, "%", sep = "")
print(result)
[1] "45.8%" "166.5%" "83.1%"
Now the function
addPercent <- function(x){
percent <- round(x * 100, digits = 1)
result <- paste(percent, "%", sep = "")
return(result)
}
x <- c(0.458, 1.6653, 0.83112)
addPercent(x)
Another example
pow <- function(x, y) {
# function to print x raised to the power y
result <- x^y

79
Computer Interactive statistics Jane A. Aduda

print(paste(x,"raised to the power",y,"is",result))


}
> pow(8,2)
[1] "8 raised to the power 2 is 64"
> pow(2,8)
[1] "2 raised to the power 8 is 256"

5.2 if Statement
Often, you want to make choices and take action dependent on a certain value.
If this condition is true, then carry out a certain task. Logical operators are used
as the conditions in the if statement.

Syntax of if statement

if (test_expression) {
statement
}

start

Test false
expression

True

if statement do nothing

stop

80
Computer Interactive statistics Jane A. Aduda

If the test_expression is TRUE, the statement gets executed. But if it’s FALSE,
nothing happens.

x<-5
if(x > 0){
print("Positive number")
}
[1] "Positive number"

Syntax of if...else statement

if (test_expression) {
statement1
} else {
statement2
}

start

Test false
expression

True

if statement else statement

stop

The else part is optional and is evaluated if test_expression is FALSE.

x <- -5

81
Computer Interactive statistics Jane A. Aduda

if(x > 0){


print("Non-negative number")
} else {
print("Negative number")
}
[1] "Negative number"
The above conditional can also be written in a single line as follows.
> if(x > 0) print("Non-negative number") else print("Negative number")
[1] "Negative number"
Syntax of nested if...else statement
if ( test_expression1) {
statement1
} else if ( test_expression2) {
statement2
} else if ( test_expression3) {
statement3
} else
statement4
Only one statement will get executed depending upon the test_expressions.
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")
[1] "Zero"
Another example
com1 <- function(number)
{
# if ... else
if (number != 1)
{

82
Computer Interactive statistics Jane A. Aduda

cat(number,"is not one\n")


}
else
{
cat(number,"is one\n")
}
}
com1(3)
3 is not one
> com1(1)
1 is one

com2 <- function(number)


{
if (number == 0)
{
cat(number,"equals 0\n")
}
else if (number > 0)
{
cat(number,"is positive\n")
}
else
{
cat(number,"is negative\n")
}
}
> com2(2)
2 is positive
> com2(-2)
-2 is negative
> com2(0)
0 equals 0

This function below demonstrates the use of && in the condition. This means
that both conditions must be met before a value of TRUE is returned.

83
Computer Interactive statistics Jane A. Aduda

{
if ( (number > 0) && (number < 10) )
{
cat(number,"is between 0 and 10\n")
}
}
> com3(8)
8 is between 0 and 10

5.3 ifelse Statement


A vectorised version of the if statement is ifelse. This is useful if you want to
perform some action on every element of a vector that satisfies some condition.

The syntax is
ifelse( condition, true expr, false expr )
If condition == TRUE, the true expr is carried out. If
condition == FALSE, the false expr is carried out.

Example
x <- rnorm(20, mean=15, sd=5)
ifelse(x >= 17, sqrt(x), NA)
[1] NA NA NA NA NA NA NA NA
[9] NA NA NA NA NA NA 4.603291 NA
[17] NA 4.387977 NA 4.801747

5.4 for Loops


To loop/iterate through a certain number of repetitions a for loop is used. The
basic syntax is
for(variable_name in sequence) {
command
command
command
}

84
Computer Interactive statistics Jane A. Aduda

A simple example of a for loop is:

for(i in 1:5){
print(sqrt(i))
}
[1] 1
[1] 1.414214
[1] 1.732051
[1] 2
[1] 2.236068

Another example is:

n <- 20
p <- 5
value <- vector(mode="numeric", length=n)
rand.nums <- matrix(rnorm(n*p), nrow=n)
for(i in 1:length(value)){
value[i] <- max(rand.nums[i,])
print(sum(value))
}

– First create variables n and p with values 20 and 5 respectively.

– Then create a numeric vector (of zeros) called value with length 20

> value <- vector(mode="numeric", length=n)#initial vector


> value
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Next, create a matrix of 20*5=100 random numbers, called rand.nums,


with 20 rows.

> rand.nums <- matrix(rnorm(n*p), nrow=n)


> rand.nums
[,1] [,2] [,3] [,4] [,5]
[1,] -0.87730496 -0.46488384 -0.4834062 -0.1229413 -0.57756428
[2,] -0.98041300 -0.26903889 -0.1874345 -0.5653059 -0.43128501
[3,] 0.02709582 -0.07843223 -0.9583897 2.1021781 0.47257098

85
Computer Interactive statistics Jane A. Aduda

[4,] 1.80681384 0.26171942 0.4127236 -1.8994535 0.54718825


[5,] -2.32776345 -0.07742778 0.2793574 0.6272354 1.48606203
[6,] 0.05212340 -0.69657145 -0.8007763 0.9722483 1.22723904
[7,] -0.50886040 -1.81070188 0.7565307 0.1582692 1.00600116
[8,] -0.47626183 0.76410344 1.2975840 -0.9178019 1.71174090
[9,] 1.49112244 0.85628299 0.3746094 -1.0491838 -0.18610686
[10,] 1.32179596 0.69579623 -0.1976340 0.5141740 -0.17731601
[11,] 0.82700774 0.41010326 0.5756402 1.8077148 0.04487817
[12,] -1.67733695 -1.79698080 -0.3162465 1.0744179 -1.08034742
[13,] -0.65306827 0.86844218 -0.3163877 -0.0432359 -0.69427903
[14,] -0.85122378 -0.36582098 -1.1491916 -0.7080450 2.20187680
[15,] -0.08079412 0.64780625 -0.3693236 1.3860512 -2.21165262
[16,] 0.11184236 -1.08740802 -0.9639199 -1.6628534 -1.80461710
[17,] -0.16610549 -0.53623433 0.4328577 -1.0426445 -1.25356698
[18,] 0.22799339 -0.66650309 1.0619847 -0.5713357 -0.60580924
[19,] -0.29728305 0.06645103 2.9274367 0.6688510 -0.40750491
[20,] 1.43643249 0.44238136 2.3841454 0.1663345 0.50310488

– The for loop performs 20 loops and stores the maximum value from each
row of rand.nums into position i of the vector value. The sum of the
current numbers in value is also printed to the screen.

> for(i in 1:length(value)){


+ value[i] <- max(rand.nums[i,])
+ print(sum(value))
+ }
[1] -0.1229413
[1] -0.3103758
[1] 1.791802
[1] 3.598616
[1] 5.084678
[1] 6.311917
[1] 7.317918
[1] 9.029659
[1] 10.52078
[1] 11.84258
[1] 13.65029
[1] 14.72471

86
Computer Interactive statistics Jane A. Aduda

[1] 15.59315
[1] 17.79503
[1] 19.18108
[1] 19.29292
[1] 19.72578
[1] 20.78777
[1] 23.7152
[1] 26.09935
– See the value of the vector value now
> value
[1] -0.1229413 -0.1874345 2.1021781 1.8068138 1.4860620 1.2272390
[7] 1.0060012 1.7117409 1.4911224 1.3217960 1.8077148 1.0744179
[13] 0.8684422 2.2018768 1.3860512 0.1118424 0.4328577 1.0619847
[19] 2.9274367 2.3841454
Another example
u1 <- rnorm(30) # create a vector filled with random normal values
print("This loop calculates the square of the first 10 elements of vector u1")
[1] "This loop calculates the square of the first 10 elements of vector u1"
usq<-0
for(i in 1:10)
{
usq[i]<-u1[i]*u1[i] # i-th element of u1 squared into i-th position of usq
print(usq[i])
}
[1] 0.01080562
[1] 0.1109192
[1] 0.2443853
[1] 0.07218485
[1] 4.803868
[1] 1.961641
[1] 1.338433
[1] 0.0007863216
[1] 0.7498973
[1] 0.05495671
> print(i)
[1] 10

87
Computer Interactive statistics Jane A. Aduda

5.4.1 Nested for loops


# nested for: multiplication table
mymat = matrix(nrow=30, ncol=30) # create a 30 x 30 matrix (of 30 rows
# and 30 columns)

for(i in 1:dim(mymat)[1]) # for each row


{
for(j in 1:dim(mymat)[2]) # for each column
{
mymat[i,j] = i*j # assign values based on position: product of two indexes
}
}
mymat[1:10,1:10]# show just the upper left 10x10 chunk
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 2 4 6 8 10 12 14 16 18 20
[3,] 3 6 9 12 15 18 21 24 27 30
[4,] 4 8 12 16 20 24 28 32 36 40
[5,] 5 10 15 20 25 30 35 40 45 50
[6,] 6 12 18 24 30 36 42 48 54 60
[7,] 7 14 21 28 35 42 49 56 63 70
[8,] 8 16 24 32 40 48 56 64 72 80
[9,] 9 18 27 36 45 54 63 72 81 90
[10,] 10 20 30 40 50 60 70 80 90 100

for loops and multiply nested for loops are generally avoided when possible in
R as they can be quite slow.

5.5 while loops


The while loop can be used if the number of iterations required is not known
beforehand. For example, if we want to continue looping until a certain con-
dition is met, a while loop is useful. The following is the syntax for a while
loop:

while (condition){
command
command

88
Computer Interactive statistics Jane A. Aduda

The loop continues while condition == TRUE.

niter <- 0
num <- sample(1:100, 1)
while(num != 20) {
num <- sample(1:100, 1)
niter <- niter + 1
}
niter

another example

i <- 0
while (i < 4) {
i <- i+1
print (i)
}
[1] 1
[1] 2
[1] 3
[1] 4

Do NOT forget the increase of the counter variable i!

5.5.1 next, break, repeat Statements


The next statement can be used to discontinue one particular iteration of any
loop, i.e. this iteration is ended and the loop “skips" to the next iteration. Use-
ful if you want a loop to continue even if an error is found (error checking).

The break statement completely terminates a loop. Useful if you want a loop
to end if an error is found.

The repeat loop uses next and break. The only way to end this type of loop is
to use the break statement.

i <- 0
repeat {

89
Computer Interactive statistics Jane A. Aduda

i <- i+1
print (i)
if (i == 4) break
}
[1] 1
[1] 2
[1] 3
[1] 4
If no break is given, loop runs forever!

5.5.2 The apply() commands


As previously discussed these commands allow functions to be run on matrices.
apply() Function used on matrix
tapply() table grouped by factors
lapply() on lists and vectors; returns a list
sapply() like lapply(), returns vector/matrix
mapply() multivariate sapply()

apply()
apply(data, margin, function)
> a <- matrix (1:10 , nrow =2)
> apply (a ,1, mean ) # 1 = by rows
[1] 5 6
> apply (a ,2, mean ) # 2 = by columns
[1] 1.5 3.5 5.5 7.5 9.5
lapply() and lapply()
> a <- matrix (2:11 , nrow =2)
> b <- matrix (1:10 , nrow =2)
> c <- list (a,b)
> lapply (c, mean )
[[1]]
[1] 6.5

[[2]]
[1] 5.5

90
Computer Interactive statistics Jane A. Aduda

> sapply (c, mean )


[1] 6.5 5.5

mapply()
Like sapply() but applies over the first elements of each argument

# mapply ( FUNCTION , list , list , list ...)


> mapply (rep , pi , 3:1)
[[1]]
[1] 3.141593 3.141593 3.141593

[[2]]
[1] 3.141593 3.141593

[[3]]
[1] 3.141593
# equivalent to:
rep (pi , 3)
rep (pi , 2)
rep (pi , 1)

tapply()
Run a function on each group of values specified by a factor. Requires a vector,
factor and function.

5.6 Exercises
5.6.1 Exercise 9
1. Write a function that when passed a number, returns the number squared,
the number cubed, and the square root of the number. Also access each
result separately by using the list indices

2. Write a function that when passed a numeric vector, prints the value of
the mean and standard deviation to the screen and creates a histogram of
the data.

3. Management requires a function that calculates the sum of the lengths of


3 vectors. Write the function.

91
Computer Interactive statistics Jane A. Aduda

4. For each of the following code sequences, predict the result. Then do the
computation:

(a) answer <- 0


for(j in 3:5) { answer <- j + answer }
(b) answer <- 10
j <- 0
while(j < 5) {
j <- j + 1
if(j == 3)
next
else
answer <- answer + j*answer
}

5. Add up all the numbers for 1 to 100 in two different ways: using a for
loop and using sum.

6. Create a vector x <- seq(0, 1, 0.05). Plot x versus x and use type="l".
Label the y-axis "y". Add the lines x versus x^j where j can have
values 3 to 5 using either a for loop or a while loop.

92
Computer Interactive statistics Jane A. Aduda

6 One Sample Hypothesis Tests


If we have a single sample we might want to answer several questions:
– What is the mean value?

– Is the mean value significantly different from current theory?

– What is the level of uncertainty associated with our estimate of the mean
value? (Confidence interval) (Hypothesis test)
Researchers retain or reject hypothesis based on measurements of observed sam-
ples. The decision is often based on a statistical mechanism called hypothesis
testing.

A type I error (also known as a “false positive”) is the mishap of falsely rejecting
a null hypothesis when the null hypothesis is true. The probability of commit-
ting a type I error is called the significance level of the hypothesis testing, and
is denoted by the Greek letter α. It occurs when we are observing a difference
when in truth there is none (or more specifically – no statistically significant
difference).

A type II error (also known as a “false negative”) is due to a failure of rejecting


an invalid null hypothesis. The probability of avoiding a type II error is called
the power of the hypothesis test, and is denoted by the quantity 1 − β. In
other words, this is the error of failing to accept an alternative hypothesis when
you don’t have adequate power. Plainly speaking, it occurs when we are failing
to observe a difference when in truth there is one.

Truth
Decision True H0 False H0
Reject Type I Accurate
Fail to reject Accurate Type II

Table 2: Types of errors

To ensure that our analysis is correct we need to check for outliers in the data
and we also need to check whether the data are normally distributed or not.

93
Computer Interactive statistics Jane A. Aduda

Graphical methods are often used to check that the data being analysed are nor-
mally distributed. Can use Histogram, Box-plot, Normal probability (Q-Q)
plot etc.

When the sample size is sufficiently large, we can test our hypothesis by con-
ducting a z-test using the standard normal distribution. We conduct a one-
sample z-test when we want to test whether the mean of the population (from
which we have a random sample) is equal to the hypothesized value.

The null hypothesis for a two-sided test takes the form

H0 : µ = µ0

where µ is the population mean and µ0 is its hypothesized value.

When we know the direction of the difference between µ and µ0 , we conduct a


one-sided test.

The alternative hypothesis changes from

H1 : µ 6= µ0

(two-sided) to
H1 : µ > µ0
or
H1 : µ < µ0
(one-sided).

We calculate the critical value for the α significance level by qnorm(1 − α/2) for
a two-sided test, and qnorm(1 − α) or qnorm(α) for a one-sided test, depending
on the direction of the prior knowledge.

We can also calculate the p-value using 2*pnorm(abs(z), lower.tail=FALSE)


for a two-sided test and pnorm(z) or pnorm(z, lower.tail=FALSE) for a one-
sided test where z is the z-statistic.

94

You might also like