Professional Documents
Culture Documents
For Beginners: Emmanue L Paradis
For Beginners: Emmanue L Paradis
R for beginners
Emmanuel Paradis
1 What is R ? 3
6
3 Data with R 8
8
"
!
$
)
(
11
3.4.2 Random sequences
)
(
13
3.5 Manipulating objects 13
(
" "
( $ (
"
! !
(
!
, (
-
4 Graphics with R 18
. /
0 0
1
0
18
4.1.2 Partitioning a graphic window 18
0 0
19
2
(
! !
0 $
!
4 4
5
6
7
26
(
( 0 (
<
:
. .
8 Index 31
2
4 < < 4 4
The goal of the present document is to give a starting point for people newly interested in R. I
= @
/ . 5 > 5 . 5 / 5 . / / 5 5 . 5
7 ? 6
tried to simplify as much as I could the explanations to make them understandables by all,
/ /
6 7 7 7 6
while giving useful details, sometimes with tables. Commands, instructions and examples are
/
? 7 7 7
I thank Julien Claude, Christophe Declercq, Friedrich Leisch and Mathieu Ros for their
B
/
7 7 7
comments and suggestions on an earlier version of this document. I am also grateful to all the
7 ? 7 7
members of the R Development Core Team for their considerable efforts in developing R and
/ /
? ?
animating the discussion list ‘r-help’. Thanks also to the R users whose questions or
/
7 7 7
4
> > 5 . .
7
1 What is R ?
E
F H
R is a statistical analysis system created by Ross Ihaka & Robert Gentleman (1996, J.
I
6 6 6
M
Comput. Graph. Stat., 5: 299-314). R is both a language and a software; its most remarkable
N
7
J K L K L
features are:
5 5 5 5 .
? 6
< < 4 4 4
• a suite of operators for calculations on arrays, matrices, and other complex operations,
/ . . . 5 5 . . > . 5 . > / / . 5
7 7 6 O
/ /
7 7 7 6
/ /
? 7 7 6
4 4 < 4 4 4
5 5 . 5 .
7 7 6
4 4 < 4 <
. . . > > .
? 6
the conceptions of R and S, but they are not of interest to us here: those who want to know
/
7 6 7
more on this point can read the paper by Gentleman & Ihaka (1996) or the R-FAQ
/ / /
6
V
U U U U U
/ / /
6 7
software.
R is freely distributed on the terms of the GNU Public Licence of the Free Software
W S D R X
. . 5 . > 5 . .
6 7 7
< < 4
Foundation (for more information: http://www.gnu.org/); its development and distribution are
U U U
X
/
6 ? ? 6
/ /
? ? ?
R is available in several forms: the sources written in C (and some routines in Fortran77)
? ? 7 7
ready to be compiled, essentially for Unix and Linux machines, or some binaries ready for use
S
/
6 6 7 6 7
< 4 4 < 4 4 4
. . 5 5 . 5 5
? 6 6
c d
]
Y [ Y [ ^ [ _ ` a a [ a
\ b
"
0
k
( ,
j j
# # f f % m
PPC M a c OS
o 1 #
n n +
o %
LinuxPPC 5.0 (
# %
(
# h m
(
< 4 4 4 V <
The files to install these binaries are at http://cran.r-project.org/bin/ (except for the Macintosh
U U U U
= T
5 5 . . / . 5 . / . . 5 / . 5
O
version 1) where you can find the installation instructions for each operating system as well.
/
? 6 7 7 6
R is a language with many functions for statistical analyses and graphics; the latter are
/
7 6 7 6
visualized immediately in their own window and can be saved in various formats (for
? 7 6 ? ? 7
4 V <
example, jpg, png, bmp, eps, or wmf under Windows, ps, bmp, pictex under Unix). The
S =
4 < 4 4 4 4
results from a statistical analysis can be displayed on the screen, some intermediate results (P
q
-values, regression coefficients) can be written in a file or used in subsequent analyses. The R
=
. . 5 5 5 . 5 5 . 5 5 5
? 7 7 7 7 6
4 4 4 < 4 < 4 4
language allows the user, for instance, to program loops of commands to successively analyse
" k
" " "
1 The Macintosh port of R has just been finished by Stefano Iacus <jago@mclink.it>, and should be available
#
+ e
r
s t
!
( ( ( $
soon on CRAN.
o *
g
4
several data sets. It is also possible to combine in a single program different statistical
/ /
?
functions to perform more complex analyses. The R users may benefit of a large number of
=
routines written for S and available on internet (for example: http://stat.cmu.edu/S/), most of
U U U U
4
. 5 5 .
7 7 6
At first, R could seem too complex for a non-specialist (for instance, a biologist). This may
P
=
not be true actually. In fact, a prominent feature of R is its flexibility. Whereas a classical
@
5 . 5 / . > 5 5 . .
7 7 6 7 O 6
< 4 4 4 4 4 < 4
software (SAS, SPSS, Statistica, ...) displays (almost) all the results of an analysis, R stores
P
D
. / > . 5 5 .
6 7 6
these results in an object, so that an analysis can be done with no result displayed. The user
u
/
7 6 7 6 7
may be surprised by thus, but such a feature is very useful. Indeed, the user can extract only
/
6 7 6 7 7 7 7 ? 6 7 7 7 6
the part of the results which is of interest. For example, if one runs a series of 20 regressions
/ /
7 7
and wants to compare the different regression coefficients, R can display only the estimated
5 5 > / . . 5 . . 5 5 5 / 5 >
6 6
coefficients: thus the results will take 20 lines, whereas a classical software could well open
5 . 5 . . / 5
7 7 7
4 4 4 4 4 <
20 results windows. One could cite many other examples illustrating the superiority of a
. 5 5 > 5 . > / . 5 / . .
7 7 6 O 7 7 6
4 4 < 4 4 <
system such as R compared to classical softwares; I hope the reader will be convinced of this
@
> > / . . / . . 5 5
6 7 ?
<
. . 5 > 5
7
w w x w H F
4 4 < 4
Once R is installed on your computer, the software is accessed by launching the corresponding
5 5 5 . > / . . 5 5 . . / 5 5
6 7 7 6 7
4
executable (RGui.exe ou Rterm.exe under Windows, R under Unix). The prompt ‘>’ indicates
y y y S =
. > 5 . 5 5 . 5 / . > / 5
O 7 7 O 7 O 7 7 O
6 7
Under Windows, some commands related to the system (accessing the on-line help, opening
S
/ /
6
< 4 4 4 <
files, ...) can be executed via the pull-down menus, but most of them must heve to be typed on
the keyboard. We shall see in a first step three aspects of R: creating and modifying elements,
. 5 . / . / . 5 5 > 5 > 5
6 6
4 4 V 4 4
listing and deleting objects in memory, and accessing the on-line help.
5 5 5 5 > > . 5 5 5 5 /
6
| }
R is an object-oriented language: the variables, data, matrices, functions, results, etc. are
u
7 ? 7 7
V
stored in the active memory of the computer in the form of objects which have a name: one
/
? 6 7 ?
V < V 4 4 < V
has just to type the name of the object to display its content. For example, if an object n has
X
/ 5 > / 5 5 . > / 5
7 6 6 O
< 4
for value 10 : .
? 7
> n
[1] 10
The digit 1 within brackets indicates that the display starts at the first element of n (see §
/
6
V
3.4.1). The symbol assign is used to give a value to an object. This symbol is written with a
6 7 ? ? 7 6
L
v v 4 4
bracket (< or >) together with a sign minus so that they make a small arrow which can be
< 4 <
. . > . . . .
?
> n <- 15
> n
[1] 15
> 5 -> n
> n
[1] 5
4 4 <
5 5 . 5 . > / . 5
? 7 ? 7 O
[1] 12
4 4 V 4
Note that you can simply type an expression without assigning its value to an object, the result
W
5 > / / 5 / . 5 5 5 5 .
6 7 6 6 O 7 ? 7 7
/
7 6 7 6
> (10+2)*5
[1] 60
{
}
V V
The function ls() lists simply the objects in memory: only the names of the objects are
/
7 6 6 6
displayed. /
6
> name <- "Laure"; n1 <- 10; n2 <- 100; m <- 0.5
A
> ls()
[1] "m" "n1" "n2" "name"
Note the use of the semi-colon ";" to separate distinct commands on the same line. If there are
W
/
7
6
V
a lot of objects in memory, it may be useful to list those which contain given character in their
6 6 7 7 ?
name: this can be done with the option pattern (which can be abbreviated with pat) :
5 > 5 5 / 5 5 .
?
> ls(pat="m")
V
If we want to restrict the list of objects whose names strat with this character:
> ls(pat="^m")
[1] "m"
4 4 V <
To display the details on the objects in memory, we can use the function ls.str():
=
/ 5 5 > > . 5 5 5
6 6 7 7
> ls.str()
m: num 0.5
n1: num 10
4 < 4 <
The option pattern can also be used as described above. Another useful option of ls.str()
P
=
/ 5 5 . 5 . / 5
7 ? 7 7
is max.level which specifies the level of details to be displayed for composite objects. By
Q
/ / . > /
? 6 6
default, ls.str() displays the details of all objects in memory, including the columns of data
/ 5 > > . 5 5 > 5
7 6 6 7 7
< 4 4 4 4 4 4 4
frames, matrices and lists, which can result in a very long display. One can avoid to display all
. > > . 5 5 . 5 . 5 / 5 5 /
7 ? 6 6 ? 6
> ls.str(pat="M")
$ n1: num 10
A
$ m: num 0.5
A
V V
To delete objects in memory, we use the function rm(): rm(x) deletes the object x, rm(x,y)
6 7 7
V V
both objects x et y, rm(list=ls()) deletes objects in memory; the same options mentioned
/
6 6
< < 4 4 4 V
for the function ls() can then be used to delete selectively some objects:
. 5 5 5 5 >
7 7 ? 6
rm(list=ls(pat="m")).
{ {
|
The on-line help of R gives some very useful informations on how to use the functions. The
= =
5 5 / > . 5 . > 5 5 5 5
? ? 6 7 7 7 7
4 4 < 4 4
/ 5 > . > / 5
6 6
> help.start()
v 4 4 4 < <
A search with key-words is possible with this html help. A search of functions can be done
P P
. . / > / . 5 5 5 5
6 7
with apropos("what") which lists the functions with "what" in their name:
> apropos("anova")
A
The help is available in text format for a given function, for instance:
/
? ? 7
7
9
> ?lm
displays the help file for the function lm(). The function help(lm) or help("lm") has the
/ /
6 7 7
same effect. This last function must be used to access the help with non-conventional
/
7 7 7 ?
characters:
. .
> ?*
Error: syntax error
> help("*")
A
Arithmetic Operators
¡
...
8
¢
3 Data with R F
{
v V 4 4
R works with objects which all have two intrinsic attributes: mode and length. The mode is
=
. 5 . 5 . 5 >
? 7
the kind of elements of an object; there are four modes: numeric, character, complex, and
5 > 5 5 . . . > 5
7
A
<
logical. There are modes but these do not characterize data, for instance: function,
=
. . > 5 . . . 5 5
7
A
4 < 4 < V
expression, or formula. The length is the total number of elements of the object. The
= =
. 5 > . > 5
7
A
V
/
7 7 6
object
¦
¨ ¨ ¨
^ § a a [ « § [ a a [ [ ® _ « § [ a ^ § a a [ ¯
° ±
©
[ a _ [ § [ Y
(
j j j
(
(
j j j
(
j j j
j
j
j
Yes
j
j
j
Yes
list
" " "
! !
( , ( ,
j j j j j j
formula
(
4 4 < 4 4
4 ³ 4 < ³
An array is a table with k dimensions, a matrix being a particular case of array with k = 2.
P
5 > 5 5 > . 5 / . .
O 7
Note that the elements of an array or of a matrix are all of the same mode. A data.frame is a
W
table composed with several vectors all of the same length but possibly of different modes. A
/ /
? ? 7 6
ts time-series data set and so contains supplementary attributes such as the frequency and the
/ /
7 6 7 7 7 6
dates.
< V v
Among the non-intrinsic attributes of an object, one is to be kept in mind: it is dim which
P
> 5 5 5 5 . 5 . 5 5 / 5 > 5
7
V
corresponds to the dimensions of a multivariate object. For example, a matrix with 2 lines and
/ /
7 ?
2 columns has for dim the pair of values [2,2], but its length is 4.
/
7 ? 7 7
It is useful to know that R discriminates, for the names of the objects, the upper-case
@
5 . > 5 . 5 > / / .
7 7 7
< 4
characters from the lower-case ones (i.e., it is case-sensitive), so that x and X can be used to
´
. . . > . 5 5 5 5
? 7
V
5 > 5 5 5 . 5
? 7
¶
| } ¶ ¶
¶
¸
< 4 <
R can read data stored in text (ASCII) files; three functions can be used: read.table()
P
@ @
5 . . 5 . 5 5 5
O 7 7
(which has two variants: read.csv() and read.csv2()), scan() and read.fwf(). For
X
. 5 5 5 .
?
º
4 < < 4 V
> / 5 5 /
O ? 7 6
2
9
mydata will then be a data.frame, and each variable will be named, by default, V1, V2, ...
» »
? 6 7
4 4 4
¼
¼
.
6
¼
mydata["V2"], ..., or, still another solution, by mydata[,1], mydata[,2], etc2. There are
4 4 4
¼
. 5 . 5 . .
7 6
several options available for the function read.table() which values by default (i.e. those
/
? ? 7 ? 7 6 7
used by R if omitted by the user) and other details are given in the following table:
7 6 6 7 ?
½ ¾ ¿
A º
½ ¾ ¿ À ¾ Á Â ¿
strip.white=FALSE)
file
the name of the file (within ""), possibly with its path (the symbol \ is not allowed and must
h m h Ã
0 0 0 (
"
$ ( 0
header
" "
"
"
a logical (FALSE or TRUE) indicating if the file contains the names of the variables on its
h * # m
Ä 3 Å p Å
!
$
"
first line
sep
"
" "
the field separator used in the file, for instance sep="\t" if it is a tabulation
( (
quote
( $
dec
(
row.names
a vector with the names of the lines which can be a vector of mode character, or the number
$ 0 0 $ (
"
" "
!
$ (
j j j
col.names
a vector with the names of the variables (by default: V1, V2, V3, ...)
h m
Æ Æ Æ
$ 0 $ (
j j j
as.is
controls the conversion of character variables as factors (if FALSE) or keep them as
h # m
$ $
characters (TRUE)
h m
p Å
na.strings
$ ( $ $
skip
(
check.names
$ $
strip.white
" "
(conditional to sep) if TRUE, scan deletes extra spaces before and after the character
h m
p Å
,
"
variables
Two variants of read.table() are useful because they have different by default options:
/
? 7 7 7 6 ? 6 7
Á Â ¿
A
Á Â ¿
A
< < 4 4
The function scan() is more flexible than read.table() and has more options. The main
= =
5 5 > . 5 5 > . / 5 > 5
7 O
difference is that it is possible to specify the mode of the variables, for example :
. 5 / / > . . > /
6 ? O
reads in the file data.dat three variables, the first is of mode character and the next two are of
. 5 . . . > . . 5 5 .
? O
/
7
A A
À ¾ ½ ¾ ¿
A
strip.white=FALSE, quiet=FALSE)
A
file
h m h Ã
the name of the file (within ""), possibly with its path (the symbol \ is not allowed and must be
0 0 0 (
"
k
replaced by /, even under Windows); if file="", the data are input with the keyboard (the
f m h
8
$ ( 0 ( 0
j j
" k "
!
0
what
"
2 Nevertheless, there is a difference: mydata$V1 and mydata[,1] are vectors whereas mydata["V1"] is a
g
$ $ 0
data.frame.
!
%
10
nmax
h
the number of data to read, or, if what is a list, the number of lines to read (by default, scan
( ( (
j j j j
m
(
n
h m
( (
sep
"
"
(
quote
"
!
( $
dec
"
!
(
skip
" k
!
(
nlines
"
!
(
na.string
"
!
$ ( $ $
flush
" "
"
"
a logical, if TRUE, scan goes to the next line once the number of columns has been reached
p Å
! !
, ( (
j j
" "
"
! !
0 (
strip.white (conditional to sep) if TRUE, scan deletes extra spaces before and after the character
" "
h m
p Å
,
"
variables
quiet
a logical, if FALSE, scan displays a line showing which fields have been read
#
0 0 $
j j
The function read.fwf() can be used to read in a file some data in fixed width format:
=
5 5 5 . 5 > 5 . >
7 7 O
col.names)
The options are the same than for read.table() except widths which specifies the width of
/ / /
the fields. For example, if the file data.txt has the following data:
A1.501.2
¾
A1.551.3
B1.601.4
Ç
B1.651.5
C1.701.6
C1.751.7
5 5 . >
º º
> mydata
V1 V2 V3
¼ ¼ ¼
1 A 1.50 1.2
2 A 1.55 1.3
¾
3 B 1.60 1.4
Ç
4 B 1.65 1.5
5 C 1.70 1.6
6 C 1.75 1.7
È
} É } }
< V
5 5 . 5 . > . . 5
7 ? O
º
array) in the file data.txt. There are two options: nc (or ncol) which defines the number of
=
. . 5 . . / 5 . 5 5 > .
6 O 7
columns in the file (by default nc=1 if x is of mode character, nc=5 for the other modes), and
7 6 7
append (a logical) to add the data without erasing those possibly already present in the file
/ /
7 6 6
< 4 4
. . 5
7 ? 7
< < 4
5 5 . 5 / 5 .
7
º
Â
A
º
11
sep
(
col.names
a logical indicating whether the names of the columns are written in the file
0 ( 0
row.names
"
!
quote
" "
"
a logical or a numeric vector; if TRUE, the variables of mode character are quoted with ""; if a
p Å
! ! )
( $ $ ( 0
"
"
numeric vector, its elements gives the indices of the columns to be quoted with "". In both cases,
e
! ! ! )
( $ $ ( ( 0
j j
"
" "
the names of the lines and of the columns are also quoted with "" if they are written.
! ! )
( ( 0 0
dec
"
!
(
na
!
(
eol
"
( (
To record a group of objects in a binary form, we can use the function save(x, y, z,
=
. . . / 5 5 . . > 5 5 5
7 6 7 7
A
option ascii=TRUE can be used. The data (which are now called image) can be loaded later in
Â
/
7
< <
Á Ê
> > . 5 5 . .
6 7 7
A
save(list=ls(all=TRUE), file=".RData").
Á Â ¿ Á Ê
Ë Ì
¶ } } }
Í Ï Ð
[ ` _ ® a [ Ò [ ¯ Y [ a
\ \
Î Î
A regular sequence of integers, for example from 1 to 30, can be generated with:
/
7 7
The resulting vector x has 30 éléments. The operator ‘:’ has priority on the arithmetic
/ /
7 ? 6
/ /
> 1:10-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(10-1)
[1] 1 2 3 4 5 6 7 8 9
< 4 < 4 4
5 5 5 5 . 5 . 5 > .
7 7 7
where the first number indicates the start of the sequence, the second one the end, and the
. . 5 > . 5 . 5 5 5 5 5
7 7
third one the increment to be used to generate the sequence. One can use also:
7 7 7
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
It is also possible to type directly the values using the function c() :
/ /
6 6 ? 7 7 7
which gives exactly the same result, but is obviously longer. We shall see late that the
> . 5 .
? O 6 7 7 ? 7 6
function c() is more useful in other situations. The function rep() creates a vector with
7 7 7 7 7 ?
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
12
The function sequence() creates a series of sequences of integers each ending by the
=
5 5 . . 5 5 . 5 5
7 7 6
A
5 > . 5 . > 5
7 ? 7
> sequence(4:5) A
[1] 1 2 3 4 1 2 3 4 5
> sequence(c(10,5)) A
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
The function gl() is very useful because it generates regular series of factor variables. The
7 ? 6 7 7 7 7 ?
usage of this fonction is gl(k, n) where k is the number of levels (or classes), and n is the
7 7 ?
< 4 4 4 <
number of replications in each level. Two options may be used: length to specify the number
=
5 > . . / 5 5 / 5 > / 5 > .
7 ? 6 7 6 7
of data produced, and labels to specify the names of the factors. Examples:
C
/ . 5 / 5 > . > /
7 6 O
> gl(3,5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> gl(3,5,30)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> gl(2,8, label=c("Control","Treat"))
[1] Control Control Control Control Control Control Control Control Treat
6 ?
given as arguments:
? 7
½
º
h w sex
1 60 100 Male
2 70 100 Male
3 80 100 Male
4 60 200 Male
5 70 200 Male
6 80 200 Male
7 60 300 Male
8 70 300 Male
9 80 300 Male
10 60 100 Female
11 70 100 Female
½
12 80 100 Female
13 60 200 Female
½
14 70 200 Female
15 80 200 Female
16 60 300 Female
½
17 70 300 Female
18 80 300 Female
½
Í Í
¬
Î
_
4
§
a
4
[ Ò
\
[ Y [ a
< 4 <
It is classical in statistics to generate random data, and R can do it for a large number of
@
5 5 . . 5 > 5 5 . . 5 > .
7
probability density functions. These functions are built on the following form:
=
/ . 5 5 5 5 5 . 5 5 . >
6 6 7 7 7
13
whre func indicates the law of probability, n the number of data to generate and p[1], p[2],
/
6 7
... are the values for the parameters of the law. The following table gives the details for each
=
. . / . > . 5 .
? 7 ?
law, and the possible default values (if none default value is indicated, this means that the
5 / 5 5 5 > 5
7 ? 7 7 ? 7
/ /
7 6 7
loi commande
ª ¬
§ Y § « « _ ¯ [
Gaussian (normal)
(
rexp(n, rate=1)
"
exponential
,
gamma
Poisson
rpois(n, lambda)
rweibull(n, shape, scale=1)
" "
Weibull
8
Cauchy
o
(
beta
rt(n, df)
‘Student’ (t)
# h m
(
Fisher (F)
h m
Ä
Ô
binomial !
rgeom(n, prob)
geometric
!
rhyper(nn, m, n, k)
hypergeometric
!
logistic
lognormal
negative binomial
$
uniform
4 4 < 4 4 4
Note all these functions can be used by replacing the letter r with d, p or q to get, respectively,
W
5 5 5 . / 5 . . . /
7 7 6 ? 6
4 4 4
the probability density (dfunc(x)), the cumulative probability density (pfunc(x)), and the
/ . 5 > / . 5 5
6 6 7 7 ? 6 6
4 < 4
5
? 7 7
~ ~
}
| }
¸
Í °
©
Y Y [ a a ` _ ^ _ Y _ _ [ § _ § [ Y
\ \
Î Î
To access, for example, the third value of a vector x, we just type x[3]. If x is a matrix or a
= @
. > / . . / > . .
O ? 7 ? O 7 6 O O
4 < 4 V 4 4 4
data.frame, the value of the ith line and jth column is accessed with x[i,j]. To change all
=
5 5 > 5 5
? 7 7
/
? 7 7 6
This indexing system is easily generalised to arrays, with as many indices as the number of
=
dimensions of the array (for example, a three dimensional array: x[i,j,k], x[,,3], ...). It is
/
6 6
useful to keep in mind that indexing is made with straight brackets [], whereas parentheses
/ /
7 7
. . . > 5 5 5
7 7 7
> x(1)
A A
4 4 4 4 4 4
Indexing can be used to suppress one or several lines or columns. For examples, x[-1,] will
@ X
5 5 5 / / . 5 . . 5 . > 5 . > /
O 7 7 ? 7 O
< 4 4 4 < 4
suppress the first line, or x[-c(1,15),] will do the same for the 1st and 15th lines.
/ / . . 5 . > . 5 5
7
14
4 4 < 4
For vectors, matrices and arrays, it is possible to access the values of an element with a
X
. . > . 5 . . / 5 > 5
? 6 ? 7
<- 20
> x
[1] 1 2 3 4 20 20 20 20 20 20
> x[x == 1]
<- 25
> x
[1] 25 2 3 4 20 20 20 20 20 20
The six comparaison operators used by R are: < (lesser than), > (greater than), <= (lesser than
/ /
7 6
or equal to), >= (greater than or equal to), == (equal to), et != (different from). Note that these
W
7 7 7
4 < 4 4
/ . . . . 5 . > .
7 ?
Í
¬
[ Y a _ a ^ [ a Y § a
\
Î Î
There are numerous functions in R to manipulate data. We have already seen the simplest one,
/ /
7 7 7 7 ? 6
V 4 4
5 5 5 / . 5 . > /
O
. 5 > 5 / . > / . 5
7 O
> z <- x + y
> z
[1] 2.0 3.0 4.0 5.0
< < < 4 4
Vectors of different lengths can be added; in this case, the shortest vector is recycled.
»
. . 5 5 5 5 . . .
? 6
Examples:
C
> /
O
> y
<- c(1,2)
> z <- x+y
> z
[1] 2 4 4 6
Warning message:
longer object length
A
> z
[1] 2 4 4
Note that R has given a warning message and not an error message, thus the operation has
W
/
? 7
been done. If we want to add (or multiply) the same value to all the elements of a vector:
/
7 6 ? 7 ?
> x
<- c(1,2,3,4)
> a <- 10
> z <- a*x
> z
[1] 10 20 30 40
<
. > / . . . 5 . / .
15
Two other useful operators are x %% y for “x modulo y”, and x %/% y for integer divisions
=
. / . . . . > 5 . 5 . 5
7 7 O 7 6 ?
< <
. . 5 5 . / . 5
7 ? O 6 6
The functions available in R are too many to be listed here. One can find all basic
7 ? 6
mathematical functions (log, exp, log10, log2, sin, cos, tan, asin, acos, atan, abs, sqrt,
...), special functions (gamma, digamma, beta, besselI, ...), as well as diverse functions useful
/ 5 5 . 5 5
7 ? 7 7 7
5 > 5 5 . 5 5
7
sum(x)
"
! !
( ,
prod(x)
(
which.max(x)
(
which.min(x)
(
(
length(x)
number of elements in x
(
mean(x)
"
! !
,
median(x)
"
! !
,
" " "
! !
( $ , ( , ,
×
" "
!
$ $ , (
cor(x)
"
! !
, , , , $
"
"
var(x,y) ou covariance between x and y, or between the columns of x and the columns of y if they
! !
( $ 0 , 0 ( , (
cov(x,y)
! !
cor(x,y)
" " "
! !
0 , ,
data.frames
!
These functions return a single value (thus a vector of length one), except range() which
/
7 7 ? 7 7 ?
< 4
returns a vector of length two, and var(), cov() and cor() which may return a matrix. The
=
. . 5 . 5 5 5 > . . 5 > .
7 ? 6 7 O
< 4 4 < 4 4
5 5 5 . . 5 > . > / .
7 7 O 7
round(x,n)
" "
! !
( ,
rev(x)
"
!
$ ,
sort(x)
log(x,base)
( 0
pmin(x,y,...) a vector which ith element is the minimum of x[i], y[i], ...
$
0
j j
! ! !
, (
cumsum(x)
"
! ! !
$ 0 (
cumprod(x)
(
cummin(x)
cummax(x)
(
match(x,y)
h m
returns a vector of same length than x with the elements of x which are in y (else NA)
g
( $ 0 0
which(x==a)
h m
returns a vector of the indices of x if the comparison operation is true (TRUE), i.e. the
p
( $ (
values of i for which x[i]==a (or x!=a; the argument of this function must be a variable of
h
$ ( 0 ( ( ( $
which(x!=a)
mode logical)
m
...
16
choose(n,k)
Ø Ø Ø
f h m
( $
× × ×
na.omit(x)
h m h
suppresses the observations with missing data (NA) (suppresses the corresponding line if x
g
( $ 0 (
is a matrix or a data.frame)
na.fail(x)
!
( ,
table(x)
"
" " "
returns a table with the numbers of the differents values of x (typically for integers or
h
!
( 0 ( $ ( ,
factors)
subset(x,...) returns a selection of x with respect to criteria (...) depending on the mode of x (typically
"
" "
h m h
!
( 0
" "
comparisons: x$V1 < 10); if x is a data.frame, the option select allows the user to
m
!
0 (
h m
$ ( (
Í Í
_ Y § ^ _ §
\
Î Î
< 4 < 4
R has facilities for matric computation and manipulation. A matrix can be created with the
P
<
function matrix():
5 5
7
[,1] [,2]
[1,] 5 5
[2,] 5 5
> matrix(1:6, nr=2, nc=3)
The functions rbind() and cbind() bind matrices with respect to the lines or the columns,
=
5 5 5 5 > . . / 5 . > 5
7 7
respectively:
/
? 6
> rbind(m1,m2)
[,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 2 2
[4,] 2 2
> cbind(m1,m2)
The operator for the product of two matrices is ‘%*%’. For example, considering the two
= X
/ . . . / . > . . > / 5 . 5
7 O
[,1] [,2]
[1,] 10 10
[2,] 10 10
< < < 4
The transposition of a matrix is done with the function t(); this function also with a
=
. 5 / 5 > . 5 5 5 5 5
O 7 7
data.frame.
The function diag() can be used to extract or modify the diagonal of a matrix, or to build
7 7 6 7
diagonal matrix.
17
9
> diag(m1)
[1] 1 1
> diag(rbind(m1,m2) %*% cbind(m1,m2))
[1] 2 2 8 8
> m1
[,1] [,2]
[1,] 10 1
[2,] 1 10
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
18
Ú
4 Graphics with R
G
R offers a remarkable variety of graphics. To get an idea, one can type demo(graphics). It is
/ /
? 6 6
not possible to detail here the possibilities of R in terms of graphics, particularly each graphic
/ / / / /
7 6
function has a large number of options making the production of graphics very flexible. I will
/ / /
7 7 7 ? 6
< < 4
. 5 > 5 . / 5
?
{
Ë
} } ¶ } |
Ï Ð Ð ] ¨ ¨ ¨
^ [ ¯ ¯ ` a [ [ ® _ ` ® _ ^ Y ¯ § a
Ý Ý
Î Î
When a graphic function is typed, a graphic window is open with the graph required. It is
/ / / / /
7 6 7
/ / /
6 6
> x11()
The window so open becomes the active window, and the subsequent graphs will be displayed
/ / /
? 7 7 6
v 4
5 5 . / 5 . . . 5 / 5
7 6
> dev.list()
windows windows
2 3
The figures displayed under windows are the numbers of the windows which can be used to
/
7 6 7 7 7
> dev.set(2)
windows
2
¬
_ § ` _ ` _ ^ Y §
Ý Ý
Î Î
<
The function split.screen() partitions the active graphic window. For instance,
= X
5 5 / . 5 . / 5 . 5 5
7 ?
4
split.screen(c(1,2)) divide the window in two parts which can be selected with
5 5 / . 5
?
The function layout() allows more complex partitions: it partitions the active graphic
/ / / /
7 ?
A
4 4 4 4 4 4
window in several parts where the graphs will be displayed successively. For example, to
X
5 5 . / . . . / / . > /
? 6 7 ? 6 O
< 4
5 5 . / .
? 7 7
A
< <
where the vector gives the numbers of the sub-windows, and the two figures 2 indicates that
. . 5 > . 5 5 . 5
? ? 7 7 7
the window will be divided in two rows and two columns. The command:
? 7
A
4 4 4
will create six sub-windows, three in row, and two in column, whereas:
. 5 . 5 . 5 5 > 5 .
O 7 7
A
4 4 4 4
will also create six sub-windows, but two in row, and three in column. The sub-windows may
=
. 5 5 . 5 . 5 > 5 5 >
O 7 7 7 7 6
be of different sizes:
A
2
19
will open two sub-windows in row in the left half of the window, and a third sub-window in
/
7 7
4 < 4 4 4
. 5 . 5 5 5 . /
6
A
4 <
the vectors c(3,1) and c(1,3) giving the relative dimensions of the sub-windows.
. 5 5 . > 5 5 5
? ? ? 7
4 <
To visualize the partition created by layout() before drawing the graphs, we can use the
=
/ . 5 . . . 5 . / 5
? 7 6 7
A
function layout.show(2), if, for example, two sub-windows have been defined.
/
7 7 ?
A
{ ¹
Ë
¶ } |
¸
/
? ? 7
plot(x)
h m
$ (
Þ ß
plot(x,y)
h m h m
$
ß Þ
sunflowerplot(x,y)
" "
id. than plot() but the points with similar coordinates are drawn as flowers which
!
( 0 0 0 0
"
!
!
( (
piechart(x)
"
circular pie-chart
(
boxplot(x)
k "
“box-and-whiskers” plot
, 0
stripplot(x)
"
" " " " " "
plot the values of x on a line (an alternative to boxplot()for small sample sizes)
h m
! !
$ ( , $ à
coplot(x~y|z)
"
"
$ , $ ( à à
interaction.plot
if f1 and f2 are factors, plots the means of y (on the y-axis) with respect to the values
h m
0 $ (
Þ
(f1,f2,x)
of f1 (on the x-axis) and of f2 (different curves) ; the option fun= allows to choose
h
ß
,
m
h
(
$
m
" "
0
"
! !
( (
matplot(x,y)
bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the
$ (
á â á â
ã j ã
dotplot(x)
$
column-by-column)
m
( (
pairs(x)
" " " "
!
, , 0 $ 0
"
columns of x
(
!
,
plot.ts(x)
if x is an object of class ts, plot of x with respect to time, x may be multivariate but
0 ( $ (
j j
)
( $ (
ts.plot(x)
id. but if x is multivariate the series may have different dates and must have the same
( ( $ $ ( $
frequency3 )
(
)
(
$
(
qqnorm(x)
)
( 0 $ ( ( 0
qqplot(x,y) )
0
)
(
contour(x,y,z)
h m
creates a contour plot (data are interpolated to draw the curves), x and y must be
( 0 ( $ (
$ (
image(x,y,z)
h m
( 0 ( (
persp(x,y,z)
h m
( (
For each function, the options may be found with the on-line help in R. Some of these options
X
. 5 5 / 5 > 5 5 5 / 5 > / 5
7 6 7
are identical for several graphic functions; here are the main ones (with their possible default
. 5 . . . / 5 5 . . > 5 5 . /
? 7 7
values): 7
k
"
3 The function ts.plot() is in the package ts and not in base as for the other graphic functions listed in the
( (
" " k
r
%
20
< 4 <
add=FALSE
½ ¾ ¿
/ . / / 5 / . 5
7 ? 7 O
<
axes=TRUE
Á Â ¿
5 .
O
specifies the type of plot, "p": points, "l": lines, "b": points connected by
type="p"
/ / / / /
6 6
lines, "o": id. but the lines are over the points, "h": vertical lines , "s": steps,
/ /
7 ? ?
< 4 4
the data are represented by the top of the vertical lines, "S": id. but the data
. . / . 5 / . 5
6 ? 7
< 4 4
. . / . 5 > . 5
6 ?
4 <
xlab=, ylab= annotates the axes, must be variables of mode character (either a character
5 5 > . > . . .
O 7 ?
main=
7 ?
4 4 4 <
sub=
. 5 5 > . 5
7
É | }
R has a set of graphic functions which affect an already existing graph: they are called low-
. / 5 5 5 . 5 . / .
7 6 O 6
4 4 4
points(x,y)
(
lines(x,y)
"
( 0
text(x,y,labels,...)
"
adds text given by labels at coordinates (x, y); a typical usage is:
h m
, $ , (
plot(x,y,type="n"); text(x,y,names)
segments(x0,y0,x1,y1)
h m h m
0
ß Þ ß Þ
å å
j j
arrows(x0,y0,x1,y1,
h m h m
id. with an arrow at point (x0,y0) if code=2, at point (x1,y1) if code=1, or at both
0 0
ß Þ ß Þ
å å
j j j j
angle=30, code=2)
points if code=3; angle controls the angle from the shaft of the arrow to the
0
0
abline(a,b)
0
abline(h=y)
" "
0 à
abline(v=x)
" "
0 $
abline(lm.obj)
0 $
rect(x1,y1,x2,y2)
draws a rectangle which left, right, bottom, and top limits are x1, x2, y1, and y2,
0 0
j j j j j j j
"
respectively
$
polygon(x,y)
" " k
0 0 $ ,
legend(x,y,legend)
adds the legend at the point (x,y) with symbols given by legend
h m
0 $
title()
(
axis(side,vect)
h m h m h m
adds an axis at the bottom (side=1), on the left (2), at the top (3), or on the right
j j j
h m h m h m
(4); vect (optional) gives the abcissa (or ordinates) where tick-marks are drawn
$ 0 0
rug(x)
0 $
ß
locator(n, type="n",
" k
"
returns the coordinates (x,y) after the user has clicked n times on the plot with the
h m
!
( ( 0
ß Þ
...) !
0
!
"
h m
"
h m
0
"
"
!
( 0
/ / /
6
< <
. 5 5 . 5 . > . > 5 5
7 7
4 4
mathematical equation according to a coding used in the type-setting TeX. For example,
= ç X
4 4 4
Â
/
6
4 < 4 4 <
5 / 5 5 / 5 . 5
7
è é
1
Uk 37
S
ê ë
T
ð ñ ò ó
1 e ì í î ï
21
4 4 <
To include in an expression a variable we can use the function substitute() together with
=
5 5 5 / . 5 . 5 5 5 .
7 O ? 7 7
A A
5 5 . > / 5 / . > /
7 O 7 ? 7 ? 7 6 7
V
Á Á
A A A
4 4 4 4 <
/ 5 / / 5 . 5
6 O 6
R2 = 0.9856298
4 4 4 < < 4 4
/ 5 . > 5 >
6 6 6
A A A A
4 4 4
. 5
7
R2 = 0.986
4 4 4 4
5 . 5 . > > 5 5 5
6 ?
>text(x, y, as.expression(substitute(italic(R)^2==r,
Á
A A
list(r=round(Rsquared,3)))))
A A
R2 = 0.986
ô
} | | } }
/ / / /
? ?
with graphic parameters. They can be used either as options of graphic functions (but it does
/ / / /
6 7 7 7
not work for all), oe with the function par() to change permanently the graphic parameters,
/ / /
7 6
4 <
i.e. the subsequent plots with respect to the parameters specified by the user. For instance,
X
5 / . / / . > . / . . 5 5
7 7 6 7
4
l’instruction suivante:
5 . 5 5
7 7 ?
> par(bg="yellow")
4 4 4 4 4 4 4 v
will draw all subsequent plots with a yellow background. There are 68 graphic parameters,
=
. 5 / . 5 . . . / / . > .
7 7 6 7
some of them have very close functions. The exhaustive list of graphic parameters can be read
=
4 4 4 < 4 4 4 4
with ?par; I will limit the following table to the most usual ones.
@
> 5 > 5
7 7
adj
h % % m
( ( (
j j
bg
"
k
"
" "
specifies the colour of the background (e.g.: bg="red", bg="blue", ... the list of the 657 available
h
9
( ( $
j j
" "
( 0
bty
"
" " " "
controls the type of box drawn around the plot, allowed values are: "o", "l", "7", "c", "u" or "]"
, 0 ( 0 $ (
j j j j j
" k " k
(the box looks like the corresponding character); if bty="n" the box is not drawn
h m
, , 0
cex
a value controling the size of texts and symbols with respect to the default; the following parameters
$ ( 0 ( 0
"
have the same control for numbers on the axes, cex.axis, annotations on the axes, cex.lab, the
! !
$ ( , ,
j j j j
"
"
(
j j j
col
controls the colour of symbols; as for cex there are: col.axis, col.lab, col.title, col.sub
(
j j j
font
h % m
an integer which controls the style of text (0: normal, 1: italics, 2: bold, 3: bold italics); as for cex
0
j j j
j j j
22
las
h %
an integer which controls the orientation of annotations on the axes (0: parallel to the axes, 1:
0
m
( $
j j
lty
h
controls the type of lines, can be an integer (1: solid, 2: dashed, 3: dotted, 4: dotdash, 5: longdash, 6:
j j j j j j
" "
twodash), or a string of up to eight characters (between "0" and "9") which specifies alternatively the
m h m
0 ( 0 0 $
"
"
"
" k " " "
length, in points or pixels, of the drawn elements and the blanks, for example lty="44" will have the
! !
, 0 , 0 $
j j j
!
lwd
( 0 0
mar
a vector of 4 numeric values which control the space between the axes and the border of the figure of
$ ( $ ( 0 0 (
the form c(bottom, left, top, right), the default values are c(5.1, 4.1, 4.1, 2.1)
( $ (
mfcol
a vector of the form c(nr,nc) which partitions the graphic window as a matrix of nr lines and nc
$ 0 0 0
h m
( 0 (
mfrow
h m
( 0 0
pch
controls the type of symbol, either an integer between 1 and 25, or any single character within ""
0 0
j j
ps
"
"
!
0 à ,
pty
" "
a character which specifies the type of the plotting region, "s": square, "m": maximal
) ! !
0 ( ,
j j
tck
"
"
k k
" "
a value which specifies the length of tick-marks on the axes as a fraction of the smallest of the width
! !
$ ( 0 , 0
"
0
tcl
a value which specifies the length of tick-marks on the axes as a fraction of the height of a line of text
$
(
0
"
(
xaxt
if xaxt="n" the x-axis is set but not drawn (useful in conjonction with axis(side=1, ...))
h m
( 0 ( ( 0
ß
yaxt
h m
if yaxt="n" the y-axis is set but not drawn (useful in conjonction with axis(side=2, ...))
( 0 ( ( 0
Þ
23
ö
G G
H F F H w
Even more than for graphics, it is impossible here to go in the details of the possibilities
/ / /
?
offered by R with respect to statistical analyses. A wide range of functions is available in the
P
. . / 5 . 5 5 5 5
6 6 7 ?
v
/ 5 5 . .
7
4 v 4 < 4
Several contributed packages increase the potentialities of R. They are distributed separately
=
. 5 . / 5 . / 5 . . / .
? 7 6 7 6
4 4 <
U U
/ / /
V
U U U
gee
7
4 4 4 4
multiv
> . 5 5 . . / 5 5 5
7 ? 6 7 6
A
5 . .
6 7
nlme
survival5A
survival analyses; 7 ? ? 6
tree
4
time-series analyses
tseries
> . 5
6
> . > 5 .
7
V
U U U U
/ /
6 7 7 6 ?
interesting packages:
dna
> 5 / 5 > . 5 5 / . 5 /
7 7 7 7 7
4 4 4
gnlm
5
5
. 5 . >
stable
/
6 7 7
growth
/
7
repeated
/
7
event
/ /
7
4 < 4
rmutil
. 5 5 5 . . . 5 5 . / > .
7
A
“An Introduction to R” (pp 51-63) gives an excellent introduction to statistical models with R.
/ /
7 ? 7
Only some points are given here in order that a new user can make his first steps. There are
/ /
6 ? 7
/
? 7
4 4
lm
4 4 4
glm
nlm
7
4 < <
For example, if we have two vectors x and y each with five observations, and we wish to
X
. > / . 5 . 5 5
O ? ? O 6 ? ?
> lm(y~x)
Call:
lm(formula = y ~ x)
A
Coefficients:
24
(Intercept) x
0.2252 0.1809
< < 4 < V
. 5 5 5 5 . 5 / 5 5
6 7 7
if we type mymodel, the display will be the same than previously. Several functions allow the
/ / /
6 6 ? 7 6 ? 7
4 4 4 4 4 < 4
user to display details relative to a statistical model, among the useful ones summary()
. / . > > 5 5
7 6 ? 7 7
A
4 4 4 < 4 < 4
displays details on the results of a model fitting procedure (statistical tests, ...), residuals()
/ 5 . > 5 / . .
6 7 7
displays the regression residuals, predict() displays the values predicted by the model, and
/ / /
6 7 6 ? 7 6
/ /
6 ?
> summary(mymodel) A
Call:
lm(formula = y ~ x) A
Residuals:
Á
1 2 3 4 5
1.0070 -1.0711 -0.2299 -0.3550 0.6490
Coefficients:
A
A
A A A A
A
> residuals(mymodel) A
1 2 3 4 5
1.0070047 -1.0710587 -0.2299374 -0.3549681 0.6489594
> predict(mymodel)
1 2 3 4 5
0.4061329 0.5870257 0.7679186 0.9488115 1.1297044
> coef(mymodel)
(Intercept) x
0.2252400 0.1808929
It may be useful to use these values in subsequent computations, for instance, to compute
/ /
6 7 7 7 ? 7 7 7 7 7
/
? 7 6
º
> a + b*newdata
To list the elements in the results of an analysis, we can use the function names(); in fact, this
7 6 7 7
V
7 6 7 6
> names(mymodel)
A
A A
25
> names(summary(mymodel)) A
A
"r.squared" A
"adj.r.squared" A
A
4 < 4 4
> 5 > . 5 5
6 O 6
> summary(mymodel)["r.squared"] A
A
$r.squared A
[1] 0.09504547
Formulae are a key-ingrediant in statistical analyses with R: the notation used is actually the
7 6 6 7 7 6
same for (almost) all functions. A formula is typically of the form y ~ model where y is the
/
7 7 6 6
analysed response and model is a set of terms for which some parameters are to be estimated.
/ /
6
4 4
These terms are separated with arithmetic symbols but they have here a particular meaning.
=
a+b
a:b
identical to a+b+a:b
a*b
4 4 <
/ 5 > / .
6 7
4 4 4 4 4 4
^n
5 5 . 5 / 5
7 7 ?
a+b+c+a:b+a:c+b:c
b%in%a
a-b
. > . > / 5
? O
< <
a+b+c+a:c+b:c, y~x-1 forces the regression through the origin (id. for y~x+0, or
. . . 5 . . 5 . .
7
0+y~x)
We see that arithmetic operators of R have in a formula a different meaning than the one they
/
? 7 6
have in a classical expression. For example, the formula y~x1+x2 defines the model y = β1x1 +
/ /
? 7
é è