P. 1
LearnR

# LearnR

|Views: 60|Likes:

See more
See less

11/11/2012

pdf

text

original

## Sections

• Introduction
• 1.1 Why learn R?
• 1.2.1 What is R?
• 1.2.2 What are the diﬀerences between S-Plus and R?
• 1.3 Basic R commands
• Basic arithmetic in R
• Getting help in R
• 3.1 Help in R itself
• 3.2 Help outside R
• 3.3.2 Another look
• 4.1 Introduction
• 4.2 Entering data directly
• 4.5 Data frames
• 4.6 Examining data
• 4.6.1 Initial examination of the data frame
• 4.6.2 Initial examination of the variables
• 4.6.3 Initial plotting of the data
• Matrices in R
• Plotting and graphics in R
• 6.1 Plotting data
• 6.2 Plotting functions
• Simple analyses
• 7.1 Formula notation
• 7.2 Function output
• How to run your code in R
• More on plotting
• 9.1 Introduction
• 9.3. CHANGING THE PLOTTING CHARACTERS 43
• 9.3 Changing the plotting characters
• 9.4 Changing colours
• 9.5 Changing the format of the axes
• 9.6 Selecting parts of a data frame
• 9.8 Saving ﬁgures
• 9.9 Other options

# Learning R

Peter K Dunn

2

Contents
1 Introduction 1.1 Why learn R? . . . . . . . . . . . . . . . . . . . . . 1.2 About R . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 What is R? . . . . . . . . . . . . . . . . . . 1.2.2 What are the diﬀerences between S-Plus and 1.3 Basic R commands . . . . . . . . . . . . . . . . . . 2 Basic arithmetic in R 3 Getting help in R 3.1 Help in R itself . . . . 3.2 Help outside R . . . . 3.3 Using R functions . . . 3.3.1 An initial look . 3.3.2 Another look . 1 1 2 2 3 3 5 9 9 9 10 10 12 15 15 15 16 17 19 20 20 21 22 27

. . . . . . R? . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4 Loading data 4.1 Introduction . . . . . . . . . . . . . . . . . . 4.2 Entering data directly . . . . . . . . . . . . 4.3 Loading text data ﬁles . . . . . . . . . . . . 4.4 More on loading data . . . . . . . . . . . . . 4.5 Data frames . . . . . . . . . . . . . . . . . . 4.6 Examining data . . . . . . . . . . . . . . . . 4.6.1 Initial examination of the data frame 4.6.2 Initial examination of the variables . 4.6.3 Initial plotting of the data . . . . . . 5 Matrices in R

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 Plotting and graphics in R 29 6.1 Plotting data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 i

ii 6.2

CONTENTS Plotting functions . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Simple analyses 33 7.1 Formula notation . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.2 Function output . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8 How to run your code in R 9 More on plotting 9.1 Introduction . . . . . . . . . . . . 9.2 Adding labels . . . . . . . . . . . 9.3 Changing the plotting characters 9.4 Changing colours . . . . . . . . . 9.5 Changing the format of the axes . 9.6 Selecting parts of a data frame . . 9.7 Adding a legend . . . . . . . . . . 9.8 Saving ﬁgures . . . . . . . . . . . 9.9 Other options . . . . . . . . . . . 39 41 41 42 43 43 43 44 45 46 47

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Chapter 1 Introduction
1.1 Why learn R?

 r is available for most operating systems (including Windows, Mac, and most ﬂavours of Linux including bsd)1 .  r is open source.  Because r is open source, the code is freely available to anyone. This has the advantages that:

– Anyone can examine the code to ﬁnd (and ﬁx) errors; – Unlike some statistical (and mathematical) packages, it is possible to know exactly what the software does by looking at the code. This means the software has been examined by many experts to ensure quality of the statistical procedures.
 Because r is free, it is easy and cheap to ensure you have the most up-to-date versions.  r is very powerful.  r is extendable, with numerous add-on packages available.  Because r is open source and used by many active and renowned scientists and statisticians, r is often the software of choice for implementing new ideas.
For example, spss is not available for Macs, and only for some ﬂavours of Linux, and not for bsd
1

1

visualise. a very high level language and an environment for data analysis and graphics. notably bioinformatics (using the BioConductor package for r). Open Source implementation of S. with conceptual integrity. It is a command-line package. or a statistics environment. you can program r to do it. Chambers. the Association for Computing Machinery (ACM) presented its Software System Award to John M.2. In 1998. . r is (or is becoming) the standard environment for analysis. The two main expression of the S language are:  S-Plus.  r is an object-oriented language. and manipulate data. . and  r.  In many areas.1 About R What is R? r is a statistics package. and eﬀort of John Chambers. widely accepted.  S is an elegant. taste. thanks to the insight. the principal designer of S. r is based on the S language.  r is actively supported by many well-known and accomplished statisticians and scientists. universities and commercial companies use r. 1.2 1. unlike many popular statistics packages.  r is actively supported through a variety of means (see Section 3).2 CHAPTER 1. which has forever altered the way people analyse. in a similar way that MATLAB is command-line driven. an free. . a commercial (and often expensive) implementation of S.  r is programmable: if r can’t do a particular task. and enduring software system. INTRODUCTION  Many research institutions.  r produces publication quality graphics. for  the S system.

and for most purposes they can be considered together apart from the interface. S-Plus has some functionality not found in r.  r has some useful functions not found in S-Plus (such as adding Greek letters to plots). r is free and Open Source.  There are a few diﬀerences in the language also.  S-Plus has a fancier gui. The functionality of r is growing rapidly.3 Basic R commands  Once r has been started. or are free in r. BASIC R COMMANDS 3 1.  S-Plus has many (additional cost) add-ons that extend its functionality beyond r.  r is command-line based (like.1. however:  S-Plus is commercial. Likewise. Matlab).  r is available “out of the box” on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux).  In the command window.  S-Plus often has fancier graphics capabilities. S-Plus is available for Windows and some ﬂavours of Linux. not point-and-click (like. say.3. a > character at the far left (the ‘command prompt’) indicates r is awaiting your instructions. but not MacOS. spss or Word). Low-level users might never notice the diﬀerences.2 What are the diﬀerences between S-Plus and R? Since both S-Plus and r are implementations of S. There are some diﬀerences. . they are very similar. 1. In contrast. It also compiles and runs on Windows and MacOS. say. you can quit by typing q().2.

4 CHAPTER 1. INTRODUCTION .

Chapter 2 Basic arithmetic in R  Arithmetic is performed as one would expect: > 3^2 * exp(2) .57128 > 2 * pi [1] 6.2i ) * (-3 1i) You must tell r you wish to do complex arithmetic.283185  r can do complex arithmetic. try. This means that abc.5015 > 67/(log(2) + 1) [1] 39. compare the output of > sqrt(-1) [1] NaN with the output of sqrt(-1 + 0i) [1] 0+1i  r is case-sensitive. 5 .1 [1] 65. for example. (1 . ABC and aBc are diﬀerent.

model1. this syntax was also valid: x _ 3 which is very hard to read. which is scary as well as not what we want). square brackets are used for indexing vectors and matrices. The following are all valid variable names: x3. This syntax is being phased out (thankfully).x. It is common in r to use the period in variable names. but can contain numbers and dots (and. usually.3 (Best) – x = 3 (OK) – 3 -> x (OK) – Not so many years ago.residual is a valid variable name.2  Variables can be deﬁned using the assignment operator <> > > > radius <. Variables must start with a letter.  Round brackets are used for function arguments. though <. for example.(3).6 CHAPTER 2. even if there are no arguments to the function the parentheses are necessary (otherwise we see the function deﬁnition. All these are the same – x <. 2 is the same as x <-  Assignment is also allowed using =. So use spaces wisely and carefully! 1 .  R works happily with variable names.3 circumference <.pi * radius^2 area [1] 28.is preferred. underscores). But note: x<-3 can be interpretted two ways: x < (-3) or x <. D. except as the ﬁrst character.2 * pi * radius area <.2. However. Numbers can be used in variable names. as the example above for q() shows. x2. But. these parentheses contain arguments to the function.27433 > circumference [1] 18. BASIC ARITHMETIC IN R  All commands must be follows by parentheses. more recently.2.84956 Note that spacing is not important: x <2. use spaces wisely for readability1 .

00 0.0625 4.75 1.0625 2.25 -1.00 -2.2500 1. and so instructions generally work with vectors as easily as scalars: > x <.0000 > x + 10 .0000 1.  r is a vector-based language. try typing the incomplete command 5 + log( and then pressing Return.75 3.title <. 3.50 0.0625 0.2500 3.0000 0.2500 7.pi * radius^2 # Circle area  Two or more commands can be entered on a line at once if separated by a semi-colon. For example.75 2.25 1.50 2.0625 6.0625 0. try entering this all on one line in r: length <.  Strings are usually given in double quotes (though single quotes work in more recent versions too).0000 7.50 [12] -0.25 0.75 -2.0000 [10] 0.7 You can even do things like this: x <. This is useful for explaining what your commands do.5.2500 0. by = 0.5625 1.0000 5.  Commands not completed are automatically continued on to the next line with the + prompt.2500 0.25 0.50 1.75 -0.5625 9.2.50 -1. width <.z <.5625 6. The command prompt will change to + when you should type 1) so r will have a complete command.2500 5. length * width  In the command window. it will then give you an answer.25) > x [1] -3."Title for my plot"  Comments are indicated by the # character: Any text following this character is ignored by R.5625 0.5625 1.00 2.50 -2.2 -> y (but this looks awful) and x <.00 > x^2 [1] 9.00 -1.25 [23] 2. brackets are unmatched).5625 [19] 2.00 1.seq(-3. area <.0625 4.2. for example: plot.00 -0.0000 3. a + character means r is waiting for you to complete an instructions that is unﬁnished (for example.y <.75 -1.25 -2.

50 [12] 9.2. y.00 10.75 10.8 CHAPTER 2.50 11.  When calling functions.00 9. BASIC ARITHMETIC IN R [1] 7.25 8. It is good practice to remove unwanted variables using rm(variable.50 10.00 12.75 13.25 7.50 8.25 [23] 12.3.00 7.50 12.25 10. For example.00 11.75 11.75 12. .50 7. plot(x.25 9. More on this in Section 3. the names of the input variable can be used explicitly (this is unlike MATLAB).00  A list of the current variables can be found using objects() or ls().name). xlab="The x-axis label") explicitly assigns a value to xlab wherever it appears in the deﬁnition of the function plot.75 9.00 8.75 8.25 11.

demo(graphics). RSiteSearch("regression"). example(lm).1 Help in R itself  Open an html search page using help.  If you do not know the function name.org/ – The Comprehensive R Archive Network (CRAN) at http://cran. or r manuals and help pages. au.r-project. Type demo() for a list of available demonstrations. help. help("lm") or ?lm  You can also try using example.html 9  You can search the web pages: .org and enables them to be viewed in a web browser. for example.r-project. 3.  Demonstration of r capabilities are given using demo(). another way is to try.org/ – Search the mailing list of three r groups: http://cran. for example.Chapter 3 Getting help in R 3.search("regression"). This function searches for key words or phrases in the r-help mailing list archives.r-project. org/search. using the search engine at http://search. try.2 Help outside R – The R Project at http://www.r-project.au.start()  If you do not know the function name. for example. for example. try.  If you do know the function name. for example.

and what they mean. Melvin.3 3. these commands are executed. 1. these example are for non-beginners. 2. 5. like: 0.). (519. To ﬁnd such a function. Andreas and Olson.5028553 Ven) 3.1 Using R functions An initial look Suppose we wish to create a sequence (like −1. . see Section 1.  Lines 38–42: Other function that might be relevant.10 CHAPTER 3. and Ripley. When the command example(seq) is given. W.2).50285 Spl/Kra). We learn that seq is for generating regular sequences. N. Cambridge University Press. D. – Venables. Sometimes. (Most books about S-Plus are also relevant.50285 Mai) – Krause. 0. obtained by typing ?seq at the r prompt. Modern applied statistics with S-PLUS. 1. Consider the following: . 1999. (519. Part of an example help page for seq.  Lines 44–63: Some examples of using the function. . 3. New York: Springer. GETTING HELP IN R  The library has books on r. is shown in Figure 3. type help.1. 2002.3. New York: Springer. New York: Springer. (519. 2. John and Braun.  Lines 20–32: The argument to the function. Some useful books include (in subjective rank order from very useful to useful): – Dalgaard. This book is also available as an e-book through the library catalogue. 1997. John. Data Analysis and Graphics Using r: An Example-based Approach. Introductory statistics with r. 4. . 2003. The basics of S and S-Plus. Here is a brief explanation:  Line 5: A brief description of what the function does  Lines 9–18: How to use the function. Peter.search("seq") seq appears to be the appropriate function. B. (519.5 Dal) – Maindonald.

length.6. ’rep’. ’interaction’.with= ) seq(from) Arguments: from: starting value of sequence. by = 2) # match seq(1. length. 5. ’sequence’.b: ’factor’s of same length. by = 3) seq(1.out: desired length of the sequence.9.125. by= ) seq(from. to) seq(from. by = pi)# stay below seq(1. --<snip> --package:base R Documentation 11 See Also: The method ’seq.with: take the length from the length of this argument.POSIXt’.3. ’col’.9.1.05) seq(17) # same as 1:17 <snip> .3.out= ) seq(along. length=11) seq(rnorm(20)) seq(1. along. to. ’row’. to. by=0. to: (maximal) end value of the sequence. USING R FUNCTIONS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 seq Sequence Generation Description: Generate regular sequences. a. by: increment of the sequence. Usage: from:to a:b seq(from.575. As an alternative to using ’:’ for factors. Examples: 1:4 pi:6 # float 6:pi # integer seq(0.

by = 1. the input arguments are named : > seq(from = 0. 5. for example set. use: > rnorm(n = 2. by = 1) [1] 0 1 2 3 4 5  If the arguments appear in the order given. If you want to set the random number seed (and so have reproducible ‘random’ numbers). and you will see diﬀerent values if you type these commands in yourself. GETTING HELP IN R  Generally. they can appear in any order: > seq(to = 5. from = 0) [1] 0 1 2 3 4 5  A variation is: > seq(0. sd = 1) [1] -0. Observe the following1 :  To generate two random numbers from N (0. to = 5.3. use set. This help page also discusses the help for other functions related to the Normal distribution.12 CHAPTER 3. they need not be named: > seq(0. each time the function is run produces a diﬀerent value. 1).2. we can also use: > 0:5 [1] 0 1 2 3 4 5 3. . 5. by = 1) [1] 0 1 2 3 4 5  If the arguments are named. mean = 0.091054 Note that.549638 1 1. and place some integer in the parentheses. since random numbers are used. see Figure 3.seed(). for generating random numbers from a Normal distribution. length = 6) [1] 0 1 2 3 4 5  Since this quite a common command.seed(518768).2 Another look Consider the help for the r function rnorm.

rnorm(n.tail = TRUE. qnorm(p. Arguments: x. mean: vector of means.q: vector of quantiles. log.3. if TRUE. Details: If ’mean’ or ’sd’ are not specified they assume the default values of ’0’ and ’1’. distribution function.p = FALSE) sd=1) Figure 3.3. mean=0.tail = TRUE. n: number of observations. probabilities p are given as log(p). lower. log.p: logical.p = FALSE) sd=1. sd=1. lower. --. p: vector of probabilities. lower. log = FALSE) sd=1. log. mean=0. quantile function and random generation for the normal distribution with mean equal to ’mean’ and standard deviation equal to ’sd’.2: Part of the help for the function rnorm . ’qnorm’ gives the quantile function.<snip> --Value: ’dnorm’ gives the density. and ’rnorm’ generates random deviates. P[X > x]. probabilities are P[X <= x]. the length is taken to be the number required. sd: vector of standard deviations. mean=0.tail: logical. USING R FUNCTIONS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Normal package:stats R Documentation 13 The Normal Distribution Description: Density. log. respectively. mean=0. pnorm(q. Usage: dnorm(x. otherwise. if TRUE (default). If ’length(n) > 1’. ’pnorm’ gives the distribution function.

n = 2.78126982 [16] -0.70889985 -1.51943184 1.74996129 0.15463424 0.36243538 -0.29019398 -2.40957238 [21] 1.87884444 -1.94138513 -0.9705545 -0.14622623 1.27981588 0.30466297 [26] -0. of course.55097480 -0.54943140 0. GETTING HELP IN R  In the deﬁnition.25422449 1.1378004  We could. we could use: > rnorm(2) [1] 0.1697033 1.03265396 0.91306984 -0. .14058509 [6] -0.1318273 If we produce a lot of random numbers you see how r displays long vectors: > rnorm(30) [1] 0.59014791 1.10732266 [11] 0. type: > rnorm(mean = 0. so we could use: > rnorm(n = 2) [1] 0. sd = 1) [1] -0. the default values for the mean and standard deviation are speciﬁed as 0 and 1 (see Line 16) respectively.14372175 -0.14 CHAPTER 3.6397777 1.44129222 -2.08767391 The numbers in brackets on the far-left indicate the element number in the vector.52921818 0.24283860 -0.02429531 2.0425760  Since arguments need not be named if given in the speciﬁed order.15992394 -0.13591141 -0.83093656 1.

Therefore. Systat. S. 10. spss. 15. as well as data saved by statistical packages such as Minitab. 13. it is vitally important to be able to load data.dbf (dBase) ﬁles.Chapter 4 Loading data 4.1 Introduction The main purpose of r is analysis of data. Certain functions can be used to read and write data stored in a variety of text formats. sas. r is very ﬂexible about importing data. Data can also be entered directly into r.scan() 9 10 15 13 12 Read 5 items > attendance [1] 9 10 15 13 12 15 . 4.2 Entering data directly Data can be entered directly into r using the function c() (for concatenate): > attendance <. and for reading and writing . 12) > attendance [1] 9 10 15 13 12 The function scan() can also be used to read the data line-by-line from the console (input is terminated by entering a blank line): > attendance <. Fortunately.c(9. Stata.

table("http://www.data Gender Age Height Income Male 34 173 34556 Male 45 184 67780 Female NA 175 56449 Male 24 NA 22089 Female 45 169 34402 Female 21 172 26709 Female 26 166 45221 Male NA 189 56223 1 2 3 4 5 6 7 8 So we have a data frame called dummy. but it can contain numerical as well as textual variables.read. it is stored as an object called a data frame. such as  Quantitative variables (such as heights).data. When data is loaded into r. sep = "\t") > dummy.data\$Income) Min. .  Text (or string) variables (such as people’s names). 67780 Incidentally. > dummy. For example.  Categorical variables (such as gender).4.usq.dat". r also handles missing values well.data <.5 Data frames Data ﬁles may contain diﬀerent types of variables.au/staff/dunn/Datasets/datafile1.data\$Gender [1] Male Male Female Male Levels: Female Male > summary(dummy. 1st Qu. A data frame is like a matrix. and many others. this example demonstrates that r is object oriented : the summary command produces diﬀerent output depending of what type of input it is asked to summarize. 42930 56280 Max. + header = TRUE.edu.data\$Gender) Female 4 Male 4 Female Female Female Male > summary(dummy.5.sci. 22090 32480 1 Median 39890 Mean 3rd Qu. How do we just access one variable in that data frame? As follows1 : > dummy. DATA FRAMES 19 4.

: 6.:16. like the mean and median: > mean(Length) [1] 15.0 1 21 1 2 3 4 5 6 > tail(jf) Breadth Length Site 19 20 2 15 21 2 16 21 2 21 21 2 19 22 2 20 22 2 41 42 43 44 45 46 > summary(jf) Breadth Min.000 1st Qu.34 3rd Qu.80 3rd Qu.000 Max.00 Max.000 Mean :1.00 1st Qu.25 Median :15. :21.:19.25 .00 Max.:2.00 Mean :13.00 Site Min.0 1 8.000 Notice that summary produces diﬀerent information for diﬀerent sort of variables: again.0 1 7.00 Length Min.2 Initial examination of the variables All the usual statistical summaries of data are available in r. with quantitative variables:  measures of centre. :22.5 1 7.0 9.00 1st Qu.0 1 6.4. :2.5 8.0 1 6. this is the object-oriented aspect of r.000 Median :2.0 9.:13. 4. :1.5 9.25 Mean :15.:10.6.6.522 3rd Qu.0 10. EXAMINING DATA > head(jf) Breadth Length Site 6.80435 > median(Length) [1] 16.:1.00 Median :16. for example. : 8.0 9.

0 1. 1. 1. 15.000 Median 2.000 Median 16. 1st Qu.4 1.80 19.000 Mean 3rd Qu.00 13.8 2.000 Max.6 1. 8. 1st Qu.522 2. variance and interquartile range: > sd(Length) [1] 4.00 4.6.0 1.2 1.000 1.00 Max. much like we saw above: > summary(Length) Min.25 Mean 3rd Qu. like the standard deviation.3 Initial plotting of the data We should also do some plots of the data: > plot(jf) 8 10 14 18 22 q q q q q q q q q q q Breadth q qq q q q q q q q 22 q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qqq q q 14 Length q q q q q q q q q q q q q q q q q 8 10 18 q q q q q q q q q q qq q qqq q q q q q q qqq q q q q q q q q q qq q qqq q q q q q q 10 15 20 1.2 1.0 1. 2.0 10 q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q 15 20 .167971 > var(Length) [1] 17.6 1.37198 > IQR(Length) [1] 6 A simpler overview of a variable is provided using summary.4 Site 1.8 2.22 CHAPTER 4. 22. LOADING DATA  measures of spread.00 > summary(Site) Min.

2 .4. 2)) hist(Breadth) hist(Length) boxplot(Length) plot(Site) par(mfrow = c(1.c) ) splits the graphics windows into r rows and c columns. we can examine each variable separately2 : > > > > > > par(mfrow = c(2.6. here is an alternative plot: The command par( mfrow=c(r.8 0 10 20 30 40 Index Relationships may also be examined: > plot(Breadth.screen.0 1.4 qqqqqqqqqqq qqqqqqqqqqq 8 1. EXAMINING DATA Even better. 1)) Histogram of Breadth 15 Frequency Frequency 12 23 Histogram of Length 10 5 0 10 15 Breadth 20 0 8 4 8 12 16 Length 20 qqqqqqqqqqqq qqqqqqqqqqqq 18 Site 12 1. Also see the help (or run the example) for layout and split. Length) 22 q q q q q q q q q q q q q q q q q q q q q q Length 14 16 18 20 q q q q q q 12 q q q q q 10 q q q q q 8 10 Breadth 15 20 Some markers represent more than one observation.

1. col = c("red". This shows us that the data for each Site is diﬀerent. And remember: > detach(jf) .24 > sunflowerplot(Breadth. pch = 15. even though the linear relationship looks quite similar: Salamander Bay jellyﬁsh are generally larger. "Salamander Bay")) The ﬁnal plot is shown in Figure 4. First. using a diﬀerent plotting character (pch) and colour (col): > points(Breadth[Site == 1] ~ Length[Site == 1]. use type="n"): > plot(Breadth ~ Length. Perhaps some kind of linear model is appropriate. + col = "red") > points(Breadth[Site == 2] ~ Length[Site == 2]. LOADING DATA 22 q q q q q q q q 20 q q q q q q q 18 q q q q q q 16 Length q 14 q q q q q q 12 q q q 10 q q q q q q 8 q 10 Breadth 15 20 In either case. add a legend: > legend("topleft". "blue"). plot the data but don’t display the points (that is. + col = "blue") If you want to be really clever. We can also be fancier. Length) CHAPTER 4. there appears to be a linear relationship between Length and Breadth. 15). + legend = c("Dangar Island". type = "n") Then plot Breadth against Length diﬀerently for each Site. pch = 19. pch = c(19.

1: Plotting the jellyﬁsh data.4.6. using diﬀerent plotting characters for each location . EXAMINING DATA 25 q 20 Dangar Island Salamander Bay q 15 q q q q q q q Breadth 10 q q q q q q q q q q q 8 10 12 14 16 18 20 22 Length Figure 4.

byrow = TRUE) > matrix2 <.1] 65 22 152 [1.] [2.Chapter 5 Matrices in R  Matrix arithmetic requires special operators: > matrix1 <.]  Element-by-element multiplication uses no special operators: > matrix1 * matrix1 [. ncol = 1. 22). 1.] [2.1] [. + byrow = TRUE) > matrix1 [.] [3.] 27 . data = c(1. 5).] [2.] [3.1] [. 2.matrix(nrow = 2.] [2. + 0.] [3. ncol = 2. 2.2] 1 4 0 1 4 25 [1.] > matrix1 %*% matrix2 [. data = c(21.matrix(nrow = 3.] > matrix2 [.2] 1 2 0 1 2 5 [1.1] 21 22 [1.

solve(XtX) > XtX.1] [.] 4.] [2. MATRICES IN R [1.inverse <.] [2.440892e-16 1.inverse %*% XtX) [.1] [.] .1] [.] [2.]  Transpose of matrices are found using t() > XtX <.1] [.]  Inverses of matrices are found using solve(): > XtX.] 1.2] 1 0 0 1 [1.2] 1 4 0 1 4 25 CHAPTER 5.2] [1.t(matrix1) %*% matrix1 > XtX [.881784e-15 [2.000000e+00 > zapsmall(XtX.28 > matrix1^2 [.inverse %*% XtX [.2] 5 12 12 30 [1.000000e+00 8.] [3.

use > plot(disp. using formula notation.1 Plotting data First. Individual variables can be plotted also: > plot(mpg) 29 . r plots using points as this is usually what is needed for plotting data. mpg) or. > plot(mpg ~ disp) By default.Chapter 6 Plotting and graphics in R 6. load some data: > data(mtcars) > attach(mtcars) The following object(s) are masked from mtcars ( position 3 ) : am carb cyl disp drat gear hp mpg qsec vs wt > names(mtcars) [1] "mpg" "cyl" "disp" "hp" [10] "gear" "carb" "drat" "wt" "qsec" "vs" "am" To plot one variable against another.

Perhaps better is one of the following > hist(mpg) Histogram of mpg 12 Frequency 0 10 2 4 6 8 10 15 20 mpg 25 30 35 > boxplot(mpg) . PLOTTING AND GRAPHICS IN R q q 30 q q q q 25 q q q q q q q q q mpg 20 q q q q q q q q q q q q 15 q q q 10 q q 0 5 10 15 Index 20 25 30 But that’s not very useful.30 CHAPTER 6.

2. Since r is a statistical package. > x <.x^2 > plot(x. see Figure 6. 3. (The function seq is explained in Section 3.6. Lines can be speciﬁed by plot( x.1). type="l") where the type is the letter ‘ell’ for lines.1. points are the default plot (since it is most useful for plotting data). y) This produces the plot in Figure 6.2 for a variation of Figure 6.3. You do not need to know how to produce this plot.seq(-3.05) > y <.2 10 15 20 25 30 Plotting functions  Since r is a vector-based language.1. PLOTTING FUNCTIONS 31 6.  r can produce publication-quality graphics. plots of functions are easily produced. And don’t forget: > detach(mtcars) . by = 0. but it is important to know that high-quality plots are possible. y.

2 .1: An example plot: y = x2 . PLOTTING AND GRAPHICS IN R q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq qq qqq qq qqqqqqq qqqqq y 0 2 4 6 8 −3 −2 −1 0 x 1 2 3 Figure 6.2: A variation of Figure 6.32 CHAPTER 6. An example of quality graphics y = x2 y = (x − 1)2 8 6 y = x2 4 2 0 −3 −2 −1 0 The value of x 1 2 3 Figure 6.

usq.1 Formula notation In this section. Formula notation is used frequently in r.sci.lm(Breadth ~ Length) The formula Breadth ~ Length works as follows: 33 .Chapter 7 Simple analyses From the previous Chapter. a linear relationship is not unreasonable between the length and breadth of jellyﬁsh. It is best seen in practice: > jf. Perhaps it will be useful to look at ?lm now.dat". 7. use the lm function. To ﬁt a linear model. we use the jellyﬁsh data again: > jf <. + header = TRUE) > attach(jf) The following object(s) are masked from jf ( position 3 ) : Breadth Length Site The following object(s) are masked from jf ( position 4 ) : Breadth Length Site > names(jf) [1] "Breadth" "Length" "Site" To use this function.lm <. which stands for linear model.table("http://www.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/jelly-R. we need to understand r’s formula notation.read.

we will see this later.1) Coefficients: Length 0.1) Call: lm(formula = Breadth ~ Length .  On the left of the equation is the response variable. and is often more natural: > sunflowerplot(Breadth ~ Length) . or the ‘X’.34 CHAPTER 7. or the independent variables. So the above model ﬁts a linear regression model of the form ˆ ˆ Breadth = β0 + β1 Length. or the dependent variable.  On the right of the tilde are the predictors or the covariates. To explicitly omit the constant term. or the ‘Y ’ variable.8497 or > lm(Breadth ~ Length + 0) Call: lm(formula = Breadth ~ Length + 0) Coefficients: Length 0. use > lm(Breadth ~ Length . SIMPLE ANALYSES  The tilde ~ means ‘described by’ or ‘modelled by’. This right-hand side can be quite complex. Notice the constant term is ﬁtted by default as this is nearly always what is wanted.8497 Note that plotting can also use formula notation.

lm.4423 Length 0.lm(Breadth ~ Length) What is jf.lm <.lm: > jf.05847 3Q 0.lm) Call: lm(formula = Breadth ~ Length) Residuals: Min 1Q -3.80440 .70711 Max 2.lm? It is an r object. FUNCTION OUTPUT 35 q 20 q q q q q q q q q q q q q q q q q q 15 q Breadth q q q q q q q 10 q q q q q q q q q q q 8 10 12 14 16 18 20 22 Length 7. In fact. But what does this object contain? Let’s have a look: > jf. this is only a brief overview of the object jf.2.7. the output was directed to a variable called jf.9351 In fact.lm Call: lm(formula = Breadth ~ Length) Coefficients: (Intercept) -1.19560 -0. almost everything in r is called an object.2 Function output In the above call to lm.81180 Median 0. We learn more by typing > summary(jf.

001 ^˘¨**^˘´ 0.) The same format can be used to see all the other components also: try > jf.lm) (Intercept) -1.4422622 Length 0.) This summary provides quite a lot of information: a brief summary of the residuals.lm\$coefficients (Intercept) -1.75199 -1.9351363 > resid(jf. codes: 0 ^˘¨***^˘´ 0. summary does something diﬀerent. jf.2e-16 (Recall we used summary before to summarize data objects. we can see the coeﬃcients of the model by typing > jf. Error t value Pr(>|t|) (Intercept) -1.lm is a diﬀerent type of object.287 on 44 degrees of freedom Multiple R-Squared: 0.4422622 Length 0. the corresponding t-ratios and corresponding P -values. the parameter estimates.0616 .lm\$residuals Typing jf.residual" "model" So.9351363 (Note this is similar to how we accessed variables within a data frame if it is not attach-ed.lm\$residuals becomes a bit cumbersome after a while. Since jf.^˘´ 0.9014 F-statistic: 412.values" "assign" [9] "xlevels" "call" "effects" "qr" "terms" "rank" "df. Adjusted R-squared: 0. SIMPLE ANALYSES Coefficients: Estimate Std. plus some more.44226 0.01 ^˘¨*^˘´ 0. for example.311 <2e-16 *** --Signif.05 ^˘¨. their standard errors.1 ^˘¨ ^˘´ 1 aAY aAZ aAY aAZ aAY aAZ aAY aAZ aAY aAZ Residual standard error: 1.lm) [1] "coefficients" "residuals" [5] "fitted. for the commonly requested components.lm is actually an object that contains a lot of information: > names(jf. there are usually shortcuts: > coef(jf.lm)[1:5] . Length 0.9036.04604 20.36 CHAPTER 7. p-value: < 2.918 0.93514 0.5 on 1 and 44 DF.

this demonstrates r’s object oriented approach: when you call plot.) You can also plot objects: > par(mfrow = c(2.08 Fitted values Leverage (Again. r determines what type of object you wish to plot.55846774 37 (Note the [1:5 ] means to extract just the ﬁrst ﬁve residuals.46117214 -0.5 6 8 12 16 0. 2)) > plot(jf.2. 1)) Residuals vs Fitted q q q qq q qqq q qq qq qq Standardized residuals Normal Q−Q 44 22 q q q qqq qq qq q qq qq qq qq qq qq qq q qq q qq qq qq qq qq qqq q qq q 42 −3 42 q 6 8 12 16 −2 44 q q 22 qq q q qqqqq q qqq q q qq q q Residuals 0 2 0 2 −2 −1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals Scale−Location q 22 Standardized residuals Residuals vs Leverage q q qq q qq qq q q q qq qq q q q q 44 q q q q q q q q q qq q q q q q 1.lm) > par(mfrow = c(1.0 0.02603587 5 0. FUNCTION OUTPUT 1 2 3 0.00 0.47396413 4 0.97396413 -0.0 −3 q q q q −1 qqq q qq qq qq qq qq 1 q 42 q 44 q q q qq qq q q q q qqq qq q q 43 Cook's distance q 42 0.) And don’t forget: > detach(jf) .04 0. and plots sensible things accordingly.7.

SIMPLE ANALYSES .38 CHAPTER 7.

R  Ensure the working directory of r is set to where this ﬁle is located (for example.R"). source-ing a ﬁle also appears in the File menu item. for example.  In r. In Windows.r or whatever.  Type your r commands into this text editor. The extension can be anything you like: . such as Notepad. Wordpad.Chapter 8 How to run your code in R Typing a series of commands at the r prompt is useful for small analyses. However. using getwd() and setwd()). you can make them in the . Kwrite. for example. source("analysis.  If you have any changes you need to make.txt.R. Emacs. protocol and tradition encourage one to use the .R extension.R ﬁle and run the code again quite easily. analysis. 1 39 . This approach has at least two signiﬁcant advantages over working at the command prompt alone:  You keep a permanent copy of what you have done. . But what if you have a lot of analyses to do? Or what if you want to keep track of what you are doing? Or you want to keep repeating very similar but long commands till you get thing just right? The best way is to proceed as follows:  Start your favourite text editor. type.  Save the ﬁle somewhere with the extension1 .

you can get extra functionality using ESS (Emacs Speaks Statistics). In the following Chapters. we will assume you are using scripts.40 CHAPTER 8. . HOW TO RUN YOUR CODE IN R If you use Emacs.

22 1 0 3 1 "drat" "wt" "qsec" "vs" "am" Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant 41 .7 8 360 175 3.90 2.1 6 225 105 2.15 3.61 1 1 4 1 21.85 2.0 6 160 110 3.Chapter 9 More on plotting 9.4 6 258 110 3. concerning fuel consumption and ten aspects of automobile design and performance for 32 US automobiles (1973–74 models).02 0 1 4 4 22.320 18.02 0 0 3 2 18.1 Introduction To learn more about plotting.215 19.8 4 108 93 3.44 1 0 3 1 18.08 3.76 3.620 16. This data set comes with r.0 6 160 110 3.440 17. we will use the data set mtcars.875 17. Load this data and have a brief look at the data as follows: > data(mtcars) > attach(mtcars) The following object(s) are masked from mtcars ( position 3 ) : am carb cyl disp drat gear hp mpg qsec vs wt The following object(s) are masked from mtcars ( position 6 ) : am carb cyl disp drat gear hp mpg qsec vs wt > names(mtcars) [1] "mpg" "cyl" "disp" "hp" [10] "gear" "carb" > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb 21.46 0 1 4 4 21.460 20.90 2.

42 CHAPTER 9. Let’s initially examine the relationship between fuel consumption (mpg. xlab = "Weight (in pounds)". You might like to make changes to the plotting commands in a . MORE ON PLOTTING As with any data analysis. as explained above. + main = "Fuel consumption against Weight") If the main title is too long. we like to start with plots. This plot is pretty basic. or miles per gallon) and vehicle weight (wt): > plot(mpg ~ wt) q q 30 qq q q 25 q q q q q q q q q mpg 20 q q q q q q q q q q qq q q 15 q 10 q q 2 3 wt 4 5 Note that the formula notation has been used.2 Adding labels We can alter the axis labels as follows: > plot(mpg ~ wt. This is much easier than retyping similar commands over and over. ylab = "Consumption (miles per gallon)". xlab = "Weight (in pounds)". xlab = "Weight (in pounds)". where the tilde ~ means ‘is modelled by’ or ‘is described by’. ylab = "Consumption (miles per gallon)") We can also add a major title: > plot(mpg ~ wt. + main = "Fuel consumption against Weight") . In this section. but still quite decent for a default plot. ylab = "Consumption (miles per gallon)". and then source this ﬁle. we discuss plotting options. you can split it across lines: > plot(mpg ~ wt.R ﬁle. 9.

but I usually ﬁnd it makes the graphics too cluttered and looks too much like an extension of the horizontalaxis label. xlab = "Weight (in pounds)". + ylab = "Consumption (miles per gallon)". pch = "@". main = "Fuel consumption against Weight") . xlab = "Weight (in pounds)". + ylab = "Consumption (miles per gallon)". xlab = "Weight (in pounds)". CHANGING THE PLOTTING CHARACTERS 43 There is also a sub-title that can be set. (The standard x11 colour names are used. pch = 19. deﬁned by using pch. Try this: > plot(mpg ~ wt. las = 1.5 Changing the format of the axes I prefer my axis labels to always be horizontal. + ylab = "Consumption (miles per gallon)". main = "Fuel consumption against Weight") Or we can use a predeﬁned list (pch=19 is one of my favourites): > plot(mpg ~ wt. main = "Fuel consumption against Weight") Colours may be deﬁned explicitly as above. 9. main = "Fuel consumption against Weight") To see the list of predeﬁned plotting characters. col = "red". xlab = "Weight (in pounds)". We can specify the character directly: > plot(mpg ~ wt. titles can be added using the title() command. col = "red". More on this later.3. pch = 19. 9. see example(points).3 Changing the plotting characters Points can be plotted using many diﬀerent types of characters. or numerically using a colour code.9. After you have drawn a ﬁgure. In r. + ylab = "Consumption (miles per gallon)".4 Changing colours Diﬀerent colours can be used also: > plot(mpg ~ wt.) 9. sometimes this is almost essential if you make small graphics. pch = 19.

main = "Fuel consumption against Weight for\ This sets up the axes. To do this. pch = 1. a simple scatterplot hides information. MORE ON PLOTTING 30 Consumption (miles per gallon) q q 25 q q q q q q q q q 20 q q q q q q q q q q 15 qq q q q 10 2 3 4 5 q q Weight (in pounds) 9. the number of cylinders is probably quite important. col = "blue") points(mpg[cyl == 8] ~ wt[cyl == 8]. > + > > > plot(mpg ~ wt. col = "green") . main = "Fuel consumption against Weight for\ points(mpg[cyl == 4] ~ wt[cyl == 4].) > plot(mpg ~ wt. axis labels. type = "n".6 Selecting parts of a data frame Often. col = "red") points(mpg[cyl == 6] ~ wt[cyl == 6]. in the relationship between fuel consumption and weight. but separately for the number of cylinders. but doesn’t plot any data. type = "n". las = 1. In the data frame.44 Fuel consumption against Weight q q qq CHAPTER 9. xlab = "Weight (in pounds)". First. is numeric. For some purposes. we plot the data invisibly (using type="n". las = 1. it is useful to think of the number of cylinders as a factor or categorical variable. For example. xlab = "Weight (in pounds)". we select subsets of the data frame to plot (and note the use of == rather than =). Consider this: > coplot(mpg ~ wt | factor(cyl). the number of cylinders. Now we want to add data points. cyl. pch = 3. where n means ‘no plot at all (please)’. ylab = "Consumption (miles per gallon)". and so forth. + ylab = "Consumption (miles per gallon)". columns = 3) But we can do something similar with the standard scatterplot also. pch = 19.

xlab = "Weight (in pounds)". col = "green") legend("topright". 3). "6 cylinders".7. > + > > > > + plot(mpg ~ wt. las = 1.66364 > median(mpg[cyl == 8]) [1] 15. pch = 19.9. ylab = "Consumption (miles per gallon)". ADDING A LEGEND Fuel consumption against Weight for a sample of 32 US vehicles q q qq 45 30 Consumption (miles per gallon) q q 25 q q q q q q q q q 20 q q q 15 10 2 3 4 5 Weight (in pounds) Note that this way of selecting part of a data frame applies in general: > mean(mpg[cyl == 4]) [1] 26. pch = 3. legend = c("4 cylinders". "8 cylinders")) . main = "Fuel consumption against Weight for\n a samp points(mpg[cyl == 4] ~ wt[cyl == 4]. type = "n". pch = 1. col = "blue") points(mpg[cyl == 8] ~ wt[cyl == 8]. "blue". pch = c(19.7 Adding a legend Finally. we might add a legend.2 9. col = c("red". col = "red") points(mpg[cyl == 6] ~ wt[cyl == 6]. 1. "green").

print(png. pdf or another r ‘device’ and plot directly to this device: I don’t use Windows. 1 . in general. in reality.png") dev. another function is usually used with sensible defaults set: hist( wt ) dev.file="PracticePlot.) This is so common.eps") To save as a png or jpeg ﬁle. Noentheless. everything in this section applies to all operating systems. MORE ON PLOTTING 4 cylinders 6 cylinders 8 cylinders 30 Consumption (miles per gallon) q q 25 q q q q q q q q q 20 q q q 15 10 2 3 4 5 Weight (in pounds) Now it is clear that cars with more cylinders are. Perhaps you can right-click on the Windows and save it in some formats also. the picture is saved in a ﬁle. eps. you can use the menu to save a ﬁgure in some formats1 Easiest is to save in (encapsulated) postscript: dev.copy2eps(file="PracticePlot.46 Fuel consumption against Weight for a sample of 32 US vehicles q q qq q q CHAPTER 9. in reality. so I know little about it.8 Saving ﬁgures After a picture is produced.) There other method is to open a png. use hist(wt) dev.file="PracticePlot. heavier and use more fuel.print(postscript. the picture is saved in a ﬁle. In Windows.print(jpeg.jpg") (Again otice this is called ‘print-ing’. 9.file="PracticePlot. it can be saved in a variety of formats.eps") (Notice this is called ‘print-ing’ in r. jpg.

off() before the ﬁle is created.9. above we have explored some of the more common options so we didn’t need to examine the plethora of options! But it’s now time to look at the available options. 9.9 Other options There are many options that can be set. Once again. A list is found by typing: ?par And don’t forget: > detach(mtcars) .9. for a beginner the list is intimidating. you can specify a diﬀerent folder/directory by explicitly giving it.pdf") hist(wt) dev. Nothing useful appears on the screen.off() 47 Note the device must be turned oﬀ using dev. In fact. OTHER OPTIONS pdf(file="PracticePlot.

MORE ON PLOTTING .48 CHAPTER 9.

49 . F Daly. A Handbook of Small Data Sets. London. K Y McConway. Chapman and Hall. and E Ostrowski. 1996.Bibliography [1] D J Hand. A D Lunn.

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->